# **Lecture 8B**
# **Using Functions on DataFrame Columns**

In this part, we will introduce functions that can be used for creating/modifying DataFrame columns. Some of the functions will be coming from the **numpy** module.

Before you run the examples in this notebook, you need to execute the following 2 cells to import pandas module and load the Excel file needed.

In [2]:
# Run the code below to access files in your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# import Pandas module
import pandas as pd

# Read student2.xlsx data file
student = pd.read_excel("/content/drive/MyDrive/Data/student2.xlsx",sheet_name="sheet1")
display(student.head())

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
0,1,Amy,Chan,F,57,90,86,Chess/Swim,1.79,False,False
1,2,Betty,Lee,F,60,68,79,Swim/Football,0.58,False,True
2,3,Johnny,Lam,M,37,89,65,Music/Dance/Swim,1.83,False,False
3,4,Thomson,Ho,M,36,93,43,Reading/Dance,2.03,False,False
4,5,Mary,Cheng,F,35,38,80,Singing/Dance/Chess,1.78,False,True


---
**Example 1:** The best way to use numeric functions with DataFrame is to use **numpy** module. 

* **numpy** module provides math functions that can easily be used to transform numeric columns in Pandas DataFrame. The function will be apply to all data values in the specified column.
* Here is the list of available numpy math functions. https://numpy.org/doc/stable/reference/routines.math.html



In [None]:
# import numpy module
import numpy as np

# Natural log
student["GPA1"] = np.log(student["GPA"])

# Base 10 log
student["GPA2"] = np.log10(student["GPA"])

# Exponential
student["GPA3"] = np.exp(student["GPA"])

# Square root
student["GPA4"] = np.sqrt(student["GPA"])

# Rounding to 1 decimal place
# 2nd argument is the number of decimal places requested
student["GPA5"] = np.round(student["GPA"],1)

display(student)

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan,GPA1,GPA2,GPA3,GPA4,GPA5
0,1,Amy,Chan,F,57,90,86,Chess/Swim,1.79,False,False,0.582216,0.252853,5.989452,1.337909,1.8
1,2,Betty,Lee,F,60,68,79,Swim/Football,0.58,False,True,-0.544727,-0.236572,1.786038,0.761577,0.6
2,3,Johnny,Lam,M,37,89,65,Music/Dance/Swim,1.83,False,False,0.604316,0.262451,6.233887,1.352775,1.8
3,4,Thomson,Ho,M,36,93,43,Reading/Dance,2.03,False,False,0.708036,0.307496,7.614086,1.424781,2.0
4,5,Mary,Cheng,F,35,38,80,Singing/Dance/Chess,1.78,False,True,0.576613,0.25042,5.929856,1.334166,1.8
5,6,Jerry,Li,M,54,99,37,Ping Pong/Swim/Cycling,3.32,True,False,1.199965,0.521138,27.660351,1.822087,3.3
6,7,Bob,Wong,M,88,36,26,Reading/Swim,2.81,False,True,1.033184,0.448706,16.609918,1.676305,2.8
7,8,Peter,Yeung,M,83,90,68,Gaming/Football,2.37,False,False,0.86289,0.374748,10.697392,1.53948,2.4
8,9,Clara,Yau,F,51,65,45,Football/Music/Art,3.02,True,True,1.105257,0.480007,20.491292,1.737815,3.0
9,10,Jacky,Lee,M,90,72,94,Art/Reading/Swim,3.89,True,False,1.358409,0.58995,48.910887,1.972308,3.9


---
**Example 2:** On the other hand, Pandas provided many functions for working with string columns. This part shows a few functions for working with strings. For a complete list, you can refer to the following website.
https://pandas.pydata.org/pandas-docs/dev/user_guide/text.html

* **df[*string_column*].str.len()** will return a new Series storing the length of the given string column.
* To concatenate two or more string columns, you can simply use the "+" operator.
* **df[*string_column*].str.find(*string*)** will try to find ***string*** in ***string_column*** and return the position of the first occurance found. If ***string*** is not found, -1 will be returned.
* **df[*string_column*].str.slice(*start*, *end*)** will extract a substring from ***string_column*** from position ***start*** to the position ***end*** (but not including ***end***).



In [None]:
# Re-read student2.xlsx data file
student = pd.read_excel("/content/drive/MyDrive/Data/student2.xlsx",sheet_name="sheet1")

# Concatenating Firstname and Lastname into FullName
# Note that we have added a space between Firstname and Lastname
student["FullName"] = student["Firstname"] + " " + student["Lastname"]
display(student.head())

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan,FullName
0,1,Amy,Chan,F,57,90,86,Chess/Swim,1.79,False,False,Amy Chan
1,2,Betty,Lee,F,60,68,79,Swim/Football,0.58,False,True,Betty Lee
2,3,Johnny,Lam,M,37,89,65,Music/Dance/Swim,1.83,False,False,Johnny Lam
3,4,Thomson,Ho,M,36,93,43,Reading/Dance,2.03,False,False,Thomson Ho
4,5,Mary,Cheng,F,35,38,80,Singing/Dance/Chess,1.78,False,True,Mary Cheng


In [None]:
# Re-read student2.xlsx data file
student = pd.read_excel("/content/drive/MyDrive/Data/student2.xlsx",sheet_name="sheet1")

# Find the number of characters in Firstname and put the 
# number in the FnameLen column
student["FnameLen"] = student["Firstname"].str.len() 
display(student.head())

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan,FnameLen
0,1,Amy,Chan,F,57,90,86,Chess/Swim,1.79,False,False,3
1,2,Betty,Lee,F,60,68,79,Swim/Football,0.58,False,True,5
2,3,Johnny,Lam,M,37,89,65,Music/Dance/Swim,1.83,False,False,6
3,4,Thomson,Ho,M,36,93,43,Reading/Dance,2.03,False,False,7
4,5,Mary,Cheng,F,35,38,80,Singing/Dance/Chess,1.78,False,True,4


In [None]:
# Re-read student2.xlsx data file
student = pd.read_excel("/content/drive/MyDrive/Data/student2.xlsx",sheet_name="sheet1")

# We will find the first occurance of "Swim" in Hobby and put the starting 
# position in the column Pos. If Pos is greater than or equal to zero, then
# one of the hobby is "Swim".
# 
# We will then create a Boolean column Swim to indicate if a student has the swim hobby.
student["Pos"] = student["Hobby"].str.find("Swim")
student["Swim"] = student["Pos"]>=0
display(student.head())

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan,Pos,Swim
0,1,Amy,Chan,F,57,90,86,Chess/Swim,1.79,False,False,6,True
1,2,Betty,Lee,F,60,68,79,Swim/Football,0.58,False,True,0,True
2,3,Johnny,Lam,M,37,89,65,Music/Dance/Swim,1.83,False,False,12,True
3,4,Thomson,Ho,M,36,93,43,Reading/Dance,2.03,False,False,-1,False
4,5,Mary,Cheng,F,35,38,80,Singing/Dance/Chess,1.78,False,True,-1,False


In [4]:
# Re-read student2.xlsx data file
student = pd.read_excel("/content/drive/MyDrive/Data/student2.xlsx",sheet_name="sheet1")

# We want to construct a Name column which contains the first letter of 
# the first name and the lastname of the students.
# E.g. Amy Chan will be stored as A. Chan in Name

# Extract the first letter from Firstname
student["Name"] = student["Firstname"].str.slice(0,2)

# Add the dot & space ". " and lastname to Name
student["Name"] = student["Name"] + ". " + student["Lastname"]
display(student.head())

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan,Name
0,1,Amy,Chan,F,57,90,86,Chess/Swim,1.79,False,False,Am. Chan
1,2,Betty,Lee,F,60,68,79,Swim/Football,0.58,False,True,Be. Lee
2,3,Johnny,Lam,M,37,89,65,Music/Dance/Swim,1.83,False,False,Jo. Lam
3,4,Thomson,Ho,M,36,93,43,Reading/Dance,2.03,False,False,Th. Ho
4,5,Mary,Cheng,F,35,38,80,Singing/Dance/Chess,1.78,False,True,Ma. Cheng
