# **Lecture 8A**
# **Transforming columns in DataFrame**


In this part, we will look at various ways for creating/modifying columns in a DataFrame.<br>
Before we start, let's run the 2 cells below first. It will load student100.csv into **student** DataFrame.

In [1]:
# Run the code below to access files in your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# import Pandas module
import pandas as pd

# Read student100.csv data file
student = pd.read_csv("/content/drive/MyDrive/Data/student100.csv")
display(student)

Unnamed: 0,StudentID,Gender,Math,English,Chinese,GPA,Loan
0,1,M,69,49,86,3.01,Yes
1,2,M,42,47,71,1.66,Yes
2,3,M,85,43,65,2.57,No
3,4,F,64,66,66,2.52,No
4,5,F,73,41,69,1.84,No
...,...,...,...,...,...,...,...
95,96,F,60,35,70,1.84,Yes
96,97,M,78,48,95,3.67,No
97,98,M,69,57,64,2.13,Yes
98,99,M,69,60,76,2.81,No


---
**Example 1:** We can create/modify a column in a DataFrame by using other columns in the DataFrame.
* Suppose **df** is a DataFrame.
* **df[*new_column*] = *expression involving other columns*** will create a new column in the DataFrame.
* **df[*old_column*] = *expression involving other columns*** will overwrite an existing column in the DataFrame.


In [4]:
# Create a new variable GPA_New by rescaling the GPA to a range between 0 to 10.
student["GPA_New"] = 4*2
display(student.head())

Unnamed: 0,StudentID,Gender,Math,English,Chinese,GPA,Loan,GPA_New
0,1,M,69,49,86,3.01,Yes,8
1,2,M,42,47,71,1.66,Yes,8
2,3,M,85,43,65,2.57,No,8
3,4,F,64,66,66,2.52,No,8
4,5,F,73,41,69,1.84,No,8


In [None]:
# Rescale GPA_New from a range 0-10 to a range 0-5.
student["GPA_New"] = student["GPA_New"]/2
display(student.head())

Unnamed: 0,StudentID,Gender,Math,English,Chinese,GPA,Loan,GPA_New
0,1,M,69,49,86,3.01,Yes,3.7625
1,2,M,42,47,71,1.66,Yes,2.075
2,3,M,85,43,65,2.57,No,3.2125
3,4,F,64,66,66,2.52,No,3.15
4,5,F,73,41,69,1.84,No,2.3


---
**Example 2:** This is another example to calculate the average of **Math**, **English** and **Chinese** for each student and put that average in a new column **Overall**.

In [None]:
# Calculate a new column "Overall"
student["Overall"] = (student["Math"]+student["English"]+student["Chinese"])/3
display(student)

Unnamed: 0,StudentID,Gender,Math,English,Chinese,GPA,Loan,GPA_New,Overall
0,1,M,69,49,86,3.01,Yes,3.7625,68.000000
1,2,M,42,47,71,1.66,Yes,2.0750,53.333333
2,3,M,85,43,65,2.57,No,3.2125,64.333333
3,4,F,64,66,66,2.52,No,3.1500,65.333333
4,5,F,73,41,69,1.84,No,2.3000,61.000000
...,...,...,...,...,...,...,...,...,...
95,96,F,60,35,70,1.84,Yes,2.3000,55.000000
96,97,M,78,48,95,3.67,No,4.5875,73.666667
97,98,M,69,57,64,2.13,Yes,2.6625,63.333333
98,99,M,69,60,76,2.81,No,3.5125,68.333333


---
**Example 3:** We can also create a new string variable based on an existing numeric variable. For example, creating a **Math_Grade** variable from **Math** variable.

* It can be done by using the syntax **df.loc[*condition*,*column_name*]=*expression***.
* ***condition*** is a boolean expression involving columns in the DataFrame.
* ***column_name*** is the name of the column for storing result.
* ***expression*** is a string or numeric expression evaluated when the ***condition*** is True.

Some alternative methods can be found at https://towardsdatascience.com/efficient-implementation-of-conditional-logic-on-pandas-dataframes-4afa61eb7fce


In [None]:
# Assigning grades to Math_Grade
# Make sure that the conditions cover all possible cases
student.loc[student["Math"]>=80, "Math_Grade"] = "A"
student.loc[(student["Math"]>=60) & (student["Math"]<80), "Math_Grade"] = "B"
student.loc[(student["Math"]>=50) & (student["Math"]<60), "Math_Grade"] = "C"
student.loc[(student["Math"]>=45) & (student["Math"]<50), "Math_Grade"] = "D"
student.loc[(student["Math"]<45), "Math_Grade"] = "F"
display(student)
print(student["Math_Grade"].value_counts())

Unnamed: 0,StudentID,Gender,Math,English,Chinese,GPA,Loan,Math_Grade
0,1,M,69,49,86,3.01,Yes,B
1,2,M,42,47,71,1.66,Yes,F
2,3,M,85,43,65,2.57,No,A
3,4,F,64,66,66,2.52,No,B
4,5,F,73,41,69,1.84,No,B
...,...,...,...,...,...,...,...,...
95,96,F,60,35,70,1.84,Yes,B
96,97,M,78,48,95,3.67,No,B
97,98,M,69,57,64,2.13,Yes,B
98,99,M,69,60,76,2.81,No,B


B    52
C    30
D    10
A     5
F     3
Name: Math_Grade, dtype: int64


---
**Example 4**: String variables can be transformed based on conditions as well. For example, we want to replace "F" with "Female" and "M" with "Male" in the **Gender** variable.

In [None]:
# Replace "M" with "Male" and "F" with "Female"
student.loc[student["Gender"]=="M", "Gender"] = "Male"
student.loc[student["Gender"]=="F", "Gender"] = "Female"
display(student)

Unnamed: 0,StudentID,Gender,Math,English,Chinese,GPA,Loan,Math_Grade
0,1,Male,69,49,86,3.01,Yes,B
1,2,Male,42,47,71,1.66,Yes,F
2,3,Male,85,43,65,2.57,No,A
3,4,Female,64,66,66,2.52,No,B
4,5,Female,73,41,69,1.84,No,B
...,...,...,...,...,...,...,...,...
95,96,Female,60,35,70,1.84,Yes,B
96,97,Male,78,48,95,3.67,No,B
97,98,Male,69,57,64,2.13,Yes,B
98,99,Male,69,60,76,2.81,No,B


---
**Example 5:** Other than creating and modifying columns, we can also drop an existing columns. For a DataFrame **df**, you can drop one or more column by the following syntax.

* **df.drop(*column_name*,axis=1)** will drop the column with the given name.
* **df.drop(*list_of_columns*,axis=1)** will drop the columns in the given list.

In [None]:
# This is the original DataFrame
student = pd.read_csv("/content/drive/MyDrive/Data/student100.csv")
display(student.head())

# Dropping the Loan column
student = student.drop("Loan",axis=1)
display(student.head())

# Dropping Math, English and Chinese
student = student.drop(["Math","English","Chinese"],axis=1)
display(student.head())


Unnamed: 0,StudentID,Gender,Math,English,Chinese,GPA,Loan
0,1,M,69,49,86,3.01,Yes
1,2,M,42,47,71,1.66,Yes
2,3,M,85,43,65,2.57,No
3,4,F,64,66,66,2.52,No
4,5,F,73,41,69,1.84,No


Unnamed: 0,StudentID,Gender,Math,English,Chinese,GPA
0,1,M,69,49,86,3.01
1,2,M,42,47,71,1.66
2,3,M,85,43,65,2.57
3,4,F,64,66,66,2.52
4,5,F,73,41,69,1.84


Unnamed: 0,StudentID,Gender,GPA
0,1,M,3.01
1,2,M,1.66
2,3,M,2.57
3,4,F,2.52
4,5,F,1.84
