# **Lecture 6B**
# **Pandas DataFrame Basics**


Pandas DataFrame is a two-dimensional container. Rows in a DataFrame contain data collected from one entity (e.g. a student, a customer and etc.). The columns are variables (e.g. Math scores, English scores of a student). Columns in a DataFrame can have different types (e.g. integers, floating point numbers, strings, boolean values and etc.). <br>
Each column in a DataFrame is called a Series, which is also a container!


---
**Example 1:** We are reading another Excel file **student2.xlsx**. It contains many different variables of different types. This file will be used in the subsequent examples.

In [None]:
# Run the code below to access files in your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# import Pandas module
from pandas import*

# Read XLSX file into DataFrame 
# We are reading "sheet1" from the file student.xlsx
datadf = read_excel("/content/drive/MyDrive/Data/student2.xlsx",sheet_name="sheet1")
print("The DataFrame:")
display(datadf)

The DataFrame:


Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
0,1,Amy,Chan,F,57,90,86,Chess/Swim,1.79,False,False
1,2,Betty,Lee,F,60,68,79,Swim/Football,0.58,False,True
2,3,Johnny,Lam,M,37,89,65,Music/Dance/Swim,1.83,False,False
3,4,Thomson,Ho,M,36,93,43,Reading/Dance,2.03,False,False
4,5,Mary,Cheng,F,35,38,80,Singing/Dance/Chess,1.78,False,True
5,6,Jerry,Li,M,54,99,37,Ping Pong/Swim/Cycling,3.32,True,False
6,7,Bob,Wong,M,88,36,26,Reading/Swim,2.81,False,True
7,8,Peter,Yeung,M,83,90,68,Gaming/Football,2.37,False,False
8,9,Clara,Yau,F,51,65,45,Football/Music/Art,3.02,True,True
9,10,Jacky,Lee,M,90,72,94,Art/Reading/Swim,3.89,True,False


---
**Example 2:** Extracting a **Series** (column) from a DataFrame. 
* A column of the DataFrame has a special data type called **Series**.
* A Series can be extracted by using the square bracket operator like **datadf[*column_name*]**.
* If ***column_list*** is a list containing column names, then **datadf[*column_list*]** will return a DataFrame containing those columns.

In [None]:
# Get a column (i.e. a Series) from the DataFrame by using the square bracket operator.
# You simply put the name of a column inside the square bracket.
x = datadf["Math"]
print(x)        # The Series contains integers 
print(type(x))  # This is a Series, not a list

0    57
1    60
2    37
3    36
4    35
5    54
6    88
7    83
8    51
9    90
Name: Math, dtype: int64
<class 'pandas.core.series.Series'>


In [None]:
# Get multiple columns from the DataFrame by using square bracket operator.
column_list = ["Math","English","Chinese"]
x = datadf[column_list]
print(x)
print(type(x)) # When you request multiple columns, you get a DataFrame.

   Math  English  Chinese
0    57       90       86
1    60       68       79
2    37       89       65
3    36       93       43
4    35       38       80
5    54       99       37
6    88       36       26
7    83       90       68
8    51       65       45
9    90       72       94
<class 'pandas.core.frame.DataFrame'>


---
**Example 3:** Accessing a data value (a cell) in DataFrame by using **loc[*row,col*]** property.
* First argument ***row*** specifies which row(s) will be extracted. The most basic usage is to specify an integer row index. Pay attention that the index starts from 0 just like list.
* Second argument ***col*** specifies which column(s) will be extracted. The most basic usage is to specify the name of a column.

In [None]:
# Get the value located at row 2 and column "Lastname"
x = datadf.loc[2,"Lastname"]
print(x)
print(type(x))

# Get the value located at row 0 and column "English"
x = datadf.loc[0,"English"]
print(x)
print(type(x))


Lam
<class 'str'>
90
<class 'numpy.int64'>


---
**Example 4:** Accessing a column or row by using **loc[*row,col*]** property.
* We can use **loc[*row,:*]** or **loc[*row*]** to extract a row from the DataFrame. ***row*** will be the index of the row we want to extract. A colon **":"** in the column specification means that all columns are needed.
* We can use **loc[*:,col*]** to extract a column from the DataFrame. ***col*** is the name of the column we want to extract. A colon **":"** in the row specification means that all rows are needed.


In [None]:
# Get row 3 from a DataFrame
x = datadf.loc[3,:]
display(x)
print(type(x))

x = datadf.loc[3]
print(x)
print(type(x))

StudentID                  4
Firstname            Thomson
Lastname                  Ho
Gender                     M
Math                      36
English                   93
Chinese                   43
Hobby          Reading/Dance
GPA                     2.03
Scholarship            False
Loan                   False
Name: 3, dtype: object

<class 'pandas.core.series.Series'>
StudentID                  4
Firstname            Thomson
Lastname                  Ho
Gender                     M
Math                      36
English                   93
Chinese                   43
Hobby          Reading/Dance
GPA                     2.03
Scholarship            False
Loan                   False
Name: 3, dtype: object
<class 'pandas.core.series.Series'>


In [None]:
# Get Gender column from a DataFrame
x = datadf.loc[:,"Gender"]
print(x)
print(type(x))

0    F
1    F
2    M
3    M
4    F
5    M
6    M
7    M
8    F
9    M
Name: Gender, dtype: object
<class 'pandas.core.series.Series'>


---
**Example 5:** We can using ***slicing*** in the row specification of **loc[*row,col*]**.
* **loc[*start:end,:*]** will return all rows with index from ***start*** to ***end INCLUSIVE***. 
* If you recall, slicing in list will not include the last index, but the last index will be included when using **loc[*row,col*]** in DataFrame.
* Slicing will produce a DataFrame containing a subset of rows from the original DataFrame.


In [None]:
# Get row 3 to 6 from datadf
x = datadf.loc[3:7,:]
display(x)
print(type(x))

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
3,4,Thomson,Ho,M,36,93,43,Reading/Dance,2.03,False,False
4,5,Mary,Cheng,F,35,38,80,Singing/Dance/Chess,1.78,False,True
5,6,Jerry,Li,M,54,99,37,Ping Pong/Swim/Cycling,3.32,True,False
6,7,Bob,Wong,M,88,36,26,Reading/Swim,2.81,False,True
7,8,Peter,Yeung,M,83,90,68,Gaming/Football,2.37,False,False


<class 'pandas.core.frame.DataFrame'>


---
**Example 6:** Using **loc[*row,col*]** property to extract multiple columns from a DataFrame.
* If ***column_list*** is a list containing the names of some columns in the DataFrame, you will get a DataFrame containing the columns you requested.

In [None]:
# Get StudentID, Math and English from the DataFrame
columns = ["StudentID","Math","English"]
x = datadf.loc[:,columns]
display(x)
print(type(x))


Unnamed: 0,StudentID,Math,English
0,1,57,90
1,2,60,68
2,3,37,89
3,4,36,93
4,5,35,38
5,6,54,99
6,7,88,36
7,8,83,90
8,9,51,65
9,10,90,72


<class 'pandas.core.frame.DataFrame'>


---
**Example 7:** Using row and column specification at the same time.
* The row specifications and column specifications we talked about in previous examples can be mixed.

In [None]:
# Get StudentID, Math and English of row 4 to 8 from the DataFrame
columns = ["StudentID","Math","English"]
x = datadf.loc[4:8,columns]
print(x)
print(type(x))


   StudentID  Math  English
4          5    35       38
5          6    54       99
6          7    88       36
7          8    83       90
8          9    51       65
<class 'pandas.core.frame.DataFrame'>


---
**Example 8:** Row indices of extracted DataFrames.<br>
* If you extract some rows from a DataFrame into a new DataFrame, the row indices will carry over to the new DataFrame.
* If you want to reset the row indices to start from zero, you can use the method **reset_index(drop=True)** on the new DataFrame.
* The **drop=True** option means that we are dropping the old indices.

In [None]:
# Get StudentID, Math and English of row 4 to 8 from the DataFrame
# newdf will get the row indices from datadf as well
columns = ["StudentID","Math","English"]
newdf = datadf.loc[4:8,columns]
display(newdf)

# Resetting the row indices to start from 0
newdf = newdf.reset_index(drop=True)
display(newdf)

Unnamed: 0,StudentID,Math,English
4,5,35,38
5,6,54,99
6,7,88,36
7,8,83,90
8,9,51,65


Unnamed: 0,StudentID,Math,English
0,5,35,38
1,6,54,99
2,7,88,36
3,8,83,90
4,9,51,65


--- 
**Example 9:** Processed DataFrames can be written back to files using<br> 
**to_csv(file_path, index=True/False)** or<br> 
**to_excel(file_path, sheet_name=worksheet, index=True/False)**.<br>
* ***file_path*** is the file path of the file you are writing.
* **index=*True/False*** allows us to indicate if row indices will be written to the file or not. ***True*** means that the row indices will be written. ***False*** means that row indices will not be writen. Usually, we will use ***False*** when writing to a file.
* ***sheet_name=worksheet*** is only used when writing to Excel file. ***worksheet*** is a string containing the name of the worksheet in the Excel file.


In [None]:
# Get StudentID, Math and English of row 4 to 8 from the DataFrame
columns = ["StudentID","Math","English"]
newdf = datadf.loc[4:8,columns]
display(newdf)
newdf.to_csv("/content/drive/MyDrive/Data/outfile.csv",index=False)
newdf.to_excel("/content/drive/MyDrive/Data/outfile.xlsx",sheet_name="outdata",index=False)

Unnamed: 0,StudentID,Math,English
4,5,35,38
5,6,54,99
6,7,88,36
7,8,83,90
8,9,51,65
