# **Lecture 6A**
# **Introduction to Pandas Module**


---
**Example 1:** To use functions in pandas module, we have to import the modules. There are several ways to to import the module.
* **import pandas** - If you this method, all the functions should have a prefix. e.g. **pandas.**DataFrame()
* **import pandas as pd** - If you use this method, we are giving it a new prefix **pd**. When you use a function, it will have a prefix **pd** instead of **pandas**.
* **from pandas import *** - If you use this method, the module name is not needed.
* There are several other ways to import the module, but the 2nd method is the most common one.
* You only need to import the module once in a session. 

---
We have used the **math** module before for math related functions. There are many modules in Python for data processing. One of the most important one is the **pandas** module.<br>

**Pandas** module provides many functions for working with data sets. Data sets are typically 2 dimensional tables, in which rows correspond to entities (such as students) and columns correspond to variables (such as scores in different subjects). Below is a typical example of a data set.

<table>
<tr><td><b>Name</td><td><b>Math</<td><td><b>English</<td><td><b>Chinese</<td></tr>
<tr><td>Amy</td><td>45</<td><td>86</<td><td>77</<td></tr>
<tr><td>Betty</td><td>78</<td><td>65</<td><td>89</<td></tr>
<tr><td>Johnny</td><td>69</<td><td>91</<td><td>54</<td></tr>
<tr><td>Thomson</td><td>88</<td><td>73</<td><td>67</<td></tr>
<tr><td>Mary</td><td>81</<td><td>55</<td><td>50</<td></tr>
</table>

In [None]:
# Method 1
# all functions will have a prefix "pandas"
import pandas          

# Method 2
# all functions will have a prefix "pd"
import pandas as pd

# Method 3
# No prefix is needed 
from pandas import *

---
**Example 2:** Creating DataFrame from a dictionary of lists.<br>
Suppose we have the following data set.

<table>
<tr><td><b>Name</td><td><b>Math</<td><td><b>English</<td><td><b>Chinese</<td></tr>
<tr><td>Amy</td><td>45</<td><td>86</<td><td>77</<td></tr>
<tr><td>Betty</td><td>78</<td><td>65</<td><td>89</<td></tr>
<tr><td>Johnny</td><td>69</<td><td>91</<td><td>54</<td></tr>
<tr><td>Thomson</td><td>88</<td><td>73</<td><td>67</<td></tr>
<tr><td>Mary</td><td>81</<td><td>55</<td><td>50</<td></tr>
</table>

* We can put each column in a list. e.g. Math scores are stored in a list [45,78,69,88,81].
* We will have 4 lists, one for each column.
* A dictionary is created to store the 4 lists. Key will be the variable name, value will be the list containing the data of the variable. e.g. "Math":[45,78,69,88,81].

Such data structure can be converted to a **pandas DataFrame**.
* **pd.DataFrame(*data*)** - This function converts the input dictionary ***data*** into a Pandas DataFrame.
* When you want to show a DataFrame in the output, it is better to use **display()** instead of **print()**. 
* If we can represent the data set using list and dictionary, why do we need pandas module? It is because pandas module provides us with lots of functions that cannot be done easily in list or dictionary.

In [None]:
# The data set is represent using a dictionary of lists
# key is the variable name, value is a list of data values.
data = {
  "Name":["Amy","Betty","Johnny","Thomson","Mary"],
  "Math":[45,78,69,88,81],
  "English":[86,65,91,73,55],
  "Chinese":[77,89,54,67,50]
}
print("This is the data:")
display(data)
print(type(data))
print()

# Create the DataFrame
from pandas import*
datadf = DataFrame(data)
print("This is the Pandas DataFrame:")
display(datadf)       # Print the Data Frame
print(type(datadf))   # Show us the type

This is the data:


{'Name': ['Amy', 'Betty', 'Johnny', 'Thomson', 'Mary'],
 'Math': [45, 78, 69, 88, 81],
 'English': [86, 65, 91, 73, 55],
 'Chinese': [77, 89, 54, 67, 50]}

<class 'dict'>

This is the Pandas DataFrame:


Unnamed: 0,Name,Math,English,Chinese
0,Amy,45,86,77
1,Betty,78,65,89
2,Johnny,69,91,54
3,Thomson,88,73,67
4,Mary,81,55,50


<class 'pandas.core.frame.DataFrame'>


---
**Example 3:** Creating DataFrame from a list of lists.<br>
Suppose we are reusing the data in Example 2.
* For the student Amy, her name and scores in the 3 subjects will be stored in a list ["Amy",45,86,77]. 
* We will create a list for each student. i.e. 5 lists.
* Finally, we put all 5 lists in a list.

Such data structure can also be converted to a DataFrame using the function **pd.DataFrame()**.
* Although the usage of pd.DataFrame() is similar in this example, we need one extra argument **columns=** when calling the function.
* Since there are no keys associated with the scores, the program does not know which score corresponds to which subject. The **columns=** argument allows us to specify a list of column names when creating the DataFrame.

In [None]:
# A list containing information about 4 students
# Each student is represented by a list
data = [
    ["Amy",45,86,77],
    ["Betty",78,65,89],
    ["Johnny",69,91,54],
    ["Thomson",88,73,67],
    ["Mary",81,55,50]
]
print("This is the data:")
display(data)
print(type(data))
print()

# Create DataFrame
# You can see that the variable names are specified using columns=[...]
import pandas as pd
datadf = pd.DataFrame(data, columns=["Name","Math","English","Chinese"])
print("This is the Pandas DataFrame:")
display(datadf)
print(type(datadf))

This is the data:


[['Amy', 45, 86, 77],
 ['Betty', 78, 65, 89],
 ['Johnny', 69, 91, 54],
 ['Thomson', 88, 73, 67],
 ['Mary', 81, 55, 50]]

<class 'list'>

This is the Pandas DataFrame:


Unnamed: 0,Name,Math,English,Chinese
0,Amy,45,86,77
1,Betty,78,65,89
2,Johnny,69,91,54
3,Thomson,88,73,67
4,Mary,81,55,50


<class 'pandas.core.frame.DataFrame'>


---
**Example 4:** Creating DataFrame from a list of dictionaries.<br>
* Each row (a student) in the data set is a dictionary. The key is the variable name and the value is the score. 
* All the dictionaries will be put in a big list.<br>

This data structure can also be converted to a DataFrame.

In [None]:
# A list containing information about 4 students
# Each student is represented using a dictionary
data = [
    {"Name":"Amy","Math":45,"English":86,"Chinese":77},
    {"Name":"Betty","Math":78,"English":65,"Chinese":89},
    {"Name":"Johnny","Math":69,"English":91,"Chinese":54},
    {"Name":"Thomson","Math":88,"English":73,"Chinese":67},
    {"Name":"Mary","Math":81,"English":55,"Chinese":50},
]
print("This is the data:")
display(data)
print(type(data))
print()

# Create DataFrame
import pandas as pd
datadf = pd.DataFrame(data)
print("This is the Pandas DataFrame:")
display(datadf)
print(type(datadf))


This is the data:


[{'Name': 'Amy', 'Math': 45, 'English': 86, 'Chinese': 77},
 {'Name': 'Betty', 'Math': 78, 'English': 65, 'Chinese': 89},
 {'Name': 'Johnny', 'Math': 69, 'English': 91, 'Chinese': 54},
 {'Name': 'Thomson', 'Math': 88, 'English': 73, 'Chinese': 67},
 {'Name': 'Mary', 'Math': 81, 'English': 55, 'Chinese': 50}]

<class 'list'>

This is the Pandas DataFrame:


Unnamed: 0,Name,Math,English,Chinese
0,Amy,45,86,77
1,Betty,78,65,89
2,Johnny,69,91,54
3,Thomson,88,73,67
4,Mary,81,55,50


<class 'pandas.core.frame.DataFrame'>


---
**Example 5:** The most commonly used method to create DataFrame is to read it from a CSV file or Excel file.
* Make sure the student.csv and student.xlsx are stored in your Google Drive.
* To access the files from Google drive you need to execute the first cell below. After you execute it you can see your Google Drive folder showing up on the left.
* CSV file can be read by using **pd.read_csv(*file_path*)**. The argument is the file path is a string containing the location of the file.
* XLSX file can be read by using **pd.read_excel(*file_path*,*worksheet*)**. The first argument is the file path is a string containing the location of the file. The second argument is a string containing the name of the worksheet to be loaded.
* File location can be found in the Files section on the left panel of Google Colab. Right click the file and select "Copy path".

Some further notes about CSV & XLSX files:
* XLSX files are the default file format of Excel.
* CSV files are text files in which data are separated by commas. Since it is text file, it can be opened by any text editors.
* You can create CSV files easily by saving an Excel worksheet using "Save as". Make sure you change the file type to **"CSV UTF-8 (Comma delimited)"** in the file dialog. 



In [None]:
# Run the code below to access files in your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# import Pandas module
import pandas as pd

# Read CSV file into DataFrame 
datadf1 = pd.read_csv("/content/drive/MyDrive/Data/student.csv")
print("This is the Pandas DataFrame:")
display(datadf1)
print()

# Read XLSX file into DataFrame 
# We are reading "sheet1" from the file student.xlsx
datadf2 = pd.read_excel("/content/drive/MyDrive/Data/student.xlsx",sheet_name="sheet1")
print("This is the Pandas DataFrame:")
display(datadf2)

This is the Pandas DataFrame:


Unnamed: 0,Name,Math,English,Chinese
0,Amy,45,86,77
1,Betty,78,65,89
2,Johnny,69,91,54
3,Thomson,88,73,67
4,Mary,81,55,50



This is the Pandas DataFrame:


Unnamed: 0,Name,Math,English,Chinese
0,Amy,45,86,77
1,Betty,78,65,89
2,Johnny,69,91,54
3,Thomson,88,73,67
4,Mary,81,55,50
