# Week 4
# Pandas Data Frames (Part 1)

[Pandas](https://pandas.pydata.org/) is a major tool for data scientists on Python. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy.

References:
- Textbook Chapter 5: Getting Started with Pandas
- [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html)
- [Pandas Exercises on W3Resources](https://www.w3resource.com/python-exercises/pandas/index.php)

In [1]:
import numpy as np
import pandas as pd # pd is the universally-used abbreviation

Pandas provides two data types that extend numpy arrays:
- Data Series: extending 1D array, used to represent a single feature
- Data Frame: extending 2D array, used to represent a data table

We will focus on data frames today, as most data sets are stored in table format.

In [2]:
# Define a DataFrame from scratch
df1 = pd.DataFrame(np.random.rand(5, 3),
                   columns=['Feature1', 'Feature2', 'Feature3'])
df1.head()# prints the first several rows

Unnamed: 0,Feature1,Feature2,Feature3
0,0.914287,0.489586,0.533515
1,0.207462,0.031232,0.241292
2,0.430043,0.689233,0.83744
3,0.797728,0.45387,0.070333
4,0.044556,0.402424,0.074989


In [3]:
# Print the shape of the data frame
print(df1.shape)

(5, 3)


In [4]:
# Row indices
# print(df1.index)
print(df1.index.values)

[0 1 2 3 4]


In [5]:
# Column indices
# print(df1.columns)
print(df1.columns.values)

['Feature1' 'Feature2' 'Feature3']


In [15]:
# Access elements using .loc[row_index, col_index]
# Ex: Print the Feature1 value on the first row

df1.loc[0, 'Feature1']

# Ex: Display the value in the lower-right corner
df1.loc[4, 'Feature3']

# # There is another expression: iloc[]
# df1.iloc[-1, -1]

0.4110340764183723

In [16]:
# Index slicing
# Ex: Print the Feature2 value for the first 3 rows
df1.loc[0:2, "Feature2"] # Both the start index and the end index are inclusive

0    0.472880
1    0.800659
2    0.864312
Name: Feature2, dtype: float64

In [18]:
# Ex: Print the Feature2 and Feature3 values for the last 3 rows
# .loc[] does not support negative index
df1.loc[2:4, ['Feature2', 'Feature3']]

Unnamed: 0,Feature2,Feature3
2,0.864312,0.374409
3,0.05778,0.461642
4,0.288185,0.411034


In [21]:
# Ex: Use boolean indexing to extract the last 3 rows
row_indices = (df1.index >= 2)
print(row_indices)
df1.loc[row_indices, :]

[False False  True  True  True]


Unnamed: 0,Feature1,Feature2,Feature3
2,0.131572,0.864312,0.374409
3,0.308173,0.05778,0.461642
4,0.689194,0.288185,0.411034


In [22]:
df1.loc[(df1.index <= 2), :]

Unnamed: 0,Feature1,Feature2,Feature3
0,0.573158,0.47288,0.26109
1,0.458431,0.800659,0.060027
2,0.131572,0.864312,0.374409


## Basic Table Operations
- Change a value 
- Add a new row
- Add a new column
- Remove a row
- Remove a column

In [7]:
data = [[60, 70, 80],
        [66, 88, 77],
        [100, 60, 30],
        [85, 87, 83]]
scores = pd.DataFrame(data,
                      index=['Alice', 'Bob', 'Chris', 'David'],
                      columns=['Quiz1', 'Quiz2', 'Final'])
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60,70,80
Bob,66,88,77
Chris,100,60,30
David,85,87,83


In [8]:
# Change Alice's final score to 90.
scores.loc['Alice', 'Final'] = 90
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60,70,90
Bob,66,88,77
Chris,100,60,30
David,85,87,83


In [9]:
# Add a new row: "Edward": [77, 88, 99]
scores.loc['Edward', :] = [77, 88, 99]
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,85.0,87.0,83.0
Edward,77.0,88.0,99.0


In [10]:
scores.loc["Fred", ["Quiz1", "Quiz2"]] = [100, 100]
scores
# This will create a missing value in the Final column

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,85.0,87.0,83.0
Edward,77.0,88.0,99.0
Fred,100.0,100.0,


In [11]:
# Append a new data frame
more_scores = pd.DataFrame(data={'Quiz1': [67, 76],
                                 'Quiz2': [78, 87],
                                 'Final': [89, 98]},
                           index=['Flora', 'Gabriel']) # Represent data as a dictionary
print(more_scores)
total_scores = scores.append(more_scores) # append() creates a new data frame
total_scores

         Quiz1  Quiz2  Final
Flora       67     78     89
Gabriel     76     87     98


Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,85.0,87.0,83.0
Edward,77.0,88.0,99.0
Fred,100.0,100.0,
Flora,67.0,78.0,89.0
Gabriel,76.0,87.0,98.0


In [12]:
total_scores.shape

(8, 3)

In [13]:
# Add a column "ExtraCredit"
total_scores['ExtraCredit'] = [0, 1, 2, 3, 4, 5, 6, 4.5]
total_scores

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0.0
Bob,66.0,88.0,77.0,1.0
Chris,100.0,60.0,30.0,2.0
David,85.0,87.0,83.0,3.0
Edward,77.0,88.0,99.0,4.0
Fred,100.0,100.0,,5.0
Flora,67.0,78.0,89.0,6.0
Gabriel,76.0,87.0,98.0,4.5


In [14]:
total_scores['ExtraCredit'] = [0, 0, 0, 0, 0, 0, 0, 0]
total_scores

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0
Bob,66.0,88.0,77.0,0
Chris,100.0,60.0,30.0,0
David,85.0,87.0,83.0,0
Edward,77.0,88.0,99.0,0
Fred,100.0,100.0,,0
Flora,67.0,78.0,89.0,0
Gabriel,76.0,87.0,98.0,0


In [None]:
# Add additional columns from another data frame
# will be discussed in Chapter 8


In [20]:
# Remove record for Chris
scores_without_chris = total_scores.drop('Chris') # drop() creates a new data frame

# total_scores # The original data frame is not affected.
# scores_without_chris 

# If you still want to keep a copy of the original data, assign the result from 
# drop() to a new variable. In this way you have two data frames.

In [21]:
# If the original data frame is no longer needed, then simply assign the drop
# result to the same variable.

total_scores = total_scores.drop('Chris')

total_scores

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0
Bob,66.0,88.0,77.0,0
David,85.0,87.0,83.0,0
Edward,77.0,88.0,99.0,0
Fred,100.0,100.0,,0
Flora,67.0,78.0,89.0,0
Gabriel,76.0,87.0,98.0,0


In [None]:
# Modifying an existing data frame is called an "in-place" operation.
total_scores.drop("David", inplace=True)

In [24]:
total_scores

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0
Bob,66.0,88.0,77.0,0
Edward,77.0,88.0,99.0,0
Fred,100.0,100.0,,0
Flora,67.0,78.0,89.0,0
Gabriel,76.0,87.0,98.0,0


In [26]:
# Remove column "ExtraCredit"
total_scores = total_scores.drop('ExtraCredit', axis=1) # drop() creates a new data frame
# total_scores.drop('ExtraCredit', axis=1, inplace=True)

total_scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Edward,77.0,88.0,99.0
Fred,100.0,100.0,
Flora,67.0,78.0,89.0
Gabriel,76.0,87.0,98.0


In [28]:
# Remove both Fred and Flora
total_scores = total_scores.drop(['Fred', 'Flora'])
total_scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Edward,77.0,88.0,99.0
Gabriel,76.0,87.0,98.0


## Table Arithmetics
- Perform an operation uniformly to all values in a column
- Arithmetics with multiple columns
- Calculate statistics
- Apply a user-defined function to all rows

In [37]:
# Create a data frame called grades
grades = pd.DataFrame([[56, 67, 78, 5],
                       [66, 77, 88, 8],
                       [98, 97, 85, 3]],
                      index=["Superman", "Hulk", "Thor"],
                      columns=["Quiz1", "Quiz2", "Final", "ExtraCredit"])
grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Superman,56,67,78,5
Hulk,66,77,88,8
Thor,98,97,85,3


In [38]:
# Double the extra credits

grades['ExtraCredit'] = grades['ExtraCredit'] * 2

grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Superman,56,67,78,10
Hulk,66,77,88,16
Thor,98,97,85,6


In [59]:
# Calculate the final grades:
#  final Grades = Quiz1 * 25% + Quiz2 * 25% + Final * 50% + ExtraCredit

grades['FinalGrade'] = grades['Quiz1'] * 0.25 + grades['Quiz2'] * 0.25 + \
grades['Final'] * 0.5 + grades['ExtraCredit']

grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit,FinalGrade
Superman,56,67,78,10,79.75
Hulk,66,77,88,16,95.75
Thor,98,97,85,6,97.25


In [65]:
# Ex: Curve the grades:
# Formula: CurvedGrades = sqrt(Grades) * 10

# Attempt 1: use the ** operator
# grades['CurvedGrades'] = grades['FinalGrade'] ** 0.5 * 10
# grades


# Attempt 2: use math.sqrt()
# import math
# math.sqrt(grades['FinalGrade']) 
# math.sqrt() can only take in a single float number
# The statement above will cause an error

# Attempt 3: use numpy.sqrt()
import numpy as np
grades['CurvedGrades'] = np.sqrt(grades['FinalGrade']) * 10 
# numpy functions can take in a list of values
grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit,FinalGrade,CurvedGrades
Superman,56,67,78,10,79.75,89.302855
Hulk,66,77,88,16,95.75,97.851929
Thor,98,97,85,6,97.25,98.615415


In [72]:
# Calculate the min, max, mean, median, variance, and std of the final grades

print("Max grade:", grades['CurvedGrades'].max())
print("Min grade:", grades['CurvedGrades'].min())
print("Use numpy.variance() to calcuate the variance:",
      np.var(grades['CurvedGrades']))
# You can break a long statement into multiple lines by starting a new line
# after a comma

Max grade: 98.6154146165801
Min grade: 89.30285549745875
Use numpy.variance() to calcuate the variance: 17.821480518660966


In [73]:
# Ex: Define a function num2letter() that converts a numerical grade to a letter grade.
# For example, num2letter(95) returns 'A', num2letter(59) returns 'F'

def num2letter(num):
    
    # 90+ -> A
    # >= 80 and < 90 -> B
    # >= 70 and < 80 -> C
    # >= 60 and < 70 -> D
    # otherwise: F
    
    if num >= 90:
        return 'A'
    elif num >= 80:
        return 'B'
    elif num >= 70:
        return 'C'
    elif num >= 60:
        return 'D'
    else:
        return 'F'

In [75]:
num2letter(59)

'F'

In [78]:
# Apply the function to the curved grade column

# num2letter(grades['CurvedGrades']) # This will cause an error
grades['LetterGrade'] = grades['CurvedGrades'].apply(num2letter)
grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit,FinalGrade,CurvedGrades,LetterGrade
Superman,56,67,78,10,79.75,89.302855,B
Hulk,66,77,88,16,95.75,97.851929,A
Thor,98,97,85,6,97.25,98.615415,A


## Example: Revisit 80 Cereal Data

Using Pandas and DataFrame, let's repeat our analysis of the 80 Cereal Data:
- Load the csv file using `pd.read_csv()`
- Examine the data
- Explore the ratings
- Analyze sugar contents

In [82]:
# Load the dataset
raw_data = pd.read_csv('cereal.csv') 
raw_data.head(3) # by default, column names come from the first row, and integer indexing is used.

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505


In [83]:
# Display the shape

raw_data.shape

(77, 16)

In [86]:
# Display the columns

print(raw_data.columns.values)

['name' 'mfr' 'type' 'calories' 'protein' 'fat' 'sodium' 'fiber' 'carbo'
 'sugars' 'potass' 'vitamins' 'shelf' 'weight' 'cups' 'rating']


In [87]:
# Display the data types

raw_data.dtypes

name         object
mfr          object
type         object
calories      int64
protein       int64
fat           int64
sodium        int64
fiber       float64
carbo       float64
sugars        int64
potass        int64
vitamins      int64
shelf         int64
weight      float64
cups        float64
rating      float64
dtype: object

In [None]:
# Display all cereal names



In [None]:
# Display all cereal ratings



In [None]:
# Find the product name with highest rating



In [None]:
# Display all cereals with rating above 60



In [None]:
# Calculate sugar per ounce
# sugar per ounce = sugar per serving / ounce per serving



In [None]:
# Which product has the highest amount of sugar per ounce?

