Pandas library was first developed by Wes McKinney in 2008 for data manipulation and analysis.

#### References:
    www.python.org
    www.numpy.org
    www.matplotlib.org
    https://pandas.pydata.org

#### Questions/feedback: petert@digipen.edu

# Chapter08: Pandas Dataframe
## pandas
   - DataFrame, Index
   - Data Manipulation
   - <font color="grey">Selection and Filtering</font>
   - <font color="grey">Descriptive Statistics</font>
   - <font color="grey">Read, Write and Load Data</font>
   
A **DataFrame** is a two-dimensional tabular data structure capable of holding data of any type. It is similar to a spreadsheet with labeled rows and columns making it easy to manipulate and analyze data.

### Import pandas:
    using 'pd'  is standard by Python users
    import frequently used DataFrame and Series onto local namespace is a good practice

In [None]:
import pandas as pd                     # using 'pd'  is standard by Python users
#from pandas import DataFrame            # optional, good practice
#from pandas import Series               # optional, good practice

import numpy as np
from matplotlib import pyplot as plt

%matplotlib notebook

## DataFrame
    - rectangular data (table, spreadsheet), similar to an array of arrays
    - ordered set of columns
    - each column could have different type: str, int, float, boolean, ...
    - column index and row index
    - can be interpreted as a dictionary of Series (using the same index)
##### Examples and basic funtionality:

Create a dataframe using numpy array:

In [None]:
np.arange(24)

In [None]:
np.arange(24).reshape(4,6)

In [None]:
frame = pd.DataFrame(np.arange(24).reshape(4,6))
frame

Create a dataframe using random numbers:

In [None]:
frame = pd.DataFrame(np.random.randn(24).reshape(4,6))
frame

Create a dataframe using list of lists:

In [None]:
list = [    [2019, 2019, 2020, 2020, 2021, 2021, 2021], 
            ['CS232', 'CS372', 'CS232', 'CS373', 'CS376', 'CS312', 'CS372'], 
            ['Data Analytics', 'Machine Learning I', 'Data Analytics', 'Machine Learning II', 'Deep Learning', 'Big Data', 'Machine Learning I']
       ]
#frame = pd.DataFrame(list)
frame = pd.DataFrame(list).T
frame

Create a dictionary of lists as a base for a dataframe:

In [None]:
data = {
    'year':       [2019, 2019, 2020, 2020, 2021, 2021, 2021],
    'courseID':   ['CS232', 'CS372', 'CS232', 'CS373', 'CS376', 'CS312', 'CS372'],
    'courseName': ['Data Analytics', 'Machine Learning I', 'Data Analytics', 'Machine Learning II', 'Deep Learning', 'Big Data', 'Machine Learning I']
}

In [None]:
data

Create dataframe using the prepared dictionary of lists:

In [None]:
# create dataframe
frame = pd.DataFrame(data)
frame

In [None]:
print(frame)

The use of *head* and *tail* methods allows a peak at the data and its structure at the beginning and the end:

In [None]:
# peak at the first 5 rows
display(frame.head())
# peak at the last  2 rows
frame.tail(3)

In [None]:
frame.sample(4)

In [None]:
len(frame)

In [None]:
frame.shape

In [None]:
frame.shape[1]

In [None]:
int(len(frame) / 2 + 1)

In [None]:
int(len(frame) / 2 - 1)

In [None]:
frame[2:4]

In [None]:
frame[int(len(frame)/2 - 1) : int(len(frame)/2 + 1)]

In [None]:
print(frame[int(len(frame)/2 - 1) : int(len(frame)/2 + 1)])

Note that print removes Pandas formatting of the dataframe

Create another dataframe using 'data' and
- add another column
- specify different than default row indices

In [None]:
data

In [None]:
# create another dataframe using 'data' and 
# the same column names but
#    add a new column
#    specify indices different than the default 0, 1, 2, ...
frame2 = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'], columns=['year', 'courseID', 'courseName', 'day'])
frame2

Examples of filtering and manipulating dataframes using column labels and row indices:

In [None]:
# retrieve a column using attribute of the dataframe
frame2.courseID

In [None]:
# retrieve another column using attribute/property of the dataframe
frame2.year

In [None]:
# retrieve a column using the column name of the dataframe
frame2['courseName']

Looks familiar?

The result looks like a Pandas Series: index column, value column and type info

In [None]:
# Check the type:
type(frame2['courseName'])

In [None]:
# Look at the previous dataframe again:
frame2

In [None]:
pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Monday'], index=['a', 'f', 'c', 'd'])

In [None]:
frame2.day

In [None]:
# modify existing values in dataframe using specific indices
dayval = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Monday'], index=['a', 'f', 'c', 'd'])
frame2.day = dayval
frame2

In [None]:
pd.Series(['Sunday', 'Saturday'], index=['x', 'y'])

In [None]:
# modify existing values in dataframe using specific indices
dayval = pd.Series(['Sunday', 'Saturday'], index=['x', 'y'])
frame2.day = dayval
frame2

In [None]:
pd.Series(['Sunday', 'Saturday'])

In [None]:
# modify existing values in dataframe without specific indices
dayval = pd.Series(['Sunday', 'Saturday'])
frame2.day = dayval
frame2

#### Transposition of a dataframe is similar to numpy arrays:

In [None]:
frame2.T

In [None]:
# The original dataframe has not changed:
frame2

Examples of modifying DataFrame elements in bulk:

In [None]:
# modify all values of a column at once
frame3 = frame2.copy()
frame3.day = 'Tuesday'
# or:
#frame3['day'] = 'Wednesday'
frame3

Delete a column:

In [None]:
frame3

In [None]:
# delete a column
del frame3['day']
frame3

Display index information:

In [None]:
frame3.index

Display values of a dataframe:

In [None]:
frame3.values

In [None]:
type(frame3.values)

##### Dropping rows or columns:

In [None]:
frame3

In [None]:
# drop rows based on indices
frame3.drop(['b', 'c', 'd'])

In [None]:
# drop column(s) based on column names and specifying axis=1
frame3.drop('courseName', axis=1)

Note that the action is displayed without calling to display or print the dataframe.

The dataframe has not changed:

In [None]:
# the dataframe has not changed
frame3

The result of the drop could have been assigned to a dataframe or else specify "inplace=True" to take effect:

In [None]:
print('dataframe frame3:')
print(frame3)
frame4 = frame3.copy()
frame4.drop(['b', 'c', 'd'], inplace=True)
frame4.drop('courseName', axis=1, inplace=True)
print('\ndataframe frame4:')
print(frame4)

In [None]:
frame4

#### Exercise 8.1:
Create a data frame and perform below tasks:
- create a 4 x 2 dataframe (4 rows and 2 columns)
- the column labels should be "class" and "midterm"
- row indices should be "first", "second", "third" and "fourth"
- the values should be 4 of your current (or made up) classes names and expected midterm grades accordingly
- add a new column with label "final"
- add expected final grade values to "second" and "fourth" (rows/index labels)
- drop one class (it cannot be CS232!)
- display the dataframe after each change

In [None]:
# Exercise 8.1 code:

