Pandas library was first developed by Wes McKinney in 2008 for data manipulation and analysis.

#### References:
    www.python.org
    www.numpy.org
    www.matplotlib.org
    https://pandas.pydata.org

#### Questions/feedback: petert@digipen.edu

# Chapter08: Pandas Dataframe
## pandas
   - DataFrame, Index
   - Data Manipulation
   - <font color="grey">Selection and Filtering</font>
   - <font color="grey">Descriptive Statistics</font>
   - <font color="grey">Read, Write and Load Data</font>
   
A **DataFrame** is a two-dimensional tabular data structure capable of holding data of any type. It is similar to a spreadsheet with labeled rows and columns making it easy to manipulate and analyze data.

### Import pandas:
    using 'pd'  is standard by Python users
    import frequently used DataFrame and Series onto local namespace is a good practice

In [2]:
import pandas as pd                     # using 'pd'  is standard by Python users
#from pandas import DataFrame            # optional, good practice
#from pandas import Series               # optional, good practice

import numpy as np
from matplotlib import pyplot as plt

%matplotlib notebook

## DataFrame
    - rectangular data (table, spreadsheet), similar to an array of arrays
    - ordered set of columns
    - each column could have different type: str, int, float, boolean, ...
    - column index and row index
    - can be interpreted as a dictionary of Series (using the same index)
##### Examples and basic funtionality:

Create a dataframe using numpy array:

In [3]:
np.arange(24)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

In [4]:
np.arange(24).reshape(4,6)

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23]])

In [5]:
frame = pd.DataFrame(np.arange(24).reshape(4,6))
frame

Unnamed: 0,0,1,2,3,4,5
0,0,1,2,3,4,5
1,6,7,8,9,10,11
2,12,13,14,15,16,17
3,18,19,20,21,22,23


Create a dataframe using random numbers:

In [6]:
frame = pd.DataFrame(np.random.randn(24).reshape(4,6))
frame

Unnamed: 0,0,1,2,3,4,5
0,0.001096,-1.623563,0.461319,1.347607,0.260246,-0.283326
1,0.602739,-0.096292,0.427112,1.565015,-0.463952,-0.005377
2,0.51984,0.151343,1.390488,1.100757,0.972321,-0.678986
3,0.728533,0.074987,1.815639,-1.631171,-0.624233,1.058463


Create a dataframe using list of lists:

In [7]:
list = [    [2019, 2019, 2020, 2020, 2021, 2021, 2021], 
            ['CS232', 'CS372', 'CS232', 'CS373', 'CS376', 'CS312', 'CS372'], 
            ['Data Analytics', 'Machine Learning I', 'Data Analytics', 'Machine Learning II', 'Deep Learning', 'Big Data', 'Machine Learning I']
       ]
#frame = pd.DataFrame(list)
frame = pd.DataFrame(list).T
frame

Unnamed: 0,0,1,2
0,2019,CS232,Data Analytics
1,2019,CS372,Machine Learning I
2,2020,CS232,Data Analytics
3,2020,CS373,Machine Learning II
4,2021,CS376,Deep Learning
5,2021,CS312,Big Data
6,2021,CS372,Machine Learning I


Create a dictionary of lists as a base for a dataframe:

In [8]:
data = {
    'year':       [2019, 2019, 2020, 2020, 2021, 2021, 2021],
    'courseID':   ['CS232', 'CS372', 'CS232', 'CS373', 'CS376', 'CS312', 'CS372'],
    'courseName': ['Data Analytics', 'Machine Learning I', 'Data Analytics', 'Machine Learning II', 'Deep Learning', 'Big Data', 'Machine Learning I']
}

In [9]:
data

{'year': [2019, 2019, 2020, 2020, 2021, 2021, 2021],
 'courseID': ['CS232', 'CS372', 'CS232', 'CS373', 'CS376', 'CS312', 'CS372'],
 'courseName': ['Data Analytics',
  'Machine Learning I',
  'Data Analytics',
  'Machine Learning II',
  'Deep Learning',
  'Big Data',
  'Machine Learning I']}

Create dataframe using the prepared dictionary of lists:

In [10]:
# create dataframe
frame = pd.DataFrame(data)
frame

Unnamed: 0,year,courseID,courseName
0,2019,CS232,Data Analytics
1,2019,CS372,Machine Learning I
2,2020,CS232,Data Analytics
3,2020,CS373,Machine Learning II
4,2021,CS376,Deep Learning
5,2021,CS312,Big Data
6,2021,CS372,Machine Learning I


In [11]:
print(frame)

   year courseID           courseName
0  2019    CS232       Data Analytics
1  2019    CS372   Machine Learning I
2  2020    CS232       Data Analytics
3  2020    CS373  Machine Learning II
4  2021    CS376        Deep Learning
5  2021    CS312             Big Data
6  2021    CS372   Machine Learning I


The use of *head* and *tail* methods allows a peak at the data and its structure at the beginning and the end:

In [12]:
# peak at the first 5 rows
display(frame.head())
# peak at the last  2 rows
frame.tail(3)

Unnamed: 0,year,courseID,courseName
0,2019,CS232,Data Analytics
1,2019,CS372,Machine Learning I
2,2020,CS232,Data Analytics
3,2020,CS373,Machine Learning II
4,2021,CS376,Deep Learning


Unnamed: 0,year,courseID,courseName
4,2021,CS376,Deep Learning
5,2021,CS312,Big Data
6,2021,CS372,Machine Learning I


In [13]:
frame.sample(4)

Unnamed: 0,year,courseID,courseName
6,2021,CS372,Machine Learning I
0,2019,CS232,Data Analytics
5,2021,CS312,Big Data
4,2021,CS376,Deep Learning


In [14]:
len(frame)

7

In [15]:
frame.shape

(7, 3)

In [16]:
frame.shape[1]

3

In [17]:
int(len(frame) / 2 + 1)

4

In [18]:
int(len(frame) / 2 - 1)

2

In [19]:
frame[2:4]

Unnamed: 0,year,courseID,courseName
2,2020,CS232,Data Analytics
3,2020,CS373,Machine Learning II


In [20]:
frame[int(len(frame)/2 - 1) : int(len(frame)/2 + 1)]

Unnamed: 0,year,courseID,courseName
2,2020,CS232,Data Analytics
3,2020,CS373,Machine Learning II


In [21]:
print(frame[int(len(frame)/2 - 1) : int(len(frame)/2 + 1)])

   year courseID           courseName
2  2020    CS232       Data Analytics
3  2020    CS373  Machine Learning II


Note that print removes Pandas formatting of the dataframe

Create another dataframe using 'data' and
- add another column
- specify different than default row indices

In [22]:
data

{'year': [2019, 2019, 2020, 2020, 2021, 2021, 2021],
 'courseID': ['CS232', 'CS372', 'CS232', 'CS373', 'CS376', 'CS312', 'CS372'],
 'courseName': ['Data Analytics',
  'Machine Learning I',
  'Data Analytics',
  'Machine Learning II',
  'Deep Learning',
  'Big Data',
  'Machine Learning I']}

In [23]:
# create another dataframe using 'data' and 
# the same column names but
#    add a new column
#    specify indices different than the default 0, 1, 2, ...
frame2 = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'], columns=['year', 'courseID', 'courseName', 'day'])
frame2

Unnamed: 0,year,courseID,courseName,day
a,2019,CS232,Data Analytics,
b,2019,CS372,Machine Learning I,
c,2020,CS232,Data Analytics,
d,2020,CS373,Machine Learning II,
e,2021,CS376,Deep Learning,
f,2021,CS312,Big Data,
g,2021,CS372,Machine Learning I,


Examples of filtering and manipulating dataframes using column labels and row indices:

In [24]:
# retrieve a column using attribute of the dataframe
frame2.courseID

a    CS232
b    CS372
c    CS232
d    CS373
e    CS376
f    CS312
g    CS372
Name: courseID, dtype: object

In [25]:
# retrieve another column using attribute/property of the dataframe
frame2.year

a    2019
b    2019
c    2020
d    2020
e    2021
f    2021
g    2021
Name: year, dtype: int64

In [26]:
# retrieve a column using the column name of the dataframe
frame2['courseName']

a         Data Analytics
b     Machine Learning I
c         Data Analytics
d    Machine Learning II
e          Deep Learning
f               Big Data
g     Machine Learning I
Name: courseName, dtype: object

Looks familiar?

The result looks like a Pandas Series: index column, value column and type info

In [27]:
# Check the type:
type(frame2['courseName'])

pandas.core.series.Series

In [28]:
# Look at the previous dataframe again:
frame2

Unnamed: 0,year,courseID,courseName,day
a,2019,CS232,Data Analytics,
b,2019,CS372,Machine Learning I,
c,2020,CS232,Data Analytics,
d,2020,CS373,Machine Learning II,
e,2021,CS376,Deep Learning,
f,2021,CS312,Big Data,
g,2021,CS372,Machine Learning I,


In [29]:
pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Monday'], index=['a', 'f', 'c', 'd'])

a       Monday
f      Tuesday
c    Wednesday
d       Monday
dtype: object

In [30]:
frame2.day

a    NaN
b    NaN
c    NaN
d    NaN
e    NaN
f    NaN
g    NaN
Name: day, dtype: object

In [31]:
# modify existing values in dataframe using specific indices
dayval = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Monday'], index=['a', 'f', 'c', 'd'])
frame2.day = dayval
frame2

Unnamed: 0,year,courseID,courseName,day
a,2019,CS232,Data Analytics,Monday
b,2019,CS372,Machine Learning I,
c,2020,CS232,Data Analytics,Wednesday
d,2020,CS373,Machine Learning II,Monday
e,2021,CS376,Deep Learning,
f,2021,CS312,Big Data,Tuesday
g,2021,CS372,Machine Learning I,


In [32]:
pd.Series(['Sunday', 'Saturday'], index=['x', 'y'])

x      Sunday
y    Saturday
dtype: object

In [33]:
# modify existing values in dataframe using specific indices
dayval = pd.Series(['Sunday', 'Saturday'], index=['x', 'y'])
frame2.day = dayval
frame2

Unnamed: 0,year,courseID,courseName,day
a,2019,CS232,Data Analytics,
b,2019,CS372,Machine Learning I,
c,2020,CS232,Data Analytics,
d,2020,CS373,Machine Learning II,
e,2021,CS376,Deep Learning,
f,2021,CS312,Big Data,
g,2021,CS372,Machine Learning I,


In [34]:
pd.Series(['Sunday', 'Saturday'])

0      Sunday
1    Saturday
dtype: object

In [35]:
# modify existing values in dataframe without specific indices
dayval = pd.Series(['Sunday', 'Saturday'])
frame2.day = dayval
frame2

Unnamed: 0,year,courseID,courseName,day
a,2019,CS232,Data Analytics,
b,2019,CS372,Machine Learning I,
c,2020,CS232,Data Analytics,
d,2020,CS373,Machine Learning II,
e,2021,CS376,Deep Learning,
f,2021,CS312,Big Data,
g,2021,CS372,Machine Learning I,


#### Transposition of a dataframe is similar to numpy arrays:

In [36]:
frame2.T

Unnamed: 0,a,b,c,d,e,f,g
year,2019,2019,2020,2020,2021,2021,2021
courseID,CS232,CS372,CS232,CS373,CS376,CS312,CS372
courseName,Data Analytics,Machine Learning I,Data Analytics,Machine Learning II,Deep Learning,Big Data,Machine Learning I
day,,,,,,,


In [37]:
# The original dataframe has not changed:
frame2

Unnamed: 0,year,courseID,courseName,day
a,2019,CS232,Data Analytics,
b,2019,CS372,Machine Learning I,
c,2020,CS232,Data Analytics,
d,2020,CS373,Machine Learning II,
e,2021,CS376,Deep Learning,
f,2021,CS312,Big Data,
g,2021,CS372,Machine Learning I,


Examples of modifying DataFrame elements in bulk:

In [38]:
# modify all values of a column at once
frame3 = frame2.copy()
frame3.day = 'Tuesday'
# or:
#frame3['day'] = 'Wednesday'
frame3

Unnamed: 0,year,courseID,courseName,day
a,2019,CS232,Data Analytics,Tuesday
b,2019,CS372,Machine Learning I,Tuesday
c,2020,CS232,Data Analytics,Tuesday
d,2020,CS373,Machine Learning II,Tuesday
e,2021,CS376,Deep Learning,Tuesday
f,2021,CS312,Big Data,Tuesday
g,2021,CS372,Machine Learning I,Tuesday


Delete a column:

In [39]:
frame3

Unnamed: 0,year,courseID,courseName,day
a,2019,CS232,Data Analytics,Tuesday
b,2019,CS372,Machine Learning I,Tuesday
c,2020,CS232,Data Analytics,Tuesday
d,2020,CS373,Machine Learning II,Tuesday
e,2021,CS376,Deep Learning,Tuesday
f,2021,CS312,Big Data,Tuesday
g,2021,CS372,Machine Learning I,Tuesday


In [40]:
# delete a column
del frame3['day']
frame3

Unnamed: 0,year,courseID,courseName
a,2019,CS232,Data Analytics
b,2019,CS372,Machine Learning I
c,2020,CS232,Data Analytics
d,2020,CS373,Machine Learning II
e,2021,CS376,Deep Learning
f,2021,CS312,Big Data
g,2021,CS372,Machine Learning I


Display index information:

In [41]:
frame3.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')

Display values of a dataframe:

In [42]:
frame3.values

array([[2019, 'CS232', 'Data Analytics'],
       [2019, 'CS372', 'Machine Learning I'],
       [2020, 'CS232', 'Data Analytics'],
       [2020, 'CS373', 'Machine Learning II'],
       [2021, 'CS376', 'Deep Learning'],
       [2021, 'CS312', 'Big Data'],
       [2021, 'CS372', 'Machine Learning I']], dtype=object)

In [43]:
type(frame3.values)

numpy.ndarray

##### Dropping rows or columns:

In [44]:
frame3

Unnamed: 0,year,courseID,courseName
a,2019,CS232,Data Analytics
b,2019,CS372,Machine Learning I
c,2020,CS232,Data Analytics
d,2020,CS373,Machine Learning II
e,2021,CS376,Deep Learning
f,2021,CS312,Big Data
g,2021,CS372,Machine Learning I


In [45]:
# drop rows based on indices
frame3.drop(['b', 'c', 'd'])

Unnamed: 0,year,courseID,courseName
a,2019,CS232,Data Analytics
e,2021,CS376,Deep Learning
f,2021,CS312,Big Data
g,2021,CS372,Machine Learning I


In [46]:
# drop column(s) based on column names and specifying axis=1
frame3.drop('courseName', axis=1)

Unnamed: 0,year,courseID
a,2019,CS232
b,2019,CS372
c,2020,CS232
d,2020,CS373
e,2021,CS376
f,2021,CS312
g,2021,CS372


Note that the action is displayed without calling to display or print the dataframe.

The dataframe has not changed:

In [47]:
# the dataframe has not changed
frame3

Unnamed: 0,year,courseID,courseName
a,2019,CS232,Data Analytics
b,2019,CS372,Machine Learning I
c,2020,CS232,Data Analytics
d,2020,CS373,Machine Learning II
e,2021,CS376,Deep Learning
f,2021,CS312,Big Data
g,2021,CS372,Machine Learning I


The result of the drop could have been assigned to a dataframe or else specify "inplace=True" to take effect:

In [48]:
print('dataframe frame3:')
print(frame3)
frame4 = frame3.copy()
frame4.drop(['b', 'c', 'd'], inplace=True)
frame4.drop('courseName', axis=1, inplace=True)
print('\ndataframe frame4:')
print(frame4)

dataframe frame3:
   year courseID           courseName
a  2019    CS232       Data Analytics
b  2019    CS372   Machine Learning I
c  2020    CS232       Data Analytics
d  2020    CS373  Machine Learning II
e  2021    CS376        Deep Learning
f  2021    CS312             Big Data
g  2021    CS372   Machine Learning I

dataframe frame4:
   year courseID
a  2019    CS232
e  2021    CS376
f  2021    CS312
g  2021    CS372


In [49]:
frame4

Unnamed: 0,year,courseID
a,2019,CS232
e,2021,CS376
f,2021,CS312
g,2021,CS372


#### Exercise 8.1:
Create a data frame and perform below tasks:
- create a 4 x 2 dataframe (4 rows and 2 columns)
- the column labels should be "class" and "midterm"
- row indices should be "first", "second", "third" and "fourth"
- the values should be 4 of your current (or made up) classes names and expected midterm grades accordingly
- add a new column with label "final"
- add expected final grade values to "second" and "fourth" (rows/index labels)
- drop one class (it cannot be CS232!)
- display the dataframe after each change

In [58]:
# Exercise 8.1 code:

myData = {'class': ['CS232', 'DES214', 'GAM375', 'DES230'],
          'midterm': ['100', '100', '100', '100']}

myFrame = pd.DataFrame(myData, index=['first', 'second', 'third', 'fourth'])

myFrame

myFrame['final'] = [None, '100', None, '100']

myFrame

myFrame.drop('third', inplace=True)

myFrame

Unnamed: 0,class,midterm,final
first,CS232,100,
second,DES214,100,100.0
fourth,DES230,100,100.0
