<h1>Python 2 - Object Oriented Programming and Pandas</h1>

<!-- :    : -->

<p>4 Pillars of OOP</p>
<ul>
    
<li>Encapsulation: Group related variables and functions together to reduce complexity and increase reusability</li>
<li>Data Abstraction: Creating methods to interface with attributes of your class. Show only essentials to reduce complexity</li>
<li>Inheritance</li>
<li>Polymorphism</li>

</ul>

## Jupyter Notebook 

This is a web-based application (runs in the browser) that is used to interpret Python code. 

- To add more code cells (or blocks) click on the **'+'** button in the top left corner
- There are 3 cell types in Jupyter:
    - Code: Used to write Python code
    - Markdown: Used to write texts (can be used to write explanations and other key information)
    - NBConvert: Used convert Jupyter (.ipynb) files to other formats (HTML, LaTex, etc.) 
    

- To run Python code in a specific cell, you can click on the **'Run'** button at the top or press **Shift + Enter**
- The number sign (#) is used to insert comments when coding to leave messages for yourself or others. These comments will not be interpreted as code and are overlooked by the program


<h1>Inheritance</h1>
<ul>
    <li>New classes do not need to be declared from scratch. They may build on existing classes</li>
    <li>When one class inherits from another, it automatically takes on all the attributes and methods of the first class</li>
    <li>Goal: Eliminate redundant code by inheriting attributes and methods from a parent class</li>
</ul>


In [31]:
class Employee():
    def __init__(self, employee_num, department, name):
        self.employee_num = employee_num
        self.department = department
        self.name = name
        self.days_worked = 0
        
    def get_descriptive_name(self):
        long_name = f"{self.name} ({self.employee_num}) of {self.department}"
        return long_name.title()
    
    def num_days(self):
        print(f"{self.name} has worked {self.days_worked} days")
        
    def increment_days(self):
        self.days_worked += 1
        print("Days worked increased!")

In [32]:
new_hire = Employee(1213, "Machine Learning", "Peter Ling") 
description = new_hire.get_descriptive_name()

In [33]:
print(description)

Peter Ling (1213) Of Machine Learning


In [34]:
new_hire.num_days()

Peter Ling has worked 0 days


In [35]:
new_hire.increment_days()

Days worked increased!


In [36]:
class Engineer(Employee):
    def __init__(self, employee_num, department, name, p_eng):
        super().__init__(employee_num, department, name)
        self.p_eng = p_eng
    

In [44]:
new_eng_hire = Engineer(1213, "Marketing", "Shakti", False)

In [39]:
new_eng_hire.get_descriptive_name()

'Shakti (1213) Of Marketing'

In [40]:
new_eng_hire.num_days()

Shakti has worked 0 days


In [47]:
new_eng_hire.p_eng

False

<h1>Polymorphism</h1>

<ul>
    <li>Because child classes inherit all attributes and methods from their parent class, we may wish to refactor and customize classes to specific use cases.</li>
    <li>Overiding involves the redefining of methods to better suit child classes </li>
</ul>

In [67]:
class Recruiter(Employee):
    def __init__(self, employee_num, department, name):
        super().__init__(employee_num, department, name)
        self.hires = []
        
    def get_descriptive_name(self):
        long_name = f"{self.name} ({self.employee_num}) has hired {len(self.hires)} many employees."
        return long_name.title()
    
        
    def add_hire(self, emp_id):
        self.hires.append(emp_id)
        print(self.hires)

In [68]:
new_rec_hire = Recruiter(1000, "Sales", "Robert")

In [69]:
new_rec_hire.get_descriptive_name()

'Robert (1000) Has Hired 0 Many Employees.'

In [70]:
new_rec_hire.add_hire(1080) 

[1080]


In [71]:
new_rec_hire.get_descriptive_name()

'Robert (1000) Has Hired 1 Many Employees.'

<h1>Pandas</h1>

In [72]:
import pandas as pd

<h1>Reading CSV Files</h1>

<ul>
    <li>Function to use in Pandas: read_csv()</li>
    <li>Value passed to read_csv() must be string and the <b>exact</b> name of the file</li>
    <li>CSV Files must be in the same directory as the python file/notebook</li>
</ul>

In [73]:
features_df = pd.read_csv('features.csv')

<h1>Basic DataFrame Functions</h1>

<ul>
    <li>head() will display the first 5 values of the DataFrame</li>
    <li>tail() will display the last 5 values of the DataFrame </li>
    <li>shape will display the dimensions of the DataFrame</li>
    <li>columns() will return the columns of the DataFrame as a list</li>
    <li>dtypes will display the types of each column of the DataFrame</li>
    <li>drop() will remove a column from the DataFrame</li>
</ul>

In [74]:
print(features_df)

      Store        Date  Temperature  Fuel_Price  MarkDown1  MarkDown2  \
0         1  2010-02-05        42.31       2.572        NaN        NaN   
1         1  2010-02-12        38.51       2.548        NaN        NaN   
2         1  2010-02-19        39.93       2.514        NaN        NaN   
3         1  2010-02-26        46.63       2.561        NaN        NaN   
4         1  2010-03-05        46.50       2.625        NaN        NaN   
...     ...         ...          ...         ...        ...        ...   
8185     45  2013-06-28        76.05       3.639    4842.29     975.03   
8186     45  2013-07-05        77.50       3.614    9090.48    2268.58   
8187     45  2013-07-12        79.37       3.614    3789.94    1827.31   
8188     45  2013-07-19        82.84       3.737    2961.49    1047.07   
8189     45  2013-07-26        76.06       3.804     212.02     851.73   

      MarkDown3  MarkDown4  MarkDown5         CPI  Unemployment  IsHoliday  
0           NaN        NaN        

In [98]:
features_df.head(10)

Unnamed: 0,Store,Date,Temp,Fuel Price,MD1,MD2,MD3,MD4,MD5,CPI,Unemployment,IsHoliday
0,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False
1,1,2010-02-12,38.51,2.548,,,,,,211.24217,8.106,True
2,1,2010-02-19,39.93,2.514,,,,,,211.289143,8.106,False
3,1,2010-02-26,46.63,2.561,,,,,,211.319643,8.106,False
4,1,2010-03-05,46.5,2.625,,,,,,211.350143,8.106,False
5,1,2010-03-12,57.79,2.667,,,,,,211.380643,8.106,False
6,1,2010-03-19,54.58,2.72,,,,,,211.215635,8.106,False
7,1,2010-03-26,51.45,2.732,,,,,,211.018042,8.106,False
8,1,2010-04-02,62.27,2.719,,,,,,210.82045,7.808,False
9,1,2010-04-09,65.86,2.77,,,,,,210.622857,7.808,False


In [76]:
features_df.tail()

Unnamed: 0,Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday
8185,45,2013-06-28,76.05,3.639,4842.29,975.03,3.0,2449.97,3169.69,,,False
8186,45,2013-07-05,77.5,3.614,9090.48,2268.58,582.74,5797.47,1514.93,,,False
8187,45,2013-07-12,79.37,3.614,3789.94,1827.31,85.72,744.84,2150.36,,,False
8188,45,2013-07-19,82.84,3.737,2961.49,1047.07,204.19,363.0,1059.46,,,False
8189,45,2013-07-26,76.06,3.804,212.02,851.73,2.06,10.88,1864.57,,,False


In [80]:
features_df.shape

(8190, 12)

In [81]:
features_df.columns

Index(['Store', 'Date', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2',
       'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment',
       'IsHoliday'],
      dtype='object')

In [82]:
features_df.columns = ['Store', 'Date', 'Temperature', 'Fuel Price', 
                       'MD1', 'MD2', 'MD3', 'MD4', 'MD5', 'CPI', 
                       'Unemployment', 'IsHoliday']

In [83]:
features_df.head()

Unnamed: 0,Store,Date,Temperature,Fuel Price,MD1,MD2,MD3,MD4,MD5,CPI,Unemployment,IsHoliday
0,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False
1,1,2010-02-12,38.51,2.548,,,,,,211.24217,8.106,True
2,1,2010-02-19,39.93,2.514,,,,,,211.289143,8.106,False
3,1,2010-02-26,46.63,2.561,,,,,,211.319643,8.106,False
4,1,2010-03-05,46.5,2.625,,,,,,211.350143,8.106,False


In [84]:
features_df.rename(columns = {'Temperature': 'Temp'}, inplace=True)

In [85]:
features_df.head()

Unnamed: 0,Store,Date,Temp,Fuel Price,MD1,MD2,MD3,MD4,MD5,CPI,Unemployment,IsHoliday
0,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False
1,1,2010-02-12,38.51,2.548,,,,,,211.24217,8.106,True
2,1,2010-02-19,39.93,2.514,,,,,,211.289143,8.106,False
3,1,2010-02-26,46.63,2.561,,,,,,211.319643,8.106,False
4,1,2010-03-05,46.5,2.625,,,,,,211.350143,8.106,False


In [86]:
features_df.dtypes
# in Panda 'object' is a 'string' data type

Store             int64
Date             object
Temp            float64
Fuel Price      float64
MD1             float64
MD2             float64
MD3             float64
MD4             float64
MD5             float64
CPI             float64
Unemployment    float64
IsHoliday          bool
dtype: object

<h1>Indexing and Series Functions</h1>

<ul>
    <li>Columns of a DataFrame can be accessed through the following format: df_name["name_of_column"] </li>
    <li>Columns will be returned as a Series, which have different methods than DataFrames </li>
    <li>A couple useful Series functions: max(), median(), min(), value_counts(), sort_values()</li>
</ul>

In [87]:
features_df.head()

Unnamed: 0,Store,Date,Temp,Fuel Price,MD1,MD2,MD3,MD4,MD5,CPI,Unemployment,IsHoliday
0,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False
1,1,2010-02-12,38.51,2.548,,,,,,211.24217,8.106,True
2,1,2010-02-19,39.93,2.514,,,,,,211.289143,8.106,False
3,1,2010-02-26,46.63,2.561,,,,,,211.319643,8.106,False
4,1,2010-03-05,46.5,2.625,,,,,,211.350143,8.106,False


In [88]:
features_df['CPI']

0       211.096358
1       211.242170
2       211.289143
3       211.319643
4       211.350143
           ...    
8185           NaN
8186           NaN
8187           NaN
8188           NaN
8189           NaN
Name: CPI, Length: 8190, dtype: float64

In [89]:
features_df.fillna(0)

Unnamed: 0,Store,Date,Temp,Fuel Price,MD1,MD2,MD3,MD4,MD5,CPI,Unemployment,IsHoliday
0,1,2010-02-05,42.31,2.572,0.00,0.00,0.00,0.00,0.00,211.096358,8.106,False
1,1,2010-02-12,38.51,2.548,0.00,0.00,0.00,0.00,0.00,211.242170,8.106,True
2,1,2010-02-19,39.93,2.514,0.00,0.00,0.00,0.00,0.00,211.289143,8.106,False
3,1,2010-02-26,46.63,2.561,0.00,0.00,0.00,0.00,0.00,211.319643,8.106,False
4,1,2010-03-05,46.50,2.625,0.00,0.00,0.00,0.00,0.00,211.350143,8.106,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8185,45,2013-06-28,76.05,3.639,4842.29,975.03,3.00,2449.97,3169.69,0.000000,0.000,False
8186,45,2013-07-05,77.50,3.614,9090.48,2268.58,582.74,5797.47,1514.93,0.000000,0.000,False
8187,45,2013-07-12,79.37,3.614,3789.94,1827.31,85.72,744.84,2150.36,0.000000,0.000,False
8188,45,2013-07-19,82.84,3.737,2961.49,1047.07,204.19,363.00,1059.46,0.000000,0.000,False


In [90]:
features_df["CPI"].max()

228.9764563

In [91]:
features_df["CPI"].min()

126.064

In [92]:
features_df["CPI"].median()

182.7640032

In [93]:
features_df["Store"].unique()
#shows the distinct values in a column

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45], dtype=int64)

In [97]:
features_df['Date'].value_counts()

2012-10-05    45
2010-03-19    45
2013-03-01    45
2012-02-17    45
2013-02-01    45
              ..
2012-07-27    45
2011-03-04    45
2011-09-30    45
2012-11-02    45
2011-02-11    45
Name: Date, Length: 182, dtype: int64

In [100]:
features_df.drop(columns="MD1").head()

Unnamed: 0,Store,Date,Temp,Fuel Price,MD2,MD3,MD4,MD5,CPI,Unemployment,IsHoliday
0,1,2010-02-05,42.31,2.572,,,,,211.096358,8.106,False
1,1,2010-02-12,38.51,2.548,,,,,211.24217,8.106,True
2,1,2010-02-19,39.93,2.514,,,,,211.289143,8.106,False
3,1,2010-02-26,46.63,2.561,,,,,211.319643,8.106,False
4,1,2010-03-05,46.5,2.625,,,,,211.350143,8.106,False


In [105]:
features_df.drop(columns=["MD1", "MD2", "MD3", "MD4", "MD5"], inplace = True)
# inplace = True stores the data back into the dataframe
# You can only drop something once or else if you run it again an error will occur since it can't be found

KeyError: "['MD1' 'MD2' 'MD3' 'MD4' 'MD5'] not found in axis"

In [107]:
features_df.head()

Unnamed: 0,Store,Date,Temp,Fuel Price,CPI,Unemployment,IsHoliday
0,1,2010-02-05,42.31,2.572,211.096358,8.106,False
1,1,2010-02-12,38.51,2.548,211.24217,8.106,True
2,1,2010-02-19,39.93,2.514,211.289143,8.106,False
3,1,2010-02-26,46.63,2.561,211.319643,8.106,False
4,1,2010-03-05,46.5,2.625,211.350143,8.106,False


<h1>Indexing</h1>

<ul>
    <li>Because Pandas will select entries based on column values by default, selecting data based on row values requires the use of the iloc method. 
    </li>
    <li>
      Allowed inputs are:
        <ul>
            <li>An integer, e.g. 5.</li>
            <li>A list or array of integers, e.g. [4, 3, 0].</li>
            <li>A slice object with ints, e.g. 1:7.</li>
        </ul>
    </li>
</ul>

In [106]:
features_df.loc[0:10, "Fuel Price":"IsHoliday"]

Unnamed: 0,Fuel Price,CPI,Unemployment,IsHoliday
0,2.572,211.096358,8.106,False
1,2.548,211.24217,8.106,True
2,2.514,211.289143,8.106,False
3,2.561,211.319643,8.106,False
4,2.625,211.350143,8.106,False
5,2.667,211.380643,8.106,False
6,2.72,211.215635,8.106,False
7,2.732,211.018042,8.106,False
8,2.719,210.82045,7.808,False
9,2.77,210.622857,7.808,False


In [109]:
features_df.iloc[[0,1], [1,3]]
# iloc = integer location
# first [] gives the rows and second [] gives the columns

Unnamed: 0,Date,Fuel Price
0,2010-02-05,2.572
1,2010-02-12,2.548


<h1>Formatting Data</h1>

<ul>
    <li>To access and format the string values of a DataFrame, we can access methods within the "str" module of the DataFrame </li>
    <li>We may also format float values using options.display.float_format() in Pandas</li>
</ul>

In [110]:
new = features_df['Date'].str.split("-", expand=True)
# splits the values in separate columns
new.head()

Unnamed: 0,0,1,2
0,2010,2,5
1,2010,2,12
2,2010,2,19
3,2010,2,26
4,2010,3,5


In [111]:
"2010-02-05".split("-")

['2010', '02', '05']

In [112]:
features_df["Year"] = new[0]
features_df["Month"] = new[1]
# adds columns to the dataframe

In [113]:
features_df.head()

Unnamed: 0,Store,Date,Temp,Fuel Price,CPI,Unemployment,IsHoliday,Year,Month
0,1,2010-02-05,42.31,2.572,211.096358,8.106,False,2010,2
1,1,2010-02-12,38.51,2.548,211.24217,8.106,True,2010,2
2,1,2010-02-19,39.93,2.514,211.289143,8.106,False,2010,2
3,1,2010-02-26,46.63,2.561,211.319643,8.106,False,2010,2
4,1,2010-03-05,46.5,2.625,211.350143,8.106,False,2010,3


In [116]:
features_df.to_csv('features_pandas.csv')
features_df.to_excel('features_pandas.xlsx')

<h1>Conditional Indexing</h1>

<ul>
    <li>Conditional Operators (>, ==, >=) can be used to return rows based on their values </li>
    <li>Bitwise Operators (|, &) can be used to combine conditonal statements</li>
</ul>

In [1]:
##CLASS EXERCISE 
# find the rows with Fuel_Price larger than 3.00 AND IsHoliday is True

# find the rows with CPI < 200  OR Unemployment < 5

