## Learning Objectives

- Create Pandas DataFrame from dictionary
- Explore Pandas methods and see its effect on the original DataFrame

## What is Pandas?

[Pandas](https://pandas.pydata.org/) is an open-source library that provides all kinds of easy-to-use data structures and data analysis tools in Python. It's one of the most popular libraries for data analysis, and is used by numerous companies such as [Instacart](https://tech.instacart.com/jardin-a-dataframe-based-orm-for-python-178e02e1c21) and [Twilio/SendGrid](https://sendgrid.com/blog/benefits-of-python-for-marketers/)!

## Lets create a simple dataframe in Pandas

A **Dataframe** is a 2d dynamically sized data structure. It has a table structure, where the _data_ is stored in _rows_ and _columns_.

- If we pass a Python dictionary as the `data` to the Pandas DataFrame input argument, we can creat a Pandas DataFrame

In [38]:
import pandas as pd

df = pd.DataFrame(data= {'name':['john', 'mary', 'peter','jeff','bill', 'lisa'], 'age':[23, 78, 22, 19, 45, 33], 
                         'state': ['iowa', 'oregon', 'california', 'texas', 'washington', 'oregon'], 'num_children': [2, 2, 0, 1, 2, 1],
                        'num_pets' : [0, 4, 0, 5, 0, 0]})
df.set_index('name', inplace=True)

In [5]:
df

Unnamed: 0_level_0,age,state,num_children,num_pets
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
john,23,iowa,2,0
mary,78,oregon,2,4
peter,22,california,0,0
jeff,19,texas,1,5
bill,45,washington,2,0
lisa,33,oregon,1,0


In [8]:
df.columns

Index(['age', 'state', 'num_children', 'num_pets'], dtype='object')

## If we are only interested in three columns

In [23]:
df[['num_pets', 'age', 'state']] #we can select more than one column by providing a list of the column names

Unnamed: 0_level_0,num_pets,age,state
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
john,0,23,iowa
mary,4,78,oregon
peter,0,22,california
jeff,5,19,texas
bill,0,45,washington
lisa,0,33,oregon


What if we tried to use "name" as a column name?

# How loc[ ] works

loc [] is a very flexible way to access different rows and columns of your data frame. It is a label based selection and is NOT based on position. 

loc[]'s inputs follow this format loc[row selection, column selection]

Allowed inputs are:

- A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).

- A list or array of labels, e.g. ['a', 'b', 'c'].

- A slice object with labels, e.g. 'a':'f'.

if loc[] is used to select more than one row but only one column or only one row but multiple columns it will give us back a [pandas series](https://towardsdatascience.com/pandas-series-a-lightweight-intro-b7963a0d62a2) object which is a one-dimensional array like object with axis labels (including time series). 

## Select the row with label john

In [27]:
df.loc["john"]

age               23
state           iowa
num_children       2
num_pets           0
Name: john, dtype: object

## Select the row with label mary and column state

In [29]:
df.loc["mary", "state"]

'oregon'

### Select the rows from peter to lisa based on columns (state and age for example)

loc[] also allows slicing similar to the slicing notation used for Python arrays and strings

In [27]:
df.loc["peter":"lisa", 'state']

name
peter    california
jeff          texas
bill     washington
lisa         oregon
Name: state, dtype: object

In [28]:
df.loc['peter':'lisa', ['state', 'age']]

Unnamed: 0_level_0,state,age
name,Unnamed: 1_level_1,Unnamed: 2_level_1
peter,california,22
jeff,texas,19
bill,washington,45
lisa,oregon,33


## How iloc[ ] works

iloc[ ] is how you use position based selection, similar to how index based access works in Python lists. It also allows slicing notation.

### Select the first two rows 

In [29]:
# select the first 2 rows
df.iloc[:2]#this says start at 0 and stop at 2, (stop not included)

Unnamed: 0_level_0,age,state,num_children,num_pets
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
john,23,iowa,2,0
mary,78,oregon,2,4


### Select the last two rows

In [7]:
# select the last 2 rows
df.iloc[-2:]

Unnamed: 0,name,age,state,num_children,num_pets
4,bill,45,washington,2,0
5,lisa,33,dc,1,0


### Let's do some more practice, keep an eye on which is being used loc or iloc

### Select rows up to and including the assigned one

In [33]:
# select rows up to and including the one
# with label='bill' (this retrieves 5 rows)
df.loc[:"bill"]

Unnamed: 0_level_0,age,state,num_children,num_pets
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
john,23,iowa,2,0
mary,78,oregon,2,4
peter,22,california,0,0
jeff,19,texas,1,5
bill,45,washington,2,0


### Select the first column without specifying the name of the column

In [30]:
# first column of data frame
df.iloc[:,0] #this looks weird but it's saying select all rows (no start stop specific) 
#and then select the column at position 0

age               23
state           iowa
num_children       2
num_pets           0
Name: john, dtype: object

### Select all rows from two columns by specifying the names of the columns

In [36]:
df.loc[:,['age', 'state']]
# df[['age', 'state']]

Unnamed: 0_level_0,age,state
name,Unnamed: 1_level_1,Unnamed: 2_level_1
john,23,iowa
mary,78,oregon
peter,22,california
jeff,19,texas
bill,45,washington
lisa,33,oregon


### Select the second row of dataframe

In [4]:
# second row of dataframe
df.iloc[1] 

age                 78
state           oregon
num_children         2
num_pets             4
Name: mary, dtype: object

## Practice Exercise

Write code to:
1. Select the number of pets and the number of children for jeff
2. Select the rows starting at the 3rd row and up to the last row
3. Select age and number of pets for mary up to bill

In [37]:
df.loc["jeff", ["num_pets", "num_children"]]
df.iloc[2:]
df.loc["mary":"bill", ["age","num_pets"]]

Unnamed: 0_level_0,age,num_pets
name,Unnamed: 1_level_1,Unnamed: 2_level_1
mary,78,4
peter,22,0
jeff,19,5
bill,45,0


# Filtering with Conditionals

df[ ] can do more than just take in a column name or a list of column names, it can also take in a boolean series (array) and will give you back the rows where the value is True

### Filter or conditional selection of a dataframe 

In [40]:
# people whose "age" is greater than 30
df['age'] > 30
df[df['age'] > 30]

Unnamed: 0_level_0,age,state,num_children,num_pets
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mary,78,oregon,2,4
bill,45,washington,2,0
lisa,33,oregon,1,0


### Another way of the above selection

we can also refer to columns like properties of the dataframe using . notation

In [41]:
df[df.age > 30]

Unnamed: 0_level_0,age,state,num_children,num_pets
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mary,78,oregon,2,4
bill,45,washington,2,0
lisa,33,oregon,1,0


### Select dataframe such that the return dataframe has more pets than children

In [43]:
# people who have more pets than children
df["num_pets"] > df[ "num_children"]
df[df["num_pets"] > df[ "num_children"]]

Unnamed: 0_level_0,age,state,num_children,num_pets
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mary,78,oregon,2,4
jeff,19,texas,1,5


### Who in the dataframe is older than 40 AND own pets

In [10]:
# people older than 40 who own pets
df[ (df["age"] > 40) & (df["num_pets"] > 0) ] 

Unnamed: 0_level_0,age,state,num_children,num_pets
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mary,78,oregon,2,4


### Who in the dataframe is older than 40 OR own pets

In [11]:
df[ (df["age"] > 40) | (df["num_pets"] > 0) ] 

Unnamed: 0_level_0,age,state,num_children,num_pets
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mary,78,oregon,2,4
jeff,19,texas,1,5
bill,45,washington,2,0


### Drop age and num_children columns

**Note:** a copy of the dataframe will be returned. The original dataframe will **not** be modified by this action

drop can take optional arguments such as aixs and inplace

inplace = True **will** modify the original dataframe 

axis can have a value of 0 or 1, 0 means rows, 1 means drop labels from the columns

In [45]:
# df itself is not modified; a copy is returned instead
df.drop(["age","num_children"],axis=1)

Unnamed: 0_level_0,state,num_pets
name,Unnamed: 1_level_1,Unnamed: 2_level_1
john,iowa,0
mary,oregon,4
peter,california,0
jeff,texas,5
bill,washington,0
lisa,oregon,0


### After droping, the original dataframe is untouched 

In [14]:
df

Unnamed: 0,name,age,state,num_children,num_pets
0,john,23,iowa,2,0
1,mary,78,dc,2,4
2,peter,22,california,0,0
3,jeff,19,texas,1,5
4,bill,45,washington,2,0
5,lisa,33,dc,1,0


### Provide various descriptive properties of the dataframe

In [17]:
df.describe()

Unnamed: 0,age,num_children,num_pets
count,6.0,6.0,6.0
mean,36.666667,1.333333,1.5
std,22.384518,0.816497,2.345208
min,19.0,0.0,0.0
25%,22.25,1.0,0.0
50%,28.0,1.5,0.0
75%,42.0,2.0,3.0
max,78.0,2.0,5.0


### What is the average for age, num_pets and num_children

To do this, we will need the help of [Numpy](https://www.numpy.org/). Numpy is an open-source Python library used for scientific computing and provides the ability to work with high-performance arrays and matrices. Pandas upon funcionality provided by Numpy, so the two are often used together.

### pandas apply

.apply() will apply a function along an axis, 0 = index or rows, 1 = column. The final return type is inferred from the return type of the applied function but it may not be exactly what you expecting unless you supply a specific result type to the result_type optional argument

We will often see apply given a lambda function, lambda functions are just nameless function in python. For example these two ways of specifying a function are the same

<code> def get_mean(col):
    return np.mean(col)
    
lambda col: np.mean(col) </code>

In [49]:
# Apply an aggregate function to every column
import numpy as np
df[["age","num_pets","num_children"]].mean() #for some functions pandas already has a provided method
df[["age","num_pets","num_children"]].apply(lambda col: np.mean(col),axis=0) #what is the return type?

age             36.666667
num_pets         1.500000
num_children     1.333333
dtype: float64

In [26]:
df[["age","num_pets","num_children"]].apply(lambda col: np.sum(col),axis=0)

age             220
num_pets          9
num_children      8
dtype: int64

In [3]:
df['age'].sum()

220

In [7]:
df.sum()

name                    johnmarypeterjeffbilllisa
age                                           220
state           iowadccaliforniatexaswashingtondc
num_children                                    8
num_pets                                        9
dtype: object

### Add numerical values of age, num_pets, num_children

In [50]:
df[["age","num_pets","num_children"]].apply(lambda row: np.sum(row),axis=1) #here the axis is the columns
#here we a are summing the values the age, num_pets, and num_children across each row

name
john     25
mary     84
peter    22
jeff     25
bill     47
lisa     34
dtype: int64

Another cool thing we can do is apply functions to our data and then save the results as new columns

In [23]:
df["row_sum"] = df[["age","num_pets","num_children"]].apply(lambda row: np.sum(row),axis=1)
df

Unnamed: 0_level_0,age,state,num_children,num_pets,row_sum
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
john,23,iowa,2,0,25
mary,78,oregon,2,4,84
peter,22,california,0,0,22
jeff,19,texas,1,5,25
bill,45,washington,2,0,47
lisa,33,oregon,1,0,34


### Select dataframe based on column name

In [8]:
df[['age']] #this returns a dataframe

Unnamed: 0,age
0,23
1,78
2,22
3,19
4,45
5,33


In [9]:
df['age'] #this returns a series

0    23
1    78
2    22
3    19
4    45
5    33
Name: age, dtype: int64

### Apply a function to specific column WITHOUT applying the changes to original dataframe

**Note:** This does not affect the original dataframe object. This returns a mutated copy of the specified column

In [10]:
df[["age"]].apply(lambda value: value*2)

Unnamed: 0,age
0,46
1,156
2,44
3,38
4,90
5,66


In [24]:
df

Unnamed: 0_level_0,age,state,num_children,num_pets,row_sum
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
john,23,iowa,2,0,25
mary,78,oregon,2,4,84
peter,22,california,0,0,22
jeff,19,texas,1,5,25
bill,45,washington,2,0,47
lisa,33,oregon,1,0,34


### Apply a function to specific column and apply the changes to original dataframe

In [35]:
df['age'] = df['age'].apply(lambda x: x*2)

In [36]:
df

Unnamed: 0,name,age,state,num_children,num_pets
0,john,46,iowa,2,0
1,mary,156,dc,2,4
2,peter,44,california,0,0
3,jeff,38,texas,1,5
4,bill,90,washington,2,0
5,lisa,66,dc,1,0


### Sort dataframe based on a specific column in ascending order

sort_values() takes one required argument by, 

by can be a str or list of strs

if axis is 0 or ‘index’ then by may contain index levels and/or column labels.

if axis is 1 or ‘columns’ then by may contain column levels and/or index labels.

In [25]:
# Sort DataFrame by column value
df.sort_values("age", ascending= True)


Unnamed: 0_level_0,age,state,num_children,num_pets,row_sum
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
jeff,19,texas,1,5,25
peter,22,california,0,0,22
john,23,iowa,2,0,25
lisa,33,oregon,1,0,34
bill,45,washington,2,0,47
mary,78,oregon,2,4,84


In [27]:
df #won't actually affect the original dataframe

Unnamed: 0_level_0,age,state,num_children,num_pets,row_sum
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
john,23,iowa,2,0,25
mary,78,oregon,2,4,84
peter,22,california,0,0,22
jeff,19,texas,1,5,25
bill,45,washington,2,0,47
lisa,33,oregon,1,0,34


### Select rows whose name begins with the letter 'j'

we can also use lambda functions to perform selections

In [41]:
# select rows whose name begins with the letter 'j'
df[df.apply(lambda row: row['name'].startswith('j'),axis=1)]

Unnamed: 0,name,age,state,num_children,num_pets
0,john,46,iowa,2,0
3,jeff,38,texas,1,5


### Create a dataframe from an Ordered Dictionary

An OrderedDict is a dictionary subclass that remembers the order in which its contents are added, supporting the usual dict methods.If a new entry overwrites an existing entry, the original insertion position is left unchanged. Deleting an entry and reinserting it will move it to the end.

In [28]:
from collections import OrderedDict
from pandas import DataFrame
import pandas as pd
import numpy as np

table = OrderedDict((
    ("Item", ['Item0', 'Item0', 'Item1', 'Item1']),
    ('CType',['Gold', 'Bronze', 'Gold', 'Silver']),
    ('USD',  ['1$', '2$', '3$', '4$']),
    ('EU',   ['1€', '2€', '3€', '4€'])
))
df = DataFrame(table)

In [29]:
df

Unnamed: 0,Item,CType,USD,EU
0,Item0,Gold,1$,1€
1,Item0,Bronze,2$,2€
2,Item1,Gold,3$,3€
3,Item1,Silver,4$,4€


# Lab Challenge

- Find a dataset that you find interesting on kaggle and download it
- Use what we have learned so far to explore the data
- Come up with 2 questions that you can answer with the data. For example, in the pokemon data we could ask how many water and ice types are there
- Once you come up with your questions and data pair up and try to answer someone else's questions with thier dataset

In [7]:
import pandas as pd
import numpy as np
df = pd.read_csv("Datasets/tv_shows.csv")
df.dropna()

Unnamed: 0.1,Unnamed: 0,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,type
0,0,Breaking Bad,2008,18+,9.5,96%,1,0,0,0,1
1,1,Stranger Things,2016,16+,8.8,93%,1,0,0,0,1
2,2,Money Heist,2017,18+,8.4,91%,1,0,0,0,1
3,3,Sherlock,2010,16+,9.1,78%,1,0,0,0,1
4,4,Better Call Saul,2015,18+,8.7,97%,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
5509,5509,Diary of a Future President,2020,7+,5.5,100%,0,0,0,1,1
5517,5517,Encore!,2019,7+,7.4,68%,0,0,0,1,1
5522,5522,Spider-Man Unlimited,1999,7+,6.5,50%,0,0,0,1,1
5530,5530,The Super Hero Squad Show,2009,7+,6.1,50%,0,0,0,1,1


In [53]:
#get all the info for the shows that had a rotten tomatos rating of 96% or higher and were aired from 2019 or later

In [73]:
# def convert(str_percent):
#     #print(type(str_percent))
#     if type(str_percent) != float:
#         str_percent = int(str_percent.rstrip("%"))
#     return str_percent

# df["numerical_percent"] = df["Rotten Tomatoes"].apply(convert, 1)
# df[(df["Year"] > 2019) & (df["numerical_percent"] > 96)]

def convert(percent):
    #print(type(percent))
    if type(percent) != float:
        percent = int(percent.rstrip("%"))
    return percent

df["numerical_percent"] = df["Rotten Tomatoes"].apply(convert, 1)
df[(df["Year"] > 2018) & (df['numerical_percent'] > 95)]


Unnamed: 0.1,Unnamed: 0,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,type,numerical_percent
41,41,When They See Us,2019,18+,8.9,96%,1,0,0,0,1,96.0
46,46,Never Have I Ever,2020,16+,8.0,97%,1,0,0,0,1,97.0
56,56,Unbelievable,2019,18+,8.4,97%,1,0,0,0,1,97.0
65,65,Kingdom,2019,18+,8.4,96%,1,0,0,0,1,96.0
90,90,Russian Doll,2019,18+,7.9,96%,1,0,0,0,1,96.0
172,172,Feel Good,2020,,7.5,100%,1,0,0,0,1,100.0
192,192,Middleditch & Schwartz,2020,18+,8.7,100%,1,0,0,0,1,100.0
216,216,Cheer,2020,16+,8.2,96%,1,0,0,0,1,96.0
230,230,The Order,2019,18+,6.9,100%,1,0,0,0,1,100.0
256,256,The Innocence Files,2020,,8.0,100%,1,0,0,0,1,100.0
