----
**Author**: Gunnvant

**Description**: Pandas 101

**Audience**: Beginner

**Pre-requisites**: Python 101, Working with flat files, OOPS for Data Science

---

We have seen so far how we can manipulate data using lists, dictionaries or tuples. We have also seen how to read/write python files using csv reader. We have so far built python classes that could organize some data analysis functionality. But as you may have figured out by now that first writting the code that will analyse the data will take a very long time for the data analysis process to finish. Hence we mostly use a framework such as `pandas` to do data analysis. 


With libraries such as `pandas` the focus of writting the code is on doing data analysis rather than building a functionality as we were doing in the previous sessions.

Before we discuss in detail about `pandas` lets discuss the anatomy of tabular data analysis first. There are four main things we do when we say that we are doing tabular data analysis:

1. Filtering data
2. Sorting data by column(s)
3. Groupby or pivot the data 
4. Join tables

In [1]:
import pandas as pd
olympics = pd.read_csv("../../data/athlete_events.csv")
olympics.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


## Pandas 101

Pandas is used in tabular data analysis. Whenever we want to slice and dice the data, pandas is a good go to tool. 

In [2]:
type(olympics)

pandas.core.frame.DataFrame

In [3]:
olympics.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [4]:
olympics.columns

Index(['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
       'Year', 'Season', 'City', 'Sport', 'Event', 'Medal'],
      dtype='object')

In [5]:
olympics['Name'].head()

0                   A Dijiang
1                    A Lamusi
2         Gunnar Nielsen Aaby
3        Edgar Lindenau Aabye
4    Christine Jacoba Aaftink
Name: Name, dtype: object

In [6]:
type(olympics['Name'])

pandas.core.series.Series

In [7]:
olympics.shape

(271116, 15)

In [8]:
print(dir(olympics['Name']))

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_REVERSED', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__array_wrap__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__long__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__'

## Filtering data

Many business questions can usually be answered by just filtering data. For example in this case if we wanted to find out how many athletic events happened in 1992 Summer olympics, then one of the things we can do is subset/filter the data on the column `Games` and get the relevant data and count the number of rows in that subset

In [9]:
olympics[olympics['Games']=='1992 Summer'].head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
96,33,Mika Lauri Aarnikka,M,24.0,187.0,76.0,Finland,FIN,1992 Summer,1992,Summer,Barcelona,Sailing,Sailing Men's Two Person Dinghy,
118,43,Morten Gjerdrum Aasen,M,34.0,185.0,75.0,Norway,NOR,1992 Summer,1992,Summer,Barcelona,Equestrianism,"Equestrianism Mixed Jumping, Individual",
137,50,Arvi Aavik,M,22.0,185.0,106.0,Estonia,EST,1992 Summer,1992,Summer,Barcelona,Wrestling,"Wrestling Men's Heavyweight, Freestyle",
160,64,M'Bairo Abakar,M,31.0,,,Chad,CHA,1992 Summer,1992,Summer,Barcelona,Judo,Judo Men's Half-Middleweight,


In [10]:
olympics[olympics['Games']=='1992 Summer'].shape

(12977, 15)

In [11]:
olympics[olympics['Games']=='1992 Summer'].shape[0]

12977

### Class Drill: Write the code to find out the data subset where the athletes are from China

In [12]:
### Writting filters with multiple conditions
## table[(cond1)operatoer(cond2)] operator ---> &(and), |(or)
olympics[(olympics['Team']=='China')|(olympics['Team']=='Norway')].head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
59,20,Kjetil Andr Aamodt,M,20.0,176.0,85.0,Norway,NOR,1992 Winter,1992,Winter,Albertville,Alpine Skiing,Alpine Skiing Men's Downhill,
60,20,Kjetil Andr Aamodt,M,20.0,176.0,85.0,Norway,NOR,1992 Winter,1992,Winter,Albertville,Alpine Skiing,Alpine Skiing Men's Super G,Gold
61,20,Kjetil Andr Aamodt,M,20.0,176.0,85.0,Norway,NOR,1992 Winter,1992,Winter,Albertville,Alpine Skiing,Alpine Skiing Men's Giant Slalom,Bronze


In [13]:
olympics.query("Team=='China' or Team=='Norway'").head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
59,20,Kjetil Andr Aamodt,M,20.0,176.0,85.0,Norway,NOR,1992 Winter,1992,Winter,Albertville,Alpine Skiing,Alpine Skiing Men's Downhill,
60,20,Kjetil Andr Aamodt,M,20.0,176.0,85.0,Norway,NOR,1992 Winter,1992,Winter,Albertville,Alpine Skiing,Alpine Skiing Men's Super G,Gold
61,20,Kjetil Andr Aamodt,M,20.0,176.0,85.0,Norway,NOR,1992 Winter,1992,Winter,Albertville,Alpine Skiing,Alpine Skiing Men's Giant Slalom,Bronze


### Sorting

One can use the `.sort_values()` method and arrange the whole dataframe.

Suppose we wanted to find out the age of the oldest 10 athletes who have participated in olympics.

In [14]:
olympics.sort_values(by = "Age",ascending=False).head(10).drop_duplicates()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
257054,128719,John Quincy Adams Ward,M,97.0,,,United States,USA,1928 Summer,1928,Summer,Amsterdam,Art Competitions,"Art Competitions Mixed Sculpturing, Statues",
98118,49663,Winslow Homer,M,96.0,,,United States,USA,1932 Summer,1932,Summer,Los Angeles,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
60863,31173,Thomas Cowperthwait Eakins,M,88.0,,,United States,USA,1932 Summer,1932,Summer,Los Angeles,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
9371,5146,George Denholm Armour,M,84.0,,,Great Britain,GBR,1948 Summer,1948,Summer,London,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
236912,118789,Louis Tauzin,M,81.0,,,France,FRA,1924 Summer,1924,Summer,Paris,Art Competitions,Art Competitions Mixed Sculpturing,
154855,77710,Robert Tait McKenzie,M,81.0,,,Canada,CAN,1948 Summer,1948,Summer,London,Art Competitions,"Art Competitions Mixed Sculpturing, Unknown Event",
138812,69729,Max Liebermann,M,80.0,,,Germany,GER,1928 Summer,1928,Summer,Amsterdam,Art Competitions,"Art Competitions Mixed Painting, Graphic Arts",


### Class Excercise:

Read the data called starbucks_final.csv
An ideal diet should contain optimum level of nutrients can you find out the names of the items on menu that contain:
- Upto 450 calories
- Upto 40 g protein
- Upto 10 g fat
- Upto 40 g Carbs
- Upto 5 g fibre

Give the names of items on menu that satisfy the above criteria but contain least carbs




### Sorting contd...

Sometimes you might want to sort the data by two columns and not one. Lets look at a different dataset and see how that can be done:

In [17]:
stores = pd.read_csv("../../data/stores.csv")

In [18]:
stores.head(5)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2013-152156,11-09-2013,11-12-2013,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2013-152156,11-09-2013,11-12-2013,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2013-138688,6/13/2013,6/17/2013,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2012-108966,10-11-2012,10/18/2012,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2012-108966,10-11-2012,10/18/2012,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


##### Now imagine we want to find out details of those transactions where the profits are high but the discount given is low. This can be done by doing a nested sort.

In [19]:
stores.sort_values(by=['Profit','Discount'], ascending=[False,True]).head(5)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
6826,6827,CA-2013-118689,10-03-2013,10-10-2013,Standard Class,TC-20980,Tamara Chand,Corporate,United States,Lafayette,...,47905,Central,TEC-CO-10004722,Technology,Copiers,Canon imageCLASS 2200 Advanced Copier,17499.95,5,0.0,8399.976
8153,8154,CA-2014-140151,3/24/2014,3/26/2014,First Class,RB-19360,Raymond Buch,Consumer,United States,Seattle,...,98115,West,TEC-CO-10004722,Technology,Copiers,Canon imageCLASS 2200 Advanced Copier,13999.96,4,0.0,6719.9808
4190,4191,CA-2014-166709,11/18/2014,11/23/2014,Standard Class,HL-15040,Hunter Lopez,Consumer,United States,Newark,...,19711,East,TEC-CO-10004722,Technology,Copiers,Canon imageCLASS 2200 Advanced Copier,10499.97,3,0.0,5039.9856
9039,9040,CA-2013-117121,12/18/2013,12/22/2013,Standard Class,AB-10105,Adrian Barton,Consumer,United States,Detroit,...,48205,Central,OFF-BI-10000545,Office Supplies,Binders,GBC Ibimaster 500 Manual ProClick Binding System,9892.74,13,0.0,4946.37
4098,4099,CA-2011-116904,9/23/2011,9/28/2011,Standard Class,SC-20095,Sanjit Chand,Consumer,United States,Minneapolis,...,55407,Central,OFF-BI-10001120,Office Supplies,Binders,Ibico EPK-21 Electric Binding System,9449.95,5,0.0,4630.4755


In [20]:
### To put things in perspective about nested sorting, lets create a synthetic dataset

profit = [300,200,500,300,100]
discount = [0.2,0.1,0.2,0.05,0.1]
table = pd.DataFrame({'profit':profit,'discount':discount})
table

Unnamed: 0,profit,discount
0,300,0.2
1,200,0.1
2,500,0.2
3,300,0.05
4,100,0.1


In [21]:
table.sort_values(by = ['profit','discount'],ascending=[False,True])

Unnamed: 0,profit,discount
2,500,0.2
3,300,0.05
0,300,0.2
1,200,0.1
4,100,0.1


### Group by/pivots

Look at the following animation to better understand the intuition behind groupby or pivot tasks

![](split_apply_combine.gif)

In [22]:
### Average Profit By city:
stores.groupby('City')['Profit'].mean()

City
Aberdeen         6.630000
Abilene         -3.758400
Akron           -8.887410
Albuquerque     45.292007
Alexandria      19.913644
                  ...    
Woonsocket      19.669775
Yonkers        184.517047
York           -20.433840
Yucaipa         13.208000
Yuma          -116.497725
Name: Profit, Length: 531, dtype: float64

In [23]:
stores.groupby('City').agg({'Profit':'mean'})

Unnamed: 0_level_0,Profit
City,Unnamed: 1_level_1
Aberdeen,6.630000
Abilene,-3.758400
Akron,-8.887410
Albuquerque,45.292007
Alexandria,19.913644
...,...
Woonsocket,19.669775
Yonkers,184.517047
York,-20.433840
Yucaipa,13.208000


In [24]:
### Average Profit by Segment and City
stores.groupby(['Segment','City']).agg({'Profit':'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Profit
Segment,City,Unnamed: 2_level_1
Consumer,Aberdeen,6.630000
Consumer,Abilene,-3.758400
Consumer,Akron,-11.610200
Consumer,Albuquerque,9.628450
Consumer,Alexandria,10.528933
...,...,...
Home Office,Wilmington,133.970367
Home Office,Wilson,-5.412000
Home Office,Woodstock,0.910000
Home Office,Yonkers,47.312600


In [25]:
### Multple Summaries
stores.groupby(['Segment','City']).agg({'Profit':['min','max','mean']})

Unnamed: 0_level_0,Unnamed: 1_level_0,Profit,Profit,Profit
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean
Segment,City,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Consumer,Aberdeen,6.6300,6.6300,6.630000
Consumer,Abilene,-3.7584,-3.7584,-3.758400
Consumer,Akron,-80.9955,6.7950,-11.610200
Consumer,Albuquerque,-5.6943,34.2925,9.628450
Consumer,Alexandria,6.8768,12.5300,10.528933
...,...,...,...,...
Home Office,Wilmington,5.3949,390.9770,133.970367
Home Office,Wilson,-5.4120,-5.4120,-5.412000
Home Office,Woodstock,0.9100,0.9100,0.910000
Home Office,Yonkers,24.1224,70.5028,47.312600


In [26]:
### Multiple Summaries of multiple columns
stores.groupby(['Segment','City']).agg({'Profit':['min','max','mean'],'Discount':['min','max','mean']})

Unnamed: 0_level_0,Unnamed: 1_level_0,Profit,Profit,Profit,Discount,Discount,Discount
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,min,max,mean
Segment,City,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Consumer,Aberdeen,6.6300,6.6300,6.630000,0.0,0.0,0.000000
Consumer,Abilene,-3.7584,-3.7584,-3.758400,0.8,0.8,0.800000
Consumer,Akron,-80.9955,6.7950,-11.610200,0.2,0.7,0.369231
Consumer,Albuquerque,-5.6943,34.2925,9.628450,0.0,0.2,0.050000
Consumer,Alexandria,6.8768,12.5300,10.528933,0.0,0.0,0.000000
...,...,...,...,...,...,...,...
Home Office,Wilmington,5.3949,390.9770,133.970367,0.0,0.0,0.000000
Home Office,Wilson,-5.4120,-5.4120,-5.412000,0.7,0.7,0.700000
Home Office,Woodstock,0.9100,0.9100,0.910000,0.2,0.2,0.200000
Home Office,Yonkers,24.1224,70.5028,47.312600,0.0,0.0,0.000000


In [27]:
### Indexing a heirarchical object
summary = stores.groupby(['Segment','City']).agg({'Profit':['min','max','mean'],'Discount':['min','max','mean']})

In [28]:
summary.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Profit,Profit,Profit,Discount,Discount,Discount
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,min,max,mean
Segment,City,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Consumer,Aberdeen,6.63,6.63,6.63,0.0,0.0,0.0
Consumer,Abilene,-3.7584,-3.7584,-3.7584,0.8,0.8,0.8
Consumer,Akron,-80.9955,6.795,-11.6102,0.2,0.7,0.369231
Consumer,Albuquerque,-5.6943,34.2925,9.62845,0.0,0.2,0.05
Consumer,Alexandria,6.8768,12.53,10.528933,0.0,0.0,0.0


In [29]:
summary['Profit']

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,mean
Segment,City,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Consumer,Aberdeen,6.6300,6.6300,6.630000
Consumer,Abilene,-3.7584,-3.7584,-3.758400
Consumer,Akron,-80.9955,6.7950,-11.610200
Consumer,Albuquerque,-5.6943,34.2925,9.628450
Consumer,Alexandria,6.8768,12.5300,10.528933
...,...,...,...,...
Home Office,Wilmington,5.3949,390.9770,133.970367
Home Office,Wilson,-5.4120,-5.4120,-5.412000
Home Office,Woodstock,0.9100,0.9100,0.910000
Home Office,Yonkers,24.1224,70.5028,47.312600


In [30]:
summary['Profit']['mean']

Segment      City       
Consumer     Aberdeen         6.630000
             Abilene         -3.758400
             Akron          -11.610200
             Albuquerque      9.628450
             Alexandria      10.528933
                               ...    
Home Office  Wilmington     133.970367
             Wilson          -5.412000
             Woodstock        0.910000
             Yonkers         47.312600
             Yuma          -257.936400
Name: mean, Length: 1026, dtype: float64

In [31]:
### Now how would you filter or manipulate this?
summary['Profit']['mean'].reset_index()

Unnamed: 0,Segment,City,mean
0,Consumer,Aberdeen,6.630000
1,Consumer,Abilene,-3.758400
2,Consumer,Akron,-11.610200
3,Consumer,Albuquerque,9.628450
4,Consumer,Alexandria,10.528933
...,...,...,...
1021,Home Office,Wilmington,133.970367
1022,Home Office,Wilson,-5.412000
1023,Home Office,Woodstock,0.910000
1024,Home Office,Yonkers,47.312600


### Working with dates

Many times the data can have date-time information. Pandas will treat date columns as strings. Once date columns are appropriately handled, meta data such as dayofweek, month, year, quarter etc can be extracted.

In [32]:
stores.dtypes

Row ID             int64
Order ID          object
Order Date        object
Ship Date         object
Ship Mode         object
Customer ID       object
Customer Name     object
Segment           object
Country           object
City              object
State             object
Postal Code        int64
Region            object
Product ID        object
Category          object
Sub-Category      object
Product Name      object
Sales            float64
Quantity           int64
Discount         float64
Profit           float64
dtype: object

In [33]:
stores['Order Date'].head()

0    11-09-2013
1    11-09-2013
2     6/13/2013
3    10-11-2012
4    10-11-2012
Name: Order Date, dtype: object

In [34]:
stores['Order Date'] = pd.to_datetime(stores['Order Date']) 

In [35]:
stores['Ship Date'] = pd.to_datetime(stores['Ship Date'])

In [36]:
stores['Order Date'].dt.weekday.head()

0    5
1    5
2    3
3    3
4    3
Name: Order Date, dtype: int64

In [37]:
stores['Order Date'].dt.month.head()

0    11
1    11
2     6
3    10
4    10
Name: Order Date, dtype: int64

In [38]:
stores['Order Date'].dt.year.head()

0    2013
1    2013
2    2013
3    2012
4    2012
Name: Order Date, dtype: int64

### Class Excercise: Flight Delays Dataset

**Load the data set FlightDelays, the data has information on the flights over the year 2004 and if a particular flight was delayed or not.**
1. Find out the number of delayed flights for all weekdays
2. Find the average distance, total distance and count for all delayed flights on Friday.
3. Find out how many flights were on time on Week days and Weekends (Consider Saturday and Sunday as weekends)
4. Find out the number of flights for each destination across all weekdays
5. Find out the number of times weather was bad across all weekdays. (1 indicates bad weather)


### Map and Apply Constructs

Apply and map constructs are used when we want to loop over rows, rows+columns. `maps()` are used only to loop over rows of one column. `apply()` is used to loop over rows across columns or columns across rows.

Since a dataframe is two dimensional data-structure. We can specify in which direction an `apply()` operation can be applied. Pandas uses the idea of axes to help in specifying the direction of operation. Below is a gif that gives an intuitive idea about the axes/axis
![](axis.gif)

#### Map

Lets imagine we wanted to bucket the sales column into high,medium and low sales based on the following rule:
- If sales>5000, high
- If 5000<sales<=2000, medium
- If sales<=2000, low

we can use the notion of map to accomplish this.

In [39]:
def categorise(val):
    if val>5000:
        return 'high'
    elif val<=5000 and val>2000:
        return 'medium'
    else:
        return 'low'
stores['Sales'].map(categorise)

0       low
1       low
2       low
3       low
4       low
       ... 
9989    low
9990    low
9991    low
9992    low
9993    low
Name: Sales, Length: 9994, dtype: object

In [40]:
stores['Sales'].map(categorise).value_counts()

low       9854
medium     121
high        19
Name: Sales, dtype: int64

### Class Excercise: Comey Dataset

Use the dataset names `comey.csv`. You need to use the idea of map and find out the length of each response and each question. The you need to see if there is any difference between the lengths by party affiliattion. You can read more about the testimony of James Comey (former FBI director) [here](https://www.intelligence.senate.gov/sites/default/files/documents/os-jcomey-060817.pdf?)

### Apply

Apply is used to traverse rows across columns or vice-versa. Suppose in the stores dataset we want to do a rebalancing of sales figures. Imagine the charge of first class shipping was recognised incorrectly and now we need to decrease the sales number by $ 200. We can make use of the notion of apply here. In this case we will write a function that will traverse a row across columns, hence the correct axis direction would be 1

In [41]:
def rebalance(row):
    if row['Ship Mode']=='First Class':
        return row['Sales'] - 200
    else:
        return row['Sales']
stores.apply(rebalance,axis=1)

0       261.9600
1       731.9400
2        14.6200
3       957.5775
4        22.3680
          ...   
9989     25.2480
9990     91.9600
9991    258.5760
9992     29.6000
9993    243.1600
Length: 9994, dtype: float64

### Joins

There are scenarios when the data is not contained in a single file. Its not unusual to find data spread across many files. When that happens, we need some mechanism of joining different tables.

In [43]:
customers = pd.read_csv("../../data/customers.csv")
accounts = pd.read_csv("../../data/accounts.csv")

In [44]:
customers.head()

Unnamed: 0,Cust_id,Age
0,AA1,17
1,AA2,18
2,AA3,33
3,AA4,21
4,AA5,14


In [45]:
accounts.head()

Unnamed: 0,CustID,Account Type
0,AA1,AAA
1,AA6,AA
2,AA4,B
3,AA7,CCC
4,AA12,AAA


### Inner Join

- Join two tables based on common rows of a key column

In [46]:
pd.merge(customers,accounts,how='inner',left_on="Cust_id",right_on="CustID")

Unnamed: 0,Cust_id,Age,CustID,Account Type
0,AA1,17,AA1,AAA
1,AA4,21,AA4,B
2,AA6,81,AA6,AA


### Left Outer Join:
- Retain all the rows in the left table and give the matching rows in right table

In [47]:
pd.merge(customers,accounts,how='left',left_on="Cust_id",right_on="CustID")

Unnamed: 0,Cust_id,Age,CustID,Account Type
0,AA1,17,AA1,AAA
1,AA2,18,,
2,AA3,33,,
3,AA4,21,AA4,B
4,AA5,14,,
5,AA6,81,AA6,AA


### Right Outer Join:
- Retain all the rows in the right table and give the matching rows in left table

In [48]:
pd.merge(customers,accounts,how='right',left_on="Cust_id",right_on="CustID")

Unnamed: 0,Cust_id,Age,CustID,Account Type
0,AA1,17.0,AA1,AAA
1,AA6,81.0,AA6,AA
2,AA4,21.0,AA4,B
3,,,AA7,CCC
4,,,AA12,AAA
5,,,AA10,DDD


### Outer Join
- Join both the tables irrespective of any match in the key columns

In [49]:
pd.merge(customers,accounts,how='outer',left_on="Cust_id",right_on="CustID")

Unnamed: 0,Cust_id,Age,CustID,Account Type
0,AA1,17.0,AA1,AAA
1,AA2,18.0,,
2,AA3,33.0,,
3,AA4,21.0,AA4,B
4,AA5,14.0,,
5,AA6,81.0,AA6,AA
6,,,AA7,CCC
7,,,AA12,AAA
8,,,AA10,DDD


### Class Excercise (Joins)

- Use the files contributions.csv and candidates.csv. The file contributions.csv contains data on contributions made to political parties. The file candidates.csv contains data on the demographics of candidates belonging to different political parties. What was the highest contribution made on a Sunday?

- Use the files contributions.csv and candidates.csv. The file contributions.csv contains data on contributions made to political parties. The file candidates.csv contains data on the demographics of candidates belonging to different political parties. Is there a difference between the average donations received by Democrats on weekdays vs weekends? (In the column party, R stands for Republican and D stands for Democrats)

- Use the files contributions.csv and candidates.csv. The file contributions.csv contains data on contributions made to political parties. The file candidates.csv contains data on the demographics of candidates belonging to different political parties. The highest amount contributed on weekdays towards Democrats is?  (In the column party, R stands for Republican and D stands for Democrats)