## Session 07: Relational Structure + Pivot


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
work_dir = os.getcwd()
print(work_dir)

data_dir = os.path.join(work_dir,'_data')
file_path = os.path.join('_data/flights.csv')
flights = pd.read_csv(file_path)
display(flights.head())

The `tailnum` column in `flights` contains the plain tail number:

In [None]:
flights['tailnum'].value_counts()

Lets load the planes data frame now:

In [None]:
planes = pd.read_csv(('_data/planes.csv'),index_col=0)
display(planes)

# in this data frame we have ['manufacturer'] and also a ['tailnum'] ( which we also have in flights)
# idea is to make on data frame with joining flights['tailnum'] by their count , with data from manufacturer data frame ['tailnum','manufacturer']

The same column, `tailnum`, is present in `planes` too. 

**N.B.** For what follows, it is not necessary that any pair of columns containing data that can be matched also bear the same name in two DataFrames (to be elaborated in the live session).

We are interested to learn what `manufacturer` (in `planes`) produced the planes that flew the most of the flights present in `flights`. The frequency data - how many flights were observed by plane - are known from the `flights` DataFrame, where planes are identified by their `tailnum`. However, the data on `manufacturer` are found only in the `planes` DataFrame, where planes are also identified by their `tailnum`. We need to bring these two DataFrames, or subsets of them, together in order to merge the information from `flights` with the data in `planes` and proceed with our analysis of `manufacturers`.

These are the steps that we will follow in order to perform our analysis:
- group `flights` by `tailnum` and count the number of flights per plane;
- **join** the resulting DataFrame to `planes`;
- inspect what manufacturers produced the most frequently flied aricrafts.

### Left and right join in Pandas

Step 1: Group `flights` by `tailnum` and count the number of flights per plane.

In [None]:
flights_count = flights['tailnum']\
            .value_counts()\
             .reset_index()\
            .rename(columns={'index':'tailnum','count':'number_of_flights'})\
            .sort_values(by='number_of_flights')


flights_count

Step 2. **Join** the resulting DataFrame to `manufacturers` from `planes`.

In [None]:
planes['manufacturer']

In [None]:
manufacturers= planes[['tailnum','manufacturer']]
display(manufacturers)


In [None]:
manufacturers = manufacturers.merge(flights_count,how='left')
display(manufacturers)

In [None]:
manufacturer_counts = manufacturers[['manufacturer','number_of_flights']]\
.groupby('manufacturer')\
.agg(sum)\
.reset_index()\
.rename(columns={'number_of_flights':'sum'})\
.sort_values(by='sum', ascending=False)

manufacturer_counts

In [None]:
fig, ax = plt.subplots()
ax.barh(y='manufacturer', width='sum', data=manufacturer_counts)
ax.set_title('Number of flights per manufacturer', size=12, pad=10)
ax.set_ylabel(ylabel='Manufacturer', fontsize=8)
ax.tick_params(axis='y', which='major', labelsize=6)

ax.grid(alpha=.5)

When a left join is performed, the data in the left table stays completely unchanged (except for the inclusion of new columns). Let's see: who works where?

In [None]:
df_left = pd.DataFrame({'Name':['Peter', 'Anna', 'John', 'Marina', 'Suzana'],
                        'Department':['English', 'English', 'Russian', 'German', 'Spanish']})
df_left

In [None]:
df_right= pd.DataFrame({'Department':['English', 'Russian', 'German'],
                         'Building':['IA', 'II', 'II']})
df_right

In [None]:
df_result = df_left.merge(df_right, how='left')
df_result

### This part will be focused on my practicing `joins` while using pandas, regarding my meain goal to learn pandas library.

If there is more than one match, the data from the left table will be replicated when necessary:

In [None]:
df_left = pd.DataFrame({'Name':['Peter', 'Anna', 'John', 'Marina', 'Suzana'],
                        'Department':['English', 'English', 'Russian', 'German', 'Spanish']})
df_left

In [None]:
df_right = pd.DataFrame({'Department':['English', 'English', 'Russian', 'German'],
                         'Building':['IA', 'IB', 'II', 'II']})
df_right

In [None]:
df_result = df_left.merge(df_right, how='left', on='Department')
df_result

(df_result.isna().sum())


With NaN , expected. 

### Inner and outer (full) join in Pandas

**Inner join**: keep only the rows from the left and right table that match. Nothing else. `pd.DataFrame.merge` performs inner join **by default**; the data are **replicated** from any table if necessary.

## `.join()` 

- Joins combine two DataFrames based on a key column(s) or index.

### Inner Join: 

- Returns only matching rows between two DataFrames.

### Outer (Full) Join: 

- Returns all rows from both DataFrames. Missing values in non-matching rows are filled with NaN.

In [None]:
#  Inner Join example:

df1= pd.DataFrame({'employee_id':[1,2,3],
                   'name':['Alice','Bob','Skyler']})

df1

In [None]:
df2 = pd.DataFrame({
    'employee_id': [2, 3, 4],
    'salary': [70000, 80000, 90000]
})
df2

In [None]:
# Inner join

inner_join = pd.merge(df1,df2, on='employee_id', how='inner')
display(inner_join)

# We see that only rows that overlaps are joined

Only employees with IDs 2 and 3 are included because they exist in both DataFrames.

- Employee 1 (from df1) and Employee 4 (from df2) are excluded.

### Now Outer join (full join)

In [None]:
outer_join = pd.merge(df1,df2, how='outer',on='employee_id')
outer_join

Employee 1 and 4 are included, but their non-matching values are filled with NaN.

For Employee 1, the salary is NaN.

For Employee 4, the name is NaN.

Conclusion : Outer join merges all the rows , and missing ones fill with NaN's 

Inner Join merges only rows that are the same on both data frames.

#### Lets do some practice:

In [None]:
#DF 1 product

products = pd.DataFrame({
    'product_id':[101,102,103,104],
    'product_name':['Laptop', 'Tablet', 'Smartphone', 'Monitor']
})

products

## DF2 sales

sales = pd.DataFrame({
    'product_id': [101, 103, 105],
    'units_sold': [500, 200, 50]
})

sales

`Inner Join: Only products with sales`

In [None]:
inner_join = pd.merge(products,sales, on='product_id', how='inner')
inner_join

` Outer Join: All products and sales data`

In [None]:
outter_join = pd.merge(products,sales, how='outer', on='product_id')
outter_join

### `Left` and `Right` with `how=' '` can be used to include all rows from only one data frame:

In [None]:
left_join = pd.merge(products,sales, how='left', on='product_id')
left_join

In [None]:
right_join = pd.merge(products,sales, how='right', on='product_id')
right_join

### Semi-join and anti-join

A **semi-join** *filters the left table* down to those observations that have a match in the right table:

In [None]:
df_left = pd.DataFrame({'ID':[1, 2, 3, 4, 5, 6, 7, 8],
                        'Price':[173, 452, 333, 98, 76, 899, 200, 201]})
df_left

In [None]:
df_right = pd.DataFrame({'ID':[1, 2, 3, 3, 3, 7, 7, 8],
                         'Store':['A4', 'A4', 'A1', 'B3', 'C6', 'A1', 'B2', 'B2']})
df_right

In [None]:
df_result = df_left[df_left['ID'].isin(df_right['ID'])]
df_result

An **anti-join** returns the observations in the left table *that do not have* a matching observation in the right table.

In [None]:
df_left = pd.DataFrame({'ID':[1, 2, 3, 4, 5, 6, 7, 8],
                        'Price':[173, 452, 333, 98, 76, 899, 200, 201]})
df_left

In [None]:
df_right = pd.DataFrame({'ID':[1, 2, 3, 3, 3, 7, 7, 8],
                         'Store':['A4', 'A4', 'A1', 'B3', 'C6', 'A1', 'B2', 'B2']})
df_right

In [None]:
df_anti_join = df_left[~(df_left['ID']).isin(df_right['ID'])]
df_anti_join

# - filter ID in result from the ID *not* in the right table



### Multiple keys and different column names

In [None]:
df_left = pd.DataFrame({'ID':[1, 2, 3, 4, 5, 6, 7, 8],
                        'Demand':['Low', 'Low', 'Low', 'Low', 'High', 'High', 'High', 'High'],
                        'Price':[173, 452, 333, 98, 76, 899, 200, 201]})
df_left

In [None]:
df_right = pd.DataFrame({'ID':[1, 2, 3, 5, 5, 7, 7, 8],
                         'InStoreDemand':['Low', 'High', 'High', 'Low', 'High', 'Low', 'High', 'High'],
                         'Store':['A4', 'A4', 'A1', 'B3', 'C6', 'A1', 'B2', 'B2']})
df_right

In [None]:
df_result = df_left.merge(df_right, 
                          left_on=['ID', 'Demand'], 
                          right_on=['ID', 'InStoreDemand'])
df_result

#### More examples, Practice joins in Pandas 

1) (left join example)

Compute the average delay of departure for flights across different *destination* airports. Include the standard deviation of the delay too.

Load `airports`:

In [None]:
airports = pd.read_csv(os.path.join(data_dir,'airports.csv'),index_col=0)
airports.head()



In [None]:
flights.dtypes

In [None]:
mean_dep_delay = flights[['dest','dep_delay']]
print(mean_dep_delay)

#lets filter this

In [None]:
mean_dep_delay.dropna()

mean_dep_delay.groupby('dest')\
            .agg(['mean','std'])\
            .rename(columns={'dest':'destination','dep_delay':'departure_delay'})\
            .reset_index()

Join into `Airports`

In [None]:
airports_delay = airports.merge(mean_dep_delay,
                                how='left',
                                left_on='faa',
                                right_on='dest')

airports_delay=airports_delay.dropna()
airports_delay.sort_values(by='mean', ascending=False, inplace=True)
airports_delay[['name', 'mean', 'std']]


### Pivot: Long-to-Wide and Wide-to-Long transformations 

The so called **wide** and **long** data formats refer to the respective usage of DataFrames to represent the same information in two different ways. 

Let's begin with the **long** format:

In [106]:
df_long = pd.DataFrame({'store': ['Center', 'Center', 'Center', 'Center', 'North', 'North', 'North', 'North'],
     'itemID': [1, 2, 3, 4, 1, 2, 3, 4],
     'price': [110, 25, 47, 200, 105, 20, 41, 150]})
df_long


Unnamed: 0,store,itemID,price
0,Center,1,110
1,Center,2,25
2,Center,3,47
3,Center,4,200
4,North,1,105
5,North,2,20
6,North,3,41
7,North,4,150


The **wide** representation of the same data can be obtained from `pandas.pivot` in the following way:

In [107]:
df_wide = pd.pivot(df_long, 
                   index='itemID', 
                   columns='store', 
                   values='price').reset_index().rename_axis(None, axis=1)
df_wide

Unnamed: 0,itemID,Center,North
0,1,110,105
1,2,25,20
2,3,47,41
3,4,200,150


In [109]:
df_wide = pd.pivot(df_long,
                   index='store',
                   columns='itemID',
                   values='price').reset_index().rename_axis(None, axis=1)

df_wide

Unnamed: 0,store,1,2,3,4
0,Center,110,25,47,200
1,North,105,20,41,150


### Readings and Videos
- [Visual JOIN](https://joins.spathon.com/)
- [Combining Data in pandas With merge(), .join(), and concat()](https://realpython.com/pandas-merge-join-and-concat/)
- [Pivot tables with Pandas, Reuven Lerner](https://www.youtube.com/watch?v=abO6e_b-EHs)

### A highly recommended To Do
- [Pivot Tables, from Python Data Science Handbook by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html)