<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Combining DataFrames
_**Author**: Boom D. (DSI-NYC), Mahdi S. (DSI-NYC)_
***

__First, we'll cover a _simplification_ of the two most common Pandas methods you can combine dataframes together.__

## Import packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # for Pandas plotting

## Loading data

_Note: I've drastically modified and simplified the data from its original source, the [Central Park Squirrel Dataset](https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw)_

In [2]:
age      = pd.read_csv("./datasets/squirrel_age.csv")
color    = pd.read_csv("./datasets/squirrel_color.csv")
location = pd.read_csv("./datasets/squirrel_location.csv")

In [3]:
age # notice number of observations

Unnamed: 0,Unique Squirrel ID,Age
0,8A-AM-1013-06,Juvenile
1,7H-PM-1006-07,Adult
2,3G-PM-1013-03,Adult
3,22F-AM-1007-07,Juvenile
4,20A-PM-1017-01,Adult


In [4]:
color # notice number of observations

Unnamed: 0,Unique Squirrel ID,Primary Fur Color
0,20A-PM-1017-01,Gray
1,22F-AM-1007-07,Cinnamon
2,7H-PM-1006-07,Gray
3,3G-PM-1013-03,Gray


In [5]:
location # notice number of observations

Unnamed: 0,unique_squirrel_id,lat,long
0,3G-PM-1013-03,-73.974437,40.767428
1,7H-PM-1006-07,-73.970026,40.769934
2,31F-AM-1013-01,-73.959687,40.789379
3,8A-AM-1013-06,-73.97731,40.773805
4,22F-AM-1007-07,-73.96466,40.78277
5,20A-PM-1017-01,-73.970069,40.782889


Take a moment to notice the shape of these data

## `.merge()`

When we use `.merge()`:
- Only merges 2 dataframes
- We MUST merge on a common column - this is information that is shared by both dataframes.
---
__What is the common column in the `age` and `color` dataframes?__

In [6]:
# pd.merge(left = ,
#          right = ,
#          on = )
age.merge(color)

Unnamed: 0,Unique Squirrel ID,Age,Primary Fur Color
0,7H-PM-1006-07,Adult,Gray
1,3G-PM-1013-03,Adult,Gray
2,22F-AM-1007-07,Juvenile,Cinnamon
3,20A-PM-1017-01,Adult,Gray


In [7]:
age.merge(color, on='Unique Squirrel ID')

Unnamed: 0,Unique Squirrel ID,Age,Primary Fur Color
0,7H-PM-1006-07,Adult,Gray
1,3G-PM-1013-03,Adult,Gray
2,22F-AM-1007-07,Juvenile,Cinnamon
3,20A-PM-1017-01,Adult,Gray


In [8]:
age.merge(color, on='Unique Squirrel ID', how='inner')

Unnamed: 0,Unique Squirrel ID,Age,Primary Fur Color
0,7H-PM-1006-07,Adult,Gray
1,3G-PM-1013-03,Adult,Gray
2,22F-AM-1007-07,Juvenile,Cinnamon
3,20A-PM-1017-01,Adult,Gray


In [9]:
age.merge(color, left_on='Unique Squirrel ID', right_on='Unique Squirrel ID', how='inner')

Unnamed: 0,Unique Squirrel ID,Age,Primary Fur Color
0,7H-PM-1006-07,Adult,Gray
1,3G-PM-1013-03,Adult,Gray
2,22F-AM-1007-07,Juvenile,Cinnamon
3,20A-PM-1017-01,Adult,Gray


In [11]:
# When column name do not match, use left_on = "left_col_index", right_on = "right_col_idx" to match the column ids
# If dont go by this way below, will throw error
age.merge(location, left_on='Unique Squirrel ID', right_on='unique_squirrel_id', how='inner')

Unnamed: 0,Unique Squirrel ID,Age,unique_squirrel_id,lat,long
0,8A-AM-1013-06,Juvenile,8A-AM-1013-06,-73.97731,40.773805
1,7H-PM-1006-07,Adult,7H-PM-1006-07,-73.970026,40.769934
2,3G-PM-1013-03,Adult,3G-PM-1013-03,-73.974437,40.767428
3,22F-AM-1007-07,Juvenile,22F-AM-1007-07,-73.96466,40.78277
4,20A-PM-1017-01,Adult,20A-PM-1017-01,-73.970069,40.782889


__Are we missing an observation?__

In [None]:
# Alternative syntax that does the same thing


### What if we reverse the input order? What changes?

In [14]:
# pd.merge(left = ,
#          right = ,
#          on = "Unique Squirrel ID")
color.merge(age, left_on="Unique Squirrel ID", right_on="Unique Squirrel ID", how='left')

Unnamed: 0,Unique Squirrel ID,Primary Fur Color,Age
0,20A-PM-1017-01,Gray,Adult
1,22F-AM-1007-07,Cinnamon,Juvenile
2,7H-PM-1006-07,Gray,Adult
3,3G-PM-1013-03,Gray,Adult


In [15]:
age.merge(color, left_on="Unique Squirrel ID", right_on="Unique Squirrel ID", how='right')

Unnamed: 0,Unique Squirrel ID,Age,Primary Fur Color
0,20A-PM-1017-01,Adult,Gray
1,22F-AM-1007-07,Juvenile,Cinnamon
2,7H-PM-1006-07,Adult,Gray
3,3G-PM-1013-03,Adult,Gray


### What if I don't want the _intersection_ and, instead, I want to keep everything from the right table (i.e. `age`, the bigger one)?

In [16]:
# pd.merge(left = color,
#          right = age,
#          how = "right",
#          on = "Unique Squirrel ID")
age.merge(location, left_on="Unique Squirrel ID", right_on='unique_squirrel_id', how='inner')

Unnamed: 0,Unique Squirrel ID,Age,unique_squirrel_id,lat,long
0,8A-AM-1013-06,Juvenile,8A-AM-1013-06,-73.97731,40.773805
1,7H-PM-1006-07,Adult,7H-PM-1006-07,-73.970026,40.769934
2,3G-PM-1013-03,Adult,3G-PM-1013-03,-73.974437,40.767428
3,22F-AM-1007-07,Juvenile,22F-AM-1007-07,-73.96466,40.78277
4,20A-PM-1017-01,Adult,20A-PM-1017-01,-73.970069,40.782889


Using `how="right"`, what's changed?

### What if I have a dataframe with a _different_ name for the column I wish to join "on"?

In [12]:
age

Unnamed: 0,Unique Squirrel ID,Age
0,8A-AM-1013-06,Juvenile
1,7H-PM-1006-07,Adult
2,3G-PM-1013-03,Adult
3,22F-AM-1007-07,Juvenile
4,20A-PM-1017-01,Adult


In [13]:
location

Unnamed: 0,unique_squirrel_id,lat,long
0,3G-PM-1013-03,-73.974437,40.767428
1,7H-PM-1006-07,-73.970026,40.769934
2,31F-AM-1013-01,-73.959687,40.789379
3,8A-AM-1013-06,-73.97731,40.773805
4,22F-AM-1007-07,-73.96466,40.78277
5,20A-PM-1017-01,-73.970069,40.782889


In [None]:
# This breaks...
# pd.merge(left_on = age,
#          right = location,
#          on = "Unique Squirrel ID")

In [None]:
# This WORKS!
# pd.merge(left = age,
#          right = location,
#          left_on = ,
#          right_on = )

We see some redundancy, which is working as expected...
- You may have code that breaks if it expects some incoming datafame to have the specific column "unique_squirrel_id" in some place and "Unique Squirrel ID" in others

## `.concat()`

#### Concatenating by columns _(not recommended)_

concat => stacking the dataframes tgt

# axis 1 => by column, axis 0 => by row

In [23]:
# axis 1 => by column, axis 0 => by row
df3 = pd.DataFrame([['8A-AM-1013-06', 'Juvenile']], columns=['Unique Squirrel ID',	'Age'])
df3

Unnamed: 0,Unique Squirrel ID,Age
0,8A-AM-1013-06,Juvenile


In [24]:
pd.concat(objs=[age, df3], axis=0)

Unnamed: 0,Unique Squirrel ID,Age
0,8A-AM-1013-06,Juvenile
1,7H-PM-1006-07,Adult
2,3G-PM-1013-03,Adult
3,22F-AM-1007-07,Juvenile
4,20A-PM-1017-01,Adult
0,8A-AM-1013-06,Juvenile


In [32]:
#to reset the index
pd.concat(objs=[age, df3], axis=0, ignore_index=True)
# pd.concat(objs=[age, df3], axis=0).reset_index(drop=True)

Unnamed: 0,Unique Squirrel ID,Age
0,8A-AM-1013-06,Juvenile
1,7H-PM-1006-07,Adult
2,3G-PM-1013-03,Adult
3,22F-AM-1007-07,Juvenile
4,20A-PM-1017-01,Adult
5,8A-AM-1013-06,Juvenile


In [None]:
# if colmuns do not match, it will be subsitituded by NaN

Notice how we can concatenate two dataframes without the same number of rows, but...
- The overlap is filled with `NaN` values

### Can we `.concat()` more than 2 dataframes?

In [33]:
pd.concat(objs=[age, color, location], axis=0, ignore_index=True)

Unnamed: 0,Unique Squirrel ID,Age,Primary Fur Color,unique_squirrel_id,lat,long
0,8A-AM-1013-06,Juvenile,,,,
1,7H-PM-1006-07,Adult,,,,
2,3G-PM-1013-03,Adult,,,,
3,22F-AM-1007-07,Juvenile,,,,
4,20A-PM-1017-01,Adult,,,,
5,20A-PM-1017-01,,Gray,,,
6,22F-AM-1007-07,,Cinnamon,,,
7,7H-PM-1006-07,,Gray,,,
8,3G-PM-1013-03,,Gray,,,
9,,,,3G-PM-1013-03,-73.974437,40.767428


In [35]:
pd.concat(objs=[age, color, location], axis=1)

Unnamed: 0,Unique Squirrel ID,Age,Unique Squirrel ID.1,Primary Fur Color,unique_squirrel_id,lat,long
0,8A-AM-1013-06,Juvenile,20A-PM-1017-01,Gray,3G-PM-1013-03,-73.974437,40.767428
1,7H-PM-1006-07,Adult,22F-AM-1007-07,Cinnamon,7H-PM-1006-07,-73.970026,40.769934
2,3G-PM-1013-03,Adult,7H-PM-1006-07,Gray,31F-AM-1013-01,-73.959687,40.789379
3,22F-AM-1007-07,Juvenile,3G-PM-1013-03,Gray,8A-AM-1013-06,-73.97731,40.773805
4,20A-PM-1017-01,Adult,,,22F-AM-1007-07,-73.96466,40.78277
5,,,,,20A-PM-1017-01,-73.970069,40.782889


#### Concatenating by rows _(useful)_

In [None]:
# Creating a new data point (row)
new_datapoint = pd.DataFrame(data = [['8A-AM-1013-06', "Cinnamon"]],
                             columns = ['Unique Squirrel ID', 'Primary Fur Color'])


In [None]:
new_datapoint

In [None]:
# Concatenate new datapoint to existing dataframe
new_color = pd.concat(objs = [color, new_datapoint], axis = 0)
new_color

__Is there anything odd about this new dataframe?__

In [None]:
# Reset index


In [None]:
new_color

## References
- [Central Park Squirrel Census](https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw)