# Joining Multiple Datasets

In many applications, data may be spread across a number of files or databases and we might need to combine them to perform some analysis. Let's look at an example where we need to do a join. 

For this example we will use the data repository provided by NY Times (https://github.com/nytimes/covid-19-data). Our dataset contains daily cases across all the counties in Ohio (it is not Cumulative). Before exploring the dataset let's look into the various types of Joins

## Types of Joins

Let's create our sample datasets

In [1]:
import pandas as pd
basicDetails = pd.DataFrame([[1,'Jay',35,'M'],[2,'Sam',23,'M'],[3,'Joanne',20,'F']],columns=['uid','Name','Age','Sex'])
basicDetails

Unnamed: 0,uid,Name,Age,Sex
0,1,Jay,35,M
1,2,Sam,23,M
2,3,Joanne,20,F


In [3]:
bloodGroup = pd.DataFrame([[1,'A+'],[2,'B+'],[5,'C+']],columns=['bid','Blood Group'])
bloodGroup

Unnamed: 0,bid,Blood Group
0,1,A+
1,2,B+
2,5,C+


There are basically four types of Join

### Inner Join

![innerjoin](images/innerjoin.png)

Inner join only retains the records based on the keys that are common to both the datasets. Let us look into a simple example which uses merge method to perform Join. 

In [4]:
innerjoin = basicDetails.merge(bloodGroup,left_on='uid',right_on='bid')
innerjoin

Unnamed: 0,uid,Name,Age,Sex,bid,Blood Group
0,1,Jay,35,M,1,A+
1,2,Sam,23,M,2,B+


As you can see, since record with uid 3 doesnot have a corresponding match in the other DataFrame, it gets discarded.

## Left Join
![leftjoin](images/leftjoin.png)

Left Join returns all records from the left DataFrame, and the matching records from the right DataFrame.

In [5]:
leftjoin = basicDetails.merge(bloodGroup,left_on='uid',right_on='bid',how='left')
leftjoin

Unnamed: 0,uid,Name,Age,Sex,bid,Blood Group
0,1,Jay,35,M,1.0,A+
1,2,Sam,23,M,2.0,B+
2,3,Joanne,20,F,,


Here left DataFrame is the dataframe from which we are calling the merge method (basicDetails). Let's try to call merge with the other DataFrame

In [6]:
leftjoin2 = bloodGroup.merge(basicDetails,left_on='bid',right_on='uid',how='left')
leftjoin2

Unnamed: 0,bid,Blood Group,uid,Name,Age,Sex
0,1,A+,1.0,Jay,35.0,M
1,2,B+,2.0,Sam,23.0,M
2,5,C+,,,,


## Right Join
![rightjoin](images/rightjoin.png)

Right Join returns all records from the right DataFrame, and the matching records from the left DataFrame.

In [43]:
rightjoin = bloodGroup.merge(basicDetails,left_on='bid',right_on='uid',how='right')
rightjoin

Unnamed: 0,bid,Blood Group,uid,Name,Age,Sex
0,1.0,A+,1,Jay,35,M
1,2.0,B+,2,Sam,23,M
2,,,3,Joanne,20,F


## Outer Join

![outerjoin](images/outerjoin.png)

Outer join returns unmatched records from both DataFrames,as well as matched records in both DataFrames.

In [7]:
outerjoin = bloodGroup.merge(basicDetails,left_on='bid',right_on='uid',how='outer')
outerjoin

Unnamed: 0,bid,Blood Group,uid,Name,Age,Sex
0,1.0,A+,1.0,Jay,35.0,M
1,2.0,B+,2.0,Sam,23.0,M
2,5.0,C+,,,,
3,,,3.0,Joanne,20.0,F


Noe let's explore our dataset.

In [8]:
# import pandas
import pandas as pd

In [9]:
dataset = pd.read_csv('data/us-counties_oh.csv')
dataset

Unnamed: 0,date,geoid,county,state,cases,cases_avg,cases_avg_per_100k,deaths,deaths_avg,deaths_avg_per_100k
0,2020-03-09,USA-39035,Cuyahoga,Ohio,3,0.43,0.03,0,0.00,0.00
1,2020-03-10,USA-39035,Cuyahoga,Ohio,0,0.43,0.03,0,0.00,0.00
2,2020-03-11,USA-39151,Stark,Ohio,1,0.14,0.04,0,0.00,0.00
3,2020-03-11,USA-39035,Cuyahoga,Ohio,0,0.43,0.03,0,0.00,0.00
4,2020-03-12,USA-39155,Trumbull,Ohio,1,0.14,0.07,0,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...
49257,2021-09-29,USA-39009,Athens,Ohio,30,31.43,48.11,0,0.00,0.00
49258,2021-09-29,USA-39007,Ashtabula,Ohio,47,42.29,43.49,0,0.43,0.44
49259,2021-09-29,USA-39005,Ashland,Ohio,41,34.43,64.37,0,0.57,1.07
49260,2021-09-29,USA-39003,Allen,Ohio,90,72.43,70.76,0,0.71,0.70


So our dataset contains 49,262 rows and 10 columns. Now let us do some basic grouping to get a feel of our data. Let us convert our date column into a datetime column so that we can do date based aggregations.

In [10]:
dataset['date_converted'] = pd.to_datetime(dataset['date'])

In [11]:
dataset.head()

Unnamed: 0,date,geoid,county,state,cases,cases_avg,cases_avg_per_100k,deaths,deaths_avg,deaths_avg_per_100k,date_converted
0,2020-03-09,USA-39035,Cuyahoga,Ohio,3,0.43,0.03,0,0.0,0.0,2020-03-09
1,2020-03-10,USA-39035,Cuyahoga,Ohio,0,0.43,0.03,0,0.0,0.0,2020-03-10
2,2020-03-11,USA-39151,Stark,Ohio,1,0.14,0.04,0,0.0,0.0,2020-03-11
3,2020-03-11,USA-39035,Cuyahoga,Ohio,0,0.43,0.03,0,0.0,0.0,2020-03-11
4,2020-03-12,USA-39155,Trumbull,Ohio,1,0.14,0.07,0,0.0,0.0,2020-03-12


Now let us do a group by with county and then get the county totals using sum()

In [12]:
countyGroup = dataset.groupby('county')
len(countyGroup)

89

Hmmm it seems like we have 89 counties in Ohio (but we only have 88 right). So lets see the County names

In [13]:
sorted(countyGroup.groups.keys())

['Adams',
 'Allen',
 'Ashland',
 'Ashtabula',
 'Athens',
 'Auglaize',
 'Belmont',
 'Brown',
 'Butler',
 'Carroll',
 'Champaign',
 'Clark',
 'Clermont',
 'Clinton',
 'Columbiana',
 'Coshocton',
 'Crawford',
 'Cuyahoga',
 'Darke',
 'Defiance',
 'Delaware',
 'Erie',
 'Fairfield',
 'Fayette',
 'Franklin',
 'Fulton',
 'Gallia',
 'Geauga',
 'Greene',
 'Guernsey',
 'Hamilton',
 'Hancock',
 'Hardin',
 'Harrison',
 'Henry',
 'Highland',
 'Hocking',
 'Holmes',
 'Huron',
 'Jackson',
 'Jefferson',
 'Knox',
 'Lake',
 'Lawrence',
 'Licking',
 'Logan',
 'Lorain',
 'Lucas',
 'Madison',
 'Mahoning',
 'Marion',
 'Medina',
 'Meigs',
 'Mercer',
 'Miami',
 'Monroe',
 'Montgomery',
 'Morgan',
 'Morrow',
 'Muskingum',
 'Noble',
 'Ottawa',
 'Paulding',
 'Perry',
 'Pickaway',
 'Pike',
 'Portage',
 'Preble',
 'Putnam',
 'Richland',
 'Ross',
 'Sandusky',
 'Scioto',
 'Seneca',
 'Shelby',
 'Stark',
 'Summit',
 'Trumbull',
 'Tuscarawas',
 'Union',
 'Unknown',
 'Van Wert',
 'Vinton',
 'Warren',
 'Washington',
 'Wayn

Seems like we have records with county name Unknown

These issues are common in real-world datasets. 

Let us calculate the total cases for each counties. 

In [14]:
countyGroup['cases'].sum()

county
Adams          3805
Allen         14622
Ashland        6189
Ashtabula      8613
Athens         7191
              ...  
Washington     7561
Wayne         12469
Williams       4780
Wood          16195
Wyandot        2906
Name: cases, Length: 89, dtype: int64

As you can see we get a Series object as the result. We can convert it to a DataFrame using our reset_index() trick.

In [15]:
totalCasesForCounties = countyGroup['cases'].sum().reset_index()

Let's rename the column cases to Total Cases

In [16]:
totalCasesForCounties.rename(columns={'cases':'Total Cases'},inplace=True)

In [17]:
totalCasesForCounties

Unnamed: 0,county,Total Cases
0,Adams,3805
1,Allen,14622
2,Ashland,6189
3,Ashtabula,8613
4,Athens,7191
...,...,...
84,Washington,7561
85,Wayne,12469
86,Williams,4780
87,Wood,16195


Now suppose we want to calculate the Total Cases per population (normalize the data by population).

Let us look into another file that contains population data

In [18]:
popData = pd.read_csv(r'data/UID_ISO_FIPS_LookUp_Table.csv')
popData

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,Population
0,4,AF,AFG,4.0,,,,Afghanistan,33.939110,67.709953,Afghanistan,38928341.0
1,8,AL,ALB,8.0,,,,Albania,41.153300,20.168300,Albania,2877800.0
2,10,AQ,ATA,10.0,,,,Antarctica,-71.949900,23.347000,Antarctica,
3,12,DZ,DZA,12.0,,,,Algeria,28.033900,1.659600,Algeria,43851043.0
4,20,AD,AND,20.0,,,,Andorra,42.506300,1.521800,Andorra,77265.0
...,...,...,...,...,...,...,...,...,...,...,...,...
4316,84056037,US,USA,840.0,56037.0,Sweetwater,Wyoming,US,41.659439,-108.882788,"Sweetwater, Wyoming, US",42343.0
4317,84056039,US,USA,840.0,56039.0,Teton,Wyoming,US,43.935225,-110.589080,"Teton, Wyoming, US",23464.0
4318,84056041,US,USA,840.0,56041.0,Uinta,Wyoming,US,41.287818,-110.547578,"Uinta, Wyoming, US",20226.0
4319,84056043,US,USA,840.0,56043.0,Washakie,Wyoming,US,43.904516,-107.680187,"Washakie, Wyoming, US",7805.0


Let us filter out the Province_State for Ohio

In [19]:
popOhio = popData[popData.Province_State=='Ohio']
popOhio

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,Population
1038,84000039,US,USA,840.0,39.0,,Ohio,US,40.388800,-82.764900,"Ohio, US",11689100.0
1110,84080039,US,USA,840.0,80039.0,Out of OH,Ohio,US,,,"Out of OH, Ohio, US",
1161,84090039,US,USA,840.0,90039.0,Unassigned,Ohio,US,,,"Unassigned, Ohio, US",
3222,84039001,US,USA,840.0,39001.0,Adams,Ohio,US,38.845411,-83.471896,"Adams, Ohio, US",27698.0
3223,84039003,US,USA,840.0,39003.0,Allen,Ohio,US,40.772852,-84.108023,"Allen, Ohio, US",102351.0
...,...,...,...,...,...,...,...,...,...,...,...,...
3305,84039167,US,USA,840.0,39167.0,Washington,Ohio,US,39.456906,-81.491214,"Washington, Ohio, US",59911.0
3306,84039169,US,USA,840.0,39169.0,Wayne,Ohio,US,40.829259,-81.888448,"Wayne, Ohio, US",115710.0
3307,84039171,US,USA,840.0,39171.0,Williams,Ohio,US,41.560520,-84.584296,"Williams, Ohio, US",36692.0
3308,84039173,US,USA,840.0,39173.0,Wood,Ohio,US,41.362248,-83.622851,"Wood, Ohio, US",130817.0


The columns we are interested includes FIPS, Admin2 and Population

In [20]:
popOhio = popOhio[['FIPS','Admin2','Population']]
popOhio

Unnamed: 0,FIPS,Admin2,Population
1038,39.0,,11689100.0
1110,80039.0,Out of OH,
1161,90039.0,Unassigned,
3222,39001.0,Adams,27698.0
3223,39003.0,Allen,102351.0
...,...,...,...
3305,39167.0,Washington,59911.0
3306,39169.0,Wayne,115710.0
3307,39171.0,Williams,36692.0
3308,39173.0,Wood,130817.0


Now we have our totalCasesForCounties DataFrame containing the counties and the total cases and the popOhio dataset containing the population data for counties.

What is the common factor in both. We can see that county in totalCasesForCounties and Admin2 in popOhio are referring to the county names in Ohio. So let's try to merge the datasets using county/Admin2 as the key. 

In [21]:
mergedOhioDataset = totalCasesForCounties.merge(popOhio,left_on='county',right_on='Admin2')
mergedOhioDataset

Unnamed: 0,county,Total Cases,FIPS,Admin2,Population
0,Adams,3805,39001.0,Adams,27698.0
1,Allen,14622,39003.0,Allen,102351.0
2,Ashland,6189,39005.0,Ashland,53484.0
3,Ashtabula,8613,39007.0,Ashtabula,97241.0
4,Athens,7191,39009.0,Athens,65327.0
...,...,...,...,...,...
83,Washington,7561,39167.0,Washington,59911.0
84,Wayne,12469,39169.0,Wayne,115710.0
85,Williams,4780,39171.0,Williams,36692.0
86,Wood,16195,39173.0,Wood,130817.0


Voilaaaa!!! we now have the required 88 counties (The unknown key never got joined because it doesn't exist in one of the datasets). So the only job that is left is to calculate the rate. Let's do that

In [22]:
mergedOhioDataset['casesPerPopulation'] = mergedOhioDataset['Total Cases']/mergedOhioDataset['Population']
mergedOhioDataset

Unnamed: 0,county,Total Cases,FIPS,Admin2,Population,casesPerPopulation
0,Adams,3805,39001.0,Adams,27698.0,0.137375
1,Allen,14622,39003.0,Allen,102351.0,0.142861
2,Ashland,6189,39005.0,Ashland,53484.0,0.115717
3,Ashtabula,8613,39007.0,Ashtabula,97241.0,0.088574
4,Athens,7191,39009.0,Athens,65327.0,0.110077
...,...,...,...,...,...,...
83,Washington,7561,39167.0,Washington,59911.0,0.126204
84,Wayne,12469,39169.0,Wayne,115710.0,0.107761
85,Williams,4780,39171.0,Williams,36692.0,0.130274
86,Wood,16195,39173.0,Wood,130817.0,0.123799


And we can sort this DataFrame to see the county with the highest case per population

In [23]:
mergedOhioDataset.sort_values(by='casesPerPopulation',ascending=False)

Unnamed: 0,county,Total Cases,FIPS,Admin2,Population,casesPerPopulation
64,Pickaway,10779,39129.0,Pickaway,58457.0,0.184392
50,Marion,11076,39101.0,Marion,65093.0,0.170157
43,Lawrence,9136,39087.0,Lawrence,59463.0,0.153642
59,Muskingum,13202,39119.0,Muskingum,86215.0,0.153129
19,Defiance,5658,39039.0,Defiance,38087.0,0.148555
...,...,...,...,...,...,...
9,Carroll,2709,39019.0,Carroll,26914.0,0.100654
52,Meigs,2291,39105.0,Meigs,22907.0,0.100013
27,Geauga,8339,39055.0,Geauga,93649.0,0.089045
3,Ashtabula,8613,39007.0,Ashtabula,97241.0,0.088574


In the next chapter we are going to do some basic plotting with Pandas and Matplotlib