In [1]:
import numpy as np
import pandas as pd

### Merge Operation

Two or more DataFrames can be combined into one DataFrame using the *merge* function.  

Merge is the Pandas version of the SQL *join* operation.  To merge DataFrames, we have to tell the merge function how to connect them.  This is done by passing a column specified (by the programmer) as the *key*.  

Often, the *key* is just a column that appears in both DataFrames and, hopefully, has values shared by both DataFrames.

The next few examples come straight from our course text.

##### Example 1

In [2]:
df1 = pd.DataFrame({'key':['b','b','a','c','a','a','b'], 'data1':range(7)})
df2 = pd.DataFrame({'key':['a', 'b', 'd','b'], 'data2':range(4)})

In [3]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [4]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2
3,b,3


First, we perform an *inner* merge (or join).  In an inner merge, only the rows with key values in df1.key$\bigcap$df2.key are considered.  Then, all possible rows of the form 

(row df1 with shared key)(row df2 with shared key)

are given.

In [5]:
pd.merge(df1, df2, on = 'key', how = 'inner')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,6,1
5,b,6,3
6,a,2,0
7,a,4,0
8,a,5,0


Next, we perform a *left* merge (or join).  In a left merge, the rows with key values in df1.key$\bigcap$df2.key and the rows in df1.key-df2.key are considered.  Then, all possible rows of the form 

(row df1 with shared key)(row df2 with shared key)

and

(row df1 without shared key)(NaN)

are given.

In [6]:
pd.merge(df1, df2, on = 'key', how = 'left')

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,c,3,
6,a,4,0.0
7,a,5,0.0
8,b,6,1.0
9,b,6,3.0


Next, we perform a *right* merge (or join).  In a right merge, the rows with key values in df1.key$\bigcap$df2.key and the rows in df2.key-df1.key are considered.  Then, all possible rows of the form 

(row df1 with shared key)(row df2 with shared key)

and

(NaN)(row df2 without shared key)

are given.

In [7]:
pd.merge(df1, df2, on = 'key', how = 'right')

Unnamed: 0,key,data1,data2
0,a,2.0,0
1,a,4.0,0
2,a,5.0,0
3,b,0.0,1
4,b,1.0,1
5,b,6.0,1
6,d,,2
7,b,0.0,3
8,b,1.0,3
9,b,6.0,3


Lastly, we perform an *outer* merge (or join).  In an outer merge all possible keys in both df1.key and df2.key are considered and all possible rows are given.

In [8]:
pd.merge(df1, df2, on = 'key', how = 'outer')

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,0.0,3.0
2,b,1.0,1.0
3,b,1.0,3.0
4,b,6.0,1.0
5,b,6.0,3.0
6,a,2.0,0.0
7,a,4.0,0.0
8,a,5.0,0.0
9,c,3.0,


$\Box$

What happens when the key column goes by a different name in each DataFrame?

##### Example 2

In [10]:
df3 = pd.DataFrame({'lkey':['b','b','a','c','a','a','b'], 'data1': range(7)})
df4 = pd.DataFrame({'rkey':['a','b','d'], 'data':range(3)})

In [11]:
df3

Unnamed: 0,lkey,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


#### 

In [12]:
df4

Unnamed: 0,rkey,data
0,a,0
1,b,1
2,d,2


Things work exactly the same as in Example 1, except now key = lkey$\bigcup$rkey.

For an inner merge (or join), only the keys that are common to both the lkey and rkey are considered.

In [49]:
pd.merge(df3, df4, left_on = 'lkey', right_on = 'rkey', how = 'inner')

Unnamed: 0,lkey,data1,rkey,data
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


For a left merge (or join), the keys in both lkey$\bigcap$rkey and lkey - rkey are considered.

In [50]:
pd.merge(df1, df2, on = 'key', how = 'left')

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,c,3,
6,a,4,0.0
7,a,5,0.0
8,b,6,1.0
9,b,6,3.0


$\Box$

##### Example 3

In this example we will be using two real-world data sets:
+ Chicago Census Dataset
+ Chicago Crime Dataset

We use the Pandas *read_csv* function to load each of these datasets into a DataFrame.

In [13]:
census = pd.read_csv('ChicagoCensusData.csv')
crime = pd.read_csv('ChicagoCrimeData.csv')

In [19]:
census.head()

Unnamed: 0,COMMUNITY_AREA_NUMBER,COMMUNITY_AREA_NAME,PERCENT_OF_HOUSING_CROWDED,PERCENT_HOUSEHOLDS_BELOW_POVERTY,PERCENT_AGED_16__UNEMPLOYED,PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA,PERCENT_AGED_UNDER_18_OR_OVER_64,PER_CAPITA_INCOME,HARDSHIP_INDEX
0,1.0,Rogers Park,7.7,23.6,8.7,18.2,27.5,23939,39.0
1,2.0,West Ridge,7.8,17.2,8.8,20.8,38.5,23040,46.0
2,3.0,Uptown,3.8,24.0,8.9,11.8,22.2,35787,20.0
3,4.0,Lincoln Square,3.4,10.9,8.2,13.4,25.5,37524,17.0
4,5.0,North Center,0.3,7.5,5.2,4.5,26.2,57123,6.0


In [20]:
crime.head()

Unnamed: 0,ID,CASE_NUMBER,DATE,BLOCK,IUCR,PRIMARY_TYPE,DESCRIPTION,LOCATION_DESCRIPTION,ARREST,DOMESTIC,...,DISTRICT,WARD,COMMUNITY_AREA_NUMBER,FBICODE,X_COORDINATE,Y_COORDINATE,YEAR,LATITUDE,LONGITUDE,LOCATION
0,3512276,HK587712,2004-08-28,047XX S KEDZIE AVE,890,THEFT,FROM BUILDING,SMALL RETAIL STORE,False,False,...,9,14.0,58.0,6,1155838.0,1873050.0,2004,41.80744,-87.703956,"(41.8074405, -87.703955849)"
1,3406613,HK456306,2004-06-26,009XX N CENTRAL PARK AVE,820,THEFT,$500 AND UNDER,OTHER,False,False,...,11,27.0,23.0,6,1152206.0,1906127.0,2004,41.89828,-87.716406,"(41.898279962, -87.716405505)"
2,8002131,HT233595,2011-04-04,043XX S WABASH AVE,820,THEFT,$500 AND UNDER,NURSING HOME/RETIREMENT HOME,False,False,...,2,3.0,38.0,6,1177436.0,1876313.0,2011,41.815933,-87.624642,"(41.815933131, -87.624642127)"
3,7903289,HT133522,2010-12-30,083XX S KINGSTON AVE,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,...,4,7.0,46.0,6,1194622.0,1850125.0,2010,41.743665,-87.562463,"(41.743665322, -87.562462756)"
4,10402076,HZ138551,2016-02-02,033XX W 66TH ST,820,THEFT,$500 AND UNDER,ALLEY,False,False,...,8,15.0,66.0,6,1155240.0,1860661.0,2016,41.773455,-87.70648,"(41.773455295, -87.706480471)"


In [14]:
census = census.rename(columns = {'COMMUNITY_AREA_NUMBER':'CAN', 'COMMUNITY_AREA_NAME':'comm_area_name', 'HARDSHIP_INDEX':'hardship'})

In [15]:
census = census.iloc[:77]

In [16]:
crime = crime.rename(columns = {'CASE_NUMBER':'case', 'LOCATION_DESCRIPTION':'location','COMMUNITY_AREA_NUMBER':'CAN'})

What kinds of crimes took place at a school?  What communities did this happen in?

In [17]:
crime_census_merged = pd.merge(crime, census, on = 'CAN', how = 'left')

In [21]:
crime.location.str.contains('SCHOOL')

0      False
1      False
2      False
3      False
4      False
       ...  
528    False
529    False
530    False
531    False
532    False
Name: location, Length: 533, dtype: bool

In [25]:
crime_census_merged[crime.location.str.contains('SCHOOL')]

Unnamed: 0,ID,case,DATE,BLOCK,IUCR,PRIMARY_TYPE,DESCRIPTION,location,ARREST,DOMESTIC,...,LONGITUDE,LOCATION,comm_area_name,PERCENT_OF_HOUSING_CROWDED,PERCENT_HOUSEHOLDS_BELOW_POVERTY,PERCENT_AGED_16__UNEMPLOYED,PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA,PERCENT_AGED_UNDER_18_OR_OVER_64,PER_CAPITA_INCOME,hardship
118,4006321,HL353697,2005-05-04,077XX S BURNHAM AVE,460,BATTERY,SIMPLE,"SCHOOL, PUBLIC, GROUNDS",False,False,...,-87.557039,"(41.754691074, -87.557038686)",South Shore,2.8,31.1,20.0,14.0,35.7,19398.0,55.0
121,4430638,HL725506,2005-11-09,048XX N FRANCISCO AVE,484,BATTERY,PRO EMP HANDS NO/MIN INJURY,"SCHOOL, PUBLIC, BUILDING",True,False,...,-87.700489,"(41.96938944, -87.700488807)",Lincoln Square,3.4,10.9,8.2,13.4,25.5,37524.0,17.0
141,6644618,HP716225,2008-12-04,030XX S DR MARTIN LUTHER KING JR DR,460,BATTERY,SIMPLE,"SCHOOL, PUBLIC, BUILDING",False,False,...,-87.617516,"(41.839816207, -87.617516172)",Douglas,1.8,29.6,18.2,14.3,30.7,23791.0,47.0
166,2341955,HH639427,2002-09-10,005XX N WALLER AVE,460,BATTERY,SIMPLE,"SCHOOL, PUBLIC, BUILDING",False,False,...,-87.767781,"(41.890459933, -87.767780886)",Austin,6.3,28.6,22.6,24.4,37.9,15957.0,73.0
183,11110571,JA460432,2017-10-05,076XX S HOMAN AVE,460,BATTERY,SIMPLE,"SCHOOL, PUBLIC, GROUNDS",False,False,...,-87.70746,"(41.754121535, -87.707460248)",Ashburn,4.0,10.4,11.7,17.7,36.9,23482.0,37.0
220,7399281,HS200939,2010-03-10,053XX W CONGRESS PKWY,1320,CRIMINAL DAMAGE,TO VEHICLE,"SCHOOL, PUBLIC, GROUNDS",False,False,...,-87.758439,"(41.873901397, -87.758439102)",Austin,6.3,28.6,22.6,24.4,37.9,15957.0,73.0
263,3530721,HK577020,2004-08-23,016XX W JONQUIL TER,2024,NARCOTICS,POSS: HEROIN(WHITE),"SCHOOL, PUBLIC, GROUNDS",True,False,...,-87.672208,"(42.021177601, -87.67220843)",Rogers Park,7.7,23.6,8.7,18.2,27.5,23939.0,39.0
265,7502426,HS305355,2010-05-13,035XX S WASHTENAW AVE,1821,NARCOTICS,MANU/DEL:CANNABIS 10GM OR LESS,"SCHOOL, PUBLIC, BUILDING",True,False,...,-87.692349,"(41.828907913, -87.692349187)",Brighton Park,14.4,23.6,13.9,45.1,39.3,13089.0,84.0
364,8082600,HT315369,2011-05-26,032XX W ADAMS ST,545,ASSAULT,PRO EMP HANDS NO/MIN INJURY,"SCHOOL, PUBLIC, GROUNDS",False,False,...,-87.707248,"(41.878370307, -87.707248137)",East Garfield Park,8.2,42.4,19.6,21.3,43.2,12961.0,83.0
471,7174283,HR585012,2009-10-13,043XX W 79TH ST,1330,CRIMINAL TRESPASS,TO LAND,"SCHOOL, PUBLIC, GROUNDS",True,False,...,-87.730447,"(41.749414464, -87.730446597)",Ashburn,4.0,10.4,11.7,17.7,36.9,23482.0,37.0
