### This notebook covers:
* How to combine multiple sets of data
* The details of merge mechanics in pandas
* Inner, Outer, Left, Right Joins
* Joins Cardinalities

### Revision:
* Concatenating dataframes:
    - pd.concat(list_of_dfs)
    - drop_duplicates(subset='col_name', keep='first') to remove duplicates
    - duplicated index issue: 1. use ignore_index=True in concat method or 2. reset_index(drop=True)
    - preserve old index with unique: pd.concat([df1, df2], verify_integrity=True)
* Creating Multiple indices with MultiIndex:
    - pd.concat(df_List, keys=['idx1', 'idx2'])
    - accessing: indexing with label name df.loc[('idx1', 'a')], indexing based on index works same df.iloc[4]
* Concatenating Columns:
    - pd.concat(df_list, axis=1)
* append() is special case of concat() - only works on rows
* pd.concat(df_list, join='outer')
* merge():
    - pd.merge(df1, df2, on='col_name', how='inner')  # default how='inner', other options: 'outer', 'left', 'right'
    - pd.merge(df1, df2, left_on='df1_col', right_on='df2_col') - to merge if columns name is different on both dfs.
    - pd.merge(df1, df2, left_index=True, right_inex=True)   - to merge based on index.
    - We can combine both left_index,right_on or right_index, left_on
    - pd.join(df1, df2)   - if at least one data set to be joined on index.
  
   

In [1]:
import pandas as pd
import numpy as np
pd.__version__

'1.4.2'

### Introducing Five New Datasets

In [16]:
eng_url = 'https://andybek.com/pandas-eng'
state_url = 'https://andybek.com/pandas-state'
party_url = 'https://andybek.com/pandas-party'
liberal_url = 'https://andybek.com/pandas-liberal'
ivies_url = 'https://andybek.com/pandas-ivies'

In [17]:
eng = pd.read_csv(eng_url)
state = pd.read_csv(state_url)
party = pd.read_csv(party_url)
liberal = pd.read_csv(liberal_url)
ivies = pd.read_csv(ivies_url)

In [18]:
state.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"


In [19]:
eng.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
4,Cooper Union,Engineering,"$62,200.00","$114,000.00"


In [20]:
party.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,University of Illinois at Urbana-Champaign (UIUC),Party,"$52,900.00","$96,100.00"
1,"University of Maryland, College Park",Party,"$52,000.00","$95,000.00"
2,"University of California, Santa Barbara (UCSB)",Party,"$50,500.00","$95,000.00"
3,University of Texas (UT) - Austin,Party,"$49,700.00","$93,900.00"
4,State University of New York (SUNY) at Albany,Party,"$44,500.00","$92,200.00"


In [21]:
liberal.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00"
1,Colgate University,Liberal Arts,"$52,800.00","$108,000.00"
2,Amherst College,Liberal Arts,"$54,500.00","$107,000.00"
3,Lafayette College,Liberal Arts,"$53,900.00","$107,000.00"
4,Bowdoin College,Liberal Arts,"$48,100.00","$107,000.00"


In [22]:
ivies.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"


### Concatinating DataFrames

In [23]:
dfs = [eng, state, party, liberal, ivies]
for df in dfs:
    print(df.shape)

(19, 4)
(175, 4)
(20, 4)
(47, 4)
(8, 4)


In [24]:
pd.concat(dfs).shape

(269, 4)

In [25]:
pd.concat(dfs).head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
4,Cooper Union,Engineering,"$62,200.00","$114,000.00"


In [27]:
# but party and state schools are duplicated
set(party['School Name']).difference(set(state['School Name']))

{'Randolph-Macon College'}

In [32]:
'Randolph-Macon College' in liberal['School Name'].values

True

In [34]:
pd.concat(dfs)[pd.concat(dfs).duplicated(subset=['School Name'], keep='first')]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,University of Illinois at Urbana-Champaign (UIUC),Party,"$52,900.00","$96,100.00"
1,"University of Maryland, College Park",Party,"$52,000.00","$95,000.00"
2,"University of California, Santa Barbara (UCSB)",Party,"$50,500.00","$95,000.00"
3,University of Texas (UT) - Austin,Party,"$49,700.00","$93,900.00"
4,State University of New York (SUNY) at Albany,Party,"$44,500.00","$92,200.00"
5,University of Florida (UF),Party,"$47,100.00","$87,900.00"
6,Louisiana State University (LSU),Party,"$46,900.00","$87,800.00"
7,University of Georgia (UGA),Party,"$44,100.00","$86,000.00"
8,Pennsylvania State University (PSU),Party,"$49,900.00","$85,700.00"
9,Arizona State University (ASU),Party,"$47,400.00","$84,100.00"


In [37]:
schools = pd.concat(dfs).drop_duplicates(subset=['School Name'])

In [38]:
schools.shape

(249, 4)

### Duplicated Index Issue

In [36]:
schools.loc[0]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00"
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"


In [40]:
# schools.loc[0:2] - this doesn't work anymore

In [41]:
# 1: reset_index
schools.reset_index(drop=True, inplace=False)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
4,Cooper Union,Engineering,"$62,200.00","$114,000.00"
...,...,...,...,...
244,Harvard University,Ivy League,"$63,400.00","$124,000.00"
245,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
246,Cornell University,Ivy League,"$60,300.00","$110,000.00"
247,Brown University,Ivy League,"$56,200.00","$109,000.00"


In [45]:
# 2: remove old index while concatinating.
schools = pd.concat(dfs, ignore_index=True).drop_duplicates(subset=['School Name'])

In [46]:
pd.concat(dfs, ignore_index=True).drop_duplicates(subset=['School Name']).index.duplicated().sum()

0

### Enforcing Unique Indices

In [47]:
# previously: df.reset_index(drop=True) or pd.concat(dfs,ignore_index=True)
# GOAL: What if we want to preserve old index while having uniqueness

In [48]:
ivies2 =ivies.set_index('School Name')

In [49]:
eng2 = eng.set_index('School Name')

In [52]:
pd.concat([ivies2,eng2], ignore_index=True)  # achieved but at the cost of important column

Unnamed: 0,School Type,Starting Median Salary,Mid-Career Median Salary
0,Ivy League,"$58,000.00","$134,000.00"
1,Ivy League,"$66,500.00","$131,000.00"
2,Ivy League,"$59,100.00","$126,000.00"
3,Ivy League,"$63,400.00","$124,000.00"
4,Ivy League,"$60,900.00","$120,000.00"
5,Ivy League,"$60,300.00","$110,000.00"
6,Ivy League,"$56,200.00","$109,000.00"
7,Ivy League,"$59,400.00","$107,000.00"
8,Engineering,"$72,200.00","$126,000.00"
9,Engineering,"$75,500.00","$123,000.00"


In [53]:
pd.concat([ivies2, eng2], verify_integrity=True)

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
Princeton University,Ivy League,"$66,500.00","$131,000.00"
Yale University,Ivy League,"$59,100.00","$126,000.00"
Harvard University,Ivy League,"$63,400.00","$124,000.00"
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
Cornell University,Ivy League,"$60,300.00","$110,000.00"
Brown University,Ivy League,"$56,200.00","$109,000.00"
Columbia University,Ivy League,"$59,400.00","$107,000.00"
Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


In [54]:
random_eng_school = eng2.sample()

In [55]:
random_eng_school

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Georgia Institute of Technology,Engineering,"$58,300.00","$106,000.00"


In [59]:
ivies2 = ivies2.append(random_eng_school)

  ivies2 = ivies2.append(random_eng_school)


In [60]:
pd.concat([ivies2, eng2], verify_integrity=True)

ValueError: Indexes have overlapping values: Index(['Georgia Institute of Technology'], dtype='object', name='School Name')

### Creating multiple indices with concat()

In [62]:
# previously: pd.concat(ignore_index=True)
# how about multi index

In [63]:
new_df = pd.concat([ivies,eng], keys=['ivyleague_schools', 'engineering_schools'])

In [64]:
new_df

Unnamed: 0,Unnamed: 1,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
ivyleague_schools,0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
ivyleague_schools,1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
ivyleague_schools,2,Yale University,Ivy League,"$59,100.00","$126,000.00"
ivyleague_schools,3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
ivyleague_schools,4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
ivyleague_schools,5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
ivyleague_schools,6,Brown University,Ivy League,"$56,200.00","$109,000.00"
ivyleague_schools,7,Columbia University,Ivy League,"$59,400.00","$107,000.00"
engineering_schools,0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
engineering_schools,1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


In [68]:
new_df.loc[('engineering_schools', 3)] # indexing based on label need to change like this tuple

School Name                 Polytechnic University of New York, Brooklyn
School Type                                                  Engineering
Starting Median Salary                                       $62,400.00 
Mid-Career Median Salary                                    $114,000.00 
Name: (engineering_schools, 3), dtype: object

In [69]:
new_df.iloc[10]   # indexing based on index works same

School Name                 Harvey Mudd College
School Type                         Engineering
Starting Median Salary              $71,800.00 
Mid-Career Median Salary           $122,000.00 
Name: (engineering_schools, 2), dtype: object

### Columns Concatenation

In [70]:
ivies3 = ivies.sort_values(by='Starting Median Salary', ascending=False)[:5].reset_index(drop=True)
eng3 = eng.sort_values(by='Starting Median Salary', ascending=False)[:5].reset_index(drop=True)

In [71]:
pd.concat([ivies3,eng3], axis=1)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,School Name.1,School Type.1,Starting Median Salary.1,Mid-Career Median Salary.1
0,Princeton University,Ivy League,"$66,500.00","$131,000.00",California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
1,Harvard University,Ivy League,"$63,400.00","$124,000.00",Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
2,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
3,Cornell University,Ivy League,"$60,300.00","$110,000.00","Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
4,Columbia University,Ivy League,"$59,400.00","$107,000.00",Cooper Union,Engineering,"$62,200.00","$114,000.00"


### apend() - special case of concat()

In [72]:
"""append() = dataframe instance method, only operates on row axis
concat() = pandas method, operates on row/column axis"""

'append() = dataframe instance method, only operates on row axis\nconcat() = pandas method, operates on row/column axis'

### Concat on Different Columns

In [73]:
eng4 = eng.copy()
eng4['Stem'] = True

In [74]:
eng4

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Stem
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00",True
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00",True
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00",True
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00",True
4,Cooper Union,Engineering,"$62,200.00","$114,000.00",True
5,Worcester Polytechnic Institute (WPI),Engineering,"$61,000.00","$114,000.00",True
6,Carnegie Mellon University (CMU),Engineering,"$61,800.00","$111,000.00",True
7,Rensselaer Polytechnic Institute (RPI),Engineering,"$61,100.00","$110,000.00",True
8,Georgia Institute of Technology,Engineering,"$58,300.00","$106,000.00",True
9,Colorado School of Mines,Engineering,"$58,100.00","$106,000.00",True


In [76]:
pd.concat([ivies, eng4])  # default join is outer, that's why Stem column shows up

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Stem
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",
2,Yale University,Ivy League,"$59,100.00","$126,000.00",
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",
5,Cornell University,Ivy League,"$60,300.00","$110,000.00",
6,Brown University,Ivy League,"$56,200.00","$109,000.00",
7,Columbia University,Ivy League,"$59,400.00","$107,000.00",
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00",True
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00",True


In [78]:
pd.concat([ivies,eng4], join='inner').head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"


In [121]:
# Challenge
# 1: concatenate liberal and state schools. how many unique school names are there? 
ls = pd.concat([liberal,state], ignore_index=True)
ls.shape

(222, 4)

In [124]:
# 2: what is avg starting median salary in above df? 
ls = pd.concat([liberal,state])
ls['Mid-Career Median Salary'] = [float(x[1:].replace(',','')) for x in ls['Mid-Career Median Salary']]
ls

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,"$54,100.00",110000.0
1,Colgate University,Liberal Arts,"$52,800.00",108000.0
2,Amherst College,Liberal Arts,"$54,500.00",107000.0
3,Lafayette College,Liberal Arts,"$53,900.00",107000.0
4,Bowdoin College,Liberal Arts,"$48,100.00",107000.0
...,...,...,...,...
170,Austin Peay State University,State,"$37,700.00",59200.0
171,Pittsburg State University,State,"$40,400.00",58200.0
172,Southern Utah University,State,"$41,900.00",56500.0
173,Montana State University - Billings,State,"$37,900.00",50600.0


In [125]:
ls['Mid-Career Median Salary'].mean()

80856.3063063063

In [107]:
# 3 
# create a short dataframe that shows up the top 3 liberal and state schools that produces highest midcareer earners.
# show tha school name and mid-career salary for each data set side by side 
# nest the column labels within liberal and state labels

In [113]:
short_liberal = liberal.sort_values(by='Mid-Career Median Salary', ascending=False)[:3].reset_index(drop=True)

In [114]:
short_state = state.sort_values(by='Mid-Career Median Salary', ascending=False)[:3].reset_index(drop=True)

In [115]:
pd.concat([short_liberal, short_state], axis=1)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,School Name.1,School Type.1,Starting Median Salary.1,Mid-Career Median Salary.1
0,"Wesleyan University (Middletown, Connecticut)",Liberal Arts,"$46,500.00","$97,900.00","University of California, Davis",State,"$52,300.00","$99,600.00"
1,Bates College,Liberal Arts,"$47,300.00","$96,500.00",University of Colorado - Boulder (UCB),State,"$47,100.00","$97,600.00"
2,Union College,Liberal Arts,"$47,200.00","$95,800.00","University of California, Irvine (UCI)",State,"$48,300.00","$96,700.00"


In [116]:
pd.concat([short_liberal, short_state], keys=['liberal','state'], axis=1)

Unnamed: 0_level_0,liberal,liberal,liberal,liberal,state,state,state,state
Unnamed: 0_level_1,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"Wesleyan University (Middletown, Connecticut)",Liberal Arts,"$46,500.00","$97,900.00","University of California, Davis",State,"$52,300.00","$99,600.00"
1,Bates College,Liberal Arts,"$47,300.00","$96,500.00",University of Colorado - Boulder (UCB),State,"$47,100.00","$97,600.00"
2,Union College,Liberal Arts,"$47,200.00","$95,800.00","University of California, Irvine (UCI)",State,"$48,300.00","$96,700.00"


In [118]:
pd.concat([short_liberal, short_state], keys=['liberal','state'], axis=1).columns

MultiIndex([('liberal',              'School Name'),
            ('liberal',              'School Type'),
            ('liberal',   'Starting Median Salary'),
            ('liberal', 'Mid-Career Median Salary'),
            (  'state',              'School Name'),
            (  'state',              'School Type'),
            (  'state',   'Starting Median Salary'),
            (  'state', 'Mid-Career Median Salary')],
           )

### merge() similar to SQL

In [126]:
# concat() - glues data together, structure focused operation
# merge() - combines data sets based on content they share, much more flexible than concat

In [127]:
regions_url = 'https://andybek.com/pandas-regions'
regions = pd.read_csv(regions_url)

In [128]:
regions.head()

Unnamed: 0,School Name,Region
0,Massachusetts Institute of Technology (MIT),Northeastern
1,California Institute of Technology (CIT),California
2,Harvey Mudd College,California
3,"Polytechnic University of New York, Brooklyn",Northeastern
4,Cooper Union,Northeastern


In [129]:
regions.shape

(269, 2)

In [130]:
pd.merge(schools, regions)  # merging happends based on school name automatically

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00",Northeastern
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00",California
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00",California
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00",Northeastern
4,Cooper Union,Engineering,"$62,200.00","$114,000.00",Northeastern
...,...,...,...,...,...
264,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
265,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
266,Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
267,Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern


In [131]:
pd.merge(schools, regions, on='School Name')

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00",Northeastern
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00",California
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00",California
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00",Northeastern
4,Cooper Union,Engineering,"$62,200.00","$114,000.00",Northeastern
...,...,...,...,...,...
264,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
265,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
266,Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
267,Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern


### left_on and right_on parameter in merge()

In [135]:
income_url = 'https://andybek.com/pandas-mid'

In [136]:
incomes = pd.read_csv(income_url)
incomes

Unnamed: 0,school_name,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
0,Massachusetts Institute of Technology (MIT),"$76,800.00","$99,200.00","$168,000.00","$220,000.00"
1,California Institute of Technology (CIT),,"$104,000.00","$161,000.00",
2,Harvey Mudd College,,"$96,000.00","$180,000.00",
3,"Polytechnic University of New York, Brooklyn","$66,800.00","$94,300.00","$143,000.00","$190,000.00"
4,Cooper Union,,"$80,200.00","$142,000.00",
...,...,...,...,...,...
264,Austin Peay State University,"$32,200.00","$40,500.00","$73,900.00","$96,200.00"
265,Pittsburg State University,"$25,600.00","$46,000.00","$84,600.00","$117,000.00"
266,Southern Utah University,"$30,700.00","$39,700.00","$78,400.00","$116,000.00"
267,Montana State University - Billings,"$22,600.00","$31,800.00","$78,500.00","$98,900.00"


In [138]:
pd.merge(schools, incomes, left_on='School Name', right_on = 'school_name')

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,school_name,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00",Massachusetts Institute of Technology (MIT),"$76,800.00","$99,200.00","$168,000.00","$220,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00",California Institute of Technology (CIT),,"$104,000.00","$161,000.00",
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00",Harvey Mudd College,,"$96,000.00","$180,000.00",
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00","Polytechnic University of New York, Brooklyn","$66,800.00","$94,300.00","$143,000.00","$190,000.00"
4,Cooper Union,Engineering,"$62,200.00","$114,000.00",Cooper Union,,"$80,200.00","$142,000.00",
...,...,...,...,...,...,...,...,...,...
264,Harvard University,Ivy League,"$63,400.00","$124,000.00",Harvard University,"$54,800.00","$86,200.00","$179,000.00","$288,000.00"
265,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",University of Pennsylvania,"$55,900.00","$79,200.00","$192,000.00","$282,000.00"
266,Cornell University,Ivy League,"$60,300.00","$110,000.00",Cornell University,"$56,800.00","$79,800.00","$160,000.00","$210,000.00"
267,Brown University,Ivy League,"$56,200.00","$109,000.00",Brown University,"$55,400.00","$74,400.00","$159,000.00","$228,000.00"


In [139]:
pd.merge(schools, incomes, left_on='School Name', right_on = 'school_name').drop('school_name', axis=1)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00","$76,800.00","$99,200.00","$168,000.00","$220,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00",,"$104,000.00","$161,000.00",
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00",,"$96,000.00","$180,000.00",
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00","$66,800.00","$94,300.00","$143,000.00","$190,000.00"
4,Cooper Union,Engineering,"$62,200.00","$114,000.00",,"$80,200.00","$142,000.00",
...,...,...,...,...,...,...,...,...
264,Harvard University,Ivy League,"$63,400.00","$124,000.00","$54,800.00","$86,200.00","$179,000.00","$288,000.00"
265,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00","$55,900.00","$79,200.00","$192,000.00","$282,000.00"
266,Cornell University,Ivy League,"$60,300.00","$110,000.00","$56,800.00","$79,800.00","$160,000.00","$210,000.00"
267,Brown University,Ivy League,"$56,200.00","$109,000.00","$55,400.00","$74,400.00","$159,000.00","$228,000.00"


### how parameter in merge()

In [141]:
pd.merge(ivies, regions) # how = inner default 

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
2,Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
5,Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
6,Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
7,Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern


In [143]:
pd.merge(ivies, regions, how='outer')

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
2,Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
...,...,...,...,...,...
264,Austin Peay State University,,,,Southern
265,Pittsburg State University,,,,Midwestern
266,Southern Utah University,,,,Western
267,Montana State University - Billings,,,,Western


In [145]:
pd.merge(ivies, regions, how='right')

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Massachusetts Institute of Technology (MIT),,,,Northeastern
1,California Institute of Technology (CIT),,,,California
2,Harvey Mudd College,,,,California
3,"Polytechnic University of New York, Brooklyn",,,,Northeastern
4,Cooper Union,,,,Northeastern
...,...,...,...,...,...
264,Austin Peay State University,,,,Southern
265,Pittsburg State University,,,,Midwestern
266,Southern Utah University,,,,Western
267,Montana State University - Billings,,,,Western


### One-to-One and One-to-Many Joins

In [146]:
# one to one : each record in one data set is matched with exactly one record in another data set.

In [147]:
ivies

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"
7,Columbia University,Ivy League,"$59,400.00","$107,000.00"


In [148]:
regions.head()

Unnamed: 0,School Name,Region
0,Massachusetts Institute of Technology (MIT),Northeastern
1,California Institute of Technology (CIT),California
2,Harvey Mudd College,California
3,"Polytechnic University of New York, Brooklyn",Northeastern
4,Cooper Union,Northeastern


In [149]:
pd.merge(ivies, regions, on='School Name', how='inner')

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
2,Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
5,Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
6,Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
7,Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern


In [150]:
ivies['School Name'].is_unique

True

In [151]:
regions['School Name'].is_unique

False

In [156]:
regions[regions['School Name'].isin(ivies['School Name'])]['School Name'].is_unique

True

In [157]:
regions[regions['School Name'].isin(ivies['School Name'])]

Unnamed: 0,School Name,Region
86,Dartmouth College,Northeastern
87,Princeton University,Northeastern
88,Yale University,Northeastern
89,Harvard University,Northeastern
90,University of Pennsylvania,Northeastern
91,Cornell University,Northeastern
92,Brown University,Northeastern
93,Columbia University,Northeastern


In [158]:
# one to many: one/more record in one data set is matched with more than one records from another data set

In [159]:
state

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
170,Austin Peay State University,State,"$37,700.00","$59,200.00"
171,Pittsburg State University,State,"$40,400.00","$58,200.00"
172,Southern Utah University,State,"$41,900.00","$56,500.00"
173,Montana State University - Billings,State,"$37,900.00","$50,600.00"


In [160]:
state['School Name'].is_unique

True

In [163]:
regions[regions['School Name'].isin(state['School Name'])]

Unnamed: 0,School Name,Region
19,University of Illinois at Urbana-Champaign (UIUC),Midwestern
20,"University of Maryland, College Park",Southern
21,"University of California, Santa Barbara (UCSB)",California
22,University of Texas (UT) - Austin,Southern
23,State University of New York (SUNY) at Albany,Northeastern
...,...,...
264,Austin Peay State University,Southern
265,Pittsburg State University,Midwestern
266,Southern Utah University,Western
267,Montana State University - Billings,Western


In [165]:
regions[regions['School Name'].isin(state['School Name'])]['School Name'].value_counts()

University of Illinois at Urbana-Champaign (UIUC)    2
Indiana University (IU), Bloomington                 2
University of Maryland, College Park                 2
Ohio University                                      2
University of Tennessee                              2
                                                    ..
University of Illinois at Chicago                    1
State University of New York (SUNY) at Buffalo       1
University of Kansas                                 1
University of New Mexico (UNM)                       1
Black Hills State University                         1
Name: School Name, Length: 175, dtype: int64

In [167]:
pd.merge(state,regions).sort_values(by='School Name').drop_duplicates()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
169,Appalachian State University,State,"$40,400.00","$69,100.00",Southern
58,Arizona State University (ASU),State,"$47,400.00","$84,100.00",Western
184,Arkansas State University (ASU),State,"$38,700.00","$63,300.00",Southern
48,Auburn University,State,"$45,400.00","$84,700.00",Southern
189,Austin Peay State University,State,"$37,700.00","$59,200.00",Southern
...,...,...,...,...,...
117,Wayne State University,State,"$42,800.00","$76,100.00",Midwestern
107,West Virginia University (WVU),State,"$43,100.00","$78,100.00",Southern
175,Western Carolina University,State,"$36,900.00","$66,600.00",Southern
131,Western Michigan University (WMU),State,"$42,300.00","$73,800.00",Midwestern


In [168]:
pd.merge(state, regions.drop_duplicates())

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00",California
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00",Southern
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00",California
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00",California
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00",California
...,...,...,...,...,...
170,Austin Peay State University,State,"$37,700.00","$59,200.00",Southern
171,Pittsburg State University,State,"$40,400.00","$58,200.00",Midwestern
172,Southern Utah University,State,"$41,900.00","$56,500.00",Western
173,Montana State University - Billings,State,"$37,900.00","$50,600.00",Western


### many to many

In [170]:
survey = pd.DataFrame({
    'School Type': ['Ivy League', 'Ivy League', 'Engineering', 'Engineering'],
    'Prestige': ['High', 'Good', 'Good', 'Okay'],
    'Respondent': [1,2,2,3]
})
survey

Unnamed: 0,School Type,Prestige,Respondent
0,Ivy League,High,1
1,Ivy League,Good,2
2,Engineering,Good,2
3,Engineering,Okay,3


In [171]:
ivies

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"
7,Columbia University,Ivy League,"$59,400.00","$107,000.00"


In [172]:
pd.merge(ivies, survey)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Prestige,Respondent
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",High,1
1,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Good,2
2,Princeton University,Ivy League,"$66,500.00","$131,000.00",High,1
3,Princeton University,Ivy League,"$66,500.00","$131,000.00",Good,2
4,Yale University,Ivy League,"$59,100.00","$126,000.00",High,1
5,Yale University,Ivy League,"$59,100.00","$126,000.00",Good,2
6,Harvard University,Ivy League,"$63,400.00","$124,000.00",High,1
7,Harvard University,Ivy League,"$63,400.00","$124,000.00",Good,2
8,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",High,1
9,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Good,2


In [173]:
# when there are duplicated value in key columns for both data sets.

In [174]:
# These relationships are called JOIN Cardinality

###  Merging by Index

In [176]:
ivies4 = ivies.set_index('School Name')

In [177]:
region4 = regions.set_index('School Name')

In [178]:
pd.merge(ivies4, region4)

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

In [179]:
pd.merge(ivies4, region4, left_index=True, right_index=True)

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary,Region
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern
Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern


In [180]:
# Merging one index and column can be merged by (left_index, right_on) and (right_index, left_on) parameters

### join() method

In [184]:
ivies4.join(region4)  # index to index

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary,Region
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern
Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern


In [185]:
ivies.join(region4)  # column to index

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",
2,Yale University,Ivy League,"$59,100.00","$126,000.00",
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",
5,Cornell University,Ivy League,"$60,300.00","$110,000.00",
6,Brown University,Ivy League,"$56,200.00","$109,000.00",
7,Columbia University,Ivy League,"$59,400.00","$107,000.00",


In [188]:
# Under the hood join() used merge() method.
# useful when atleast one data set has to be joined on index

In [190]:
# Challenge

In [198]:
# 1 - merge the liberal arts schools with regions and assigning resulting df to dfm. 
# What region has the highest number of schools?
dfm = pd.merge(liberal, regions)
dfm['Region'].value_counts()

Northeastern    25
Midwestern       8
Western          7
Southern         5
California       3
Name: Region, dtype: int64

In [208]:
# 2 - Set the school_name as the index of the mid_career(incomes) dataframe. do this inplace
incomes.set_index('school_name',inplace=True)

In [211]:
# 3 - Merge the dfm and mid_career dataframe
pd.merge(dfm,incomes, right_index=True, left_on='School Name') 

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00",Northeastern,"$62,800.00","$80,600.00","$156,000.00","$251,000.00"
1,Colgate University,Liberal Arts,"$52,800.00","$108,000.00",Northeastern,"$60,000.00","$76,700.00","$167,000.00","$265,000.00"
2,Amherst College,Liberal Arts,"$54,500.00","$107,000.00",Northeastern,,"$84,900.00","$162,000.00",
3,Lafayette College,Liberal Arts,"$53,900.00","$107,000.00",Northeastern,"$70,600.00","$79,300.00","$144,000.00","$204,000.00"
4,Bowdoin College,Liberal Arts,"$48,100.00","$107,000.00",Northeastern,,"$74,600.00","$146,000.00",
5,College of the Holy Cross,Liberal Arts,"$50,200.00","$106,000.00",Northeastern,,"$65,600.00","$143,000.00",
6,Occidental College,Liberal Arts,"$51,900.00","$105,000.00",California,,"$54,800.00","$157,000.00",
7,Washington and Lee University,Liberal Arts,"$53,600.00","$104,000.00",Southern,,"$82,800.00","$146,000.00",
8,Swarthmore College,Liberal Arts,"$49,700.00","$104,000.00",Northeastern,,"$67,200.00","$167,000.00",
9,Davidson College,Liberal Arts,"$46,100.00","$104,000.00",Southern,,"$70,500.00","$146,000.00",


In [223]:
# 4 - # is join operation one-to-one? - no
dfm['School Name'].is_unique, incomes.index.is_unique

(False, False)

In [224]:
incomes[incomes.index.isin(dfm['School Name'])].index.is_unique

False

In [225]:
incomes[incomes.index.isin(dfm['School Name'])].index.value_counts()

Randolph-Macon College                           2
Gustavus Adolphus College                        1
Siena College                                    1
Smith College                                    1
Hamilton College                                 1
Wellesley College                                1
Denison University                               1
Oberlin College                                  1
University of Puget Sound                        1
Colorado College (CC)                            1
Reed College                                     1
Whitman College                                  1
Colby College                                    1
Ursinus College                                  1
Juniata College                                  1
Wittenberg University                            1
Grinnell College                                 1
Skidmore College                                 1
Moravian College                                 1
Lewis & Clark College          