In [1]:
import pandas as pd

## Rationale
It is often the case that you will have two or more separate, but related, datasets that can be usefully combined together. One way of performing this operation would be to put the data into tables in a SQL database and then write a JOIN query. Alternatively, you could use Pandas, which can perform SQL-like joins, and more.

## Concatenating DataFrames
The most basic way to combine DataFrames is to simply stack them together. You can use pd.concat to do so.



In [2]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                            'B': ['B0', 'B1', 'B2', 'B3'],
                            'C': ['C0', 'C1', 'C2', 'C3'],
                            'D': ['D0', 'D1', 'D2', 'D3']},
                            index=[0, 1, 2, 3])

In [3]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                            'B': ['B4', 'B5', 'B6', 'B7'],
                            'C': ['C4', 'C5', 'C6', 'C7'],
                            'D': ['D4', 'D5', 'D6', 'D7']},
                            index=[4, 5, 6, 7])

In [4]:
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                            'B': ['B8', 'B9', 'B10', 'B11'],
                            'C': ['C8', 'C9', 'C10', 'C11'],
                            'D': ['D8', 'D9', 'D10', 'D11']},
                            index=[8, 9, 10, 11])

In [5]:
frames = [df1, df2, df3]

In [6]:
result = pd.concat(frames)

In [7]:
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [8]:
pd.concat(frames, axis = 1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,A0,B0,C0,D0,,,,,,,,
1,A1,B1,C1,D1,,,,,,,,
2,A2,B2,C2,D2,,,,,,,,
3,A3,B3,C3,D3,,,,,,,,
4,,,,,A4,B4,C4,D4,,,,
5,,,,,A5,B5,C5,D5,,,,
6,,,,,A6,B6,C6,D6,,,,
7,,,,,A7,B7,C7,D7,,,,
8,,,,,,,,,A8,B8,C8,D8
9,,,,,,,,,A9,B9,C9,D9


## Merging DataFrames
For more complex ways of combining DataFrames, pd.merge has a great deal of flexibility. It allows SQL-style joins that can use either columns or indexes or a combination of both.

In [9]:
left = pd.DataFrame({'key': ['dog', 'cat', 'fish', 'bird'],
                             'A': ['A0', 'A1', 'A2', 'A3'],
                             'B': ['B0', 'B1', 'B2', 'B3']})

In [10]:
right = pd.DataFrame({'key': ['bird', 'fish', 'cat', 'dog'],
                              'C': ['C0', 'C1', 'C2', 'C3'],
                              'D': ['D0', 'D1', 'D2', 'D3']})

In [11]:
result = pd.merge(left, right, on='key')

In [12]:
result

Unnamed: 0,key,A,B,C,D
0,dog,A0,B0,C3,D3
1,cat,A1,B1,C2,D2
2,fish,A2,B2,C1,D1
3,bird,A3,B3,C0,D0


In [13]:
left = pd.DataFrame({'city': ['Springfield', 'Springfield',
                                  'Dover', 'Chicago'],
                         'state': ['IL', 'OH', 'DE', 'IL'],
                         'A': ['A0', 'A1', 'A2', 'A3'],
                         'B': ['B0', 'B1', 'B2', 'B3']})

In [14]:
right = pd.DataFrame({'city': ['Cleveland', 'Dover',
                                       'Springfield', 'Chicago'],
                              'state': ['OH', 'NH', 'IL', 'IL'],
                                        'C': ['C0', 'C1', 'C2', 'C3'],
                                        'D': ['D0', 'D1', 'D2', 'D3']})

In [15]:
pd.merge(left, right, on=['city', 'state'])

Unnamed: 0,city,state,A,B,C,D
0,Springfield,IL,A0,B0,C2,D2
1,Chicago,IL,A3,B3,C3,D3


Notice that merge defaults to an inner join, but other types of joins can be specified with the how keyword. (Hint: read the DataFrame merge documentation  to learn about 'how', as it may be useful soon...)

Both the merge and concat functions have many options to allow a wide-range of possible merges. For many more examples, consult Pandas Merging Tutorial  , which is the definitive resource for merging functionality.


Join the DataFrames below to return a new DataFrame of users with listed birthdays, along with their addresses if you have them. The resulting dataframes should include all of the names and birthdays from dobs even if you don't have a corresponding address. Individuals which are only in the addresses dataframe should only be included if they are also in the dobs dataframe.

In [19]:
import pandas as pd

dobs = pd.DataFrame({'name': ['Suzy', 'Wei','Yulia', 'Arvind'],
                   'day': ['12', '19', '2', '23'],
                   'month': ['Dec', 'Nov', 'May', 'Jul']})

addresses = pd.DataFrame({'name': ['Marisol', 'Arvind','Stephan', 'Suzy'],
                     'city': ['San Francisco', 'Denver', 'Austin', 'Seattle'],
                     'state': ['CA', 'CO', 'TX', 'WA']})


birthday_address = pd.merge(dobs, addresses, on = 'name', how = 'left')

In [20]:
dobs

Unnamed: 0,name,day,month
0,Suzy,12,Dec
1,Wei,19,Nov
2,Yulia,2,May
3,Arvind,23,Jul


In [21]:
addresses

Unnamed: 0,name,city,state
0,Marisol,San Francisco,CA
1,Arvind,Denver,CO
2,Stephan,Austin,TX
3,Suzy,Seattle,WA


In [23]:
birthday_address

Unnamed: 0,name,day,month,city,state
0,Suzy,12,Dec,Seattle,WA
1,Wei,19,Nov,,
2,Yulia,2,May,,
3,Arvind,23,Jul,Denver,CO


The common column between the two DataFrames is "name", so that's what you merge on. In order to return all users who have listed birthdays, you need a left merge (since dobs is listed first).

In [17]:
birthday_address = pd.merge(dobs, addresses, on ='name', how ='left')

In [18]:
birthday_address

Unnamed: 0,name,day,month,city,state
0,Suzy,12,Dec,Seattle,WA
1,Wei,19,Nov,,
2,Yulia,2,May,,
3,Arvind,23,Jul,Denver,CO
