# Combine data

For combining data we can use the methods `concat`, `merge` and `join`. Sometimes all three methods can be applyed to get the same end result, it depends on your data and your preference. 

In [1]:
import pandas as pd
import numpy as np

In [25]:
pd.concat?

In [3]:
pd.merge?

In [None]:
pd.DataFrame.join?

## Concat 

with concat we can combine data. This is espaccially handy in combining arrays. By default the concat works with `axis=0`, e.g the row concationation. It glues the rows of one dataframe or array to another

In [4]:
df_01 = pd.DataFrame(np.random.randn(3, 5))
df_01

Unnamed: 0,0,1,2,3,4
0,-0.180766,0.205277,0.264044,1.680505,0.857488
1,-1.383589,-0.604053,0.81467,0.601188,-0.483399
2,1.131062,1.385822,-1.301972,-0.758514,3.111644


In [5]:
df_02 = pd.DataFrame(np.random.randn(3,5))
df_02

Unnamed: 0,0,1,2,3,4
0,-1.584656,-0.604503,-0.075125,-0.509854,-1.822336
1,-0.235976,0.31159,1.823117,0.286824,2.080734
2,1.954269,-0.383021,0.36128,-1.193056,1.735152


In [None]:
df_03 = pd.concat([df_01, df_02])
df_03

if you pass `axis=1` it will glue the dataframes in the collumn direction

In [None]:
df_03 = pd.concat([df_01, df_02], axis=1)
df_03

In case of unequal shape it will fill the gaps with NaN

In [20]:
A = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]}).set_index('key')
B = pd.DataFrame({'key': ['A', 'B', 'X', 'Y'], 'value': [3, 4, 5, 7]}).set_index('key')

In [27]:
A

Unnamed: 0_level_0,value
key,Unnamed: 1_level_1
A,1
B,2
C,3


In [28]:
B

Unnamed: 0_level_0,value
key,Unnamed: 1_level_1
A,3
B,4
X,5
Y,7


In [29]:
df = pd.merge([A, B], axis=0, join = 'inner')
df

Unnamed: 0_level_0,value
key,Unnamed: 1_level_1
A,1
B,2
C,3
A,3
B,4
X,5
Y,7


## Merge

You can also perform a SQL-style join using the .merge() function:

In [33]:
pd.merge?

In [30]:
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3]})
left 

Unnamed: 0,key,left_value
0,A,1
1,B,2
2,C,3


In [31]:
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3, 4, '51,3']})
right

Unnamed: 0,key,right_value
0,A,3
1,B,4
2,D,513


In [35]:
pd.merge(left, right, on=['City', 'State'], how = 'outer')

Unnamed: 0,key,left_value,right_value
0,A,1.0,3.0
1,B,2.0,4.0
2,C,3.0,
3,D,,513.0


In [None]:
pd.merge(left, right, how='inner', left_on=['key'], right_on=['key'])

In [None]:
pd.merge(left, right, how='outer', left_on=['key'], right_on=['key'])

In [None]:
pd.merge(left, right, how='right', left_on=['key'], right_on=['key'])

In [None]:
pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])

In [None]:
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_value': [1, 2, 3], 'other_key': ['X','Y','Z']})
left

In [None]:
right = pd.DataFrame({'key': ['A', 'B', 'D'], 'right_value': [3,'53,2', 5], 'some_key': ['W','Y', 'Z']})
right

In [None]:
pd.merge(left, right, how='inner', left_on=['key', 'other_key'], right_on=['key', 'some_key' ])

In [None]:
df_03 = pd.merge(left, right, how='left', left_on=['key'], right_on=['key'])
df_03


## Join
Pandas DataFrame has als a join function for merging by index. However overlapping columns cannot exist.

In [None]:
left

In [None]:
right

In [None]:
left.set_index('key').join(right.set_index('key'), how='outer')

In [None]:
right = right.rename(columns = {'key': 'name'})
right

In [None]:
df_04 = left.join(right, how='outer')
df_04

With the `on=` argument you can match indexes with keys. For example:

In [None]:
left1 = pd.DataFrame({'key': ['a','b','a','a','b','c'], 'value': range(6)})
left1

In [None]:
right1 = pd.DataFrame({'group_val': [3.5,7]}, index = ['a','b'])
right1

In [None]:
df_05 = left1.join(right1, on='key')
df_05

Merging on indexes with `merge` is also possible

In [None]:
left1 = left1.set_index('key')
left1

In [None]:
df_06 = pd.merge(left1, right1, how = 'outer', left_index=True, right_index=True)
df_06

## Pivoting

In [None]:
def reformat():
    df=pd.read_table('data/JamesBondGlucoseLevels25.txt')
    df=df.set_index('Time')
    df = df.rename(columns = {
                         'hist.glucose.mg.dL.':'glucose.hist',
                         'Scan.glucose.mg.dL.':'glucose.scan',
                         'previous.time':'time.previous',
                         'adjusted.time':'time.new'})
    columns_to_keep = [
                   'Type',
                   'glucose.hist',
                   'glucose.scan',
                   'time.previous',
                   'time.new']
    df = df[columns_to_keep].replace(r'\s+', np.nan, regex=True) 
    return df

df = reformat()
df.head(5)

In [None]:
df = df.reset_index()
df = df.pivot('Time','Type', 'glucose.scan')
df.head(5)