# Pandas Merge and Groupby

## Merge

`Pandas` provides various facilities for easily combining Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join- or merge-type operations.

In [5]:
import pandas as pd
import numpy as np

In [6]:
df = pd.DataFrame(np.random.randn(10, 4))
df

Unnamed: 0,0,1,2,3
0,-0.190119,-0.929216,-0.052559,-0.691737
1,0.187512,-0.560223,-0.631252,1.248516
2,0.883326,0.699764,1.192473,1.5445
3,-0.621701,0.976108,0.018224,-0.158604
4,-1.143894,-1.322625,-0.694949,-0.278043
5,-1.731017,0.994448,-0.586827,-0.373763
6,-0.522256,-0.499761,-1.294147,-1.134487
7,1.631342,2.257581,0.357565,0.12149
8,-1.216415,-0.121272,0.617422,0.408866
9,-0.088069,0.331678,-0.023853,0.830165


In [7]:
# break into pieces
pieces = [df[:3], df[3:7], df[7:]]

In [8]:
# put it all back together
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,-0.190119,-0.929216,-0.052559,-0.691737
1,0.187512,-0.560223,-0.631252,1.248516
2,0.883326,0.699764,1.192473,1.5445
3,-0.621701,0.976108,0.018224,-0.158604
4,-1.143894,-1.322625,-0.694949,-0.278043
5,-1.731017,0.994448,-0.586827,-0.373763
6,-0.522256,-0.499761,-1.294147,-1.134487
7,1.631342,2.257581,0.357565,0.12149
8,-1.216415,-0.121272,0.617422,0.408866
9,-0.088069,0.331678,-0.023853,0.830165


To join 2 DataFrames, we use `merge()` in Pandas (equivalent to `JOIN` in `SQL`).

In [9]:
# define two DataFrames for the example
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [10]:
# make an inner join between tables created above on column 'key'
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


In [11]:
# make an outer join between tables created above on column 'key'
pd.merge(left, right, on='key', how='outer')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


## Grouping

By `group by`, we are referring to a process involving the following steps:
- splitting the data into groups based on some criteria
- applying a function to each group independently
- combining the results into a data structure

Let's create the DataFrame we will work on.

In [14]:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})

df

Unnamed: 0,A,B,C,D
0,foo,one,-0.910111,-1.200847
1,bar,one,-1.067202,1.321423
2,foo,two,0.753015,-0.708634
3,bar,three,-1.851238,0.645349
4,foo,two,-1.528145,-0.46581
5,bar,two,1.014305,-0.108448
6,foo,one,-0.061367,0.278837
7,foo,three,0.255566,-1.376579


In [17]:
# group the DataFrame by column A and sum the values of C and D
df.groupby('A').sum()

  df.groupby('A').sum()


Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-1.904135,1.858323
foo,-1.491042,-3.473033


In [18]:
# we can also group by multiple columns, which will create a new DataFrame with Multilevel indexing
df.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-1.067202,1.321423
bar,three,-1.851238,0.645349
bar,two,1.014305,-0.108448
foo,one,-0.971478,-0.92201
foo,three,0.255566,-1.376579
foo,two,-0.77513,-1.174444


In [19]:
# you cannot apply two aggregation functions in 1 `groupby` statement in Pandas
df.groupby('A').agg({'C': np.sum, 'D': np.max})

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-1.904135,1.321423
foo,-1.491042,0.278837
