# Ultimate Guide to Merging data in Pandas
## From semi/anti joins to validating data merges

### Introduction
With each data science project or dataset, you want to perform several analyses and create plots to find insights. Often, the raw data never comes in one massive table but in many separate ones. To answer your questions, you should have the skills to join multiple tables into one and then perform operations on them.

You can acquire these skills by learning different kinds of merge operations such as inner join, left and right joins, self and anti joins, merging on indexes, etc.

The goal of this article is that you come away with a strong knowledge of combining data in pandas using precise methods suited for any question you want to ask about your data.

### Pandas merge()
Pandas provide several methods for performing merges on dataframes. Among all the others `merge()` method is the most flexible one. It is a dataframe method and the general syntax is as follows:



df1.merge(df2, on='common_column')

When combining tables, there are two terminologies you should be familiar with: The name of the table you use first is called __the left table__ while the other is called __the right table__. In the code snippet above, the left table is `df1` and the right table is `df2`. Also, the verbs join, combine and merge are all used interchangebly.

Now let's see how we perform an inner join:

An inner join will only return rows that have matching values in both tables. During the joining process, you will have to know the common table name which exists in both tables. 

In [5]:
# Load necessary libraries
import pandas as pd
import numpy as np

In [6]:
# Enable multiple cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [17]:
# Load necessary dataframes
user_usage = pd.read_csv('data/user_usage.csv')
user_devices = pd.read_csv('data/user_device.csv').drop('user_id', axis='columns')

### Basic Exploration

In [19]:
user_usage.info()
user_devices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   outgoing_mins_per_month  240 non-null    float64
 1   outgoing_sms_per_month   240 non-null    float64
 2   monthly_mb               240 non-null    float64
 3   use_id                   240 non-null    int64  
dtypes: float64(3), int64(1)
memory usage: 7.6 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272 entries, 0 to 271
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   use_id            272 non-null    int64  
 1   platform          272 non-null    object 
 2   platform_version  272 non-null    float64
 3   device            272 non-null    object 
 4   use_type_id       272 non-null    int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 10.8+ KB


In [21]:
user_usage.describe()

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id
count,240.0,240.0,240.0,240.0
mean,274.559167,98.968292,3628.602042,23285.516667
std,293.745744,111.172685,4486.311513,624.139253
min,0.5,0.25,0.0,22787.0
25%,74.59,29.03,1132.23,22888.75
50%,189.705,70.775,1797.975,22987.5
75%,336.045,125.6275,4246.6175,23482.5
max,1816.63,906.92,31146.67,25220.0


Let's say we have these three tables:

In [18]:
user_usage.head()
user_devices.head()

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id
0,21.97,4.82,1557.33,22787
1,1710.08,136.88,7267.55,22788
2,1710.08,136.88,7267.55,22789
3,94.46,35.17,519.12,22790
4,71.59,79.26,1557.33,22792


Unnamed: 0,use_id,platform,platform_version,device,use_type_id
0,22782,ios,10.2,"iPhone7,2",2
1,22783,android,6.0,Nexus 5,3
2,22784,android,5.1,SM-G903F,1
3,22785,ios,10.2,"iPhone7,2",3
4,22786,android,6.0,ONE E1003,1


This data was downloaded from the KillBiller application. KillBiller was a free service that compared every mobile tariff in UK and Ireland. The first, `user_usage` table contains monthyl statistics of mobile usage of users. `user_devices` table provides details about each users phone such as operating system and phone model.

#### Question 1:
How many users use Android OS and how many use iOS?

To answer this question, we will need the information from both tables. There is one linking attribute between both tables: `use_id`. We will use this column in our merge:

In [31]:
usage_w_os = user_usage.merge(user_devices[['platform', 'use_id']], on='use_id', how='inner')
usage_w_os.sample(5)
print(f'Number of users for each OS: {usage_w_os["platform"].value_counts()}')

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id,platform
72,8.14,0.79,1777.61,22912,android
64,145.55,11.5,3114.67,22895,android
61,28.85,30.22,3114.67,22890,android
137,101.59,84.41,5191.12,23018,android
17,797.06,7.67,15573.33,22816,android


Number of users for each OS: android    157
ios          2
Name: platform, dtype: int64


It looks like there is a huge difference between the two operating systems in our dataset. 

In the merge above, we used an example of an inner join. In `merge()` function `how` argument is set to `inner` by default so we did not have to write it out. When merging two tables using the `merge()` function, we use `on` argument to specify the common column. If there are multiple, it is also possible to pass a list of columns to the argument and `pandas` will take care of the rest. 

Note that as a right table, I subset the `user_devices` table to exclude irrelevant columns to the question. 

Now, as we explore further, we will notice that number of given users in two datasets is different:

In [33]:
print(f'Dimensions of user_usage table: {user_usage.shape}')
print(f'Dimensions of user_devices table: {user_devices.shape}')
print(f'Dimensions of the joined table: {usage_w_os.shape}')

Dimensions of user_usage table: (240, 4)
Dimensions of user_devices table: (272, 5)
Dimensions of the joined table: (159, 5)


Clearly, matching user IDs in both tables were 159. This means there are user IDs which are in `user_devices` table and not in `user_usage` table and vice versa. So, the next question we want to ask is:

#### Question 2
How many users use Android OS and iOS, including all the users without any monthly stats in `user_usage` table?

We can answer this question by using either a __left__ or a __right__ join. First let's look at the general case of one-sided joins: