# 4.06 Combining and Exporting Data

## Table of contents

#### 1. Preparing Notebook and create data to experiment on
#### 2. Concatenating Data
#### 3. Appending Data
#### 4. Merging Data

## 1. Preparing Notebook and create data to experiment on

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import os

In [11]:
# Define a dictionary containing January 2020 data
data1 = {'customer_id':['6732', '767', '890', '635'],
    'month':['Jan-20', 'Jan-20', 'Jan-20', 'Jan-20'],
    'purchased_meat':[0, 13, 3, 4],
    'purchased_alcohol':[1, 2, 10, 0],
    'purchased_snacks':[10, 5, 1, 7]}

In [12]:
data1

{'customer_id': ['6732', '767', '890', '635'],
 'month': ['Jan-20', 'Jan-20', 'Jan-20', 'Jan-20'],
 'purchased_meat': [0, 13, 3, 4],
 'purchased_alcohol': [1, 2, 10, 0],
 'purchased_snacks': [10, 5, 1, 7]}

In [3]:
# Define a dictionary containing February 2020 data
data2 = {'customer_id':['6732', '767', '890', '635'],
    'month':['Feb-20', 'Feb-20', 'Feb-20', 'Feb-20'],
    'purchased_meat':[0, 10, 5, 3],
    'purchased_alcohol':[2, 4, 14, 0],
    'purchased_snacks':[15, 3, 2, 6]}

In [5]:
data2

{'customer_id': ['6732', '767', '890', '635'],
 'month': ['Feb-20', 'Feb-20', 'Feb-20', 'Feb-20'],
 'purchased_meat': [0, 10, 5, 3],
 'purchased_alcohol': [2, 4, 14, 0],
 'purchased_snacks': [15, 3, 2, 6]}

In [29]:
# Create data with different columns from df
data3 = {'customer_id':['6732', '767', '890', '635'],
    'month':['Jan-20', 'Jan-20', 'Jan-20', 'Jan-20'],
    'days_purchased_on':[0, 13, 3, 4]}

In [30]:
data3

{'customer_id': ['6732', '767', '890', '635'],
 'month': ['Jan-20', 'Jan-20', 'Jan-20', 'Jan-20'],
 'days_purchased_on': [0, 13, 3, 4]}

In [31]:
# Convert the dictionary into dataframe
df = pd.DataFrame(data1,index=[0,1,2,3])
df_1 = pd.DataFrame(data2,index=[0,1,2,3])
df_2 = pd.DataFrame(data3,index=[0, 1, 2, 3])

In [16]:
df

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Jan-20,0,1,10
1,767,Jan-20,13,2,5
2,890,Jan-20,3,10,1
3,635,Jan-20,4,0,7


In [17]:
df_1

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Feb-20,0,2,15
1,767,Feb-20,10,4,3
2,890,Feb-20,5,14,2
3,635,Feb-20,3,0,6


In [32]:
df_2

Unnamed: 0,customer_id,month,days_purchased_on
0,6732,Jan-20,0
1,767,Jan-20,13
2,890,Jan-20,3
3,635,Jan-20,4


## 2. Concatenating Data

#### Concatenation is a good choice for combining data sets that have MULTIPLE rows and columns of the SAME length.

#### One common use case of concatenation is combining two or more data sets that share the same characteristics but refer to different time periods.

In [18]:
# create a list with data that we want to concatenate
frames = [df, df_1]

In [22]:
# check if frames is really a list
type(frames)

list

In [23]:
# concatenate data with pd.concat
df_concat = pd.concat(frames)

In [24]:
# e voila, the result
df_concat

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Jan-20,0,1,10
1,767,Jan-20,13,2,5
2,890,Jan-20,3,10,1
3,635,Jan-20,4,0,7
0,6732,Feb-20,0,2,15
1,767,Feb-20,10,4,3
2,890,Feb-20,5,14,2
3,635,Feb-20,3,0,6


In [25]:
# creating a wide_format dataframe [place two df side by side]
df_concat2 = pd.concat(frames, axis = 1)

In [26]:
df_concat2

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,customer_id.1,month.1,purchased_meat.1,purchased_alcohol.1,purchased_snacks.1
0,6732,Jan-20,0,1,10,6732,Feb-20,0,2,15
1,767,Jan-20,13,2,5,767,Feb-20,10,4,3
2,890,Jan-20,3,10,1,890,Feb-20,5,14,2
3,635,Jan-20,4,0,7,635,Feb-20,3,0,6


#### The process for concatenating [pd.concat()] is as follows:
#### 1. Is the data suitable? Have the rows and columns the same length
#### 2. Will place df on top of each other by default (axis = 0)
#### 3. Requires a list as its main argument

## Appending Data

#### Appending Data is a straightforward approach for adding rows to an existing datafram with the SAME number of columns.

In [27]:
# appending data with .append()
df_append = df.append(df_1)

In [28]:
df_append

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Jan-20,0,1,10
1,767,Jan-20,13,2,5
2,890,Jan-20,3,10,1
3,635,Jan-20,4,0,7
0,6732,Feb-20,0,2,15
1,767,Feb-20,10,4,3
2,890,Feb-20,5,14,2
3,635,Feb-20,3,0,6


#### The dataframe upon which you want to append another dataframe is included before the dot (df), while the dataframe you want to append onto another dataframe is included in the parentheses (df_1).

In [33]:
# appending multiple dataframes
df_append_test = df.append(df_2)

In [34]:
df_append_test

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on
0,6732,Jan-20,0.0,1.0,10.0,
1,767,Jan-20,13.0,2.0,5.0,
2,890,Jan-20,3.0,10.0,1.0,
3,635,Jan-20,4.0,0.0,7.0,
0,6732,Jan-20,,,,0.0
1,767,Jan-20,,,,13.0
2,890,Jan-20,,,,3.0
3,635,Jan-20,,,,4.0


#### The command .append() does not work if the columns are NOT the same

## 4. Merging Data

#### The best use cases for the df.merge() function are those where the dataframes you want to combine DON'T MATCH IN SHAPE

### Inner Join: Used to keep only information that's present in BOTH data sets
### Left Join: Used to keep information from the left dataframe, combining it with any information in the right dataframe that can be mapped back to the dataframe on the left
### Right Join: Operates under the same principle as the left join, only reversed.
### Full Outer Join: Used to keep all information from both dataframes, regardless of wether they match. When using this method on data sets of different sizes, you'll end up with a lot of missing data in your final datafram.

In [35]:
# merge to dataframe with a common identifier
df_merged = df.merge(df_2, on = 'customer_id')

In [36]:
df_merged

Unnamed: 0,customer_id,month_x,purchased_meat,purchased_alcohol,purchased_snacks,month_y,days_purchased_on
0,6732,Jan-20,0,1,10,Jan-20,0
1,767,Jan-20,13,2,5,Jan-20,13
2,890,Jan-20,3,10,1,Jan-20,3
3,635,Jan-20,4,0,7,Jan-20,4


In [38]:
# merge to dataframe with two common identifier
df_merged = df.merge(df_2, on = ['customer_id','month'])

In [39]:
df_merged

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on
0,6732,Jan-20,0,1,10,0
1,767,Jan-20,13,2,5,13
2,890,Jan-20,3,10,1,3
3,635,Jan-20,4,0,7,4


In [40]:
# merge to dataframe with two common identifier
# and checking on which df the data do exist
df_merged = df.merge(df_2, on =['customer_id','month'], indicator = True)

In [41]:
df_merged

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on,_merge
0,6732,Jan-20,0,1,10,0,both
1,767,Jan-20,13,2,5,13,both
2,890,Jan-20,3,10,1,3,both
3,635,Jan-20,4,0,7,4,both


In [44]:
# Counting the values in the column '_merge'
df_merged['_merge'].value_counts ()

both          4
left_only     0
right_only    0
Name: _merge, dtype: int64

In [45]:
# Test merge without overwriting
pd.merge(df,df_2, on = ['customer_id','month'], indicator = True)

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on,_merge
0,6732,Jan-20,0,1,10,0,both
1,767,Jan-20,13,2,5,13,both
2,890,Jan-20,3,10,1,3,both
3,635,Jan-20,4,0,7,4,both
