<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

In [None]:
import pandas as pd
import numpy as np
import datetime

pd.options.display.float_format = '{:,.2f}'.format

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Merge DataFrames

* Similar functionality to VLOOKUP in Excel
<br><br>
* Useful when data is distributed across multiple files / database tables
<br><br>
* we match rows from one DataFrame to rows in another DataFrame
    * the row matching is done on "keys"
        * these are either DataFrame columns or the row index
<br><br>        
* Several ways to merge two DataFrames: left merge, right merge, inner merge, outer merge, cross merge (recently added)

In [None]:
countries = pd.DataFrame({
    'Letter': ['a', 'b', 'c', 'd', 'n', 'o'],
    'Country': ['Andorra', 'Belgium', 'Croatia', 'Denmark', 'Niger', 'Oman']})

In [None]:
capitals = pd.DataFrame( {
    'Name': ['Andorra', 'Denmark', 'Spain', 'Portugal'], 
    'Capital': ['Andorra la Vella', 'Copenhagen', 'Madrid', 'Lisbon']
} )

In [None]:
countries

In [None]:
capitals

* we want to look up the capital city of each country and append it to our `countries` DataFrame
* essentially an Excel VLOOKUP
    * we want to match the `Country` values in `countries` to the `Name` values in `capitals`

<br><br>

## Left Merge

* a new DataFrame object is created
<br><br>
* columns from both DataFrames are added to the result
    * if there are columns with identical names, a suffix will be added to each resulting column (e.g. `country_x`, `country_y`)
<br><br>
* **ALL rows from the left-hand side DataFrame are included**
<br><br>
* **right-hand side DataFrame rows are included ONLY IF they match**
    * for our example, the country names are the keys on which we match rows from the right-hand side DataFrame (`capitals`) to rows in the left-hand side DataFrame (`countries`)
    * we include rows from the `capitals` DataFrame only if the country name in those rows matches a country name in the `countries` DataFrame
<br><br>
* similar to a SQL left outer join

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/left_merge.png" width="400px" />

In [None]:
countries

In [None]:
capitals

In [None]:
pd.merge(
    countries, capitals, 
    left_on='Country', right_on='Name', 
    how='left'
)

<br><br>

## Right Merge

* a new DataFrame object is created
<br><br>
* columns from both DataFrames are added to the result
    * if there are columns with identical names, a suffix will be added to each resulting column (e.g. `country_x`, `country_y`)
<br><br>
* **ALL rows from the right-hand side DataFrame are included**
<br><br>
* **left-hand side DataFrame rows are included ONLY IF they match**
    * for our example, the country names are the keys on which we match rows from the left-hand side DataFrame (`countries`) to rows in the right-hand side DataFrame (`capitals`)
    * we include data from the `countries` DataFrame only if the country name in those rows matches a country name in the `capitals` DataFrame    
<br><br>
* similar to a SQL right outer join

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/right_merge.png" width="400px" />

In [None]:
countries

In [None]:
capitals

In [None]:
pd.merge(
    countries, capitals, 
    left_on='Country', right_on='Name', 
    how='right'
)

<br><br>

## Inner Merge

* a new DataFrame object is created
<br><br>
* columns from both DataFrames are added to the result
    * if there are columns with identical names, a suffix will be added to each resulting column (e.g. `country_x`, `country_y`)
<br><br>
* **ONLY rows that have key values present in both DataFrames are included**
    * i.e. if we merge on country names, only rows that have country names that are present in **both** DataFrames are included
<br><br>
* similar to a SQL inner join

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/inner_merge.png" width="400px" />

In [None]:
countries

In [None]:
capitals

In [None]:
pd.merge(
    countries, capitals, 
    left_on='Country', right_on='Name', 
    how='inner'
)

<br><br>

## Outer Merge

* a new DataFrame object is created
<br><br>
* columns from both DataFrames are added to the result
    * if there are columns with identical names, a suffix will be added to each resulting column (e.g. `country_x`, `country_y`)
<br><br>
* **ALL rows from both DataFrames are included**
<br><br>
* similar to a SQL full outer join

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/outer_merge.png" width="400px" />

In [None]:
countries

In [None]:
capitals

In [None]:
pd.merge(
    countries, capitals, 
    left_on='Country', right_on='Name', 
    how='outer'
)

<br><br><br>

## Merging using the row index

In [None]:
countries

In [None]:
capitals

In [None]:
# Approach 1

capitals.set_index('Name', inplace=True)
capitals

In [None]:
pd.merge(
    countries, capitals, 
    left_on='Country', right_index=True, 
    how='outer'
)

<br>

In [None]:
countries

In [None]:
capitals = capitals.reset_index()
capitals

In [None]:
# Approach 2

pd.merge(
    countries, capitals, 
    left_index=True, right_index=True, 
    how='right'
)

**NOTE: `merge` DOES NOT modify any of the dataframes in place. You need to assign the result to a variable to store it for future use.**

In [None]:
countries

In [None]:
capitals

In [None]:
data = pd.merge(
    countries, capitals, 
    left_on='Country', right_on='Name', 
    how='outer'
)

data

<br><br><br><br>

## Merging on multiple columns

* It's not possible :(
* BUT: remember, you can always create "artificial" keys!

In [None]:
df1 = pd.DataFrame({
    "first_name": ["James", "Janet", "Jamie"],
    "last_name": ["Bond", "Brown", "Baker"],
    "phone_numbers": ["555-1111", "555-1010", "555-0010"]
})

df1

In [None]:
df2 = pd.DataFrame({
    "name": ["James Bond", "Janet Brown", "Jamie Baker"],
    "id": ["007", "008", "009"]
})

df2

### How do we merge these DataFrames?

* Start by creating any required artificial keys

In [None]:
df1["first_name"] + " " + df1["last_name"]

In [None]:
df1["name"] = df1["first_name"] + " " + df1["last_name"]

In [None]:
df1

* And then merge away!

In [None]:
pd.merge(
    df1, df2, 
    left_on="name", right_on="name",
    how="inner"
)

<br><br><br><br><br><br>

## Exercise

Write code to do the following:

* Create a dataframe, call it `customers`, that has two columns, `Customer ID` and `Customer Name`. Store the following data in this dataframe:
<table>
    <tr>
        <th>Customer ID</th>
        <th>Customer Name</th>
    </tr>
    <tr>
        <td>1234</td>
        <td>James</td>
    </tr>
    <tr>
        <td>3034</td>
        <td>Eileen</td>
    </tr>
</table>

* Create a dataframe, call it `orders`, that has two columns, `Reference ID` and `Amount`. Store the following data in this dataframe:

<table>
    <tr>
        <th>Reference ID</th>
        <th>Amount</th>
    </tr>
    <tr>
        <td>30-34</td>
        <td>20</td>
    </tr>
    <tr>
        <td>30-34</td>
        <td>17</td>
    </tr>
    <tr>
        <td>12-34</td>
        <td>12</td>
    </tr>
    <tr>
        <td>12-34</td>
        <td>27</td>
    </tr>
    <tr>
        <td>30-34</td>
        <td>21</td>
    </tr>
    <tr>
        <td>40-01</td>
        <td>25</td>
    </tr>
</table>

1. Merge the two dataframes such that, for each order we can see the customer name. Store the result in `data`. Note that the `Reference ID` has an extra `-` that you will have to remove before you can do the merging.

2. What are all the purchases made by Eileen?

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

#### SOLUTION

In [None]:
customers = pd.DataFrame( {
    'Customer ID': [1234, 3034], 
    'Customer Name': ['James', 'Eileen'] 
} )

customers

In [None]:
orders = pd.DataFrame( {
    'Reference ID': ['30-34', '30-34', '12-34', 
                     '12-34', '30-34', '40-01'], 
    'Amount': [20, 17, 12, 27, 21, 25]
} )

orders

1. Merge the two dataframes such that, for each order we can see the customer name. Store the result in `data`. Note that the `Reference ID` has an extra `-` that you will have to remove before you can do the merging.

In [None]:
orders['Customer ID'] = (
    orders['Reference ID']
    .str
    .replace('-', '')
    .astype('int')
)

In [None]:
orders

In [None]:
data = pd.merge(
    orders, customers, 
    left_on='Customer ID', right_on='Customer ID', 
    how='outer'
)

data

<br>

2. What are all the purchases made by Eileen?

In [None]:
data[ data['Customer Name'] == 'Eileen' ]

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Sorting a DataFrame

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv"
)

data.head()

We can use `sort_values()` to sort a dataframe
* sorting can be done in place (`inplace=True`)
* or can be achieved by creating a new (sorted) DataFrame object (`inplace=False`)

In [None]:
data.sort_values(
    by='Helpfulness', 
    ascending=False, 
    inplace=True
)

In [None]:
data.head()

In [None]:
data.sort_values(
    by=['Helpfulness', 'Courtesy'], 
    ascending=[True, False], 
    inplace=True
)

In [None]:
data.head()

In [None]:
data.tail()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Dealing with missing data

Essentially, there are a few things we can do when we have missing (`NaN`) values:
    
* we can drop any row that has a missing value
* we can drop any column that has a missing value
* we can replace missing values with some default value

In [None]:
dt = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    parse_dates=['Date']
)

dt['Location'].head(10)

In [None]:
dt.shape

In [None]:
dt.info()

<br><br>

### Dropping rows that have missing data

In [None]:
dt.dropna(inplace=True)

dt['Location'].unique()

In [None]:
dt.head()

In [None]:
dt.shape

<br><Br>

### Fill `NaN` values with some default value

In [None]:
dt = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    parse_dates=['Date']
)

dt['Location'].head(10)

In [None]:
dt.shape

In [None]:
dt.info()

In [None]:
dt['Location']

In [None]:
dt['Location'].fillna('store', inplace=True)

dt['Location'].head(10)

In [None]:
dt.shape

In [None]:
dt.info()

<br><br>

### More generally, you can also `replace` a value with another value

In [None]:
dt = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    parse_dates=['Date']
)

dt['Location'].head(10)

In [None]:
dt['Location'].replace('online', 'store', inplace=True)

dt['Location'].head(10)

In [None]:
dt['Location'].unique()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Exercise

For this exercise, we will use the following files:
    
* `survey_sample.csv`, available here: https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv 
<br><br>
* `survey_mappings.xlsx`, available here: https://edlitera-datasets.s3.amazonaws.com/survey_mappings.xlsx, which has two sheets, `Departments` and `Reps`
<br><br>
* `customer_ltv.xlsx`, available here: https://edlitera-datasets.s3.amazonaws.com/customer_ltv.xlsx

Import the files above into the following data frames:
* `data`, which stores the survey sample data (`survey_sample.csv`)
<br><br>
* `departments`, which stores a mapping of department codes to department names (the `Departments` sheet of the `survey_mappings.xlsx` file)
<br><br>
* `reps`, which stores a mapping of rep IDs to rep names (the `Reps` sheet of the `survey_mappings.xlsx` file)
<br><br>
* `ltv`, which stores our customers' life-time value (i.e. how much money they spent with us so far; the data is available in the `customer_ltv.xlsx` file)

Then, write code to perform the following actions:

**(1) look up the name of each of the reps and add it to the survey data (`data`)**

#### (2) taking into account that the first two digits of the rep id code are the department code (the department where the cutomer representative works), add the Department name to the survey data as a new column

**(3) add the life-time value to the survey data (`data`) as a new column and fill it in for customers who have purchased from our stores**

**(4) for customers that have not purchased from our stores, fill in the default life-time value of 0**

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

#### SOLUTION

#### Import the data

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv', 
    parse_dates=['Date']
)

data.head()

In [None]:
departments = pd.read_excel(
    'https://edlitera-datasets.s3.amazonaws.com/survey_mappings.xlsx',
    'Departments'
)

departments.head()

In [None]:
reps = pd.read_excel(
    'https://edlitera-datasets.s3.amazonaws.com/survey_mappings.xlsx',
    'Reps'
)

reps.head()

In [None]:
ltv = pd.read_excel('https://edlitera-datasets.s3.amazonaws.com/customer_ltv.xlsx')

ltv.head()

**(1) look up the name of each of the reps and add it to the survey data**

In [None]:
data.head(3)

In [None]:
reps.head(3)

In [None]:
data = pd.merge(
    data, reps, 
    left_on='Rep Id', right_on='Rep Id', 
    how='left'
)

data.head()

#### (2) taking into account that the first two digits of the rep id code are the department code (the department where the cutomer representative works), add the Department name to the survey data as a new column

In [None]:
data.head(3)

In [None]:
departments.head(3)

In [None]:
# First, let's create a column that stores
# only the first 2 digits of the Rep Id

data['Department Id'] = data['Rep Id'].astype(str).str[:2]
data.head()

In [None]:
# The `Code` column in the departments DataFrame
# is an int64, while the newly created `Department Id`
# column in the data DataFrame is an object. We know
# we'll need to merge on these columns, so we need to
# convert them to the same data type

data['Department Id'] = data['Department Id'].astype(departments['Code'].dtype)
data.info()

In [None]:
# And now we can do a regular merge

data = pd.merge(
    data, departments, 
    left_on='Department Id', right_on='Code',
    how='left'
)

data.head()

In [None]:
# The `Department Id` and `Code` columns contain
# the same data, so we can drop one of them
data.drop(columns=['Code'], inplace=True)
data.head()

**(3) add the life-time value as a new column and fills it in for customers who have purchased from us**

In [None]:
ltv.head(3)

In [None]:
data = pd.merge(
    data, ltv, 
    left_on='Customer Id', right_on='Customer Id', 
    how='left'
)

data.head()

**(4) for customers that have not purchased from our stores, fill in the default life-time value of 0**

In [None]:
data['Amount'].isnull()

In [None]:
data[ data['Amount'].isnull() ]

In [None]:
data[ data['Amount'].isnull() ].shape

In [None]:
data['Amount'].fillna(0, inplace=True)

data.head()

In [None]:
data.tail()

In [None]:
data[ data['Amount'].isnull() ]

<br><br><br><br><br><br>

## How to set a custom DataFrame index

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv', 
    parse_dates=['Date']
)

data.head()

#### We can use the `set_index()` method to set a custom index:

In [None]:
data.set_index('Date')

In [None]:
data.head()

<br><br>

#### The `set_index()` method has an optional `inplace` argument we can use to modify DataFrames in place:

In [None]:
data.set_index('Date', inplace=True)

In [None]:
data.head()

In [None]:
data.index

<br><br>

#### You can create a multi-index by using several columns

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv', 
    parse_dates=['Date']
)

data.set_index( ['Date', 'Location'] )

In [None]:
data.set_index(['Date', 'Location']).index

<br><Br>

#### You can actually avoid removing the column(s) you designate as row the index

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv', 
    parse_dates=['Date']
)

data.set_index('Date', drop=False, inplace=True)

In [None]:
data.head()

<br><br><br><br><br><br>

## Concatenating multiple DataFrames objects

#### Use cases:

* appending rows to existing DataFrames
    * e.g. when we have to combine multiple files
<br><br>    
* appending columns from one DataFrame (or file) to another DataFrame    
<br><br>
* achieved using the `concat` method
    * https://pandas.pydata.org/docs/reference/api/pandas.concat.html
    * works for both `Series` and `DataFrame` objects
    
<br><br>    
#### Difference between `merge` and `concat`
* `merge` combines DataFrames based on shared values
* `concat` appends rows / columns from a DataFrame to another DataFrame

<br><br><br><br><br><br>

## Concatenating DataFrames
* very useful when combining multiple files, etc.

In [None]:
df1 = pd.DataFrame({
    'col1': [1, 2, 3, 4, 5],
    'col2': ['a', 'b', 'c', 'd', 'e']
})

df1

In [None]:
df2 = pd.DataFrame({
    'col1': [10, 20, 30, 40, 50],
    'col3': ['Andorra', 'Belgium', 'Canada', 'Denmark', 'Estonia']
})

df2

<br><br>

#### To understand `concat`, we must understand the concept of `axes`

<img src="https://edlitera-images.s3.amazonaws.com/dataframe_axis.png" width="450px">

<br><br>

#### Two ways to concatenate DataFrames

<img src="https://edlitera-images.s3.amazonaws.com/concat_axes.png" width="500px"/>

<br><br>

### We can append the rows of one DataFrame to another DataFrame

In [None]:
pd.concat( [df1, df2],  axis=0)

**NOTE:** By default, `axis=0`

In [None]:
pd.concat([df1, df2])

In [None]:
df2

<br><br>

### We can append the columns of one DataFrame to another DataFrame

In [None]:
data = pd.concat([df1, df2], axis=1)
data

In [None]:
data.columns

In [None]:
data.loc[:, 'col1']

**NOTE:** Generally, column names should be unique. Duplicate column names can cause confusion, resulting in code bugs.

<br><br><br><br><br><br>

### When concatenating DataFrames, we can identify the data source

In [None]:
df1

In [None]:
df2

In [None]:
data = pd.concat(
    [df1, df2], 
    keys=['source1', 'source2']
)

data

In [None]:
data.index

In [None]:
data = pd.concat(
    [df1, df2], 
    keys=['source1', 'source2'],
    axis=1
)

data

In [None]:
data.columns

<br><br><br><br><br><br>

### We can choose to reset the index labels when concatenating DataFrame objects

In [None]:
df1

In [None]:
df2

In [None]:
pd.concat([df1, df2])

In [None]:
pd.concat([df1, df2], ignore_index=True)

<br><br><br><br><br><br>

### When concatenating DataFrames, we can choose to only keep the overlapping columns

In [None]:
df1

In [None]:
df2

In [None]:
pd.concat([df1, df2], axis=1)

In [None]:
pd.concat([df1, df2], axis=1, join='outer')

<br><br><br><br><br><br>

### When concatenating DataFrames, we can choose to only keep the rows that have matching index labels

In [None]:
df1 = pd.DataFrame({
    'col1': [1, 2, 3, 4, 5],
    'col2': ['a', 'b', 'c', 'd', 'e']}, 
    index=[1, 2, 3, 7, 8])

df2 = pd.DataFrame({
    'col1': [10, 20, 30, 40, 50], 
    'col3': ['Andorra', 'Belgium', 'Canada', 'Denmark', 'Estonia']}, 
    index=[1, 2, 3, 20, 21])

In [None]:
df1

In [None]:
df2

In [None]:
pd.concat([df1, df2], axis=1)

In [None]:
pd.concat([df1, df2], join='inner', axis=1)