# Intro to Pandas
by Ryan Orsinger

## Module 5: Combining Dataframes
- Using `.concat` to combine dataframes horizontally or vertically 
- Intro to joining dataframes together like database tables
- Understanding different types of joins
- Using `.merge` to join dataframes together based on column values in common

In [1]:
import pandas as pd

In [2]:
# String concatenation
"con" + "cat" + "e" + "nation"

'concatenation'

In [3]:
# List concatenation
["con", "cat"] + ["e", "nation"]

['con', 'cat', 'e', 'nation']

In [4]:
# Dataframe Concatenation 
fruits = pd.DataFrame({
    "name": ["mango", "guava", "orange"],
    "quantity": [2, 1, 3]
})

vegetables = pd.DataFrame({
    "name": ["Brussels sprouts", "spinach", "broccoli"],
    "quantity": [1, 7, 4]
})

In [5]:
# Default arguments preserve the original index for each dataframe
pd.concat([fruits, vegetables])

Unnamed: 0,name,quantity
0,mango,2
1,guava,1
2,orange,3
0,Brussels sprouts,1
1,spinach,7
2,broccoli,4


In [6]:
# Axis=0 is the default argument for concatenating dataframes
# This is vertical concatenation, since we're adding row-wise
pd.concat([fruits, vegetables], axis=0)

Unnamed: 0,name,quantity
0,mango,2
1,guava,1
2,orange,3
0,Brussels sprouts,1
1,spinach,7
2,broccoli,4


In [7]:
pd.concat([fruits, vegetables], ignore_index=True)

Unnamed: 0,name,quantity
0,mango,2
1,guava,1
2,orange,3
3,Brussels sprouts,1
4,spinach,7
5,broccoli,4


In [8]:
# Dataframe Concatenation 
fruits = pd.DataFrame({
    "name": ["mango", "guava", "orange"],
})

# Notice that this instance of vegetables lacks a quantity column
vegetables = pd.DataFrame({
    "name": ["Brussels sprouts", "spinach", "broccoli"],
    "quantity": [2, 3, 4]

})

# If a column is missing from a dataframe, its values will be missing, so the concatenation succeeds
pd.concat([fruits, vegetables])

Unnamed: 0,name,quantity
0,mango,
1,guava,
2,orange,
0,Brussels sprouts,2.0
1,spinach,3.0
2,broccoli,4.0


In [9]:
# Axis=1 concatenates dataframes horizontally
# This is a column-wise concatenation
price_quality = pd.DataFrame({
    "price": [2.99, 1.99, 3.99],
    "presentation": ["frozen", "washed", "raw, bunch"] 
})

pd.concat([vegetables, price_quality], axis=1)

Unnamed: 0,name,quantity,price,presentation
0,Brussels sprouts,2,2.99,frozen
1,spinach,3,1.99,washed
2,broccoli,4,3.99,"raw, bunch"


In [10]:
# concat can combine an arbitrary number of dataframes
# This can be helpful if you have lots of different data frames from multiple sources
pd.concat([vegetables, vegetables, vegetables, vegetables])

Unnamed: 0,name,quantity
0,Brussels sprouts,2
1,spinach,3
2,broccoli,4
0,Brussels sprouts,2
1,spinach,3
2,broccoli,4
0,Brussels sprouts,2
1,spinach,3
2,broccoli,4
0,Brussels sprouts,2


## Using `.merge` to combine dataframes on common column values
- Database style join for Pandas Dataframes
- Pandas `.join` joins dataframes on identical column names that exist on both dataframes
- Using `.merge` can be more flexible, since sometimes the column names are not identical

## Types of Joins
- "Inner" returns records that have matching values in both tables.
- "Left" returns all records from the left table, and the matched records from the right table.
- "Right" returns all records from the right table, and the matched records from the left table.
- "Outer" Returns all records when there is a match in either left or right table.
![diagram of different types of joins](types_of_joins.png)

In [11]:
# Notice how role_id points to the id on the roles dataframe
# Take note of the missing data
users = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5, 6],
    'name': ['bob', 'mary', 'sally', 'adam', 'jane', 'mike'],
    'role_id': [1, 2, 3, 3, None, None]
})

users

Unnamed: 0,user_id,name,role_id
0,1,bob,1.0
1,2,mary,2.0
2,3,sally,3.0
3,4,adam,3.0
4,5,jane,
5,6,mike,


In [12]:
# Notice that the role id column is called "id" on the roles dataframe
roles = pd.DataFrame({
    'role_id': [1, 2, 3, 4],
    'role': ['admin', 'author', 'reviewer', 'commenter']
})

roles

Unnamed: 0,role_id,role
0,1,admin
1,2,author
2,3,reviewer
3,4,commenter


In [13]:
# An inner join returns members that exist on both the dataframes
users.merge(roles, left_on='role_id', right_on='role_id', how='inner')

Unnamed: 0,user_id,name,role_id,role
0,1,bob,1.0,admin
1,2,mary,2.0,author
2,3,sally,3.0,reviewer
3,4,adam,3.0,reviewer


In [14]:
# If the same exact column name exists on both dataframes, we can use the "on" argument
users.merge(roles, on='role_id', how='inner')

Unnamed: 0,user_id,name,role_id,role
0,1,bob,1.0,admin
1,2,mary,2.0,author
2,3,sally,3.0,reviewer
3,4,adam,3.0,reviewer


In [15]:
# Notice that the left join keeps all records from the users dataframe, even if they are missing on the right dataframe
users.merge(roles, on='role_id', how='left')

Unnamed: 0,user_id,name,role_id,role
0,1,bob,1.0,admin
1,2,mary,2.0,author
2,3,sally,3.0,reviewer
3,4,adam,3.0,reviewer
4,5,jane,,
5,6,mike,,


In [16]:
# Notice that the right join keeps all records from the users dataframe, even if they are missing on the right dataframe
users.merge(roles, left_on='role_id', right_on='role_id', how='right')

Unnamed: 0,user_id,name,role_id,role
0,1.0,bob,1.0,admin
1,2.0,mary,2.0,author
2,3.0,sally,3.0,reviewer
3,4.0,adam,3.0,reviewer
4,,,4.0,commenter


In [17]:
# The outer join keeps all records from every dataframe, but values are associated, where applicable
# Outer joins keep all values including nulls
users.merge(roles, on='role_id', how='outer')

Unnamed: 0,user_id,name,role_id,role
0,1.0,bob,1.0,admin
1,2.0,mary,2.0,author
2,3.0,sally,3.0,reviewer
3,4.0,adam,3.0,reviewer
4,5.0,jane,,
5,6.0,mike,,
6,,,4.0,commenter


In [18]:
# Relationship between dataframe order and join type 
# Consider the result of starting with users and left joining roles
users.merge(roles, on="role_id", how='left')

Unnamed: 0,user_id,name,role_id,role
0,1,bob,1.0,admin
1,2,mary,2.0,author
2,3,sally,3.0,reviewer
3,4,adam,3.0,reviewer
4,5,jane,,
5,6,mike,,


In [19]:
# Compare to starting with roles and using right join with users
roles.merge(users, on="role_id", how='right')

Unnamed: 0,role_id,role,user_id,name
0,1.0,admin,1,bob
1,2.0,author,2,mary
2,3.0,reviewer,3,sally
3,3.0,reviewer,4,adam
4,,,5,jane
5,,,6,mike


## Additional Resources
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
- https://pandas.pydata.org/docs/user_guide/merging.html
- https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#compare-with-sql-join

## Exercises
- Read "2020_sales.csv", "2021_sales.csv", and "2022_sales.csv" into dataframes, then concatenate these 3 dataframes vertically.
- Create a `posts` dataframe of the following information. 
```
[
    {
        "author_id": 1,
        "title": "How I Learned Python"
    },
    {
        "author_id": 2,
        "title": "How I Learned to Stop Worrying and Love Pandas"
    },
    {
        "author_id": 2,
        "title": "Quick Tutorial on Installing Anaconda"
    },
    {
        "author_id": 9,
        "title": "Learning Pandas If You Already Work With Spreadsheets"
    }
]
```
- Perform an inner join of `users` and `posts`. *Hint* Think about what data these two dataframes share in common.
- Start with `users` then left join the `posts` dataframe.
- Start with `users` then right join the `posts` dataframe.
- Finally, perform an outer join of `users` and `posts`.

In [20]:
# Read "2020_sales.csv", "2021_sales.csv", and `"2022_sales.csv" into dataframes
# Concatenate these 3 dataframes together, vertically
sales_2020 = pd.read_csv("../datasets/2020_sales.csv")
sales_2021 = pd.read_csv("../datasets/2021_sales.csv")
sales_2022 = pd.read_csv("../datasets/2022_sales.csv")

In [21]:
# Create a `posts` dataframe of the above blog post data
posts = pd.DataFrame([
    {
        "author_id": 1,
        "title": "How I Learned Python"
    },
    {
        "author_id": 2,
        "title": "How I Learned to Stop Worrying and Love Pandas"
    },
    {
        "author_id": 2,
        "title": "Quick Tutorial on Installing Anaconda"
    },
    {
        "author_id": 9,
        "title": "Learning Pandas If You Already Work With Spreadsheets"
    }
])
posts

Unnamed: 0,author_id,title
0,1,How I Learned Python
1,2,How I Learned to Stop Worrying and Love Pandas
2,2,Quick Tutorial on Installing Anaconda
3,9,Learning Pandas If You Already Work With Sprea...


In [22]:
# Perform an inner join of `users` and `posts`. 
# Hint: Think about what data these two dataframes share in common.
users.merge(posts, left_on='user_id', right_on='author_id', how='inner')

Unnamed: 0,user_id,name,role_id,author_id,title
0,1,bob,1.0,1,How I Learned Python
1,2,mary,2.0,2,How I Learned to Stop Worrying and Love Pandas
2,2,mary,2.0,2,Quick Tutorial on Installing Anaconda


In [23]:
# Start with `users` then left join the `posts` dataframe
users.merge(posts, left_on='user_id', right_on='author_id', how='left')

Unnamed: 0,user_id,name,role_id,author_id,title
0,1,bob,1.0,1.0,How I Learned Python
1,2,mary,2.0,2.0,How I Learned to Stop Worrying and Love Pandas
2,2,mary,2.0,2.0,Quick Tutorial on Installing Anaconda
3,3,sally,3.0,,
4,4,adam,3.0,,
5,5,jane,,,
6,6,mike,,,


In [24]:
# Start with `users` then right join the `posts` dataframe
users.merge(posts, left_on='user_id', right_on='author_id', how='right')

Unnamed: 0,user_id,name,role_id,author_id,title
0,1.0,bob,1.0,1,How I Learned Python
1,2.0,mary,2.0,2,How I Learned to Stop Worrying and Love Pandas
2,2.0,mary,2.0,2,Quick Tutorial on Installing Anaconda
3,,,,9,Learning Pandas If You Already Work With Sprea...


In [25]:
# Finally, perform an outer join of `users` and `posts`
users.merge(posts, left_on='user_id', right_on='author_id', how='outer')

Unnamed: 0,user_id,name,role_id,author_id,title
0,1.0,bob,1.0,1.0,How I Learned Python
1,2.0,mary,2.0,2.0,How I Learned to Stop Worrying and Love Pandas
2,2.0,mary,2.0,2.0,Quick Tutorial on Installing Anaconda
3,3.0,sally,3.0,,
4,4.0,adam,3.0,,
5,5.0,jane,,,
6,6.0,mike,,,
7,,,,9.0,Learning Pandas If You Already Work With Sprea...
