## 1.4 Combining Data from Multiple Tables


In [4]:
import pandas as pd

## Introduction

Welcome to the "Combining Data from Multiple Tables" lesson! I am excited to introduce you to this topic, which will allow you to work with data from multiple sources to enhance your storytelling capabilities and uncover valuable insights. By practicing some key concepts in combining tables, you will form a valuable basis of understanding in combining tables. In this Jupyter notebook, we will explore how to merge data from multiple tables using the pandas library in Python.

## What does it mean to combine tables, and why would we want to do it?
We now define what it means to combine tables and provide motivation for doing so. First, consider what a table is. The following is an example of a table:

In [6]:
# Larger product table
product_data = {
    'Product ID': [101, 102, 103, 104, 105, 106, 107, 108],
    'Product Name': ['Laptop', 'Smartphone', 'Tablet', 'Smartwatch', 'Laptop', 'Smartphone', 'Tablet', 'Smartwatch'],
    'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Electronics', 'Electronics', 'Electronics', 'Electronics'],
    'Brand': ['Brand A', 'Brand A', 'Brand A', 'Brand A', 'Brand B', 'Brand B', 'Brand B', 'Brand B']
}

product_df = pd.DataFrame(product_data)
print("Product Table:")
display(product_df)

Product Table:


Unnamed: 0,Product ID,Product Name,Category,Brand
0,101,Laptop,Electronics,Brand A
1,102,Smartphone,Electronics,Brand A
2,103,Tablet,Electronics,Brand A
3,104,Smartwatch,Electronics,Brand A
4,105,Laptop,Electronics,Brand B
5,106,Smartphone,Electronics,Brand B
6,107,Tablet,Electronics,Brand B
7,108,Smartwatch,Electronics,Brand B


Generally speaking, in a table, each row provides information about a separate entity. In the table above, the entities are products. Each column contains a different attribute of the entity. In the example above, those attributes are Product ID, Product Name, and Category. A Pandas DataFrame also contains an index for each row, which you can see on the left side of the table.

We *combine tables* when we integrate the information in one table with the information in another to obtain a larger dataset or a dataset with a larger capacity for insights. Suppose we had another table with the following information:


In [7]:
# Separate table with product ID and cost
cost_data = {
    'Product ID': [101, 102, 103, 104, 105, 106, 107, 108],
    'Cost': [800, 500, 300, 250, 850, 550, 350, 275]
}

cost_df = pd.DataFrame(cost_data)
print("\nCost Table:")
display(cost_df)


Cost Table:


Unnamed: 0,Product ID,Cost
0,101,800
1,102,500
2,103,300
3,104,250
4,105,850
5,106,550
6,107,350
7,108,275


Alone, the first table does not provide much opportunity for insight. It allows us to create a histogram of product types and a sense of which brands offers which products, but little else. By combining the first table with the second table, we can unlock deeper insights. What are some additional forms of analysis we could do by combining the first and second tables?
*   Determine the average cost of each product. E.g. the average cost of a smartphone.
*   Determine the % difference in cost between the average Brand A smartphone and the average Brand B smartphone.




## Core Technical Tools

Depending on what kind of programming language you are using, there are different technical tools for storing tables and combining them. In this lesson, we store data in pandas DataFrames, and we use the pandas merge function to combine them.

Below, we outline the pd.merge function and how it works. The description includes important parameters for the function that enable you to ensure that data is merged properly. It's not necessary to understand everything below immediately. For now, you can just skim over it to get a basic idea of how the function works. After, we will go through some examples to give you a better sense of how to think through merging tables.

pd.merge() is a function in the pandas library used for merging DataFrames in Python. It combines the data from two DataFrames based on a common column or index, allowing you to join and analyze data from different sources. The function has several parameters that control the merging behavior:



*   left: The first (left) DataFrame to be merged.
*   right: The second (right) DataFrame to be merged.
*   how: Determines the type of merge to be performed. Default is 'inner' The 'left', 'right', 'outer', and 'inner' options correspond to different types of merges: left outer join using keys from the left DataFrame, right outer join using keys from the right DataFrame, full outer join using keys from the union of both DataFrames, and inner join using keys from the intersection of both DataFrames, respectively.
*   on: The column (or list of columns) used to join the DataFrames. Both DataFrames must have the specified column(s). If not provided, the function will use the columns with the same names in both DataFrames.
*   left_on: The column(s) in the left DataFrame to use as the merge key(s). This can be used instead of, or in conjunction with, the on parameter.
*   right_on: The column(s) in the right DataFrame to use as the merge key(s). This can be used instead of, or in conjunction with, the on parameter.
*   left_index: If True, use the index from the left DataFrame as the merge key(s). The default is False.
*   right_index: If True, use the index from the right DataFrame as the merge key(s). The default is False.










#### Below, we use the pd.merge function to merge the DataFrames above:

In [8]:
pd.merge(product_df, cost_df, on='Product ID')

Unnamed: 0,Product ID,Product Name,Category,Brand,Cost
0,101,Laptop,Electronics,Brand A,800
1,102,Smartphone,Electronics,Brand A,500
2,103,Tablet,Electronics,Brand A,300
3,104,Smartwatch,Electronics,Brand A,250
4,105,Laptop,Electronics,Brand B,850
5,106,Smartphone,Electronics,Brand B,550
6,107,Tablet,Electronics,Brand B,350
7,108,Smartwatch,Electronics,Brand B,275


Before merging tables, we can first carefully envision what kind of table we want to produce. By understanding the information that you want to capture in the resulting table, you can better understand how to 

When we perform a merge operation, we use a certain column as a basis for merging. In the above example, we used Product ID. Notice that both tables had all the same product IDs availableâ€”both had 101 through 108. Suppose that one table had only 101 through 106. What would we do then? This is why we need to consider the 'how' parameter of the merge function. In the examples below, we illustrate how different kinds of joins can allow us to work with situations where we have missing data in a DataFrame.

Imagine that we have a dataset of all of the cities that have been affected by a recent disease outbreak. We have a dataset that contains a list of those cities and their city IDs. Suppose we also have a dataset with populations correspondign to several city IDs, but the population DataFrame does not include all of the city IDs that have been affected by the outbreak. Now suppose we want to create a table with each city and its population for a publication. We want to make sure that each city that has been affected by the outbreak is included in the table. We accomplish this as follows:

In [11]:
iranian_cities = pd.DataFrame({'city_id': [1, 2, 3], 'city_name': ['Tehran', 'Mashhad', 'Isfahan']})
city_populations = pd.DataFrame({'city_id': [1, 2], 'population': [9172195, 3203253]})

merged = pd.merge(left = iranian_cities, right = city_populations, on='city_id', how='left')
merged

Unnamed: 0,city_id,city_name,population
0,1,Tehran,9172195.0
1,2,Mashhad,3203253.0
2,3,Isfahan,


The merge() function will take all the rows from the left DataFrame (iranian_cities) and attempt to match them with the corresponding rows in the right DataFrame (city_populations) based on the city_id column. If a matching row is found in the right DataFrame, then the corresponding information from the right DataFrame is included in the resulting DataFrame. If no matching row is found, then the values in the right DataFrame will be filled with NaN.

Therefore, by specifying how='left', we ensure that all the cities in iranian_cities DataFrame are present in the resulting DataFrame, regardless of whether or not there is a corresponding row in the city_populations DataFrame. If we had specified how='right', then all the cities in city_populations DataFrame would have been included in the resulting DataFrame, and any cities in iranian_cities without a corresponding row in city_populations would have been excluded:

In [16]:
right_merged = pd.merge(iranian_cities, city_populations, on='city_id', how='right')
right_merged

Unnamed: 0,city_id,city_name,population
0,1,Tehran,9172195
1,2,Mashhad,3203253


In the above, we used a left merge and contrasted it with a right merge. For completeness, note that there are other types of merges. We define inner and outer merge here.

Inner Merge:
An inner merge (or inner join) between two dataframes returns only the matching rows between the two dataframes, based on the specified common columns. In other words, only the rows with matching values in both dataframes are included in the merged dataframe.


Outer Merge:
An outer merge (or outer join) returns all the rows from both dataframes and fills in missing values with NaN (or a specified fill value) for any non-matching rows. In other words, it includes all the rows from both dataframes and merges the common rows based on the specified common columns.

#### Going Deeper

When we combine DataFrames, paying attention to certain details can help ensure that they are merged correctly. In the above, an important detail was that one DataFrame did not contain all the cities included in the other. In the example below, we notice some other important details to pay attention to.

In [10]:
import pandas as pd

#Data not necessarily correct
iran_covid_data = {
    'Province': ['Tehran', 'Isfahan', 'Mazandaran', 'Khorasan Razavi', 'Fars', 'Gilan', 'Kerman', 'Khuzestan', 'Kermanshah', 'East Azerbaijan', 'Qom', 'Semnan', 'Golestan', 'West Azerbaijan', 'Markazi', 'Lorestan', 'Hormozgan', 'Yazd', 'Chaharmahal and Bakhtiari', 'Kohgiluyeh and Boyer-Ahmad', 'Ilam', 'Bushehr', 'North Khorasan', 'South Khorasan', 'Sistan and Baluchestan', 'Ardebil'],
    'Deaths': [22824, 5392, 5302, 4969, 4851, 3496, 3016, 2963, 2823, 2536, 2213, 2059, 2044, 2027, 2004, 1958, 1624, 1607, 1202, 996, 971, 961, 836, 778, 765, 762]
}

iran_covid_df = pd.DataFrame.from_dict(iran_covid_data)

In [None]:


#Data not necessarily correct
gdp_data = {
    'province_name': ['Alborz', 'Ardabil', 'Bushehr', 'Chaharmahal and Bakhtiari', 'East Azerbaijan', 'Fars', 'Gilan', 'Golestan', 'Hamadan', 'Hormozgan', 'Ilam', 'Isfahan', 'Kerman', 'Kermanshah', 'Khuzestan', 'Kohgiluyeh and Boyer-Ahmad', 'Kurdistan', 'Lorestan', 'Markazi', 'Mazandaran', 'North Khorasan', 'Qazvin', 'Qom', 'Razavi Khorasan', 'Semnan', 'Sistan and Baluchestan', 'South Khorasan', 'Tehran', 'West Azerbaijan', 'Yazd', 'Zanjan'],
    'Abbreviation': ['AL', 'AR', 'BU', 'CB', 'EA', 'FA', 'GN', 'GO', 'HA', 'HO', 'IL', 'IS', 'KN', 'KE', 'KH', 'KB', 'KU', 'LO', 'MA', 'MN', 'NK', 'QA', 'QM', 'RK', 'SE', 'SB', 'SK', 'TE', 'WA', 'YA', 'ZA'],
    'Area (km2)': [5833, 17800, 22743, 16332, 45650, 122608, 14042, 20195, 19368, 70669, 20133, 107029, 183285, 24998, 64055, 15504, 29137, 28294, 29130, 23701, 28434, 15549, 11526, 118884, 97491, 180726, 151913, 18814, 37437, 76469, 21773],
    'GDP per capita': [12500, 9000, 10000, 6000, 7500, 7500, 7500, 6000, 6000, 8000, 6000, 7000, 9000, 8000, 6000, 6000, 6000, 8000, 6000, 8000, 6000, 7000, 7000, 6000, 7500, 6000, 5000, 6000, 8000, 8000, 8000],
    'Population (2023)': [2730000, 1284000, 1174000, 973000, 3925000, 4904000, 2546000, 1893000, 1756000, 1806000, 591000, 5136000, 3184000, 2003000, 4725000, 728000, 1614000, 1784000, 1436000, 3302000, 868000, 1284000, 1300000, 6444000, 715000, 2777000, 786000, 13323000, 3278000, 1156000, 1103000]
}

gdp_df = pd.DataFrame(gdp_data)
print(gdp_df)

In the above, we have two DataFrames, iran_covid_data and gdp_data. We want to merge the two DataFrames so that we can observe the relationship between GDP per capita and covid deaths per capita. To accomplish this, we first note the following:
*   To merge the two DataFrames, we can use the province name column in the two DataFrames. However, the province name column is different in each dataframe. To address this, we can simply change the name of the province name column in one of the DataFrames to match the other.
*   These data were assembled from a source of mediocre quality. Therefore we cannot be certain that the provinces in one DataFrame match the provinces in another. To address this issue, we can use an outer join, which enables us to see which provinces for which we lack a certain kind of data. An outer join is a good idea in this case because it allows you to keep all of the rows from both DataFrames, even if there is no match between the values in the specified column. An outer join combines the rows from both DataFrames and fills in missing values with NaN (Not a Number) or a specified fill value. This means that all the rows from both DataFrames are included in the resulting DataFrame, with NaN or the specified fill value used to represent missing data where there is no match in the specified column.




#### Using ChatGPT to assist with merging tables

To help you get started with merging DataFrames, you can leverage the latest AI technology. If you can clearly state how each of your tables is structured, including the titles of its columns, you can ask ChatGPT to write code to merge the two DataFrames. Below are some examples of prompts that would help you do this.

1.   "I have two DataFrames: one containing information about Iranian cities (CityID, CityName, Province) and another containing population data for those cities (CityID, Population). The first dataframe is called iranian_cities, the second called city_populations. How do I merge these DataFrames using the CityID column? Please ask any necessary clarifying questions."
2.   "Suppose I have a DataFrame with Iranian historical sites and their locations (SiteName, CityName) and another DataFrame with city names and province names (CityName, Province). The first dataframe is called historical_sites, the second called city_populations.  How do I merge these DataFrames to create a single DataFrame containing the historical site, city name, and province? Please ask any necessary clarifying questions."
3.   "I have two DataFrames: one with information about Iranian universities (UniversityID, UniversityName, City) and another with their international rankings (UniversityID, Ranking). The first dataframe is called iranian_universities, the second called international_rankings. How can I merge these DataFrames to create a combined DataFrame with university information and their rankings? Suppose that we do not have the ranking for all universities. Please ask any necessary clarifying questions."
4.   "Let's say I have a DataFrame with the names of Iranian national parks and their area in square kilometers (ParkName, Area) and another DataFrame with the names of national parks and their number of visitors per year (ParkName, VisitorsPerYear). The first dataframe is called national_parks, the second is called park_visitors. How do I merge these DataFrames to get a single DataFrame with park names, area, and number of visitors per year? Please ask any necessary clarifying questions."


You can also ask ChatGPT when you are running into difficulties with merging DataFrames. If you merge two DataFrames and are not getting what you expected, you can tell ChatGPT what the frames consist of, what you did to merge them, and what the output was, and it could provide a diagnosis of what went wrong as well as a possible fix.

#### Parting Notes
I hope you enjoyed this lesson on combining tables. This lesson has introduced the basic concepts of combining tables. With this basis of understanding and through augmenting it with the use of tools such as ChatGPT to solidify and extend your understanding, you will be able to effectively conquer similar tasks in the future.