# **Lecture 8C**
# **Combining DataFrames**


In this part, we are going to combine DataFrames together and produce new DataFrames.

In [1]:
# Run the code below to access files in your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# We also need Panadas module in this lecture
# Import Pandas module
import pandas as pd

---
**Example 1:** Concatenating 2 DataFrames with the same columns but different data. If you have n1 and n2 rows in the two DataFrames, you will produce a new DataFrames with (n1+n2) rows. The function to be used is **pd.concat()**.
* The syntax is **newdf = pd.concat(*list_of_DataFrames*, *ignore_index=True*, *axis=0*)**.
* ***list_of_DataFrames*** is a list containing multiple DataFrames, it can contain more than 2.
* ***ignore_index=True*** means that we will ignore the row indices from the original DataFrames and recreate new indices starting from 0.
* ***axis=0*** means that we are concatenating DataFrames in the *row* dimension.

In [3]:
# Read customer1.xlsx data file
# Note that there are 2 worksheets in there, sheet1 and sheet2.
cust1 = pd.read_excel("/content/drive/MyDrive/Data/customer1.xlsx",sheet_name="sheet1")
cust2 = pd.read_excel("/content/drive/MyDrive/Data/customer1.xlsx",sheet_name="sheet2")
print("First DataFrame:")
display(cust1)
print()
print("Second DataFrame:")
display(cust2)

# We are concatenating cust1 and cust2 DataFrames and put the result in customer DataFrames
customer = pd.concat([cust1,cust2],ignore_index=True ,axis=0)
print()
print("Combined DataFrame:")
display(customer)

First DataFrame:


Unnamed: 0,CustNo,Name,Gender,NumPurchases,IsVIP
0,1,Nancy,F,5,True
1,2,Tom,M,3,False
2,3,Stephen,M,2,True
3,4,David,M,1,False
4,5,Mary,F,7,False



Second DataFrame:


Unnamed: 0,CustNo,Name,Gender,NumPurchases,IsVIP
0,6,Janice,F,3,True
1,7,Jacky,M,4,False
2,8,William,M,1,True



Combined DataFrame:


Unnamed: 0,CustNo,Name,Gender,NumPurchases,IsVIP
0,1,Nancy,F,5,True
1,2,Tom,M,3,False
2,3,Stephen,M,2,True
3,4,David,M,1,False
4,5,Mary,F,7,False
5,6,Janice,F,3,True
6,7,Jacky,M,4,False
7,8,William,M,1,True


---
**Example 2:** Concatenating 2 DataFrames with the same number of rows but different columns. We are assuming that the row indices in the 2 DataFrames are the same. We can also use **pd.concat()** to perform such concatenation.
* The syntax is **newdf = pd.concat(*list_of_DataFrames*, *ignore_index=True*, *axis=1*)**.
* ***list_of_DataFrames*** is a list containing multiple DataFrames, it can contain more than 2.
* ***ignore_index=True*** means that we will ignore the row indices from the original DataFrames and recreate new indices starting from 0.
* ***axis=1*** means that we are concatenating DataFrames in the *column* dimension.

In [4]:
# import Pandas module
import pandas as pd

# Read customer2.xlsx data file
# Note that there are 2 worksheets in there, sheet1 and sheet2.
cust1 = pd.read_excel("/content/drive/MyDrive/Data/customer2.xlsx",sheet_name="sheet1")
cust2 = pd.read_excel("/content/drive/MyDrive/Data/customer2.xlsx",sheet_name="sheet2")
print("First DataFrame:")
display(cust1)
print()
print("Second DataFrame:")
display(cust2)

# Joining 2 DataFrames side-by-side
customer = pd.concat([cust1,cust2],axis=1)
print("Combined DataFrame:")
display(customer)

First DataFrame:


Unnamed: 0,CustNo,Name,Gender,IsVIP
0,1.0,Nancy,F,True
1,2.0,Tom,M,False
2,3.0,Stephen,M,True
3,4.0,David,M,False
4,5.0,Mary,F,False



Second DataFrame:


Unnamed: 0,NumPurchases,AmtPurchases
0,5.0,560.0
1,3.0,740.0
2,2.0,120.0
3,1.0,100.0
4,7.0,340.0


Combined DataFrame:


Unnamed: 0,CustNo,Name,Gender,IsVIP,NumPurchases,AmtPurchases
0,1.0,Nancy,F,True,5.0,560.0
1,2.0,Tom,M,False,3.0,740.0
2,3.0,Stephen,M,True,2.0,120.0
3,4.0,David,M,False,1.0,100.0
4,5.0,Mary,F,False,7.0,340.0


---
**Example 3:** We have the ideal situations in Example 1 and 2, meaning that the 2 DataFrames to be concatenated are "compatible". If the DataFrames are "not compatible", you may end up with something not you want.<br>
In this example, the 2 DataFrames being concatenated do no have the same number of rows and they have one column in common. Let's see what will be produced if we concatenating them.

In [5]:
# import Pandas module
import pandas as pd

# Read customer3.xlsx data file
# Note that there are 2 worksheets in there, sheet1 and sheet2.
cust1 = pd.read_excel("/content/drive/MyDrive/Data/customer3.xlsx",sheet_name="sheet1")
cust2 = pd.read_excel("/content/drive/MyDrive/Data/customer3.xlsx",sheet_name="sheet2")
print("First DataFrame:")
display(cust1)
print()
print("Second DataFrame:")
display(cust2)

First DataFrame:


Unnamed: 0,CustNo,Name,Gender,IsVIP
0,1,Nancy,F,True
1,2,Tom,M,False
2,3,Stephen,M,True
3,4,David,M,False
4,5,Mary,F,False
5,6,Bobby,M,True
6,7,Susan,F,False



Second DataFrame:


Unnamed: 0,IsVIP,NumPurchases,AmtPurchases
0,False,5,560
1,True,3,740
2,False,2,120
3,True,1,100
4,False,7,340


In [6]:
# Appending cust2 to the end of cust1
customer = pd.concat([cust1,cust2],ignore_index=True,axis=0)
print("Combined DataFrame:")
#drop=customer.dropna()
display(customer)

Combined DataFrame:


Unnamed: 0,CustNo,Name,Gender,IsVIP,NumPurchases,AmtPurchases
0,1.0,Nancy,F,True,,
1,2.0,Tom,M,False,,
2,3.0,Stephen,M,True,,
3,4.0,David,M,False,,
4,5.0,Mary,F,False,,
5,6.0,Bobby,M,True,,
6,7.0,Susan,F,False,,
7,,,,False,5.0,560.0
8,,,,True,3.0,740.0
9,,,,False,2.0,120.0


In [7]:
# Joining 2 DataFrames side-by-side
customer = pd.concat([cust1,cust2],axis=1)
print("Combined DataFrame:")
display(customer)

Combined DataFrame:


Unnamed: 0,CustNo,Name,Gender,IsVIP,IsVIP.1,NumPurchases,AmtPurchases
0,1,Nancy,F,True,False,5.0,560.0
1,2,Tom,M,False,True,3.0,740.0
2,3,Stephen,M,True,False,2.0,120.0
3,4,David,M,False,True,1.0,100.0
4,5,Mary,F,False,False,7.0,340.0
5,6,Bobby,M,True,,,
6,7,Susan,F,False,,,


---
**Example 4:** Combining DataFrames by ***left join***. We will call the 2 DataFrames to be combined as ***left DataFrame*** and ***right DataFrame***. Both DataFrames have a common ***key variable***. This key variable is used to match the rows in the 2 DataFrames. For example, Student ID, Customer ID are typical key variables used for joining DataFrames.<br>

A typical situation when left join is used:

* There are 2 DataFrames (Left and Right).
* They have a common *key* column for matching the rows.
* There are no restrictions on the left DataFrame, other than the required key column.
* For the right DataFrame, each value in the key column correspond to exactly 1 row in the DataFrame. (i.e. You cannot have 2 rows with the same key!)
* Typically, the right DataFrame have rows for all possible values of the key variable. (It is sort of a lookup table.)
* Left join will return all columns in both DataFrames and all rows in the left DataFrame. If a column name is used in both DataFrames, they will be renamed automatically.

The merge is done by using the syntax **pd.merge(*left*, *right*,how=*merge_type*,left_on=*left_key_column*, right_on=*right_key_column*)**
* **pd.merge()** is the function for performing the merge. Different options will produce differet types of merge.
* ***left*** is the left DataFrame.
* ***right*** is the right DataFrame.
* **how=*merge_type*** is for us to specify the type of the merge. To use left join, we will need **how="left"**.
* **left_on=*left_key_column*** is to indicate which column in left DataFrame is the key.
* **right_on=*right_key_column*** is to indicate which column in right DataFrame is the key.
* Usually the key column in left and right DataFrame should have the same name, but **pd.merge()** allows for different key column names in the two DataFrames.


In [8]:
# Read joindata1.xlsx data file
# Note that there are 2 worksheets in there, part1 and part2.
df_left = pd.read_excel("/content/drive/MyDrive/Data/joindata1.xlsx",sheet_name="part1")
df_right = pd.read_excel("/content/drive/MyDrive/Data/joindata1.xlsx",sheet_name="part2")
print("Left DataFrame:")
display(df_left)
print()
print("Right DataFrame:")
display(df_right)

# Doing a left join
df_result = pd.merge(df_left,df_right, how="left", left_on="CustNo", right_on="CustNo")
print()
print("After left join:")
display(df_result)

Left DataFrame:


Unnamed: 0,CustNo,Name,Gender
0,1.0,John,M
1,3.0,Mary,F
2,2.0,Peter,M
3,6.0,David,M



Right DataFrame:


Unnamed: 0,CustNo,Purchases,isVIP
0,1.0,1000.0,Yes
1,2.0,2100.0,No
2,3.0,1500.0,No
3,4.0,500.0,Yes
4,5.0,2400.0,No
5,6.0,1300.0,Yes



After left join:


Unnamed: 0,CustNo,Name,Gender,Purchases,isVIP
0,1.0,John,M,1000.0,Yes
1,3.0,Mary,F,1500.0,No
2,2.0,Peter,M,2100.0,No
3,6.0,David,M,1300.0,Yes


---
**Example 5:** Combining DataFrames by ***right join***. Left join and right join are essentially the same. You simply reverse the role of the two DataFrames. The only difference is the ordering of the columns in the output DataFrame.

In [None]:
# Read joindata1.xlsx data file
# Note that there are 2 worksheets in there, part1 and part2.
df_right = pd.read_excel("/content/drive/MyDrive/Data/joindata1.xlsx",sheet_name="part1")
df_left = pd.read_excel("/content/drive/MyDrive/Data/joindata1.xlsx",sheet_name="part2")
print("Left DataFrame:")
display(df_left)
print()
print("Right DataFrame:")
display(df_right)

# Doing a right join
df_result = pd.merge(df_left,df_right, how="right", left_on="CustNo", right_on="CustNo")
print()
print("After right join:")
display(df_result)

Left DataFrame:


Unnamed: 0,CustNo,Purchases,isVIP
0,1.0,1000.0,Yes
1,2.0,2100.0,No
2,3.0,1500.0,No
3,4.0,500.0,Yes
4,5.0,2400.0,No
5,6.0,1300.0,Yes



Right DataFrame:


Unnamed: 0,CustNo,Name,Gender
0,1.0,John,M
1,3.0,Mary,F
2,2.0,Peter,M
3,6.0,David,M



After right join:


Unnamed: 0,CustNo,Purchases,isVIP,Name,Gender
0,1.0,1000.0,Yes,John,M
1,3.0,1500.0,No,Mary,F
2,2.0,2100.0,No,Peter,M
3,6.0,1300.0,Yes,David,M


---
**Example 6**: Combining two DataFrames by ***inner join***. We are matching the rows in the two DataFrames by using the key column. However, ***inner join*** will only return rows with key present in both DataFrames.<br>
Typical situation when inner join is used:
* "Usually" keys in both left and right DataFrames are unique. That is each key will only appear in exactly one row.
* After the join, the resulting DataFrame will contain rows with keys present in both DataFrames.
*The syntax is similar to left join and right join. We only need to change the option to **how="inner"**.

In [None]:
# Read joindata2.xlsx data file
# Note that there are 2 worksheets in there, part1 and part2.
df_right = pd.read_excel("/content/drive/MyDrive/Data/joindata2.xlsx",sheet_name="part1")
df_left = pd.read_excel("/content/drive/MyDrive/Data/joindata2.xlsx",sheet_name="part2")
print("Left DataFrame:")
display(df_left)
print()
print("Right DataFrame:")
display(df_right)

# Doing an inner join
df_result = pd.merge(df_left,df_right, how="inner", left_on="CustNo", right_on="CustNo")
print()
print("After inner join:")
display(df_result)

Left DataFrame:


Unnamed: 0,CustNo,Purchases,isVIP
0,3,1000,Yes
1,4,2100,No
2,5,1500,No
3,6,500,Yes
4,7,2400,No
5,8,820,No



Right DataFrame:


Unnamed: 0,CustNo,Name,Gender
0,1,John,M
1,2,Peter,M
2,3,Mary,F
3,6,David,M
4,7,Susan,F



After inner join:


Unnamed: 0,CustNo,Purchases,isVIP,Name,Gender
0,3,1000,Yes,Mary,F
1,6,500,Yes,David,M
2,7,2400,No,Susan,F


---
**Example 7:** Combining two DataFrames by ***full outer join***. 
* Full outer join will return all keys that exist in either DataFrames. 
* To use full outer join, we need to use **how="outer"** argument in **pd.merge()**.
* If a key exists in only one DataFrame, some of the columns will have **Nan** values.


In [None]:
# Read joindata2.xlsx data file
# Note that there are 2 worksheets in there, part1 and part2.
df_right = pd.read_excel("/content/drive/MyDrive/Data/joindata2.xlsx",sheet_name="part1")
df_left = pd.read_excel("/content/drive/MyDrive/Data/joindata2.xlsx",sheet_name="part2")
print("Left DataFrame:")
display(df_left)
print()
print("Right DataFrame:")
display(df_right)

# Doing full outer join
df_result = pd.merge(df_left,df_right, how="outer", left_on="CustNo", right_on="CustNo")
print()
print("After full outer join:")
display(df_result)

Left DataFrame:


Unnamed: 0,CustNo,Purchases,isVIP
0,3,1000,Yes
1,4,2100,No
2,5,1500,No
3,6,500,Yes
4,7,2400,No
5,8,820,No



Right DataFrame:


Unnamed: 0,CustNo,Name,Gender
0,1,John,M
1,2,Peter,M
2,3,Mary,F
3,6,David,M
4,7,Susan,F



After full outer join:


Unnamed: 0,CustNo,Purchases,isVIP,Name,Gender
0,3,1000.0,Yes,Mary,F
1,4,2100.0,No,,
2,5,1500.0,No,,
3,6,500.0,Yes,David,M
4,7,2400.0,No,Susan,F
5,8,820.0,No,,
6,1,,,John,M
7,2,,,Peter,M


---
**Example 8:** Combining two DataFrames by **cross join**.
* **Cross join** does not require key column, it will return all combinations of rows from the two DataFrames. If you have n1 rows and n2 rows in the two DataFrames, you will get n1xn2 rows in the output DataFrame.
* To use **cross join**, we need to argument **how="cross"** when calling **pd.merge.()**

In [None]:
# Read joindata2.xlsx data file
# Note that there are 2 worksheets in there, part1 and part2.
df_right = pd.read_excel("/content/drive/MyDrive/Data/joindata2.xlsx",sheet_name="part1")
df_left = pd.read_excel("/content/drive/MyDrive/Data/joindata2.xlsx",sheet_name="part2")
print("Left DataFrame:")
display(df_left)
print()
print("Right DataFrame:")
display(df_right)

# Doing a cross join
df_result = pd.merge(df_left,df_right, how="cross")
print()
print("After cross join:")
display(df_result)

Left DataFrame:


Unnamed: 0,CustNo,Purchases,isVIP
0,3,1000,Yes
1,4,2100,No
2,5,1500,No
3,6,500,Yes
4,7,2400,No
5,8,820,No



Right DataFrame:


Unnamed: 0,CustNo,Name,Gender
0,1,John,M
1,2,Peter,M
2,3,Mary,F
3,6,David,M
4,7,Susan,F



After cross join:


Unnamed: 0,CustNo_x,Purchases,isVIP,CustNo_y,Name,Gender
0,3,1000,Yes,1,John,M
1,3,1000,Yes,2,Peter,M
2,3,1000,Yes,3,Mary,F
3,3,1000,Yes,6,David,M
4,3,1000,Yes,7,Susan,F
5,4,2100,No,1,John,M
6,4,2100,No,2,Peter,M
7,4,2100,No,3,Mary,F
8,4,2100,No,6,David,M
9,4,2100,No,7,Susan,F


In [None]:
df=pd.DataFrame()
df2=pd.DataFrame()
a=1,2,3,4,5
df["1"]=a
df2["2"]=a
display(df)
display(df2)
df_result = pd.merge(df,df2, how="cross")
display(df_result)

Unnamed: 0,1
0,1
1,2
2,3
3,4
4,5


Unnamed: 0,2
0,1
1,2
2,3
3,4
4,5


Unnamed: 0,1,2
0,1,1
1,1,2
2,1,3
3,1,4
4,1,5
5,2,1
6,2,2
7,2,3
8,2,4
9,2,5
