### Concatenating


In [1]:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.full((2, 3), "x", dtype=object), columns=["A", "B", "C"])
df1

Unnamed: 0,A,B,C
0,x,x,x
1,x,x,x


In [2]:
df2 = pd.DataFrame(np.full((3, 3), "o", dtype=object), columns=["A", "B", "C"])
df2

Unnamed: 0,A,B,C
0,o,o,o
1,o,o,o
2,o,o,o


In [3]:
df3 = pd.DataFrame(np.full((2, 2), "v", dtype=object), columns=["D", "E"])
df3

Unnamed: 0,D,E
0,v,v
1,v,v


In [4]:
# Along axis 0
pd.concat([df1, df2])

Unnamed: 0,A,B,C
0,x,x,x
1,x,x,x
0,o,o,o
1,o,o,o
2,o,o,o


In [5]:
#  reindex the concatenated DataFrame
pd.concat([df1, df2]).reset_index(drop=True)

Unnamed: 0,A,B,C
0,x,x,x
1,x,x,x
2,o,o,o
3,o,o,o
4,o,o,o


In [6]:
# DF of different size
pd.concat([df1, df3])

Unnamed: 0,A,B,C,D,E
0,x,x,x,,
1,x,x,x,,
0,,,,v,v
1,,,,v,v


#### keys parameters:

Suppose that after concatenating the DataFrames, we still want to have the data from each DataFrame in a separate group. This can be useful for determining later on which DataFrame a certain entry came from. We can achieve this with the keys parameter which creates a hierarchical index on the DataFrame.

In [7]:
df4 = pd.concat([df1, df2], keys=["df1", "df2"])
df4

Unnamed: 0,Unnamed: 1,A,B,C
df1,0,x,x,x
df1,1,x,x,x
df2,0,o,o,o
df2,1,o,o,o
df2,2,o,o,o


#### Concatenating along axis 1

Now let’s look at the second way of concatenating two DataFrames: side by side. This type of concatenation will first align by row index labels of each DataFrame, and then put in the columns of the first DataFrame followed by the columns of the second.

In [8]:
pd.concat([df1, df3], axis=1)

Unnamed: 0,A,B,C,D,E
0,x,x,x,v,v
1,x,x,x,v,v


In [9]:
# different sizes
pd.concat([df1, df2], axis=1)

Unnamed: 0,A,B,C,A.1,B.1,C.1
0,x,x,x,o,o,o
1,x,x,x,o,o,o
2,,,,o,o,o


#### The join parameter

* join='outer' is a union of the indices or labels (depending on along which axis we perform the concatenation). This is what we have seen so far. The row indices or column labels that were in common were not duplicated, and those that were not in common were each added separately with the appropriate NaN values.
* join='inner' refers to the intersection of row indices or column labels. That is, we keep only those that are in common, and discard the rest. Here is the previous example, but this time with join='inner':



In [12]:
# The rows that used to have NAN values are completely discarded
pd.concat([df1, df2], axis=1, join="inner")

Unnamed: 0,A,B,C,A.1,B.1,C.1
0,x,x,x,o,o,o
1,x,x,x,o,o,o


In [11]:
#No common row indices or column labels at all
pd.concat([df1, df3], join="inner")

0
1
0
1


### Merging and joining

The main role of the merge() function is to allow us to combine DataFrames along multiple columns, or along columns other than the index.

#### Merging on a single column

We have two DataFrames that have a column in common and we want to merge along this column. Let’s define the two DataFrames:

In [13]:
import pandas as pd

users = pd.DataFrame(
    {
        "userID": [5672, 3452, 2878, 3234],
        "First Name": ["Christopher", "Johnnie", "Debbie", "Teri"],
        "Last Name": ["Boyd", "Baldwin", "Alvarez", "Gill"],
    }
)
users

Unnamed: 0,userID,First Name,Last Name
0,5672,Christopher,Boyd
1,3452,Johnnie,Baldwin
2,2878,Debbie,Alvarez
3,3234,Teri,Gill


In [14]:
scores = pd.DataFrame(
    {"userID": [2878, 5672, 3234, 5672, 2878], "Score": [84, 56, 72, 77, 88]}
)
scores


Unnamed: 0,userID,Score
0,2878,84
1,5672,56
2,3234,72
3,5672,77
4,2878,88


In [15]:
# We wan to merge using the userID
merged_df = pd.merge(users, scores)
merged_df

Unnamed: 0,userID,First Name,Last Name,Score
0,5672,Christopher,Boyd,56
1,5672,Christopher,Boyd,77
2,2878,Debbie,Alvarez,84
3,2878,Debbie,Alvarez,88
4,3234,Teri,Gill,72


In [16]:
# Consider this alternative date were the columns do not have the same label as in users
scores2 = pd.DataFrame(
    {"studentID": [2878, 5672, 3234, 5672, 2878], "Score": [84, 56, 72, 77, 88]}
)
scores2

Unnamed: 0,studentID,Score
0,2878,84
1,5672,56
2,3234,72
3,5672,77
4,2878,88


However, if we have prior knowledge of this, we can actually specify to pandas which columns we want the merge to be based on as follows

In [18]:
pd.merge(users, scores2, left_on="userID", right_on="studentID")

Unnamed: 0,userID,First Name,Last Name,studentID,Score
0,5672,Christopher,Boyd,5672,56
1,5672,Christopher,Boyd,5672,77
2,2878,Debbie,Alvarez,2878,84
3,2878,Debbie,Alvarez,2878,88
4,3234,Teri,Gill,3234,72


Notice that both columns are retained in the merged DataFrame. We can, of course, drop one of them if we wish.

#### Merging on multiple columns

Now, let’s consider DataFrames with more than one column in common. 

In [19]:
gold = pd.DataFrame(
    {
        "Code": ["CAN", "GER", "USA", "NOR"],
        "Country": ["Canada", "Germany", "United States", "Norway"],
        "Total": [14, 10, 9, 9],
    }
)
gold

Unnamed: 0,Code,Country,Total
0,CAN,Canada,14
1,GER,Germany,10
2,USA,United States,9
3,NOR,Norway,9


In [20]:
bronze = pd.DataFrame(
    {
        "Code": ["USA", "GER", "NOR", "AUS"],
        "Country": ["United States", "Germany", "Norway", "Austria"],
        "Total": [13, 7, 7, 6],
    }
)
bronze

Unnamed: 0,Code,Country,Total
0,USA,United States,13
1,GER,Germany,7
2,NOR,Norway,7
3,AUS,Austria,6


Now what we would like is to obtain a new DataFrame that is merged along the columns code and country, but retains the two separate totals columns (and ideally gives them appropriate labels to distinguish them.) What happens if we just call the merge function as before?



In [21]:
pd.merge(gold, bronze)

Unnamed: 0,Code,Country,Total


the rows of the merged DataFrame consist of all rows where the Code, Country, and Total columns are identical in both DataFrames. This results in an empty DataFrame because the entries in the column Total of the two DataFrames never match.

In [22]:
pd.merge(gold, bronze, on=["Code", "Country"])

Unnamed: 0,Code,Country,Total_x,Total_y
0,GER,Germany,10,7
1,USA,United States,9,13
2,NOR,Norway,9,7


In [23]:
# We can change these suffixes with whatever custom names we want, using the parameter suffixes as follows:
pd.merge(gold, bronze, on=["Code", "Country"], suffixes=["_gold", "_bronze"])

Unnamed: 0,Code,Country,Total_gold,Total_bronze
0,GER,Germany,10,7
1,USA,United States,9,13
2,NOR,Norway,9,7


What we did in the last example is referred to as an inner join: we took the rows that matched in the code and country  columns of both DataFrames. Since Canada appeared in the gold DataFrame and not the bronze and Austria appeared in the bronze and not the gold these two rows were not included in our merged DataFrame. This corresponds to an intersection.

In contrast to this, we can opt for an outer join where we keep all the rows, corresponding to a union. We can do this using the how parameter.

In [24]:
pd.merge(
    gold, bronze, on=["Code", "Country"], suffixes=["_gold", "_bronze"], how="outer"
)

Unnamed: 0,Code,Country,Total_gold,Total_bronze
0,CAN,Canada,14.0,
1,GER,Germany,10.0,7.0
2,USA,United States,9.0,13.0
3,NOR,Norway,9.0,7.0
4,AUS,Austria,,6.0


We also have:
* left join: return the merge of the matched rows and the unmatched values from only the left DataFrame
* right join: return the merge of the matched rows and the unmatched values from only the right DataFrame

In [25]:
pd.merge(
    gold, bronze, on=["Code", "Country"], suffixes=["_gold", "_bronze"], how="left"
)

Unnamed: 0,Code,Country,Total_gold,Total_bronze
0,CAN,Canada,14,
1,GER,Germany,10,7.0
2,USA,United States,9,13.0
3,NOR,Norway,9,7.0


In [26]:
pd.merge(
    gold, bronze, on=["Code", "Country"], suffixes=["_gold", "_bronze"], how="right"
)


Unnamed: 0,Code,Country,Total_gold,Total_bronze
0,USA,United States,9.0,13
1,GER,Germany,10.0,7
2,NOR,Norway,9.0,7
3,AUS,Austria,,6


Remark: We would like to draw your attention to one particular issue that can arise when performing an outer merge. Suppose we have two DataFrames containing integer values

In [27]:
df1 = pd.DataFrame({"key": [1, 2, 3, 4], "val1": [1, 2, 3, 4]})
df2 = pd.DataFrame({"key": [1, 2, 3, 5], "val2": [1, 2, 3, 4]})

In [28]:
df_in = df1.merge(df2, how="inner")
df_in

Unnamed: 0,key,val1,val2
0,1,1,1
1,2,2,2
2,3,3,3


In [29]:
# the data types of each column
df_in.dtypes

key     int64
val1    int64
val2    int64
dtype: object

In [30]:
# But now suppose we form an outer join instead

df_out = df1.merge(df2, how="outer")
df_out

Unnamed: 0,key,val1,val2
0,1,1.0,1.0
1,2,2.0,2.0
2,3,3.0,3.0
3,4,4.0,
4,5,,4.0


We can see that there are some NaN values, which is expected, but note that the previous integer values are now converted to floats! And if we check the data types of the columns

In [31]:
df_out.dtypes

key       int64
val1    float64
val2    float64
dtype: object

we notice that they have been changed to float64. This is due to the fact that NaN is considered a float and hence integer columns with missing values are cast as float, as outlined in the pandas documentation here.