# Combining Multiple Datasets
Another aspect of real-world data is that it often comes in multiple pieces. In this section, you’ll learn how to grab those pieces and combine them into one dataset that’s ready for analysis.

In [1]:
import pandas as pd


In [14]:
city_employee_count = pd.Series({"Amsterdam": 5, "Tokyo": 8})
city_employee_count

Amsterdam    5
Tokyo        8
dtype: int64

In [15]:
city_revenues = pd.Series(
    [4200, 8000, 6500],
    index=["Amsterdam", "Toronto", "Tokyo"])
city_revenues

Amsterdam    4200
Toronto      8000
Tokyo        6500
dtype: int64

In [16]:
city_data = pd.DataFrame({
    "revenue": city_revenues,
    "employee_count": city_employee_count
})
city_data

Unnamed: 0,revenue,employee_count
Amsterdam,4200,5.0
Tokyo,6500,8.0
Toronto,8000,


This  DataFrame contains info on the cities "Balochistan" and "Sindh".




In [18]:
city_countries = pd.DataFrame({'countries':['Hub', 'Karachi', 'Karsaz', 'New karachi'],
                  'numbers':[2,4,5,7]},
                 )
city_countries

Unnamed: 0,countries,numbers
Balochistan,Hub,2
Sindh,Karachi,4
Sindh,Karsaz,5
Sindh,New karachi,7


# .concat()
add these cities to city_data using .concat():


In [24]:
all_city_data = pd.concat([city_data,city_countries], sort =False)

In [25]:
all_city_data

Unnamed: 0,revenue,employee_count,countries,numbers
Amsterdam,4200.0,5.0,,
Tokyo,6500.0,8.0,,
Toronto,8000.0,,,
Balochistan,,,Hub,2.0
Sindh,,,Karachi,4.0
Sindh,,,Karsaz,5.0
Sindh,,,New karachi,7.0


Now, the new variable all_city_data contains the values from both DataFrame objects.
# Note:
By default, concat() combines along axis=0. In other words, it appends rows. You can also use it to append columns by supplying the parameter axis=1:



In [57]:
marks = pd.DataFrame({'std1':[23,25,26,39],
    'std2':[34,56,23,45],
    'std3':[12,34,56,22]},

    index = ['paper_1', 'paper_2', 'paper_3', 'paper_4']
    
)
marks


Unnamed: 0,std1,std2,std3
paper_1,23,34,12
paper_2,25,56,34
paper_3,26,23,56
paper_4,39,45,22


In [58]:
marks_final= pd.DataFrame({'Std4':[24,34,44,52,4],
                          'Std5':[121,34,54,43,6]},
                         index = ['paper_1', 'paper_2', 'paper_3', 'paper_4', 'extra'])
marks_final

Unnamed: 0,Std4,Std5
paper_1,24,121
paper_2,34,34
paper_3,44,54
paper_4,52,43
extra,4,6


In [59]:
results = pd.concat([marks, marks_final], axis = 1, sort = False)

In [60]:
results

Unnamed: 0,std1,std2,std3,Std4,Std5
paper_1,23.0,34.0,12.0,24,121
paper_2,25.0,56.0,34.0,34,34
paper_3,26.0,23.0,56.0,44,54
paper_4,39.0,45.0,22.0,52,43
extra,,,,4,6


Note how Pandas added NaN for the missing values. If you want to combine only the marks that appear in both DataFrame objects, then you can set the join parameter to inner:



In [61]:
results = pd.concat([marks, marks_final], axis = 1, sort = False, join = 'inner')

In [62]:
results

Unnamed: 0,std1,std2,std3,Std4,Std5
paper_1,23,34,12,24,121
paper_2,25,56,34,34,34
paper_3,26,23,56,44,54
paper_4,39,45,22,52,43


While it’s most straightforward to combine data based on the index, it’s not the only possibility. You can use .merge() to implement a join operation similar to the one from SQL:



In [None]:
cont...