## Merge & Join

The merge() method updates the content of two DataFrame by merging them together, using the specified method(s).

Use the parameters to control which values to keep and which to replace.

In [37]:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sally','Mary','John'],
                    'Age': [50,40,30]})
df2 = pd.DataFrame({'Name':['Sally','Peter','Milky'],
                   'Age':[77,44,22]})
display(df1,df2)
newdf = df1.merge(df2, how='left')
newdf

Unnamed: 0,Name,Age
0,Sally,50
1,Mary,40
2,John,30


Unnamed: 0,Name,Age
0,Sally,77
1,Peter,44
2,Milky,22


Unnamed: 0,Name,Age
0,Sally,50
1,Mary,40
2,John,30


**Syntax**

`dataframe.merge(right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)`

**1. `right`**


In [41]:
import pandas as pd

left = pd.DataFrame({
    'key': ['A', 'B', 'C'],
    'value': [1, 2, 3]
})

right = pd.DataFrame({
    'key': ['A', 'B', 'D'],
    'value': [4, 5, 6]
})

result = left.merge(right,on='key')
result

Unnamed: 0,key,value_x,value_y
0,A,1,4
1,B,2,5


**`2. how`**

Type of merge to be performed (e.g., 'left', 'right', 'outer', 'inner').

In [44]:
result = left.merge(right, on='key', how='outer')
result

Unnamed: 0,key,value_x,value_y
0,A,1.0,4.0
1,B,2.0,5.0
2,C,3.0,
3,D,,6.0


In [46]:
result = left.merge(right, on='key', how='inner')
result

Unnamed: 0,key,value_x,value_y
0,A,1,4
1,B,2,5


In [48]:
result = left.merge(right, on='key', how='right')
result

Unnamed: 0,key,value_x,value_y
0,A,1.0,4
1,B,2.0,5
2,D,,6


In [50]:
result = left.merge(right, on='key', how='left')
result

Unnamed: 0,key,value_x,value_y
0,A,1,4.0
1,B,2,5.0
2,C,3,


**`3. on`**

Column(s) to join on. Must be present in both DataFrames.

In [67]:
result = left.merge(right, on='key', how='inner')
result

Unnamed: 0,key,value_x,value_y
0,A,1,4
1,B,2,5


**`4. left_on`**

Columns from the left DataFrame to use as keys.

In [70]:
left = pd.DataFrame({
    'lkey': ['A', 'B', 'C'],
    'value': [1, 2, 3]
})

result = left.merge(right, left_on='lkey', right_on='key', how='inner')
result

Unnamed: 0,lkey,value_x,key,value_y
0,A,1,A,4
1,B,2,B,5


**`5. right_on`**

Columns from the right DataFrame to use as keys.

In [73]:
result = left.merge(right, left_on='lkey', right_on='key', how='inner')
result

Unnamed: 0,lkey,value_x,key,value_y
0,A,1,A,4
1,B,2,B,5


**`6. left_index`**

Use the index from the left DataFrame as the join key(s).

In [84]:
import pandas as pd

df1 = pd.DataFrame({
    'value': [10, 20, 30]
}, index=['A', 'B', 'C'])

df2 = pd.DataFrame({
    'key': ['A', 'B', 'D'],
    'value': [1, 2, 4]
})

result = df1.merge(df2, left_index=True, right_on='key', how='inner')
print(result)

   value_x key  value_y
0       10   A        1
1       20   B        2


**`7. right_index`**

If True, use the index from the right DataFrame as the join key(s).

In [52]:
import pandas as pd

left = pd.DataFrame({
    'key': ['A', 'B', 'C'],
    'value_left': [10, 20, 30]
},index =['2','4','6'])

right = pd.DataFrame({
    'value_right': [100, 200, 300]
}, index=['A', 'B', 'D'])

result = left.merge(right, left_on='key', right_index=True, how='inner')
result

Unnamed: 0,key,value_left,value_right
2,A,10,100
4,B,20,200


**`8. sort`**

When sort=True, the resulting DataFrame will be sorted by the join keys. This parameter is useful when you want the output to be ordered based on the join keys.

In [101]:
import pandas as pd

left = pd.DataFrame({
    'key': ['B', 'A', 'C'],
    'value_left': [2, 1, 3]
})

right = pd.DataFrame({
    'key': ['A', 'B', 'D'],
    'value_right': [4, 5, 6]
})

result = left.merge(right, on='key', how='inner', sort=True)
result

Unnamed: 0,key,value_left,value_right
0,A,1,4
1,B,2,5


**`9. suffixes`**

A tuple of strings to append to overlapping column names in the left and right DataFrames. This helps avoid column name conflict.

In [104]:
import pandas as pd

left = pd.DataFrame({
    'key': ['A', 'B'],
    'value': [1, 2]
})

right = pd.DataFrame({
    'key': ['A', 'B'],
    'value': [3, 4]
})

result = left.merge(right, on='key', suffixes=('_left', '_right'))
result

Unnamed: 0,key,value_left,value_right
0,A,1,3
1,B,2,4


**`10. copy`**

Whether to copy data from the original DataFrames. This parameter is mainly for internal use and generally defaults to True. Setting it to False might improve performance but should be used with caution.

In [107]:
import pandas as pd

left = pd.DataFrame({
    'key': ['A', 'B'],
    'value': [1, 2]
})

right = pd.DataFrame({
    'key': ['A', 'B'],
    'value': [3, 4]
})

result = left.merge(right, on='key', how='inner', copy=True)
result

Unnamed: 0,key,value_x,value_y
0,A,1,3
1,B,2,4


 In this case, the copy parameter is set to True, which is the default behavior. The DataFrame is copied from the original DataFrames.

**`11. indicator`**

Adds a column to the output DataFrame called _merge that shows which DataFrame each row comes from ('left_only', 'right_only', or 'both').

In [61]:
import pandas as pd

left = pd.DataFrame({
    'key': ['A', 'B', 'C'],
    'value_left': [1, 2, 3]
})

right = pd.DataFrame({
    'key': ['A', 'B', 'D'],
    'value_right': [4, 5, 6]
})

result = left.merge(right, on='key', how='outer', indicator=True)
result

Unnamed: 0,key,value_left,value_right,_merge
0,A,1.0,4.0,both
1,B,2.0,5.0,both
2,C,3.0,,left_only
3,D,,6.0,right_only


The _merge column shows where each row originated from: both DataFrames, only the left, or only the right DataFrame.

**`12. validate`**

Checks if the merge is of a specific type and raises an error if it does not match the expected type. Possible values include 'one_to_one', 'one_to_many', 'many_to_one', and 'many_to_many'.

In [74]:
import pandas as pd

left = pd.DataFrame({
    'key': ['A', 'B'],
    'value_left': [1, 2]
})

right = pd.DataFrame({
    'key': ['A', 'B'],
    'value_right': [3, 4]
})

result = left.merge(right, on='key', how='inner', validate='many_to_many')
result

Unnamed: 0,key,value_left,value_right
0,A,1,3
1,B,2,4


The validate='one_to_one' parameter ensures that each key in the left DataFrame corresponds to at most one key in the right DataFrame and vice versa. If the data did not meet this criterion, a MergeError would be raised.

In [59]:
import pandas as pd
import zipfile
import os

zip_path = "C:\\Users\\winsa\\Downloads\\DataFolder.zip"
extraction_dir = 'C:\\Users\\winsa\\Desktop\\arya'
with zipfile.ZipFile(zip_path,'r')as zip_ref:
    zip_ref.extractall(extraction_dir)
print("extraction completed")

extraction completed


In [13]:
dfs = []

In [15]:
for file in os.listdir(extraction_dir):
    file_path=os.path.join(extraction_dir,file)

In [19]:
if file.endswith('.csv'):
    df.append(pd.read_csv(file_path))

In [11]:
concat_df = pd.concat(dfs,ignore_index=True)

ValueError: No objects to concatenate

In [116]:
import pandas as pd
import os

extraction_dir = 'C:\\Users\\winsa\\Desktop\\arya'  
file_path = 'C:\\Users\\winsa\\Downloads\\DataFolder.zip'

files = os.listdir(extraction_dir)
print("Files found:", files)  
for file in files:
    file_path = os.path.join(extraction_dir, file)
    
    if file.endswith('.csv'):
        try:
            dfs.append(pd.read_csv(file_path))
            print(f"Loaded {file_path}")  
        except Exception as e:
            print(f"Error reading {file_path}: {e}")  

if dfs:
    concatenated_df = pd.concat(dfs, ignore_index=True)
    print(concatenated_df.head())
else:
    print("No CSV files were found or all files could not be read.")

Files found: ['DataFolder', 'env', 'nailart']
No CSV files were found or all files could not be read.


In [166]:
import pandas as pd
import os

folder_path = 'C:\\Users\\winsa\\Desktop\\DataFolder'

dfs = []

for filename in os.listdir(folder_path):
    if filename.endswith('.csv'):
        file_path = os.path.join(folder_path, filename)
        df = pd.read_csv(file_path)
        dfs.append(df)

combined_df = pd.concat(dfs, ignore_index=True)

duplicates = combined_df[combined_df.duplicated()]

print("Combined DataFrame:")
display(combined_df)
print("Duplicate Rows:")
display(duplicates)

cleaned_df = combined_df.drop_duplicates()
print('cleaned_df')
display(cleaned_df)

data = cleaned_df.set_index(['powertrain','category'])
data

Combined DataFrame:


Unnamed: 0,region,category,parameter,mode,powertrain,year,unit,value
0,Australia,Historical,EV stock share,Cars,EV,2011,percent,3.900000e-04
1,Australia,Historical,EV sales share,Cars,EV,2011,percent,6.500000e-03
2,Australia,Historical,EV sales,Cars,BEV,2011,Vehicles,4.900000e+01
3,Australia,Historical,EV stock,Cars,BEV,2011,Vehicles,4.900000e+01
4,Australia,Historical,EV stock,Cars,BEV,2012,Vehicles,2.200000e+02
...,...,...,...,...,...,...,...,...
12763,World,Projection-STEPS,EV sales share,Cars,EV,2035,percent,5.500000e+01
12764,World,Projection-STEPS,EV stock share,Cars,EV,2035,percent,3.100000e+01
12765,World,Projection-APS,EV charging points,EV,Publicly available fast,2035,charging points,9.400000e+06
12766,World,Projection-APS,EV charging points,EV,Publicly available slow,2035,charging points,1.500000e+07


Duplicate Rows:


Unnamed: 0,region,category,parameter,mode,powertrain,year,unit,value
94,Australia,Historical,EV stock share,Cars,EV,2011,percent,0.00039
95,Australia,Historical,EV sales share,Cars,EV,2011,percent,0.00650
96,Australia,Historical,EV sales,Cars,BEV,2011,Vehicles,49.00000
97,Australia,Historical,EV stock,Cars,BEV,2011,Vehicles,49.00000
98,Australia,Historical,EV stock,Cars,BEV,2012,Vehicles,220.00000
...,...,...,...,...,...,...,...,...
1259,Australia,Historical,EV stock,Cars,PHEV,2022,Vehicles,21000.00000
1260,Australia,Historical,EV sales,Cars,PHEV,2022,Vehicles,5900.00000
1261,Australia,Historical,EV stock share,Cars,EV,2022,percent,0.59000
1262,Australia,Historical,EV charging points,EV,Publicly available fast,2022,charging points,470.00000


cleaned_df


Unnamed: 0,region,category,parameter,mode,powertrain,year,unit,value
0,Australia,Historical,EV stock share,Cars,EV,2011,percent,3.900000e-04
1,Australia,Historical,EV sales share,Cars,EV,2011,percent,6.500000e-03
2,Australia,Historical,EV sales,Cars,BEV,2011,Vehicles,4.900000e+01
3,Australia,Historical,EV stock,Cars,BEV,2011,Vehicles,4.900000e+01
4,Australia,Historical,EV stock,Cars,BEV,2012,Vehicles,2.200000e+02
...,...,...,...,...,...,...,...,...
12763,World,Projection-STEPS,EV sales share,Cars,EV,2035,percent,5.500000e+01
12764,World,Projection-STEPS,EV stock share,Cars,EV,2035,percent,3.100000e+01
12765,World,Projection-APS,EV charging points,EV,Publicly available fast,2035,charging points,9.400000e+06
12766,World,Projection-APS,EV charging points,EV,Publicly available slow,2035,charging points,1.500000e+07


Unnamed: 0_level_0,Unnamed: 1_level_0,region,parameter,mode,year,unit,value
powertrain,category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
EV,Historical,Australia,EV stock share,Cars,2011,percent,3.900000e-04
EV,Historical,Australia,EV sales share,Cars,2011,percent,6.500000e-03
BEV,Historical,Australia,EV sales,Cars,2011,Vehicles,4.900000e+01
BEV,Historical,Australia,EV stock,Cars,2011,Vehicles,4.900000e+01
BEV,Historical,Australia,EV stock,Cars,2012,Vehicles,2.200000e+02
...,...,...,...,...,...,...,...
EV,Projection-STEPS,World,EV sales share,Cars,2035,percent,5.500000e+01
EV,Projection-STEPS,World,EV stock share,Cars,2035,percent,3.100000e+01
Publicly available fast,Projection-APS,World,EV charging points,EV,2035,charging points,9.400000e+06
Publicly available slow,Projection-APS,World,EV charging points,EV,2035,charging points,1.500000e+07


In [150]:
data.isnull().sum()

region       0
parameter    0
mode         0
year         0
unit         0
value        0
dtype: int64