Readme:


We encourage you to explore more functionalities in 'Python for Data Analysis, 3E' by Wes McKinney, Chapter 8: 'Data Wrangling: Join, Combine, and Reshape'.</br>
Link: https://wesmckinney.com/book/data-wrangling

In [1]:
import pandas as pd
import numpy as np

<h3><b>Task 1 </b></h3>
<p>
Merging.</br>
What is the default merge method in pandas? Run below code and analyse the result.</br>
What is the default merge key column?  </br>
</p>


In [2]:
df1 = pd.DataFrame({"key": ['a', "b", 'c'],
                     "data1": pd.Series(range(3), dtype="Int64")})

df2 = pd.DataFrame({"key": ["a", "b", "d"],
                     "data2": pd.Series(range(3, 6), dtype="Int64")})

pd.merge(df1, df2)

Unnamed: 0,key,data1,data2
0,a,0,3
1,b,1,4


<h3><b>Task 2 </b></h3>
<p>
Perform a left merge on given dataframes explicitly mentioning the merge key column. </br>
</p>


In [3]:
import pandas as pd

# Original DataFrames
df1 = pd.DataFrame({
    "key": ['a', "b", 'c'],
    "data1": pd.Series(range(3), dtype="Int64")
})

df2 = pd.DataFrame({
    "key": ["a", "b", "d"],
    "data2": pd.Series(range(3, 6), dtype="Int64")
})

# Perform left merge explicitly specifying the key
result = pd.merge(df1, df2, how="left", on="key")
print(result)


  key  data1  data2
0   a      0      3
1   b      1      4
2   c      2   <NA>


<h3><b>Task 3 </b></h3>
<p>
What if you have different names for the merge keys? Do right merge on below dataframes </br>
</p>


In [5]:
import pandas as pd

# Given DataFrames
df3 = pd.DataFrame({
    "key_l": ['a', "b", 'c'],
    "data1": pd.Series(range(3), dtype="Int64")
})

df4 = pd.DataFrame({
    "key_r": ["a", "b", "d"],
    "data2": pd.Series(range(3, 6), dtype="Int64")
})

# Perform right merge with different key names
result = pd.merge(df3, df4, how="right", left_on="key_l", right_on="key_r")
print(result)


  key_l  data1 key_r  data2
0     a      0     a      3
1     b      1     b      4
2   NaN   <NA>     d      5


<h3><b>Task 4 </b></h3>
<p>
Do outer merge on multiple keys 'key1', 'key2' on below dataframes </br>
</p>


In [6]:
import pandas as pd

# Given DataFrames
df5 = pd.DataFrame({
    "key1": ['a', "b", 'c'],
    "key2": [1, 2, 3],
    "data1": pd.Series(range(3), dtype="Int64")
})

df6 = pd.DataFrame({
    "key1": ["a", "b", "d"],
    "key2": [1, 4, 5],
    "data2": pd.Series(range(3, 6), dtype="Int64")
})

# Outer merge on key1 and key2
result = pd.merge(df5, df6, how="outer", on=["key1", "key2"])
print(result)


  key1  key2  data1  data2
0    a     1      0      3
1    b     2      1   <NA>
2    b     4   <NA>      4
3    c     3      2   <NA>
4    d     5   <NA>      5


<h3><b>Task 5 </b></h3>
<p>
Now outer merge df5 and df6 on column 'key1' only and see how the overlaping column name 'key2' is displayed </br>
</p>


In [7]:
import pandas as pd

# Given DataFrames
df5 = pd.DataFrame({
    "key1": ['a', "b", 'c'],
    "key2": [1, 2, 3],
    "data1": pd.Series(range(3), dtype="Int64")
})

df6 = pd.DataFrame({
    "key1": ["a", "b", "d"],
    "key2": [1, 4, 5],
    "data2": pd.Series(range(3, 6), dtype="Int64")
})

# Outer merge on key1 only
result = pd.merge(df5, df6, how="outer", on="key1")
print(result)


  key1  key2_x  data1  key2_y  data2
0    a     1.0      0     1.0      3
1    b     2.0      1     4.0      4
2    c     3.0      2     NaN   <NA>
3    d     NaN   <NA>     5.0      5


<h3><b>Task 6 </b></h3>
<p>
Now repeat the same merge but provide a custom suffix for overlaping column name 'key2'.</br>
</p>


In [8]:
import pandas as pd

# DataFrames
df5 = pd.DataFrame({
    "key1": ['a', "b", 'c'],
    "key2": [1, 2, 3],
    "data1": pd.Series(range(3), dtype="Int64")
})

df6 = pd.DataFrame({
    "key1": ["a", "b", "d"],
    "key2": [1, 4, 5],
    "data2": pd.Series(range(3, 6), dtype="Int64")
})

# Outer merge on 'key1' with custom suffixes
result = pd.merge(df5, df6, how="outer", on="key1", suffixes=("_left", "_right"))
print(result)


  key1  key2_left  data1  key2_right  data2
0    a        1.0      0         1.0      3
1    b        2.0      1         4.0      4
2    c        3.0      2         NaN   <NA>
3    d        NaN   <NA>         5.0      5


<h3><b>Task 7 </b></h3>
<p>
Merging on Index.</br>
In some cases, the merge key(s) in a DataFrame will be found in its index (row labels).</br>
Can you perform inner merge on below dataframes on df1's column 'key' and df2's index? </br></br>

Note: DataFrame has a 'join' instance method to simplify merging by index - explore on your own about the differences between 'merge' and 'join' in pandas.
</p>


In [11]:
import pandas as pd

df1 = pd.DataFrame({
    "key": ['a', "b", 'c'],
    "value": pd.Series(range(3), dtype="Int64")
})

df2 = pd.DataFrame({
    'group_val': [2.5, 3.5]
}, index=['a', 'b'])

# Perform inner merge on df1['key'] and df2's index
result = pd.merge(df1, df2, left_on='key', right_index=True, how='inner')
print(result)

## df1.set_index('key').join(df2)



  key  value  group_val
0   a      0        2.5
1   b      1        3.5


<h3><b>Task 8 </b></h3>
<p>
Concatenating Along an Axis. </br>
Run below piece of code and analyze the result of pandas.concat(). </br>

</p>


In [12]:
s1 = pd.Series([0, 1], index=["a", "b"], dtype="Int64")
s2 = pd.Series([2, 3, 4], index=["c", "d", "e"], dtype="Int64")
s3 = pd.Series([5, 6], index=["f", "g"], dtype="Int64")

pd.concat([s1,s2,s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: Int64

<h3><b>Task 9 </b></h3>
<p>
Now concatenate the Series on column axis and analyze the result. </br>
</p>


In [13]:
import pandas as pd

s1 = pd.Series([0, 1], index=["a", "b"], dtype="Int64")
s2 = pd.Series([2, 3, 4], index=["c", "d", "e"], dtype="Int64")
s3 = pd.Series([5, 6], index=["f", "g"], dtype="Int64")

result = pd.concat([s1, s2, s3], axis=1)
print(result)


      0     1     2
a     0  <NA>  <NA>
b     1  <NA>  <NA>
c  <NA>     2  <NA>
d  <NA>     3  <NA>
e  <NA>     4  <NA>
f  <NA>  <NA>     5
g  <NA>  <NA>     6


<h3><b>Task 10 </b></h3>
<p>
A potential issue with the previous code is that the concatenated pieces are not identifiable in the result.  </br>
Suppose instead you wanted to create a hierarchical index on the concatenation axis.  </br>
To do this, use the 'keys' argument: run below code and analyze the result. </br>
Think of how would you use these functionalities in real life.  </br>


</p>


In [14]:
result = pd.concat([s1, s2, s3], keys=["one", "two", "three"])
print(result)
print(result.unstack())

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: Int64
          a     b     c     d     e     f     g
one       0     1  <NA>  <NA>  <NA>  <NA>  <NA>
two    <NA>  <NA>     2     3     4  <NA>  <NA>
three  <NA>  <NA>  <NA>  <NA>  <NA>     5     6


<h3><b>Task 11 </b></h3>
<p>
Combining Data with Overlap. </br>
Use numpy.where() method to produce an output array where NA values in Series 'a' are replaced with values from Series 'b', without checking whether the index labels are aligned or not</br>
</p>


In [16]:
import numpy as np
import pandas as pd

a = pd.Series([np.nan, 2.5, 0.0, 3.5, 4.5, np.nan],
              index=["f", "e", "d", "c", "b", "a"])

b = pd.Series([0., np.nan, 2., np.nan, np.nan, 5.],
              index=["a", "b", "c", "d", "e", "f"])

# Use numpy.where without aligning on index
result = pd.Series(np.where(pd.isna(a.values), b.values, a.values), index=a.index)

print(result)


f    0.0
e    2.5
d    0.0
c    3.5
b    4.5
a    5.0
dtype: float64



<h3><b>Task 12 </b></h3>
<p>
What if you want to line up the values by index?  </br>
Use pandas.combine_first() method and analyze the result. </br>

</p>


In [18]:
import numpy as np
import pandas as pd

a = pd.Series([np.nan, 2.5, 0.0, 3.5, 4.5, np.nan],
              index=["f", "e", "d", "c", "b", "a"])

b = pd.Series([0., np.nan, 2., np.nan, np.nan, 5.],
              index=["a", "b", "c", "d", "e", "f"])

# Combine a with b using combine_first (aligns by index)
result = a.combine_first(b)

print(result)


a    0.0
b    4.5
c    3.5
d    0.0
e    2.5
f    5.0
dtype: float64


<h3><b>Task 13 </b></h3>
<p>
Now run pandas.combine_first() on below dataframes, analyze the result and how this method lines up the values by index. </br>
</p>


In [19]:
import numpy as np
import pandas as pd

df1 = pd.DataFrame({
    "a": [1., np.nan, 5., np.nan],
    "b": [np.nan, 2., np.nan, 6.],
    "c": range(2, 18, 4)
})

df2 = pd.DataFrame({
    "a": [5., 4., np.nan, 3., 7.],
    "b": [np.nan, 3., 4., 6., 8.]
})

# Combine with index and column alignment
result = df1.combine_first(df2)
print(result)


     a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN
