# Missing Data

If the values are not found while reading, pandas usually replace them with None or Nan values
- How we can handle such type of situation, we will learn that

In [1]:
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)

In [2]:
d = {'A' : [1, 2, np.NAN],
    'B'  : [5, np.NAN, np.NAN],
    'C'  : [7,8,3]}
d

{'A': [1, 2, nan], 'B': [5, nan, nan], 'C': [7, 8, 3]}

In [3]:
df = pd.DataFrame(d)
df

Unnamed: 0,A,B,C
0,1.0,5.0,7
1,2.0,,8
2,,,3


### Methods related to NAN

## 1. `df.dropna()` 
- drops any row which has one or more **NAN values**

In [4]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,7


if we want to drop the columns, which has any NAN values

In [5]:
df.dropna(axis=1)

Unnamed: 0,C
0,7
1,8
2,3


**dropping by a threshold** `thresh = integer` which tells the no of Non NAN values to not drop

In [6]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,7
1,2.0,,8
2,,,3


In [7]:
df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,7
1,2.0,,8


In [8]:
df.dropna(axis=1, thresh=2)

Unnamed: 0,A,C
0,1.0,7
1,2.0,8
2,,3


# 2. `df.fillna()`
- fills the value inplace of nan

- using zero to fill NAN

In [9]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,7
1,2.0,,8
2,,,3


In [10]:
df.fillna(0)

Unnamed: 0,A,B,C
0,1.0,5.0,7
1,2.0,0.0,8
2,0.0,0.0,3


Filling value with the mean of the column

In [11]:
df.A.fillna(value=df.A.mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

---

# Group By

In [12]:
dat = {'Company': ['Mckinsey', 'Google','Bain & company','Bain & company','Google', 'Google','Bain & company','Bain & company','Mckinsey'],
      'Person'  : ['Anil', 'Shruti', 'Anil', 'Vanya', 'Sarah', 'Jasmine', 'Varsha', 'Storm', 'Anil'],
      'Awards'  : [25, 12, 23, 23, 34, 34, 23, 23, 34]}
print(dat)

{'Company': ['Mckinsey', 'Google', 'Bain & company', 'Bain & company', 'Google', 'Google', 'Bain & company', 'Bain & company', 'Mckinsey'], 'Person': ['Anil', 'Shruti', 'Anil', 'Vanya', 'Sarah', 'Jasmine', 'Varsha', 'Storm', 'Anil'], 'Awards': [25, 12, 23, 23, 34, 34, 23, 23, 34]}


In [13]:
df = pd.DataFrame(dat)
df

Unnamed: 0,Company,Person,Awards
0,Mckinsey,Anil,25
1,Google,Shruti,12
2,Bain & company,Anil,23
3,Bain & company,Vanya,23
4,Google,Sarah,34
5,Google,Jasmine,34
6,Bain & company,Varsha,23
7,Bain & company,Storm,23
8,Mckinsey,Anil,34


#### groupBy company

In [14]:
df.groupby('Company')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x00000125AC3E8BE0>

Using aggregate function on the group by to drive results

In [15]:
df.groupby('Company').mean()

Unnamed: 0_level_0,Awards
Company,Unnamed: 1_level_1
Bain & company,23.0
Google,26.666667
Mckinsey,29.5


In [16]:
df.groupby('Company').sum()

Unnamed: 0_level_0,Awards
Company,Unnamed: 1_level_1
Bain & company,92
Google,80
Mckinsey,59


In [17]:
df.groupby('Person').sum()

Unnamed: 0_level_0,Awards
Person,Unnamed: 1_level_1
Anil,82
Jasmine,34
Sarah,34
Shruti,12
Storm,23
Vanya,23
Varsha,23


In [18]:
df.groupby(['Company','Person']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Awards
Company,Person,Unnamed: 2_level_1
Bain & company,Anil,23
Bain & company,Storm,23
Bain & company,Vanya,23
Bain & company,Varsha,23
Google,Jasmine,34
Google,Sarah,34
Google,Shruti,12
Mckinsey,Anil,59


In [19]:
df.shape

(9, 3)

In [20]:
df.groupby('Company').std()

Unnamed: 0_level_0,Awards
Company,Unnamed: 1_level_1
Bain & company,0.0
Google,12.701706
Mckinsey,6.363961


In [21]:
df.groupby('Company').sum().loc['Google']

Awards    80
Name: Google, dtype: int64

**using count() group function**

In [22]:
df.groupby('Company').count()

Unnamed: 0_level_0,Person,Awards
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Bain & company,4,4
Google,3,3
Mckinsey,2,2


Finding **max() and min()** in the groups
- this gives maximum and minimum values in each column matching the gropby criteria

In [23]:
df.groupby('Company').max()

Unnamed: 0_level_0,Person,Awards
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Bain & company,Varsha,23
Google,Shruti,34
Mckinsey,Anil,34


In [24]:
df.groupby('Company').min()

Unnamed: 0_level_0,Person,Awards
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Bain & company,Anil,23
Google,Jasmine,12
Mckinsey,Anil,25


### Describe method for all the group data

In [25]:
df.groupby('Company').describe()

Unnamed: 0_level_0,Awards,Awards,Awards,Awards,Awards,Awards,Awards,Awards
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Bain & company,4.0,23.0,0.0,23.0,23.0,23.0,23.0,23.0
Google,3.0,26.666667,12.701706,12.0,23.0,34.0,34.0,34.0
Mckinsey,2.0,29.5,6.363961,25.0,27.25,29.5,31.75,34.0


If we want to transpose that, we can use the method

In [26]:
df.groupby('Company').describe().transpose()

Unnamed: 0,Company,Bain & company,Google,Mckinsey
Awards,count,4.0,3.0,2.0
Awards,mean,23.0,26.666667,29.5
Awards,std,0.0,12.701706,6.363961
Awards,min,23.0,12.0,25.0
Awards,25%,23.0,23.0,27.25
Awards,50%,23.0,34.0,29.5
Awards,75%,23.0,34.0,31.75
Awards,max,23.0,34.0,34.0


Or if we want to focus on only bunch of company

In [27]:
df.groupby('Company').describe().iloc[[0,2],:]

Unnamed: 0_level_0,Awards,Awards,Awards,Awards,Awards,Awards,Awards,Awards
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Bain & company,4.0,23.0,0.0,23.0,23.0,23.0,23.0,23.0
Mckinsey,2.0,29.5,6.363961,25.0,27.25,29.5,31.75,34.0


---

# Merging dataFrame

In [28]:
df1 = pd.DataFrame({'A':['A0', 'A1', 'A2', 'A3'],
                   'B' :['B0', 'B1', 'B2', 'B3'],
                   'C' :['C0', 'C1', 'C2', 'C3'],
                   'D' :['D0', 'D1', 'D2', 'D3']})
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [29]:
df2 =pd.DataFrame({'A':['A4', 'A5', 'A6', 'A7'],
                   'B' :['B4', 'B5', 'B6', 'B7'],
                   'C' :['C4', 'C5', 'C6', 'C7'],
                   'D' :['D4', 'D5', 'D6', 'D7']},
                 index = [4, 5, 6, 7])
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [30]:
df3 =pd.DataFrame({'A':['A8', 'A9', 'A10', 'A11'],
                   'B' :['B8', 'B9', 'B10', 'B11'],
                   'C' :['C8', 'C9', 'C10', 'C11'],
                   'D' :['D8', 'D9', 'D10', 'D11']},
                 index = [8, 9, 10, 11])
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


### Concatenation
- DataFrames are concatenated one after another
- `pd.concat([df1, df2])` are passed as the list of dataframes and they are joined if their dimensions are same

In [31]:
pd.concat([df1, df2, df3])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [32]:
pd.concat([df1,df1])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


- If we want to concatenate via columns we can do that by using `axis=1`
- if there are any missing values in row index of any dataframe it will be filled as Nan

In [33]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,,,,
1,A1,B1,C1,D1,,,,
2,A2,B2,C2,D2,,,,
3,A3,B3,C3,D3,,,,
4,,,,,A4,B4,C4,D4
5,,,,,A5,B5,C5,D5
6,,,,,A6,B6,C6,D6
7,,,,,A7,B7,C7,D7


In [34]:
# concatenating same dataframe with itself
pd.concat([df1,df1], axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,A0,B0,C0,D0
1,A1,B1,C1,D1,A1,B1,C1,D1
2,A2,B2,C2,D2,A2,B2,C2,D2
3,A3,B3,C3,D3,A3,B3,C3,D3


## Merging

For joins Pandas uses `merge()` to perform sql joins

In [35]:
left = pd.DataFrame({'key':['K0', 'K1', 'K2', 'K3'],
                    'A'   :['A0', 'A1', 'A2', 'A3'],
                    'B'   :['B0', 'B1', 'B2', 'B3'] })
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [36]:
right = pd.DataFrame({'key':['K0', 'K1', 'K2', 'K3'],
                       'C' :['C0', 'C1', 'C2', 'C3'],
                       'D' :['D0', 'D1', 'D2', 'D3']})
right

Unnamed: 0,key,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K2,C2,D2
3,K3,C3,D3


In [37]:
pd.merge(left, right, on='key')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


## Joins

Joins are performed on the index and not on the columns and for that `join() function` is used