**<h1><center>Pandas - Missing Data</center></h1>**

Data cleaning is process of transferring raw data into meaningfull data. When we are dealing with data science project most of time consuming part is data cleaning and data analysis. Initial step data cleaning is treating null values.

Let us assume that we are collecting data for an project, few people will share all information related to that project and few people will not share information then in that case, missing values will occure. Treating missing values is initial and important step when dealing with data analysis project.

Pandas dataframe will provide various functions to treat missing values :
1. Finding Missing Values

2. Drop missing values

3. Replace missing values


**<u>Finding missing values</u>**

To identify missing values, Pandas provide isnull(), isna() functions. Both functions does same thing. Both functions produce dataframes that include boolean values that, if True, indicate that a value is missing; otherwise, they return False.



In [None]:
# Observe, NaN (Not a Number) is appended in missing areas.
import pandas as pd

data = [{'a': 1, 'b': 2},
        {'a': 5, 'b': 10, 'c': 20}]

df = pd.DataFrame(data)
print(df)

print(df.isna())

print(df.isnull())

   a   b     c
0  1   2   NaN
1  5  10  20.0
       a      b      c
0  False  False   True
1  False  False  False
       a      b      c
0  False  False   True
1  False  False  False


`df.isnull().sum()` or `df.isna().sum()` returns count of null values in each column.

`df.isna().any()` or `df.isnull().any()`returns a boolean value for each column, returns True if there is at least one missing value in that column, else Fals

In [None]:

print(df.isna().sum())

print(df.isnull().sum())

a    0
b    0
c    1
dtype: int64
a    0
b    0
c    1
dtype: int64


In [None]:

print(df.isna().any())

print(df.isnull().any())

a    False
b    False
c     True
dtype: bool
a    False
b    False
c     True
dtype: bool


**<u>Drop missing values</u>**
`dropna` function is used to drop null values either in rows or columns.

```
Syntax : df.dropna(axis = 0,
                  how = "any",
                  thresh = 3)
          
          axis : this parameter is used to specify whether rows or columns data should be deleted.
          how :  any drop if there is any missing value
                 all drop if all values are missing
          thresh : drops data if data has atleast thresh number of null values.
```

In [None]:
df.dropna(axis = 0,
          thresh = 2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [None]:
df.dropna(how = "any",
          axis = 0,
          thresh = 2)

TypeError: ignored

In [None]:
# setting axis = 0 for rows
df.dropna(how = "all",
          axis = 0)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [None]:
# setting axis = 1 for columns 1
df.dropna(how = "any",
          axis = 1)

Unnamed: 0,a,b
0,1,2
1,5,10


**<u>Replace missing values</u>**

`fillna` function is used to replace null values with a value or aggregated values mean, median, mode.

```
syntax :
        df.fillna(value) or

        df.fillna(axis, method)

        value  :  Replacing value

        axis : 0 for row and 1 for column

        method : ffill stands for forward fill replaces missing values with the values in the previous
                  row. You can also choose bfill which stands for backward fill.

```

In [None]:
# Observe, NaN (Not a Number) is appended in missing areas.
import pandas as pd

data = [{'a': 1, 'b': 2},
        {'a': 5, 'b': 10, 'c': 20},
        {'a': 5, 'b': 10, 'c': 20, 'd': 40}]

df = pd.DataFrame(data)
print(df)

   a   b     c     d
0  1   2   NaN   NaN
1  5  10  20.0   NaN
2  5  10  20.0  40.0


In [None]:
df.fillna(axis = 0, method = "bfill")

Unnamed: 0,a,b,c,d
0,1,2,20.0,40.0
1,5,10,20.0,40.0
2,5,10,20.0,40.0


In [None]:
df.fillna(axis = 0, method = "bfill", limit = 1)

Unnamed: 0,a,b,c,d
0,1,2,20.0,
1,5,10,20.0,40.0
2,5,10,20.0,40.0


In [None]:
df['c'].fillna(20.0)

0    20.0
1    20.0
2    20.0
Name: c, dtype: float64

In [None]:
df['d'].fillna(40.0)

0    40.0
1    40.0
2    40.0
Name: d, dtype: float64