<center>
    <h1 id='handling-with-missing-values' style='color:#7159c1'>🔨 Handling with Missing Values 🔨</h1>
    <i>Three ways to deal with Missing Values</i>
</center>

```
- Dropping
- Imputation
- Extended Imputation
```

> **Observation** - `when dealing with missing values, always check if the value is really missing and must be replaced by another one or if it is correct to be missing. For example, a dataset has a variable that shows the number of children couples have. When we stumble upon a missing value in this variable, we have to check out if the data is really missing or if the couple just don't have any children`.

---

<h1 id='0-dropping' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>0 | Dropping</h1>

`Dropping` missing values is not `so much good` because your dataset will lose some information that can be very useful to train the model, so, before dropping them, be sure that the data is not so important to the problem. If that's not so, consider using one of the other two options: Imputation or Extended Imputation.

For educational purposes only, let's see how dropping missing values work.

In [1]:
# ---- Reading Dataset ----
import pandas as pd # pip install pandas

houses_df = pd.read_csv('./datasets/melb_data.csv')
houses_df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [6]:
# ---- Dropping Columns with Missing Values ----
cols_with_missing_values = [
    col for col in houses_df.columns
    if houses_df[col].isnull().any()
]
print(f'- Columns with Missing Values: {cols_with_missing_values}')
print('---')

houses_without_missing_null_columns_df = houses_df.copy()
houses_without_missing_null_columns_df.drop(cols_with_missing_values, axis=1, inplace=True)
houses_without_missing_null_columns_df.head()

- Columns with Missing Values: ['Car', 'BuildingArea', 'YearBuilt', 'CouncilArea']
---


Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,202.0,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,156.0,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,134.0,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,94.0,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,120.0,-37.8072,144.9941,Northern Metropolitan,4019.0


In [7]:
# ---- Droopping Rows ----
houses_without_null_rows_df = houses_df.copy()
houses_without_null_rows_df.dropna(inplace=True)
houses_without_null_rows_df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0
6,Abbotsford,124 Yarra St,3,h,1876000.0,S,Nelson,7/05/2016,2.5,3067.0,...,2.0,0.0,245.0,210.0,1910.0,Yarra,-37.8024,144.9993,Northern Metropolitan,4019.0
7,Abbotsford,98 Charles St,2,h,1636000.0,S,Nelson,8/10/2016,2.5,3067.0,...,1.0,2.0,256.0,107.0,1890.0,Yarra,-37.806,144.9954,Northern Metropolitan,4019.0


<h1 id='1-imputation' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>1 | Imputation</h1>

`Imputation` is one of the best ways to dedal with missing values because the they are replaced by some expected values, such as mean and median for numerical variables, and mode for categorical variables.

In [12]:
# ---- Imputation ----
#
# - Strategy Parameters:
#    \ mean
#    \ median
#    \ most_frequent (mode)
#    \ constant
#
from sklearn.impute import SimpleImputer # pip install sklearn

imputer = SimpleImputer(strategy='mean')
numerical_variables = [
    column for column in houses_df.columns
    if houses_df[column].dtype in ['int64', 'int32', 'float64', 'float32']
]

imputed_houses_df = houses_df[numerical_variables].copy()
imputed_houses_df = pd.DataFrame(imputer.fit_transform(imputed_houses_df))

# Since Imputation removes the columns' names, we have to get them back
imputed_houses_df.columns = numerical_variables
imputed_houses_df.head()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
0,2.0,1480000.0,2.5,3067.0,2.0,1.0,1.0,202.0,151.96765,1964.684217,-37.7996,144.9984,4019.0
1,2.0,1035000.0,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,-37.8079,144.9934,4019.0
2,3.0,1465000.0,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,-37.8093,144.9944,4019.0
3,3.0,850000.0,2.5,3067.0,3.0,2.0,1.0,94.0,151.96765,1964.684217,-37.7969,144.9969,4019.0
4,4.0,1600000.0,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,-37.8072,144.9941,4019.0


<h1 id='2-extended-imputation' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>2 | Extended Imputation</h1>

`Extended Imputation` works like the Imputation, replacing the missing values by expected ones, with the plus that a new variable is added into the dataset. These variable tells if the row got a specific variable imputed (True) or not (False).

Take care when the dataset contains a lot of variables with missing values, because since it's good to add one new column for each variable, the dataset can become really large sometimes.

In [13]:
# ---- Extended Imputation ----
#
# - Strategy Parameters:
#    \ mean
#    \ median
#    \ most_frequent (mode)
#    \ constant
#
from sklearn.impute import SimpleImputer # pip install sklearn

imputer = SimpleImputer(strategy='mean')

numerical_variables = [
    column for column in houses_df.columns
    if houses_df[column].dtype in ['int64', 'int32', 'float64', 'float32']
]

extended_imputation_df = houses_df[numerical_variables].copy()

In [16]:
# ---- Extended Imputation ----

# Adding new columns to indicate imputation and setting their values
variables_with_missing_values = [
    column for column in extended_imputation_df.columns
    if extended_imputation_df[column].isnull().any()
]

for variable in variables_with_missing_values:
    extended_imputation_df[f'{variable}_was_missing'] = extended_imputation_df[variable].isnull()

# Imputing Values
new_extended_imputation_df = pd.DataFrame(imputer.fit_transform(extended_imputation_df))
new_extended_imputation_df.columns = extended_imputation_df.columns
new_extended_imputation_df.head()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount,Car_was_missing,BuildingArea_was_missing,YearBuilt_was_missing
0,2.0,1480000.0,2.5,3067.0,2.0,1.0,1.0,202.0,151.96765,1964.684217,-37.7996,144.9984,4019.0,0.0,1.0,1.0
1,2.0,1035000.0,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,-37.8079,144.9934,4019.0,0.0,0.0,0.0
2,3.0,1465000.0,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,-37.8093,144.9944,4019.0,0.0,0.0,0.0
3,3.0,850000.0,2.5,3067.0,3.0,2.0,1.0,94.0,151.96765,1964.684217,-37.7969,144.9969,4019.0,0.0,1.0,1.0
4,4.0,1600000.0,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,-37.8072,144.9941,4019.0,0.0,0.0,0.0


---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).