# Data Cleaning

## Objectives

*   Encode categorical variables and handle missing data
*   Clean data

## Inputs

* outputs/datasets/collection/HeritageHousing.csv

## Outputs

* Generate a pipeline that performs the data cleaning

## Conclusions

  * Data Cleaning Pipeline
  * Two variables have more than 90% missing values and can be dropped, seven more variables have missing values and those values can be replaced with the variable's median


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Collected data

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/HousePricesRecords.csv"))
df.head(3)

## Encode object variables:

#### There is four object variables:

**BsmtExposure:** Refers to walkout or garden level walls Gd: Good Exposure; Av: Average Exposure; Mn: Mimimum Exposure; No: No Exposure; None: No Basement

**BsmtFinType1:** Rating of basement finished area GLQ: Good Living Quarters; ALQ: Average Living Quarters; BLQ: Below Average Living Quarters; Rec: Average Rec Room; LwQ: Low Quality; Unf: Unfinshed; None: No Basement

**GarageFinish:** Interior finish of the garage Fin: Finished; RFn: Rough Finished; Unf: Unfinished; None: No Garage

**KitchenQual:** Kitchen quality Ex: Excellent; Gd: Good; TA: Typical/Average; Fa: Fair; Po: Poor

In [None]:
# Dictionary to map the categories of the four object variables to numbers.
# It maps None to zero in contrast to the dictionary used in the HouseSalePrices
# notebook where we discarded None values.
dic = {'BsmtExposure': {'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1, 'None': 0}, 'BsmtFinType1': {'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'None': 0}, 'GarageFinish': {'Fin': 3, 'RFn': 2, 'Unf': 1, 'None': 0}, 'KitchenQual': {'Ex': 4, 'Gd': 3, 'TA': 2, 'Fa': 1, 'Po': 0}}
df2=df.copy()
for col in df.columns[df.dtypes=='object'].to_list():
    df2[col] = df2[col].replace(dic[col])
df2.head()

### Perform the same encoding as above but with a custom transformer that can be used in a pipeline

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
# create a Class variable with a fit and transform method
class MyCustomEncoder(BaseEstimator, TransformerMixin):

  def __init__(self, variables, dic):
    if not isinstance(variables, list): 
      self.variables = [variables]
    else: self.variables = variables
    self.dic = dic

  def fit(self, X, y=None):    
    return self

  def transform(self, X):
    for col in self.variables:
      if X[col].dtype == 'object':
        X[col] = X[col].replace(dic[col])
      else:
        print(f"Warning: {col} data type should be object to use MyCustomEncoder()")
      
    return X

# use the custom encoder in a pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('custom_encoder', MyCustomEncoder(variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'], dic=dic))])

df2 = df.copy()
df2 = pipeline.fit_transform(df2)
df2.head(3)

## Missing data

* Find out which variables have missing data

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

#### Explore the variables with missing data

In [None]:
from pandas_profiling import ProfileReport
profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
profile.to_notebook_iframe()

### Drop variables

* The variables 'EnclosedPorch' and 'WoodDeckSF' have about 90% missing data and can be dropped.

In [None]:
from feature_engine.selection import DropFeatures
drop_features = DropFeatures(features_to_drop = ['EnclosedPorch', 'WoodDeckSF'])

df_transformed = drop_features.fit_transform(df)
df_transformed.info()

### Replace missing data with median

In [None]:
vars_with_missing_data

In [None]:
# Remove 'EnclosedPorch' and 'WoodDeckSF' from vars_with_missing_dat since we drop them
vars_with_missing_data = ['2ndFlrSF', 'BedroomAbvGr', 'BsmtFinType1', 'GarageFinish', 'GarageYrBlt', 'LotFrontage', 'MasVnrArea']
from feature_engine.imputation import MeanMedianImputer

pipeline = Pipeline([
      ('drop_features', DropFeatures(features_to_drop = ['EnclosedPorch', 'WoodDeckSF'])),
      ('custom_encoder', MyCustomEncoder(variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'], dic=dic)),
      ('median_imputer',  MeanMedianImputer(imputation_method='median', variables=vars_with_missing_data))
])

df2 = df.copy()
df_transformed = pipeline.fit_transform(df2) 
df_transformed.head(5)   

* Check that there is no missing data

In [None]:
vars_with_missing_data = df_transformed.columns[df_transformed.isna().sum() > 0].to_list()
vars_with_missing_data

* Print the computed median for each feature

In [None]:
pipeline['median_imputer'].imputer_dict_

---