# Data Cleaning

## Objectives

* Encode categorical variables and deal with missing data

## Inputs

* outputs/datasets/collection/HousePricesRecords.csv

## Outputs

* Create a pipeline that carries out a data cleaning process
* outputs/datasets/collection/housing_data_cleaned.csv

## Conclusion

* Variables with around 90% missing data have been dropped.
* Any variables with missing data have had the median value entered in place
* The 4 object variables have been encoded



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/HousePricesRecords.csv"))
df.head(25)

# Encode the object variables

* **BsmtExposure:** Gd: Good Exposure; Av: Average Exposure; Mn: Minimum Exposure; No: No Exposure; None: No Basement
* **BsmtFinType1:** GLQ: Good Living Quarters; ALQ: Average Living Quarters; BLQ: Below Average Living Quarters; Rec: Average Rec Room; LwQ: Low Quality; Unf: Unfinshed; None: No Basement
* **GarageFinish:** Fin: Finished; RFn: Rough Finished; Unf: Unfinished; None: No Garage
* **KitchenQual:** Kitchen quality Ex: Excellent; Gd: Good; TA: Typical/Average; Fa: Fair; Po: Poor

In [None]:
# Create a dictionary to change the categories of these variables to numbers
dic = {'BsmtExposure': {'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1, 'nan': 0, 'Missing': 0}, 'BsmtFinType1': {'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'None': 0}, 'GarageFinish': {'Fin': 3, 'RFn': 2, 'Unf': 1, 'None': 0}, 'KitchenQual': {'Ex': 4, 'Gd': 3, 'TA': 2, 'Fa': 1, 'Po': 0}}
df1=df.copy()
for col in df.columns[df.dtypes=='object'].to_list():
    df1[col] = df1[col].replace(dic[col])
df1.head()

In [None]:

from sklearn.base import BaseEstimator, TransformerMixin
# create a Class variable to fit and transform
class MyCustomEncoder(BaseEstimator, TransformerMixin):

  def __init__(self, variables, dic):
    if not isinstance(variables, list): 
      self.variables = [variables]
    else: self.variables = variables
    self.dic = dic

  def fit(self, X, y=None):    
    return self

  def transform(self, X):
    for col in self.variables:
      if X[col].dtype == 'object':
        X[col] = X[col].replace(dic[col])
      else:
        print(f"Warning: {col} data type should be object to use MyCustomEncoder()")
      
    return X


from sklearn.pipeline import Pipeline
pipeline = Pipeline([('custom_encoder', MyCustomEncoder(variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'], dic=dic))])

df1 = df.copy()
df1 = pipeline.fit_transform(df1)
df1.head(3)

# Missing Data

Take the missing data vars and put them into a column to pass into a summary report

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

Pass the missing data into a profile report

In [None]:
from pandas_profiling import ProfileReport
profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
profile.to_notebook_iframe()

# Drop Variables

After viewing the summary we can see that there are 2 variables with 90% missing data so they can be dropped. These variables are 'EnclosedPorch', 'WoodDeckSF'.

In [None]:
from feature_engine.selection import DropFeatures
drop_features = DropFeatures(features_to_drop = ['EnclosedPorch', 'WoodDeckSF'])

df_transformed = drop_features.fit_transform(df)
df_transformed.info()

# Replace Missing Data

Next we will replace the missing data with the median values

In [None]:
vars_with_missing_data

We will remove the 'EnclosedPorch' and 'WoodDeckSF' as we have already dropped them

In [None]:
vars_with_missing_data = ['2ndFlrSF', 'BedroomAbvGr', 'BsmtFinType1', 'GarageFinish', 'GarageYrBlt', 'LotFrontage', 'MasVnrArea']

from feature_engine.imputation import CategoricalImputer, MeanMedianImputer


pipeline = Pipeline([
      ('drop_features', DropFeatures(features_to_drop = ['EnclosedPorch', 'WoodDeckSF'])),
      ('categorical_imputer', CategoricalImputer(imputation_method='missing', variables=['BsmtExposure'])),
      ('custom_encoder', MyCustomEncoder(variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'], dic=dic)),
      ('median_imputer',  MeanMedianImputer(imputation_method='median', variables=vars_with_missing_data))
])

df1 = df.copy()
df_transformed = pipeline.fit_transform(df1) 
df_transformed.head(100)

Now we will check that there is no missing data left

In [None]:
vars_with_missing_data = df_transformed.columns[df_transformed.isna().sum() > 0].to_list()

vars_with_missing_data

In [None]:
pipeline['median_imputer'].imputer_dict_

In [None]:
import os

output_dir = "/workspace/PP5-ML-PROJECT/outputs"
os.makedirs(output_dir, exist_ok=True)

cleaned_file_path = os.path.join(output_dir, "housing_data_cleaned.csv")
df_transformed.to_csv(cleaned_file_path, index=False)

print(f"Cleaned dataset saved at: {cleaned_file_path}")