# **Data Cleaning Notebook**

## Objectives

* Evaluate missing data
* Clean Data

## Inputs

* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv

## Outputs

* Generate cleaned Test and Train sets 
* Data cleaning pipeline

## Conclusions

* Drop ['EnclosedPorch', 'WoodDecksF']
* Cateogorical imputer- ['GarageFinish', 'BsmtFinType1']
* Median imputer- ['2ndFlrSF', 'BedroomAbvGr', 'GarageYrBlt', 'LotFrontage', 'MasVnrArea']


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load collected data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()

# Data Exploration

First we need to find the values with missing data, store in an array and create a Pandas profiling report.
This shows 9 variables with missing data, the total of each can be seen in the report.

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

In [None]:
from ydata_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")

---

# Data Cleaning

# Assessing Missing Data Levels

* Custom function to display missing data levels in a DataFrame. 

In [None]:
def EvaluateMissingData(df):
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute/len(df)*100, 2)
    df_missing_data = (pd.DataFrame(
                            data={"RowsWithMissingData": missing_data_absolute,
                                   "PercentageOfDataset": missing_data_percentage,
                                   "DataType": df.dtypes}
                                    )
                          .sort_values(by=['PercentageOfDataset'], ascending=False)
                          .query("PercentageOfDataset > 0")
                          )

    return df_missing_data

We check missing data levels for the collected dataset. This shows the number of missing rows and the percentage of missing values per feature as well as data type.

In [None]:
EvaluateMissingData(df)

We observe that EnclosedPorch and WoodDeckSF exhibit high levels of missing data, suggesting that these features may have limited predictive value for estimating the sale price.

# Dealing With Missing Data

We analyse and visualize the effects of data cleaning methods on specified variables.

In [None]:
import seaborn as sns
sns.set(style="whitegrid")
import matplotlib.pyplot as plt

def DataCleaningEffect(df_original, df_cleaned, variables_applied_with_method):
    """
    Analyse and visualize the distribution effect of data cleaning on variables.

    Parameters:
    - df_original: Original dataset (DataFrame)
    - df_cleaned: Cleaned dataset (DataFrame)
    - variables_applied_with_method: List of variables to analyse
    """      
    flag_count = 1  

    categorical_vars = [var for var in variables_applied_with_method if df_original[var].dtype == 'object']
    numerical_vars = [var for var in variables_applied_with_method if df_original[var].dtype != 'object']

    print("\n=====================================================================================")
    print("* Distribution Effect Analysis After Data Cleaning Method in the following variables:\n")
    print(f"{variables_applied_with_method}\n")

    for var in categorical_vars:
        df1 = pd.DataFrame({"Type": "Original", "Value": df_original[var]})
        df2 = pd.DataFrame({"Type": "Cleaned", "Value": df_cleaned[var]})
        df_combined = pd.concat([df1, df2], axis=0)

        plt.figure(figsize=(15, 5))
        sns.countplot(data=df_combined, x="Value", hue="Type", palette=['#432371', "#FAAE7B"])
        plt.title(f"Distribution Plot {flag_count}: {var} (Categorical Variable)")
        plt.xticks(rotation=90)
        plt.legend(title="Dataset")
        plt.xlabel(var)
        plt.ylabel("Count")
        plt.show()

        flag_count += 1

    for var in numerical_vars:
        plt.figure(figsize=(10, 5))
        sns.histplot(data=df_original, x=var, color="#432371", label='Original', kde=True, element="step")
        sns.histplot(data=df_cleaned, x=var, color="#FAAE7B", label='Cleaned', kde=True, element="step")
        plt.title(f"Distribution Plot {flag_count}: {var} (Numerical Variable)")
        plt.legend(title="Dataset")
        plt.xlabel(var)
        plt.ylabel("Frequency")
        plt.show()

        flag_count += 1

### Numerical missing values

We define the identified numerical variables with missing data for targeted imputation.

In [None]:
variables_method = ['2ndFlrSF', 'BedroomAbvGr', 'GarageYrBlt', 'LotFrontage', 'MasVnrArea', 'WoodDeckSF', 'EnclosedPorch']
variables_method

We apply the Median Imputation method to handle missing data across the selected numerical variables. This replaces missing values with the median of the respective variable, ensuring robustness against outliers. 

In [None]:
from feature_engine.imputation import MeanMedianImputer

imputer = MeanMedianImputer(imputation_method='median', variables=variables_method)
df_method = imputer.fit_transform(df)

We use the Data Cleaning Effect function to visualize the impact of this method on the distribution of numerical variables, comparing the original and cleaned datasets.

In [None]:
DataCleaningEffect(df_original=df,
                   df_cleaned=df_method,
                   variables_applied_with_method=variables_method)

### Categorical missing values

We define the categorical variables with missing data for targeted imputation. 

In [None]:
variables_method = ['GarageFinish', 'BsmtFinType1']
variables_method

We introduce a new category called 'Missing' to account for variables with missing data. This is distinct from None, as it is unclear whether the variable is missing due to omission or if it genuinely represents the absence of a value.

In [None]:
from feature_engine.imputation import CategoricalImputer

imputer = CategoricalImputer(imputation_method='missing',fill_value='Missing',
                             variables=variables_method)

df_method = imputer.fit_transform(df)

In [None]:
DataCleaningEffect(df_original=df,
                   df_cleaned=df_method,
                   variables_applied_with_method=variables_method)

## Data Cleaning Summary

1. Drop Variables: Remove EnclosedPorch and WoodDeckSF from the dataset as they contain over 88% missing values, making them unsuitable for reliable analysis or imputation.

2. Median Imputation: Apply the median imputation method to the following numerical variables with missing data: LotFrontage, GarageYrBlt, MasVnrArea, BedroomAbvGr, and 2ndFlrSF. This approach replaces missing values with the median of each respective variable, ensuring robustness against outliers.

3. Categorical Imputation: Assign the category 'Missing' to fill missing values in the categorical variables GarageFinish and BsmtFinType1. This explicitly captures the absence of data without conflating it with meaningful categories.

# Split Train and Test Models

We split the dataset into Train and Test sets to evaluate the effectiveness of our imputation methods. The imputation techniques are applied to the Train set, allowing the model to learn from the cleaned data. We then assess the performance and impact of these methods on the Test set to ensure generalisability and robustness

In [None]:
from sklearn.model_selection import train_test_split

TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['SalePrice'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

In [None]:
df_missing_data = EvaluateMissingData(TrainSet)
print(f"* There are {df_missing_data.shape[0]} variables with missing data \n")
df_missing_data

### 1. Drop Variables

We are dropping EnclosedPorch and WoodDeckSF as these have over 88% missing values.

In [None]:
variables_method = ['EnclosedPorch', 'WoodDeckSF' ]

print(f"* {len(variables_method)} variables to drop \n\n"
    f"{variables_method}")

We create a separate DataFrame applying this imputation approach to the selected variables.

In [None]:
from feature_engine.selection import DropFeatures

imputer = DropFeatures(features_to_drop=variables_method)
imputer.fit(TrainSet)
df_method = imputer.transform(TrainSet)

We verify that the selected variables have been successfully dropped by checking that they are no longer present in the transformed DataFrame.

In [None]:
remaining_variables = set(variables_method).intersection(set(df_method.columns))

if not remaining_variables:
    print("Success: All specified variables have been dropped.")
else:
    print(f"Failure: These variables were not dropped: {remaining_variables}")

df_method.shape


We can now apply these changes to both Train and Test sets.

In [None]:
imputer = DropFeatures(features_to_drop=variables_method)
imputer.fit(TrainSet)

TrainSet, TestSet = imputer.transform(TrainSet) , imputer.transform(TestSet)

We can evaluate the missing data

In [None]:
EvaluateMissingData(TrainSet)

In [None]:
missing_features_count = TrainSet.isnull().any().sum()
print(f"There are {missing_features_count} features with missing data.")

### 2. Median Imputing

We now impute the median values for numerical variables with missing data

In [None]:
variables_method = ['LotFrontage', 'GarageYrBlt', 'MasVnrArea', 'BedroomAbvGr', '2ndFlrSF' ]
print(f"* {len(variables_method)}  median\n\n"
    f"{variables_method}")

In [None]:
imputer = MeanMedianImputer(imputation_method='median', variables=variables_method)
imputer.fit(TrainSet)
df_method = imputer.transform(TrainSet)

DataCleaningEffect(df_original=TrainSet,
                   df_cleaned=df_method,
                   variables_applied_with_method=variables_method)

In [None]:
df_method.head(5)

### Imputation Analysis
General Observation:

The missing values have been replaced with the median value for each feature. While this method is robust against outliers, it has introduced subtle changes in the distribution of some features, which could influence their predictive power.

Feature: LotFrontage:<br>
The imputed median value (~70) now accounts for a disproportionately large number of observations compared to the original distribution.
This concentration of values around the median could reduce the feature's variability and potentially affect its predictive accuracy in capturing the true relationship with the target variable (SalePrice).

Feature: GarageYrBlt:<br>
While imputing the missing values in GarageYrBlt around the median year helps address the missing data, it may significantly impact the predictive power of this feature. Given that the age of the garage and property can strongly influence sale price, this approach might oversimplify the variability and dilute the feature's ability to capture its true relationship with the target variable.

Fit transform the changes to the Train and Test sets

In [None]:
imputer = MeanMedianImputer(imputation_method='median', variables=variables_method)
imputer.fit(TrainSet)
TrainSet, TestSet = imputer.transform(TrainSet) , imputer.transform(TestSet)

We evaluate the current missing data levels

In [None]:
EvaluateMissingData(TrainSet)

### 3. Categorical Imputing

We create another category 'missing' to define the missing data.

In [None]:
variables_method = [ 'GarageFinish', 'BsmtFinType1' ]
print(f"* {len(variables_method)}  categorical\n\n"
    f"{variables_method}")

In [None]:
from feature_engine.imputation import CategoricalImputer

imputer = CategoricalImputer(imputation_method='missing',fill_value='Missing', variables=variables_method)
imputer.fit(TrainSet)
df_method = imputer.transform(TrainSet)
DataCleaningEffect(df_original=TrainSet,
                   df_cleaned=df_method,
                   variables_applied_with_method=variables_method)

In [None]:
df_method.head(5)

Another category of 'Missing' has been added to handle the missing data. This does not effect the categories already present in the feature. 
The feature engineering process will handle the values as numeric.

Evaluate changes to the Trainset

In [None]:
imputer = CategoricalImputer(imputation_method='missing',fill_value='Missing', variables=variables_method)
imputer.fit(TrainSet)
TrainSet, TestSet = imputer.transform(TrainSet) , imputer.transform(TestSet)

We finally check once more for missing values.

In [None]:
EvaluateMissingData(TrainSet)
EvaluateMissingData(TestSet)

---

We save cleaned train and test sets

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)


## Train Set

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)

## Test set

In [None]:
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)