# **(DATA CLEANING)**

## Objectives

* Handle/Evaluate missing data.
* Cleaning data

## Inputs

* outputs/datasets/collection/HousePrices.csv 

## Outputs

* Generate cleaned data in outputs/datasets/cleaned 
 

## Additional Comments

* Handle missing data and droping variables


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Data

Loading collected data

In [None]:
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/HousePrices.csv")
df.head()

# Data Exploration

We want to check the shapes, length and distrubution on the data when cleaning it.

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

In [None]:
len(vars_with_missing_data)

In [None]:
df[vars_with_missing_data].info()

In [None]:
# Code from walkthrough project 02
from pandas_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("No variables has missing data")

---

# Data Cleaning

### Assessing Missing Data Levels

In [None]:
def AssessMissingValues(df):
    """
    This function assesses the presence of missing values in a DataFrame.
    """
    # Calculate the absolute number of missing values per column
    total_missing = df.isna().sum()

    # Calculate the percentage of missing values relative to the total rows
    percent_missing = (total_missing / len(df) * 100).round(2)

    # Create a new DataFrame to display missing data stats
    missing_stats = pd.DataFrame({
        'MissingValuesCount': total_missing,
        'PercentOfTotal': percent_missing,
        'ColumnType': df.dtypes
    }).sort_values(by='PercentOfTotal', ascending=False)

    # Filter to include only columns with missing values
    missing_stats = missing_stats[missing_stats['PercentOfTotal'] > 0]

    return missing_stats



In [None]:
AssessMissingValues(df)

We can drop the variables with data that wont have big inpact on the predictions


### Handle Missing Data

Inpiered off code insitutes learning material and https://github.com/Amareteklay/heritage-housing-issues/blob/main/jupyter_notebooks/02%20-%20Data_Cleaning.ipynb

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy.stats import gaussian_kde
%matplotlib inline

sns.set(style="whitegrid")

def VisualizeDataCleaningImpact(df_original, df_cleaned, applied_methods_vars):
    flag_count = 1  # To keep track of the plot sequence
    
    # Identify categorical variables from the original dataset
    cat_vars = df_original.select_dtypes(include=['object', 'category']).columns
    
    print("\n=====================================================================================")
    print(f"* Distribution Effect Analysis After Data Cleaning Method in the following variables:")
    print(f"{applied_methods_vars} \n\n")

    for var in applied_methods_vars:
        if var in cat_vars:  # For categorical variables, use bar plot
            df1 = pd.DataFrame({"Type": "Original", "Value": df_original[var]})
            df2 = pd.DataFrame({"Type": "Cleaned", "Value": df_cleaned[var]})
            df_combined = pd.concat([df1, df2], axis=0)
            
            plt.figure(figsize=(15, 5))
            sns.countplot(x="Value", hue='Type', data=df_combined, palette=['#432371', '#FAAE7B'])
            plt.title(f"Bar Plot {flag_count}: {var}")
            plt.xticks(rotation=90)
            plt.legend()

        else:  # For numerical variables, use KDE plot
            plt.figure(figsize=(10, 5))
            sns.histplot(df_original[var].dropna(), color="#432371", label='Original', alpha=0.5, edgecolor='black', zorder=1, kde=True)
            sns.histplot(df_cleaned[var].dropna(), color="#FAAE7B", label='Cleaned', alpha=0.5, edgecolor='black', zorder=1, kde=True)
            #sns.kdeplot(df_original[var].dropna(), color="#432371", label='Original', zorder=2)
            #sns.kdeplot(df_cleaned[var].dropna(), color="#FAAE7B", label='Cleaned', zorder=2)
            plt.title(f"KDE Plot {flag_count}: {var}")
            plt.legend()
            plt.show()

        flag_count += 1


### Data cleaning Summary

* Exclusion of Specific Features: We have decided to remove the columns 'EnclosedPorch' and 'WoodDeckSF'. Despite their potential relevance in augmenting a property's size, their high rate of missing values (exceeding 80%) undermines their utility for predictive modeling. The lack of substantial data variation in these columns across different house price levels supports their exclusion from our analysis.

* Approaches for Imputation:

* a. Mean Imputation Application: For the columns 'LotFrontage' and 'BedroomAbvGr', we will apply mean imputation. This choice is grounded in the observation that their distribution patterns are relatively symmetrical, resembling a normal distribution. Substituting missing values with the mean will maintain the central tendency of these features.

* b. Median Imputation Utilization: The features '2ndFlrSF', 'GarageYrBlt', and 'MasVnrArea' will undergo median imputation. Despite these features being somewhat normally distributed, the presence of skewed data warrants the use of the median. This approach is preferred over the mean, as it is less susceptible to distortion by outliers, providing a more robust measure of central tendency.

* c. Categorical Imputation for Specific Features: For 'GarageFinish' and 'BsmtFinType1', categorical imputation is the chosen method. As these are categorical variables, employing a MeanMedian imputation approach is infeasible. Instead, we will replace missing values with the most frequently occurring category within each respective feature.

* This tailored approach to handling missing data aligns with the specific characteristics of each variable in our dataset, enhancing the integrity and utility of the cleaned data for subsequent analysis and modeling.

### Train Set And Test Set

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['SalePrice'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

In [None]:
df_data_missing = AssessMissingValues(TrainSet)
print(f"* Total variables affected by missing data: {df_data_missing.shape[0]} \n")
df_data_missing

### Drop variables 

* Variable Removal: Initially, the code specifies a list of variables, 'EnclosedPorch' and 'WoodDeckSF', which are identified for removal from the dataset. This is done in anticipation that dropping these variables, due to their high rate of missing data, will refine the dataset for better analysis and modeling.

* Feature Dropping and Confirmation: The DropFeatures class from the feature_engine.selection module is then utilized to execute the removal of these specified variables from the training dataset (TrainSet). After this operation, the code conducts a verification check to ensure that each of the intended variables has indeed been dropped from the DataFrame. This verification is presented through a series of print statements that confirm the absence of each variable in the updated DataFrame.



In [None]:
# Applying feature removal process
from feature_engine.selection import DropFeatures

# Specifying the variables to be excluded
drop_columns = ['EnclosedPorch', 'WoodDeckSF']
feature_dropper = DropFeatures(features_to_drop=drop_columns)
df_updated = feature_dropper.fit_transform(TrainSet)

# Checking if the specified columns have been successfully removed
for column in drop_columns:
    print(f"Is '{column}' present in the updated DataFrame? {'Yes' if column in df_updated.columns else 'No'}")

In [None]:
# Applying Mean Imputation to specified variables
from feature_engine.imputation import MeanMedianImputer

# Implementing Mean Imputation
mean_impute_vars = ['LotFrontage', 'BedroomAbvGr']
mean_imputer = MeanMedianImputer(imputation_method='mean', variables=mean_impute_vars)
df_imputed = mean_imputer.fit_transform(TrainSet)

# Visualizing the impact of mean imputation on the data
VisualizeDataCleaningImpact(df_original=TrainSet, df_cleaned=df_imputed, applied_methods_vars=mean_impute_vars)

### Median Imputation

* This code snippet demonstrates the process of median imputation for handling missing values in a dataset. It specifically targets three variables: '2ndFlrSF', 'GarageYrBlt', and 'MasVnrArea'. The MeanMedianImputer from the feature_engine.imputation library is utilized, configured for median imputation. After applying this imputation method to the TrainSet, the impact of this data cleaning step is visualized through the VisualizeDataCleaningImpact function. This function compares the original dataset with the cleaned version, highlighting changes in data distribution for the specified variables.

In [None]:
from feature_engine.imputation import MeanMedianImputer

# Specifying the variables for median imputation
median_impute_vars = ['2ndFlrSF', 'GarageYrBlt', 'MasVnrArea']
median_imputer = MeanMedianImputer(imputation_method='median', variables=median_impute_vars)
df_median_imputed = median_imputer.fit_transform(TrainSet)

# Visualizing the impact of median imputation on the data
VisualizeDataCleaningImpact(df_original=TrainSet, df_cleaned=df_median_imputed, applied_methods_vars=median_impute_vars)

In [None]:
TrainSet[(TrainSet['GarageArea'] ==0)][['GarageYrBlt', 'GarageArea']]

* Shows where the var are 0

### Categorical Imputation

In [None]:
from feature_engine.imputation import CategoricalImputer

categorical_vars = ['GarageFinish', 'BsmtFinType1', 'BsmtExposure']
imputer = CategoricalImputer(imputation_method='missing', fill_value='None', variables=categorical_vars)
df_cat_imputation = imputer.fit_transform(TrainSet)
VisualizeDataCleaningImpact(df_original=TrainSet,
                            df_cleaned=df_cat_imputation,
                            applied_methods_vars=categorical_vars)

In [None]:
TrainSet[(TrainSet['GarageArea'] ==0)][['GarageFinish', 'GarageArea']]

---

### Data Cleaning Pipeline

* Mean Imputation: This step uses the MeanMedianImputer class with imputation_method='mean' to impute missing values in the 'LotFrontage' and 'BedroomAbvGr' columns using their respective means. This is suitable for continuous variables where the mean is a good estimate of central tendency.

* Median Imputation: Also utilizing the MeanMedianImputer class but with imputation_method='median', this step addresses missing values in the '2ndFlrSF' and 'MasVnrArea' columns by replacing them with the median of each column. This method is often used for skewed distributions or when the median is a more robust measure than the mean.

* Categorical Imputation: The CategoricalImputer class, set to imputation_method='frequent', is used for the 'GarageFinish' and 'BsmtFinType1' columns. It replaces missing values with the most frequent category within each column. This approach is common for categorical variables where the mode (or most frequent category) is a logical choice for imputation.

* Dropping Variables: The final step uses the DropFeatures class to remove specific columns from the dataset. In this case, 'EnclosedPorch', 'GarageYrBlt', and 'WoodDeckSF' are dropped. This step is crucial for eliminating features that are not useful for the analysis or modeling process.

* Overall, this pipeline is a structured approach to clean and prepare the data for further analysis or modeling, enhancing the quality and utility of the dataset.

In [None]:
from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from feature_engine.selection import DropFeatures

# Define a structured data preprocessing pipeline
preprocessing_pipeline = Pipeline([
    ('impute_mean', MeanMedianImputer(imputation_method='mean', 
                                      variables=['LotFrontage', 'BedroomAbvGr'])),
    ('impute_median', MeanMedianImputer(imputation_method='median', 
                                        variables=['2ndFlrSF', 'MasVnrArea'])),
    ('impute_categorical', CategoricalImputer(imputation_method='frequent', 
                                              variables=['GarageFinish', 'BsmtFinType1', 'BsmtExposure'])),
    ('drop_features', DropFeatures(features_to_drop=['EnclosedPorch', 'GarageYrBlt', 'WoodDeckSF']))
])

* After this we apply the dataset to the whole dataset. We do it to get cleaned data.

In [None]:
TrainSet, TestSet = preprocessing_pipeline.fit_transform(TrainSet) , preprocessing_pipeline.fit_transform(TestSet)

In [None]:
df = preprocessing_pipeline.fit_transform(df)

In [None]:
AssessMissingValues(TestSet)

In [None]:
AssessMissingValues(TrainSet)

In [None]:
AssessMissingValues(df)

* By running this we see that there is no missing data to handle 

---

# Push files to Repo

* We make an directory for the cleaned data files.

In [None]:
import os

try:
    os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
    print(f"An error occurred: {e}")

### Train Set

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)

### Test Set

In [None]:
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)

### Cleaned Dataset

In [None]:
df.to_csv("outputs/datasets/cleaned/HousePricesCleaned.csv", index=False)

### Data Cleaning Pipeline

In [None]:
import joblib

file_path = f'outputs/ml_pipeline/data_cleaning'

try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)

In [None]:
joblib.dump(value=preprocessing_pipeline, filename=f"{file_path}/preprocessing_pipeline.pkl")
