# **Data Cleaning Notebook**

## Objectives

* Evaluate missing data.
* Clean data.

## Inputs

* outputs/datasets/collection/filnamn.csv 

## Outputs

* Generate cleaned Train and Test sets, both saved under outputs/datasets/cleaned.

## Conclusions 

* Data Cleaning Pipeline.

## Additional Comments

* This file and its contents were inspired by and adapted from the Churnometer Walkthrough Project 2.  

---

## Change working directory

* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

## Load Collected Data

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/house-prices.csv")
    )
df.head(3)

## Data Exploration

* Identifying Columns with Missing Data:

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

In [None]:
from ydata_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")

## Correlation and PPS Analysis

In [51]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps
%matplotlib inline

# Function to generate a heatmap based on correlation

def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)  # np.bool is deprecated, use bool
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, ax = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=ax,
                    linewidth=0.5)
        ax.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()

# Function to generate a heatmap based on PPS

def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)  # np.bool is deprecated, use bool
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                    linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()

# Function to calculate both correlation and PPS (Predictive Power Score)

def CalculateCorrAndPPS(df):
    # Calculate Spearman and Pearson correlations
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    # Calculate PPS matrix
    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    # Calculate PPS score statistics to decide threshold
    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix

# Function to display all three heatmaps: Spearman, Pearson, PPS

def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)



We first identify all categorical columns in the dataset using df.select_dtypes() and print them out. Then, we apply pd.get_dummies() to convert specific categorical columns ('BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual') into numerical one-hot encoded columns, while dropping the first category to avoid the dummy variable trap.

In [55]:
categorical_cols = df.select_dtypes(include=['object']).columns
print(categorical_cols)

df_encoded = pd.get_dummies(df, columns=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'], drop_first=True)


Index(['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'], dtype='object')


Calculate Correlations and Power Predictive Score

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df_encoded)

Display at Heatmaps

In [None]:
# Display the correlation and PPS heatmaps
DisplayCorrAndPPS(df_corr_pearson=df_corr_pearson,
                  df_corr_spearman=df_corr_spearman,
                  pps_matrix=pps_matrix,
                  CorrThreshold=0.4,  # The threshold to filter the correlations displayed in the heatmap
                  PPS_Threshold=0.2,   # The threshold for the PPS score displayed in the heatmap
                  figsize=(12, 10),    # Set the figure size for the heatmaps
                  font_annot=10)       # Set the font size for the annotations in the heatmaps


## Data Cleaning

### Assessing Missing Data Levels

* Custom function to display missing data levels in a DataFrame, it shows the absolute levels, relative levels and data type.