# **Notebook 2: Data Cleaning**

## Objectives

* Identify and handle missing values across both datasets.
* Detect and address outliers to ensure data integrity.
* Remove duplicates and inconsistencies within the datasets.
* Standardize formatting and data types for compatibility.
* Save the cleaned datasets for further analysis and modeling.
* Document the cleaning process for reproducibility and transparency.

## Inputs

* **Raw Datasets**:
  * `house_prices_records.csv`: Contains house attribute data and sale prices for properties in Ames, Iowa.
  * `inherited_houses.csv`: Contains attributes of four inherited properties but excludes sale prices.
* **Saved Location**:
  * Raw datasets are located in `outputs/datasets/raw/`.

## Outputs

* **Cleaned Datasets**:
  * `cleaned_house_prices_records.csv`: Cleaned version of the house prices dataset.
  * `cleaned_inherited_houses.csv`: Cleaned version of the inherited houses dataset.
* **Documentation**:
  * A summary of the cleaning process, including handling of missing values, outliers, and duplicates.
  * Cleaned datasets saved in `outputs/datasets/cleaned/`.

## Additional Comments

* This notebook adheres to the CRISP-DM methodology's Data Preparation step.
* Cleaning decisions (e.g., imputation methods, outlier handling) are based on data characteristics and domain knowledge.
* The cleaned datasets will be ready for downstream tasks such as correlation analysis, feature engineering, and model training in subsequent notebooks.


---

# Change working directory

* It is assumed that you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Import Packages & Set Environment Variables

* First you will need to import the numpy and pandas packages, and set the environment variables by running the following:

In [None]:
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None
from pandas_profiling import ProfileReport
from feature_engine.imputation import ArbitraryNumberImputer, CategoricalImputer
from sklearn.pipeline import Pipeline

# Load Collected Data

* Now that we have the required packages and environment variables set, you need to load the data previously downloaded (please see the Data Collection notebook).

In [None]:
df = pd.read_csv(f"inputs/dataset/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
print(df.shape)

In [None]:
df_inherited = pd.read_csv(f"inputs/dataset/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
print(df_inherited.shape)
df_inherited

# Data Exploration

## Missing Data Exploration

* Next you will explore the dataset, check the variable types and distributing, missing levels, and what value these variables may add in the content of our first business requirement.
* First of all you need to list the variables that are missing a value using the following:

In [None]:
vars_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_missing_data

* You then need to run the pandas profiling report using just the var_missing_data variable as follows:

In [None]:
if vars_missing_data:
    pandas_report = ProfileReport(df=df[vars_missing_data], minimal=True)
    pandas_report.to_notebook_iframe()
else:
    print("Done. There are no variables that are missing data.")

### Assessing Missing Data Levels

* **Purpose**: To gain an understanding of the extend and distribution of the missing data across the dataset.
* **Steps**:
  1. **Identify Variables with Missing Data**:
       * Generate a list of columns that have missing values and their corresponding percentages.
       * Use a profiling report or visualizations to analyze the distribution and patterns of missing data.
  2. **Classify Missing Data**:
       * Categorize missing data as either systematic (e.g., due to a specific condition) or random.
  3. **Visualize Missing Data**:
       * Use heatmaps or bar charts to understand patterns in missing data.

In [None]:
def AssessMissingData(df):
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute/len(df)*100 , 2)
    df_missing_data = (pd.DataFrame(
        data={"RowsWithMissingData": missing_data_absolute,
        "PercentageOfDataset": missing_data_percentage,
        "DataType": df.types}
    )
    .sort_values(by=['PercentageOfDataset'], ascending=False)
    .query("PercentageOfDataset > 0")
    )
return df_missing_data

AssessMissingData(df)

#### Results

* Summarize findings:
  * Variables with high missing percentages (e.g., > 50%).
  * Insights into patterns (e.g., variables missing together or clustering in certain rows).
  * Highlight variables that may need special attention during cleaning.

### Correlation Analysis

* **Purpose**: Identify relationships between variables and the target (SalePrice) to understand their potential impact on the analysis.
* **Steps**:
  1. **Correlation Coefficients**:
        * Compute Pearson and Spearman correlation matrices for numerical variables.
        * Pearson measures linear relationships, whilst Spearman evaluations monotonic relationships.
  
  2. **Power Predictive Score (PPS)**:
        * Generate a PPS Matrix to capture non-linear relationships.
  
  3. **Visualizations**:
        * Display heatmaps for correlation and PPS matrices, with thresholds to highlight significant relationships.

#### Insights

* Variables strongly correlated with the target.
* Features that may exhibit multicollinearity.

# Dealing with Missing Data

## Drop Variables

* **Purpose**: Remove columns with excessive missing values or those deemed irrelevant to the analysis.
* **Steps**:
    1. Set a threshold for dropping variables (e.g., > 80% missing).
    2. List Variables to drop based on domain knowledge or exploratory analysis.
    3. Document the rationale for dropping each variable.

### Expected Outcome:

* Reduced dimensionality without siginificant information loss. 

## Impute Missing Values

* **Purpose**: Fill missing values to retain as much data as possible while minimizing bias.
* **Strategies**:
  1. **Numerical Variables**:
    * Use mean, median, or mode imputation for numerical variables.
    * Use domain-specific imputation for values.
  2. **Categorical Variables**:
    * Impute the most frequent category or use placeholders like "Unknown" or "None" for missing values.
  3. **Pipeline**
    * Use pipelines to streamline imputation and apply consisten transformations.

# Standardizing Formatting

* **Purpose**: Ensure uniformity in data representation for easier downstream processing.
* **Steps**:
  1. **Rename Columns**:
    * Use consisten naming conventions.
  2. **Data Type Standardization**:
    * Convert float columns with no decimals to integers.
    * Ensure cateforical variables are encoded as category dtype.
  3. **Date Formatting**:
    * Standardize date formats if applicable.

# Splitting the Data

* **Purpose**: Prepare the data for model training and evaluation by splitting it into training and testing sets.
* **Steps**:
  1. Define the target variable.
  2. Split the dataset into training (80%) and testing (20%) subsets.
  3. Use stratified sampling if necessary to maintain class balance.

# Save Cleaned Data

* **Purpose**: Save cleaned datasets for use in subsequent stages of the project.
* **Steps**:
  1. Create a directory for cleaned datasets.
  2. Save the following:
     1. cleaned house prices dataset
     2. cleaned inherited houses dataset
     3. training and testing sets.

# Conclusion & Next Steps

## Conclusion

## Next Steps

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
