# Data Cleaning Notebook
## Objectives
* Evaluate missing data

## Inputs
* outputs/datasets/collection/TelcoCustomerChurn.csv
## Outputs
* Generate cleaned Train and Test sets, both saved under outputs/datasets/cleaned

## Change working directory
We need to change the working directory from its current folder to its parent folder

  * We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory

  * os.path.dirname() gets the parent directory
  * os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Import Packages

* Import packages using the 'import' statement followed by the name of the package. For example, 'import pandas' which is commonly used for data manipulation and analysis. This is  followed by and alias of your choice, preferably as pd although it is arbitrary.

In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load Colllected data

In [None]:
df = pd.read_csv("/workspace/housing-market-analysis/outputs/datasets/collection/house-price-2021.csv")
df.sample(frac=0.2, random_state=101)
print(df.shape)
df.head()

# **Data Exploration** 
Now check the shape and distribution of a variable with missing data (Although I am aware I have already cleaned my data in the earlier notebooks, I am going to check for the sake of it.)

In [None]:
vars_with_missing_data = df.columns[df.isna().any()].to_list()
print(vars_with_missing_data)
print("\n* There are no Feature variables missing data!!\n\n")

In [None]:
from ydata_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("\nThere are no Feature variables with missing data\n\n")

In [None]:
def EvaluateMissingData(df):
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute/len(df)*100, 2)
    df_missing_data = (pd.DataFrame(
                            data={"RowsWithMissingData": missing_data_absolute,
                                   "PercentageOfDataset": missing_data_percentage,
                                   "DataType": df.dtypes}
                                    )
                          .sort_values(by=['PercentageOfDataset'], ascending=False)
                          .query("PercentageOfDataset > 0")
                          )

    return df_missing_data

### Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                    df.drop(['SalePrice'], axis=1),
                                    df['SalePrice'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("Train set:", X_train.shape, y_train.shape, "\n Test set:", X_test.shape, y_test.shape)

In [None]:
df_missing_data = EvaluateMissingData(X_train)
print(f"* There are {df_missing_data.shape[0]} variables with missing data \n")
df_missing_data

### Drop Variables
* Drop Variables with more than 80% of missing data since these variables will likely not add much value.
* Step 1: imputation approach: Drop Variables.
* Step 2: Select variables to apply the imputation approach.

In [None]:
variables_method = ['WoodDeckSF', 'MasVnrArea']

print(f"* {len(variables_method)} variables to drop \n\n"
    f"{variables_method}")

* Step 3: Create a separate DataFrame applying this imputation approach to the selected variables.

In [None]:
from feature_engine.selection import DropFeatures
imputer = DropFeatures(features_to_drop=variables_method)
imputer.fit(X_train)
df_method = imputer.transform(X_train)

* Dropping **['WoodDeckSF']** since it is null from top to bottom, it provides no value to the dataset.
* Dropping **['MasVnrArea']** since it mainly consists of null values, this feature variable provides little to no value to the data set.

In [None]:
from feature_engine.selection import DropFeatures
imputer = DropFeatures(features_to_drop=variables_method)
imputer.fit(X_train)

X_train, X_test = imputer.transform(X_train) , imputer.transform(X_test)

In [None]:
EvaluateMissingData(X_train)

I am satisfied with my current data set.

## **Push cleaned data to Repo**

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

In [None]:
X_train.to_csv("outputs/datasets/cleaned/X_TrainCleaned.csv", index=False)

In [None]:
X_test.to_csv("outputs/datasets/cleaned/X_testCleaned.csv", index=False)