# **Data Cleaning**

## Objectives

* Prepare the data sets for further analysis

<br>

* Load and inspect the data prepared during data collection
* Data exploration
* Correlation and PPS study
* Data Cleaning
* Conclusion and next steps

## Inputs

* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv
* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv

## Outputs

* outputs/datasets/cleaned/train_set.csv
* outputs/datasets/cleaned/test_set.csv
* outputs/datasets/cleaned/clean_house_price_records.csv
* outputs/datasets/cleaned/clean_inherited_houses.csv

## Additional Comments

* This notebook was written based on the guidelines provided in the Customer Churn walk through project, data cleaning lesson.
* This notebook relates to the Data Preparation step of Crisp-DM methodology

---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Import Packages and set environment variables

In [None]:
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
from pandas_profiling import ProfileReport
from feature_engine.imputation import ArbitraryNumberImputer, CategoricalImputer
from sklearn.pipeline import Pipeline

---

## Load Data

* Load the data downloaded in the data collection notebook

In [None]:
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
print(df.shape)

In [None]:
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
print(df_inherited.shape)
df_inherited

---

# Data Exploration

Explore the dataset, check variable types and distribution, missing levels and what value these variables may add in the context of the first business requirement.

* First list the variables that are missing values

In [None]:
vars_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_missing_data

* Run a pandas profiling report using only the `var_missing_data` variable

In [None]:
if vars_missing_data:
   pandas_report = ProfileReport(df=df[vars_missing_data], minimal=True)
   pandas_report.to_notebook_iframe()
else:
   print("There are no variables with missing data.")

---

# Correlation and PPS Analysis

* In this section I want to understand how the target variable, SalePrice, correlates with the features.
* I am using the same code from the PPS (power predictive score) lesson, to build heatmaps for pearson and spearman correlation, as well as a PPS heatmap.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps

%matplotlib inline

def heatmap_corr(df, threshold, figsize=(20,12), font_annot = 8):
  if len(df.columns) > 1:
    mask = np.zeros_like(df, dtype=bool)
    mask[np.triu_indices_from(mask)] = True
    mask[abs(df) < threshold] = True

    fig, axes = plt.subplots(figsize=figsize)
    sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                linewidth=0.5
                     )
    axes.set_yticklabels(df.columns, rotation = 0)
    plt.ylim(len(df.columns),0)
    plt.show()


def heatmap_pps(df, threshold, figsize=(20,12), font_annot = 8):
    if len(df.columns) > 1:

      mask = np.zeros_like(df, dtype=bool)
      mask[abs(df) < threshold] = True

      fig, ax = plt.subplots(figsize=figsize)
      ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                       mask=mask,cmap='rocket_r', annot_kws={"size": font_annot},
                       linewidth=0.05, linecolor='grey')
      
      plt.ylim(len(df.columns),0)
      plt.show()



def CalculateCorrAndPPS(df):
  df_corr_spearman = df.corr(method="spearman")
  df_corr_pearson = df.corr(method="pearson")

  pps_matrix_raw = pps.matrix(df)
  pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

  pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
  print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
  print(pps_score_stats.round(3))

  return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20,12), font_annot=8 ):

  print("\n")
  print("* Here I can analyze how the target variable for your ML models are correlated with other variables (features and target)")
  print("* Analyze multi colinearity, that is, how the features are correlated among themselves")

  print("\n")
  print("*** Heatmap: Spearman Correlation ***")
  print("It evaluates monotonic relationships between variables \n")
  heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Pearson Correlation ***")
  print("It evaluates the linear relationship between two continuous variables \n")
  heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Power Predictive Score (PPS) ***")
  print(f"PPS detects linear or non-linear relationships between two columns.\n"
        f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
  heatmap_pps(df=pps_matrix,threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

* Calculate Correlations and Power Predictive Score

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

* The table above shows the most common levels for pps scores in the matrix. The majority are between 0 and 0.066.

* Display correlation and pps results on Heatmaps

In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.6, PPS_Threshold = 0.2,
                  figsize=(12,10), font_annot=10)

---

# Dataset Analysis


### Data Exploration

* The data profiling report shows that there are fields that contain many zero values, more concerning though, is the number of variables that do not contain data. ie. contain null values.
  * I will examine these variables and explore whether there is common criteria that may assist in imputing data into these variables or whether in some case it is viable to drop the feature completely
  * I will then do a correlation study and compare the before and after results to establish whether this exercise makes a difference to predicting sale price

### Correlation and PPS Analysis
* Note the results show a number of variables to be moderate to strong predictors for other variables, most asynchronously.
* However, I am interested in variables that are predictors of Sale Price.
  * From the results of both the correlation and PPS studies, I see that the strongest predictor of Sale Price (SalePrice) is Overall Quality (OverallQual) of the property.
  * Overall the correlation study shows 6 features that are positively and strongly correlated to SalePrice, namely:
    * 1stFlrSF, GarageArea, GrLivArea, OverallQual, TotalBsmtSF, YearBuilt

---

 `DataCleaningEffect()` taken from `ML Feature Engine Unit 9: Custom Functions`
* Function objective: assess the effect of cleaning the data, when
  * imput mean, median or arbitrary number is a numerical variable
  * replace with 'Missing' or most frequent a categorical variable
* Parameters: `df_original`: data not cleaned, `df_cleaned`: cleaned data, `variables_applied_with_method`: variables where you applied a given method

  * It is understandable if, at first, you don't understand all code from the function below. The point is to make sense of the pseudo-code and understand the function parameters.

In [None]:
import seaborn as sns
sns.set(style="whitegrid")
import matplotlib.pyplot as plt

def DataCleaningEffect(df_original,df_cleaned,variables_applied_with_method):

  flag_count=1 # Indicate plot number
  
  # distinguish between numerical and categorical variables
  categorical_variables = df_original.select_dtypes(exclude=['number']).columns 

  # scan over variables, 
    # first on variables that you applied the method
    # if the variable is numerical plot a histogram, if categorical plot a barplot
  for set_of_variables in [variables_applied_with_method]:
    print("\n=====================================================================================")
    print(f"* Distribution Effect Analysis After Data Cleaning Method in the following variables:")
    print(f"{set_of_variables} \n\n")
  

    for var in set_of_variables:
      if var in categorical_variables:  # it is categorical variable: barplot
        
        df1 = pd.DataFrame({"Type":"Original","Value":df_original[var]})
        df2 = pd.DataFrame({"Type":"Cleaned","Value":df_cleaned[var]})
        dfAux = pd.concat([df1, df2], axis=0)
        fig , axes = plt.subplots(figsize=(15, 5))
        sns.countplot(hue='Type', data=dfAux, x="Value",palette=['#432371',"#FAAE7B"])
        axes.set(title=f"Distribution Plot {flag_count}: {var}")
        plt.xticks(rotation=90)
        plt.legend() 

      else: # it is numerical variable: histogram

        fig , axes = plt.subplots(figsize=(10, 5))
        sns.histplot(data=df_original, x=var, color="#432371", label='Original', kde=True,element="step", ax=axes)
        sns.histplot(data=df_cleaned, x=var, color="#FAAE7B", label='Cleaned', kde=True,element="step", ax=axes)
        axes.set(title=f"Distribution Plot {flag_count}: {var}")
        plt.legend() 

      plt.show()
      flag_count+= 1

# Data Cleaning

## Assessing Missing Data Levels

* Custom function to display missing data levels in a dataframe, it shows the aboslute levels, relative levels and data type

In [None]:
def EvaluateMissingData(df):
  missing_data_absolute = df.isnull().sum()
  missing_data_percentage = round(missing_data_absolute/len(df)*100 , 2)
  df_missing_data = (pd.DataFrame(
                          data= {"RowsWithMissingData": missing_data_absolute,
                                 "PercentageOfDataset": missing_data_percentage,
                                 "DataType":df.dtypes}
                                  )
                    .sort_values(by=['PercentageOfDataset'],ascending=False)
                    .query("PercentageOfDataset > 0")
                    )

  return df_missing_data

In [None]:
EvaluateMissingData(df)

---

## Create a clean dataset

### Data cleaning approach

* Investigate variables listed with missing data
* Drop EnclosedPorch and WoodDeckSF - more than 80% null values
* Other fields may possibly be imputed with valid value or median

Note:
 The 6 features that show positively and strongly correlation to SalePrice, are not listed among the variables that contain null values.
   * 1stFlrSF, GarageArea, GrLivArea, OverallQual, TotalBsmtSF, YearBuilt

---

Create a copy of the house price records dataframe

In [None]:
df_clean = df.copy()
print(df_clean.shape)

---

## Split the dataset into Train and Test sets

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set, _, __ = train_test_split(
                                        df,
                                        df['SalePrice'],
                                        test_size=0.2,
                                        random_state=0)

print(f"train_set shape: {train_set.shape} \ntest_set shape: {test_set.shape}")

Evaluate train_set Missing values

In [None]:
df_missing_data = EvaluateMissingData(train_set)
print(f"* There are {df_missing_data.shape[0]} variables with missing data \n")
df_missing_data

---

## Individual variable analysis

### Variables to consider dropping


* Inspect `WoodDeckSF` and `EnclosedPorch` variables

In [None]:
df_wooddecksf = train_set.loc[train_set['WoodDeckSF'].notnull()]
df_wooddecksf[['WoodDeckSF', 'SalePrice']]

In [None]:
df_wooddecksf['WoodDeckSF'].value_counts().sort_index(ascending=False).head(10)


---

#### EnclosedPorch - Enclosed porch area in square feet
When evaluating missing data we can see that this variable contains more than 90% null values. Therefore, I deduce that this variable will add no value to the sale price analysis. In the inherited dataset the value for this variable is 0 for all 4 properties, meaning the porch is not enclosed. In addition, the `Correlation and PPS Analysis` shows that this field has no predictive power.

#### WoodDeckSF - Wood deck area in square feet
This variable contains approximately 89% null values. In the inherited dataset this field contains valid values, however, due to the lack of comparative data in the train set, this variable may add no value at all. Furthermore, doing a value count shows the data contains diverse sizes and not enough uniqueness, and little to no exact matches to the inherited house dataset. In addition, the `Correlation and PPS Analysis` shows that this field has no predictive power.

#### Conclusion
Drop both `EnclosedPorch` and `WoodDeckSF` using `feature_engine's DropFeatures` method

In [None]:
from feature_engine.selection import DropFeatures
variables = ['EnclosedPorch', 'WoodDeckSF']
imputer = DropFeatures(features_to_drop=variables)
imputer.fit(train_set)
train_set, test_set = imputer.transform(train_set), imputer.transform(test_set)
train_set.head()

Drop the features from the inherited houses dataset as well

In [None]:
df_clean = imputer.transform(df_clean)

---

In [None]:
null_variables = train_set.columns[train_set.isnull().any()].tolist()

### Variables to consider transforming or imputing

* Inspect `LotFrontage` and `MasVnrArea` variables

In [None]:
train_set['LotFrontage'].value_counts().sort_index(ascending=False).head(10)

In [None]:
train_set['MasVnrArea'].value_counts().sort_index(ascending=False).head(10)

* The PPS score on LotFrontage and MasVnrArea shows these fields have no predictive power.
* The correlation study shows they have a moderate correlation to the SalePrice
* On inspecting the dataset, for these variables, it is noted that in relation to other variables there is no way of identifying or deriving possible valid values for imputing on null variables

#### Conclusion

Use `MeanMedianImputer` to impute a `Median` value into the null variables

In [None]:
from feature_engine.imputation import MeanMedianImputer
variables = ['LotFrontage', 'MasVnrArea']
imputer = MeanMedianImputer(imputation_method='median', variables=variables)
imputer.fit(train_set)
train_set, test_set = imputer.transform(train_set), imputer.transform(test_set)


In [None]:
df_clean = imputer.transform(df_clean)

In [None]:
EvaluateMissingData(train_set)

* Missing data evaluation shows that `EnclosedPorch`, `WoodDeckSF`, `LotFrontage` and `MasVnrArea` no longer appear on the list.

---

#### 2ndFlrSF - Second floor square feet

* Inspect `2ndFlrSF` variable

In [None]:
train_set['2ndFlrSF'].value_counts().sort_index()

* `60` variables of `1168` contain null values. When studying the data, it appears that if there is no second floor the value would be set to `0`. More than 50% of values for this variable are 0. Therefore one may deduce imputing the null values with 0 would add value.
* We prepare the pipeline to use `ArbitraryNumberImputer` to impute `0` into the null variables

---

#### BedroomAbvGr - Bedrooms above grade (does NOT include basement bedrooms)

* Inspect `BedroomAbvGr` variable

In [None]:
train_set['BedroomAbvGr'].value_counts().sort_index()

* `80` variables of `1168` contain null values. A value count shows that only 4 records have a `0` for this variable. All 4 inherited properties contain values above zero. Imputing the null values with 0 may have no effect on the sales price analysis but I deduce that `0` grading is equivalent to `null` grading.
* Prepare the pipeline to use `ArbitraryNumberImputer` to impute `0` into the null variables

---

In [None]:
pipeline = Pipeline([
      ( '2ndFlrSF',  ArbitraryNumberImputer(arbitrary_number=0,
                                                variables=['2ndFlrSF', 'BedroomAbvGr']) )
])
pipeline

In [None]:
pipeline.fit(train_set)
train_set, test_set = pipeline.transform(train_set), pipeline.transform(test_set)

In [None]:

df_clean = pipeline.transform(df_clean)

In [None]:
EvaluateMissingData(train_set)

* Missing data evaluation shows that `2ndFlrSF` and `BedroomAbvGr` no longer appear on the list.

---

#### BsmtFinType1 - Rating of basement finished area

* Inspect `BsmtFinType1` variable

In [None]:
train_set['BsmtFinType1'].value_counts().sort_index()

* Inspect `BsmtExposure` variable

In [None]:
train_set['BsmtExposure'].value_counts().sort_index()

In [None]:
train_set[train_set['BsmtFinType1'].isna()].query('BsmtExposure=="None"').sort_values(by=['BsmtExposure'])

* `89` variables of `1168` contain null values. There are only `25` properties with no basement.
* `BsmtExposure` however contains no null variables and on comparing the two fields I established that there are `3` rows that are set to `None`, meaning that they have no basement, where `BsmtFinType1` is null. For these `3` rows the `BsmtFinType1` variable may be imputed with `None`.

In [None]:
query_condition = (train_set.BsmtExposure == 'None') & (train_set['BsmtFinType1'].isnull())
train_set['BsmtFinType1'] = np.where(query_condition, 'None', train_set['BsmtFinType1'])


In [None]:
query_condition = (test_set.BsmtExposure == 'None') & (test_set['BsmtFinType1'].isnull())
test_set['BsmtFinType1'] = np.where(query_condition, 'None', test_set['BsmtFinType1'])

In [None]:
query_condition = (df_clean.BsmtExposure == 'None') & (df_clean['BsmtFinType1'].isnull())
df_clean['BsmtFinType1'] = np.where(query_condition, 'None', df_clean['BsmtFinType1'])

In [None]:
train_set[train_set['BsmtFinType1'].isna()].query('BsmtExposure=="None"').sort_values(by=['BsmtExposure'])

* The `3` rows have been imputed with `None` and hence no longer appear

In [None]:
train_set['BsmtFinType1'].isna().sum()

* There are still `86 BsmtFinType1` containing nulls

---

* Inspect `BsmtFinSF1` variable. Type 1 finished square feet.

In [None]:
df_temp = train_set[train_set['BsmtFinType1'].isna()].query('BsmtFinSF1==0').sort_values(by=['BsmtFinSF1'])
print(df_temp.shape)
df_temp

* Next we look at `BsmtFinSF1` which contains no null variables. We search for `BsmtFinType1` with nulls and `BsmtFinSF1` with value `0`, meaning `0` finished square feet, which means unfinished. We find there are 27 records. Therefore, we can impute these records with `Unf` which means unfinished for `BsmtFinType1` variable.

In [None]:
query_condition = (train_set.BsmtFinSF1 == 0) & (train_set['BsmtFinType1'].isnull())
train_set['BsmtFinType1'] = np.where(query_condition, 'Unf', train_set['BsmtFinType1'])

In [None]:
query_condition = (test_set.BsmtFinSF1 == 0) & (test_set['BsmtFinType1'].isnull())
test_set['BsmtFinType1'] = np.where(query_condition, 'Unf', test_set['BsmtFinType1'])

In [None]:
query_condition = (df_clean.BsmtFinSF1 == 0) & (df_clean['BsmtFinType1'].isnull())
df_clean['BsmtFinType1'] = np.where(query_condition, 'Unf', df_clean['BsmtFinType1'])

In [None]:
train_set[train_set['BsmtFinType1'].isna()].query('BsmtFinSF1==0').sort_values(by=['BsmtFinSF1']).shape

* The `27` rows have been imputed with `Unf` and hence no longer appear

In [None]:
train_set['BsmtFinType1'].isna().sum()

* There are still `59 BsmtFinType1` containing nulls. These remaing null variables will be imputed with `Unk` meaning `Unknown`

In [None]:
imputer = CategoricalImputer(imputation_method='missing',fill_value='Unk',
                             variables='BsmtFinType1')

imputer.fit(train_set)
train_set, test_set, df_clean = imputer.transform(train_set), imputer.transform(test_set), imputer.transform(df_clean)

In [None]:
train_set['BsmtFinType1'].isna().sum()

There are no null values in `BsmtFinType1`

---

* Inspect `GarageFinish` variable. Interior finish of the garage.

In [None]:
train_set['GarageFinish'].isna().sum()

* There are `131` null variables for `GarageFinish`

In [None]:
train_set['GarageFinish'].value_counts().sort_index()

In [None]:
train_set.loc[train_set.GarageFinish=="None",'GarageArea'].value_counts()


* Note above, where `GarageFinish=="None"`, meaning there is no garage, `GarageArea` is found to be `0`.

In [None]:
train_set.loc[train_set.GarageFinish.isnull(),'GarageArea'].value_counts()

* Therefore, where `GarageFinish` is null we can check if `GarageArea` is `0` and if so we can impute `None` on `GarageFinish`.
* Based on the above query, only `5` rows will be affected.
* For the remaining records we will assume that the garages are unfinished and hence impute `Unf` on `GarageFinish`.

In [None]:
query_condition = (train_set.GarageArea == 0) & (train_set['GarageFinish'].isnull())
train_set['GarageFinish'] = np.where(query_condition, 'None', train_set['GarageFinish'])

In [None]:
query_condition = (test_set.GarageArea == 0) & (test_set['GarageFinish'].isnull())
test_set['GarageFinish'] = np.where(query_condition, 'None', test_set['GarageFinish'])

In [None]:
query_condition = (df_clean.GarageArea == 0) & (df_clean['GarageFinish'].isnull())
df_clean['GarageFinish'] = np.where(query_condition, 'None', df_clean['GarageFinish'])

In [None]:
train_set['GarageFinish'].isna().sum()

In [None]:
pipeline = Pipeline([
      ( 'categorical_imputer', CategoricalImputer(imputation_method='missing',
                                                  fill_value='Unf',
                                                  variables=['GarageFinish']) )
])
pipeline

In [None]:
pipeline.fit(train_set)

train_set, test_set = pipeline.transform(train_set), pipeline.transform(test_set)

In [None]:
df_clean = pipeline.transform(df_clean)

In [None]:
train_set['GarageFinish'].isna().sum()

There are no null values in `GarageFinish`

---

* Inspect `GarageYrBlt` variable. Year garage was built.

In [None]:
train_set['GarageYrBlt'].isna().sum()

* Get a row count of where `GarageYrBlt` is null, and return the value of `GarageFinish`.

In [None]:
train_set.loc[train_set.GarageYrBlt.isnull(),'GarageFinish'].value_counts()

In [None]:
train_set[train_set.GarageFinish=='None']

* Note there are `58` null records for the `GarageYrBlt` and where this variable is null `GarageFinish` is `None`, meaning there is no garage.
* Prepare a pipeline to use `ArbitraryNumberImputer` to impute `0` into the null variables

In [None]:
pipeline = Pipeline([
      ( 'GarageYrBlt',  ArbitraryNumberImputer(arbitrary_number=0,
                                                variables='GarageYrBlt') )
])
pipeline

pipeline.fit(train_set)
train_set, test_set = pipeline.transform(train_set), pipeline.transform(test_set)

In [None]:
df_clean = pipeline.transform(df_clean)

---

Show missing data evaluation

In [None]:
EvaluateMissingData(train_set)

In [None]:
EvaluateMissingData(df_clean)

* There are no variables missing data.

---

## Before and After comparison

Assess the effect on the variable distribution
* The function plots in the same Axes the distribution before and after applying the method. This helps to give you insights into how different your variable would look after cleaning.
* We notice the "peak" in the variable distribution after median imputation.

In [None]:
DataCleaningEffect(df_original=df,
                   df_cleaned=df_clean,
                   variables_applied_with_method=null_variables)

---

## Datatype changes - `Float` to `Integer`

On examining the data in the heritage houses dataset, we can see that there are no float values in the float columns so we will change these to int.

In [None]:
print(df_clean.shape)

In [None]:
df_clean.select_dtypes('float').info()

In [None]:
for col in df_clean.select_dtypes('float').columns:
    df_clean[col] = df_clean[col].astype('int64')

In [None]:
df_clean.select_dtypes('float').info()


On examining the data in the inherited houses dataset, we can see that there are no float values in the float columns so we will change these to int.

In [None]:
for col in df_inherited.select_dtypes('float').columns:
    df_inherited[col] = df_inherited[col].astype('int64')

In [None]:
df_inherited.select_dtypes('float').info()

In [None]:
df_inherited.info()

---

Change float columns to int for train and test

In [None]:
train_set.info()

In [None]:
for col in train_set.select_dtypes('float').columns:
    train_set[col] = train_set[col].astype('int64')

In [None]:
for col in test_set.select_dtypes('float').columns:
    test_set[col] = test_set[col].astype('int64')

---

# Save Train and Test sets to csv

* Create a cleaned folder.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

* output the clean datasets to csv files into the outputs/datasets folder
* outputs/datasets/cleaned/train_set.csv
* outputs/datasets/cleaned/test_set.csv
* outputs/datasets/cleaned/clean_house_price_records.csv

In [None]:
train_set.to_csv("outputs/datasets/cleaned/train_set.csv", index=False)

In [None]:
test_set.to_csv("outputs/datasets/cleaned/test_set.csv", index=False)

In [None]:
df_clean.to_csv("outputs/datasets/cleaned/clean_house_price_records.csv", index=False)

In [None]:
df_inherited.to_csv("outputs/datasets/cleaned/clean_inherited_houses.csv", index=False)

---

# Conclusions and Next Steps


* Created clean version of the housing price dataset and the inherited houses datasets
* On the inherited dataset the only step taken was to drop the variables `EnclosedPorch` and `WoodDeckSF`
* The housing price dataset was also split into Train and Test set.
* The clean datasets were saved to csv files, in the outputs/datasets/cleaned folder:
  * clean_house_price_records.csv
  * clean_inherited_houses.csv
  * train_set.csv
  * test_set.csv
* Now we move on to the Feature Engineering

---