# **Data Cleaning**

## Objectives

* Import the Dataset
* Run the correlation studies to analyze the variables most highly correlated with 'Sale Price'

## Inputs

* The dataset is located at: outputs/datasets/collection in the root level directory

## Outputs

* Results of the correlation study 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [5]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Fabrizio-Project-Five'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [6]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [7]:
current_dir = os.getcwd()
current_dir

'/workspaces'

# Beginning of the correlation studies

In this notebook we will run correlation studies in order to answer the first business requirement which was:
* The client is interested in discovering how the house attributes correlate with the sale price. Therefore, the client expects data visualizations of the correlated variables against the sale price to show that.



In [8]:
df_ohe_corr_spearman = df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)

df_ohe_corr_spearman

NameError: name 'df_ohe' is not defined

In the above cell we can note the ten most correlated features with the variable 'SalePrice' using the Spearman method. Now let's do the same for the Pearson method and see if we get different values

In [None]:
df_ohe_corr_pearson = df_ohe.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
df_ohe_corr_pearson

It is clear that we have stronger correlation levels with the 'Spearman' method. Now that we know our most correlated features we can say that:

- The price of a house increases with the increase of Overall Quality
- The price of a house increases with the increase of Ground Living Area (in square feet) when above grade
- The price of a house increases when the year of construction is more recent
- The price of a house increases with the increase of Garage Area
- The price of a house increases when the Kitchen Quality is NOT assesed as Typical/Average
- The price of a house increases when the year of construction of the garage is more recent
- The price of a house increases when the basement surface (measured in square feet) increases
- The price of a house increases when the year of remodelling is more recent
- The price of a house increases when the surface area of the 1st floor (measured in square feet) increases
- The price of a house increases when the state of the garage is NOT unfinished

In [None]:
df_ohe.filter(df_ohe_corr_spearman.index) 

In order to fully answer the first business requirement (which was information on correlation levels with our target variable 'SalePrice' and data visualization for these levels) we need to display some plots that will help us understand this correlation significance on a visual level

---

Heat map as a first plot will help us significantly in visualize the correlation levels.

In [None]:
df_spearman_corr = df_ohe.corr(method='spearman')
df_spearman_corr

In [9]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
plt.figure(figsize=(14, 12))
mask = np.triu(np.ones_like(df_spearman_corr, dtype=bool)) # code taken from seaborn docs
sns.heatmap(df_spearman_corr,linewidths=.5, mask=mask)

NameError: name 'df_spearman_corr' is not defined

<Figure size 1400x1200 with 0 Axes>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
"""
This function will create plot for the 10 most correlated variables with 'SalePrice'
"""

def corr_plots_saleprice(df, target_var):
    for col in target_var:
        if col in ['OverallQual', 'KitchenQual_TA', 'GarageFinish_Unf']:
            plt.figure(figsize=(10, 8))
            sns.violinplot(y=df['SalePrice'], x=df_ohe[col])
        else:
            plt.figure(figsize=(10, 8))
            sns.scatterplot(y=df['SalePrice'], x=df_ohe[col])
corr_plots_saleprice(df_ohe, df_ohe_corr_spearman.index.to_list())


The plots shown above show us the distribution of the values for the most correlated values agianst 'Sale Price'. 
This allows us to draw stronger conclusions and have a firmer base for the rationale behind the selling price.

---

Now that we have plots to show correlation levels we could generate some reports using the predictive power score library to investigate any potential non linear relationship.

In [None]:
import ppscore as pps
pps.matrix(df=df)

Let's check these scores in a heat map

In [None]:
# code taken from the CI lesson on PPS
pps_matrix_raw = pps.matrix(df)
pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')
plt.figure(figsize=(14, 12))
sns.heatmap(data=pps_matrix, linewidths=0.5, annot=True, )


As we can see from the heatmap we get some confirmation regarding our previous correlation study.
On the Y axis when we analize 'SalePrice' we notice that the values which have a stronger predictive score are:
* Overall Quality
* Kitchen Quality
* Year Built
* Garage Area
* Year Remodel Add (Year Remodel Date)

Of these only the first two have levels of significant importance for the predictive score (above 0.2). This confirms our initial standard correlation analysis given the fact that after some feature engineer steps two of the most highly correlated variables in the dataset were 'Overall Quality' and 'Kitchen Quality TA' where TA stands for Typical/Average.

### Final conclusions

In our correlation studies we were able to define the variables mostly correlated with our target 'Sale Price'. Conclusively there is only one variable which alone has the strongest correlation, and that is 'Overall Quality'. From a logic standpoint is quite obvious but some other things we might want to consider is that other factors, when summed up together, might play a significant role in determining the appropriate price range for a given house.

Even without a ML pipeline our friend Lydia Doe would still be able to set the price of her inherited houses to a profitable yet competitive range.

Now let's move to our next notebook where we will proceed with the ML pipeline steps.