# **NoteBook 2**

## Objectives
---

* Answer business requirement 1
    * Perform a correlation and/or PPS study to investigate the most relevant variables correlated to the sale price.
* Some background information
    - The client is interested to predict the house prices of of homes in Iowa. 
    - The dataset is based on a previously colected features as well as sale price for each house.
    - The target variable is sale price.

## Steps
---

* Perform EDA - This can help us learn generally about the state of the dataset.
* Perform a correlation study - We will use the Pearson and Spearman method
* Select highly-correlated features

## Change working directory
---

First we set the working directory for the notebook.

In [None]:
import os
current_dir = os.getcwd()
current_dir
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(current_dir)

## Load Data
---

Load the data in from the outputs section of the last notebook. We can drop the SalePrice column as it is the target varialbe in this study.

In [None]:
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/house_prices.csv")
df.head(3)

## Data Exploration
---

We can use a library called pandas-profiling to explore the dataset with a GUI that will give us insights into the characteristics of each feature and the relationships they share with each other.

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

In [None]:
df.info()

## Correlation Study
---

Next we need to perform a correlation study on the different variables in the dataset. This will allow us to filter out the different variables that are not vital for determining the target variable.

In [None]:
corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(15)
corr_spearman

We do the same for the pearson study.

In [None]:
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(15)
corr_pearson

Correlation results of 0.8 or more are considered strong correlation values, with 0.5 to 0.8 considered moderate in correlation.

We can select the top 6 values from each list as they are all above at least 0.6.

In [None]:
n_correlated_values = 6;
vars_to_study = set(corr_pearson[:n_correlated_values].index.to_list() + corr_spearman[:n_correlated_values].index.to_list())
vars_to_study

We can investigate if a house with a high sales price:
* Has a 1st floor, and how large it typically is.
* Has a garage area, and how large it typically is.
* Has a high quality of finish, and what the most common one is.
* Has a garage living area, and how large it typically is.
* Has a high quality kitchen finish.
* Has a high overall quality.
* Has a basement, and how large it typically is.
* Is built in or around a specific year, and what that range might be.

## Correlation and PPS Analysis
---

We can use a custom function taken from the codeinstitute modules to display heatmaps for pearson and spearman correlation analyses as well as a heatmap for the Power Predictive Score for the variables.


In [None]:
from src.corr_and_pps import CalculateCorrAndPPS, DisplayCorrAndPPS

We run the function on the dataset. The figures that are generated will allow us to better understand the relationships between the different variables in the dataset.

In [None]:
%matplotlib inline
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.6, PPS_Threshold =0.5,
                  figsize=(12*1,10*1), font_annot=10)

These charts give us an understanding of the correlative relationships between the different variables contained within the dataset. It can help to refer to these charts when making decisions about how to handle any missing data contained within the datasets.

## EDA (Exploratory Data Analysis)
---

* I chose to include the top 6 correlative features from both the pearson and spearman methods in the study.
* The result was 8 variables that carried the highest correlative power with respect to the SalePrice of a give house.
* Each variable has a moderate to strong positive correlation above.

In [None]:
df_eda = df.filter(list(vars_to_study) + ['SalePrice'])
df_eda.head(3)

## Variable Distribution by SalePrice
---

We can plot the distribution of the variables, both numerical and categorical.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.regplot(data=df, x=col, y=target_var, scatter_kws={'alpha':0.4})
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()

target_var = 'SalePrice'
for col in vars_to_study:
    plot_numerical(df_eda, col, target_var)


## Conclusions
---

* We can see from the above plots that there is definitely a positive correlative relationship between each of the selected features and the sale price of a given house in Ames Iowa.
* From the pearson and spearman correlation heatmaps, it can be seen that each of the top 8 correlative variables holds a score of at least 0.6
* The PPS heatmap does not show any strong predictive power for any of the variables with respect to sale price.

## Answer Business Hypothesis
---

1.  **I suspect that a house with high OverallQual sells for a higher price.**
    
    * A spearman correlation analysis between OverallQual and SalePrice shows a positive score of 0.81.
    * A scatter plot of SalePrice vs OverallQual shows a positve, somewhat linear relationship between the two variables.

    As a result of this, we fail to reject the Hypothesis

2.  **I suspect that a house with a big garage sells for a higher price.**
    
    * A spearman correlation analysis between GarageArea and SalePrice shows a psoitive score of 0.65.
    * Ths suggests a moderate correlative relationship between the GarageArea and a rise in SalePrice.
    * A scatter plot of GarageArea vs SalePrice shows a relatively linear, positiverelationship between GarageArea and SalePrice.

    As a result of this, we fail to reject the Hypothesis.

## Next Step
---

* The next step in this process is the data cleaning, carried out in jupyter notebook 3.