# Data Analysis
### An exploratory data analysis of our collected data.

### Objectives

1. Answer our first Business requirement. 
    * Perform a correlation and/or PPS study to investigate the most relevant variables correlated to the sale price.
    * Visualize these variables against the sale price, in order to summarize the insights.

### Inputs

1. Our house_price_records data that we collected in our DataCollection Notebook, found at inputs/datasets/raw/house-price-20211124T154130z/house-price/house_price_records.csv 

### Outputs

1. Code that succesfully generates the answers to our first business requirement.
2. Plotted graphs that visualise the results found during correlation testing.
3. Useful and insightful analysis of the found results.

### Additional Comments

* This notebook was desinged and follows the principles set out by Code Institute in the Predictive Analytics lessons and Walkthrough projects. The code written in this work book has taken influence from these lessons and projects but has been modiefied, in some cases such as the graphical design, heavily modified by myself in order to suit the needs for this project.

___

## Change working Directory

* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/CI-Project-5-Predictive-Analytics/jupyter_notebooks'


We want to make the parent of the current directory the new current directory

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/CI-Project-5-Predictive-Analytics'

## Load the data

The first step is simply to load in our raw data that we collected in the previous notebook.

In [4]:
import pandas as pd
df = (pd.read_csv("inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv"))
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


## Data Exploration

* With our data loaded we can begin to analyse it. 
    * To begin with I have generated a panadas profile report, this is a an automated dashboard used for quick EDA (Exploratory Data Analysis), simply run the code below to genorate the report.

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

The Pandas report we have generated provides a lot of very useful insights, we can see straight away a summary of each variable in our data. This summary details information such as the mean, min and max values, the number of missing values if any, and a graph showing the distribution of values.

* An instant take away from this report is the amount of missing values, if we have a look at the alerts tab we can see that 8 columns contain missing values. Missing values appear in both numerical and categorical variables, there are various different avenues we can take to deal with missing values. These will be reviwed in Data Cleaning.

___

## Correlation Study

1. Firstly in order to assess the correlation between Sales price and any numerical variables in our dataset I will use Spearman and Pearson correlation tests on our raw data.
    * Pearson correlation assess the linear relationship between two variables and produces a value between -1 and +1, donating a negative or positive correlation respectivly. a value of 0 indictaes no corraelation between the variables.
    * Spearman correlation assess the monotonic relation between vairables. Again it provides a value between -1 and +1.

2. These two tests only evaluate the correlation of numerical values. Therefore we are not getting any correlation information regarding our categorical variables such as Kitchen Quality. 
    * In order to evaluate these vairbales in conjucture with the rest I will perform some simple data cleaning and feature engineering.
    * The data will have the null values amended, categorical variables will have 'missing' in place of null values, and the mean will be added in place of any null values in the numerical data.
    * One hot encoder will then be used to convert the categorical variables to a binary classification allowing us to evaluate them using spearman and pearson correlation tests.
3. The results of these 4 tests will then be evaluated and the leading variables that show the greatest correlation will be plotted.   

In [5]:
corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

  corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)


OverallQual     0.809829
GrLivArea       0.731310
YearBuilt       0.652682
GarageArea      0.649379
TotalBsmtSF     0.602725
GarageYrBlt     0.593788
1stFlrSF        0.575408
YearRemodAdd    0.571159
OpenPorchSF     0.477561
LotArea         0.456461
Name: SalePrice, dtype: float64

In [6]:
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

  corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)


OverallQual     0.790982
GrLivArea       0.708624
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
YearBuilt       0.522897
YearRemodAdd    0.507101
GarageYrBlt     0.486362
MasVnrArea      0.477493
BsmtFinSF1      0.386420
Name: SalePrice, dtype: float64

In [7]:
from sklearn.pipeline import Pipeline
from feature_engine.encoding import OneHotEncoder
from feature_engine.imputation import DropMissingData, MeanMedianImputer, CategoricalImputer

pipeline = Pipeline([
    ('mmi', MeanMedianImputer(imputation_method='mean', variables=['2ndFlrSF','EnclosedPorch','GarageYrBlt','LotFrontage','WoodDeckSF'])),
    ('ci', CategoricalImputer(imputation_method='missing', variables=['BsmtExposure','BsmtFinType1','GarageFinish','KitchenQual'])),
    ('ohe', OneHotEncoder(variables=['BsmtExposure','BsmtFinType1','GarageFinish','KitchenQual'], drop_last=False))
])
df_ohe = pipeline.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(300)

(1460, 42)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtFinSF1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageYrBlt,GrLivArea,LotArea,...,BsmtFinType1_LwQ,GarageFinish_RFn,GarageFinish_Unf,GarageFinish_Missing,GarageFinish_Fin,GarageFinish_None,KitchenQual_Gd,KitchenQual_TA,KitchenQual_Ex,KitchenQual_Fa
0,856,854.000000,3.0,706,150,0.000000,548,2003.0,1710,8450,...,0,1,0,0,0,0,1,0,0,0
1,1262,0.000000,3.0,978,284,25.330882,460,1976.0,1262,9600,...,0,1,0,0,0,0,0,1,0,0
2,920,866.000000,3.0,486,434,0.000000,608,2001.0,1786,11250,...,0,1,0,0,0,0,1,0,0,0
3,961,348.524017,,216,540,25.330882,642,1998.0,1717,9550,...,0,0,1,0,0,0,1,0,0,0
4,1145,348.524017,4.0,655,490,0.000000,836,2000.0,2198,14260,...,0,1,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,1003,0.000000,3.0,819,184,25.330882,588,1984.0,1003,7937,...,0,0,0,1,0,0,0,1,0,0
296,910,648.000000,4.0,420,490,25.330882,282,1950.0,1558,13710,...,0,0,1,0,0,0,0,1,0,0
297,975,975.000000,3.0,649,326,25.330882,576,1997.0,1950,7399,...,0,1,0,0,0,0,1,0,0,0
298,1041,702.000000,3.0,384,143,25.330882,539,1968.0,1743,11700,...,0,0,1,0,0,0,0,1,0,0


In [8]:
corr_spearman_ohe = df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman_ohe

OverallQual       0.809829
GrLivArea         0.731310
YearBuilt         0.652682
GarageArea        0.649379
TotalBsmtSF       0.602725
KitchenQual_TA   -0.581803
1stFlrSF          0.575408
YearRemodAdd      0.571159
GarageYrBlt       0.565392
KitchenQual_Gd    0.478583
Name: SalePrice, dtype: float64

In [9]:
corr_pearson_ohe = df_ohe.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson_ohe

OverallQual       0.790982
GrLivArea         0.708624
GarageArea        0.623431
TotalBsmtSF       0.613581
1stFlrSF          0.605852
YearBuilt         0.522897
KitchenQual_TA   -0.519298
YearRemodAdd      0.507101
KitchenQual_Ex    0.504094
MasVnrArea        0.477493
Name: SalePrice, dtype: float64

Upon completion of our correlation tests, we can take the top 6 correlation levels for both sets of tests to review if any differences have arised.

In [10]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

{'1stFlrSF',
 'GarageArea',
 'GrLivArea',
 'OverallQual',
 'TotalBsmtSF',
 'YearBuilt'}

In [11]:

set(corr_pearson_ohe[:top_n].index.to_list() + corr_spearman_ohe[:top_n].index.to_list())

{'1stFlrSF',
 'GarageArea',
 'GrLivArea',
 'OverallQual',
 'TotalBsmtSF',
 'YearBuilt'}

* We can see that both sets of tests have produced the same top 6 most correlated variables. The findings are as follows:
    1. A house with a high sales price typically has a large 1st floor square footage.
    2. A house with a high sales price typically has a large garage square footage.
    3. A house with a high sales price typically has a large above grade (ground) living area square footage.
    4. A house with a high sales price typically has a high overall quality.
    5. A house with a high sales price typically has a large basement square footage.
    6. A house with a high sales price typically was built more recently.

I will use the raw data (df) to plot graphs of the target variables. 

In [12]:
target_variables = ['1stFlrSF', 'GarageArea', 'GrLivArea', 'OverallQual', 'TotalBsmtSF', 'YearBuilt']
target_variables

['1stFlrSF',
 'GarageArea',
 'GrLivArea',
 'OverallQual',
 'TotalBsmtSF',
 'YearBuilt']

___

## EDA on target_variables

In [24]:
df_eda = df.filter(target_variables + ['SalePrice'])
df_eda.head(5)

Unnamed: 0,1stFlrSF,GarageArea,GrLivArea,OverallQual,TotalBsmtSF,YearBuilt,SalePrice
0,856,548,1710,7,856,2003,208500
1,1262,460,1262,6,1262,1976,181500
2,920,608,1786,7,920,2001,223500
3,961,642,1717,7,756,1915,140000
4,1145,836,2198,8,1145,2000,250000


In [40]:
import plotly.express as px


def plot_numerical(df, col, target_var):
    fig = px.scatter(data_frame=df_eda, x=col, y=target_var, marginal_x='box', trendline='ols', trendline_color_override='red' ,title=f'{col} against {target_var}',width=1400, height=800)
    fig.show()


target_var = 'SalePrice'
for col in target_variables:
    plot_numerical(df_eda, col, target_var)
    print("\n\n")































___

## Conclusions on Data Analysis

* To review, the 6 variables that were highligthed and our assumptions made on them are as follows:
    1. A house with a high sales price typically has a large 1st floor square footage.
    2. A house with a high sales price typically has a large garage square footage.
    3. A house with a high sales price typically has a large above grade (ground) living area square footage.
    4. A house with a high sales price typically has a high overall quality.
    5. A house with a high sales price typically has a large basement square footage.
    6. A house with a high sales price typically was built more recently.


* Upon further analysis of these variables through plotting we can confirm that each variable does indeed have a strong postive correlation to the sales price as indicated by our postive trendlines. Therefore we can assume they may, be well suited for predicting the future sales price of an unseen house.
    * It is clear through our analysis that some of our identified varibales look to have a greater impact then others, for example the GrLivArea and TotalBsmtSF look to have a greater influence over sales price then the year built. However all variables look to have an impact so all will be considered during our modelling.

* The next step is to clean the data prior to feature engineering.

___