# **Data Exploration Notebook**

## Objectives

* Explore the data
* Understand which attributes are most correlated to sale price

## Inputs

* Kaggle data file - inputs/datasets/raw/house-prices/house-price/house_prices_records.csv
* Kaggle data file - inputs/datasets/raw/house-prices/house-price/inherited_house.csv

---

# Change working directory

Accessing the current directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

Making sure working in the child of the workspace directory

In [None]:
os.chdir('/workspaces/milestone-project-heritage-housing-issues')
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Explore the Kaggle Data

* Load Kaggle Data

In [None]:
import pandas as pd
allowed_nans = ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN',
                '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA',
                'NULL', 'NaN', 'n/a', 'nan', 'null']
df = pd.read_csv(f"inputs/datasets/raw/house-prices/house-price/house_prices_records.csv", na_values=allowed_nans, keep_default_na=False)
df.head()

* Run Profile Report

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

---

## Correlation and PPS Analysis

* Change object type data to numerical data

In [None]:
df['BsmtExposure'] = df['BsmtExposure'].replace({'None':0, 'No':1, 'Mn':2, 'Av':3, 'Gd':4})
df['BsmtFinType1'] = df['BsmtFinType1'].replace({'None':0, 'Unf':1, 'LwQ':2, 'BLQ':3, 'Rec':4, 'ALQ':5, 'GLQ':6})
df['GarageFinish'] = df['GarageFinish'].replace({'None':0, 'Unf':1, 'RFn':2, 'Fin':3})
df['KitchenQual'] = df['KitchenQual'].replace({'Po':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4})

* Spearman Correlation

In [None]:
df_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head()
df_spearman

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

def heatmap_corr(df, threshold, figsize, font_annot):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()

In [None]:
df_spearman_heat = df.corr(method='spearman')
heatmap_corr(df=df_spearman_heat, threshold=0.6, figsize=(15, 5), font_annot=8)

* Pearson Correlation

In [None]:
df_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head()
df_pearson

In [None]:
df_pearson_heat = df.corr(method='pearson')
heatmap_corr(df=df_pearson_heat, threshold=0.6, figsize=(15, 5), font_annot=8)

* Top five correlated attributes, to SalePrice

In [None]:
set(df_pearson[:4].index.to_list() + df_spearman[:4].index.to_list())

Therefore I will investigate how the following attributes affect SalePrice:

* GarageArea
* GrLivArea
* KitchenQual
* OverallQual
* YearBuilt

In [None]:
vars_to_study = ['GarageArea', 'GrLivArea', 'KitchenQual', 'OverallQual', 'YearBuilt']
vars_to_study

In [None]:
for col in vars_to_study:
    ax = sns.regplot(data=df, x=col, y="SalePrice", scatter_kws={"color": "blue"}, line_kws={"color": "red"})
    plt.ylabel('SalePrice')
    plt.xlabel(col)
    plt.title(f"{col}", fontsize=20, y=1.1)
    plt.show()

---

# Conclusions

* Overall quality has the higest impact on SalePrice
* Kitchen quality also impacts SalePrice with higher quality getting a higher price
* Newer homes have higher SalePrice
* Larger garage and living areas have higher SalePrice - a larger living area has the most impact 