# **Sale Price Study**

## Objectives

*   Answer business requirement 1: 
    * The client is interested in discovering how the house attributes correlate with the sale price. Therefore, the client expects data visualisations of the correlated variables against the sale price to show that.

## Inputs

* outputs/datasets/collection/house_prices_records.csv

## Outputs

* generate code that answers business requirement 1 and can be used to build the Streamlit App




---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Data

Output data is loaded into this notebook and the top three entries displayed 

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/house_prices_records.csv"))
df.head(3)


---

# Data Exploration

We are interested to get more familiar with the dataset, check variable type and distribution, missing levels and what these variables mean in a business context

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

- The data is all numerical as expected following the replacing of categorical types in the previous notebook.
- There is a large amount of missing data (~10%) which will need to review in future.
- There is a large variations between all the houses, as some do not have a 2nd floor, garage or baseent.
    - As a result some of the data is skewed with higher number of zero's and will need consideration for future.
- The majority of the data is not uniformly distritubed.

---

# Correlation Study

We use `.corr()` for `spearman` and `pearson` methods, and investigate the top 10 correlations
* We know this command returns a pandas series and the first item is the correlation between SalePrice and SalePrice, which happens to be 1, so we exclude that with `[1:]`
* We sort values considering the absolute value, by setting `key=abs`

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/house_prices_records.csv"))   
df.head(3)

In [None]:
corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(24)
corr_spearman

In [None]:
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(24)
corr_pearson

In [None]:
top_n = 10
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

In [None]:
vars_to_study = ['1stFlrSF','GarageArea', 'GarageFinish', 'GarageYrBlt', 'GrLivArea', 'KitchenQual', 'OverallQual', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd']
vars_to_study

- For both methods, we notice strong and moderate correlations between SalePrice and selected variables. 
- Top 10 in Spearman and Top 9 in Pearson all show strong correlations.
    * We are able to pursue strong correlation levels and will review in more detail the top 10 correlations.

Therefore we are studying at df the following variables. We will investigate if:
* Larger floor space (GrLivArea, 1stFlrSF, GarageArea and TotalBsmtSF) increaes SalePrice
* Newer built properties (YearBuilt) are higher in SalePrice
* Recently built garages (GarageYrBlt) increases SalePrice
* Recent refurbishment (YearRemodAdd) increases SalePrice
* Higher quality kitchens, garages and overall houses (GarageFinish, KitchenQual and OveralQual) increases SalePrice


---

# EDA on selected variables

In [None]:
df_eda = df.filter(vars_to_study + ['SalePrice'])
df_eda.head(3)

We plot the distribution of each variable against SalePrice.
Quality related variables (KitchenQual, GarageFinish and OveralQual), altough numerical values are categorical categories and therefore reperesented as box plots.
The remaining vairbales are represented as scatter plots with linear regression lines

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

def plot_regression(df, col, target_var, alpha_scatter=0.5):
    plt.figure(figsize=(8, 5))
    sns.regplot(data=df, x=col, y=target_var, scatter_kws={'alpha': alpha_scatter}, line_kws={'color': 'red'})
    plt.title(f"{col} vs {target_var}", fontsize=20, y=1.05)
    plt.xlabel(col)
    plt.ylabel(target_var)
    plt.show()

def plot_boxplot(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.boxplot(data=df, x=col, y=target_var)
    plt.title(f"{col} vs {target_var}", fontsize=20, y=1.05)
    plt.xlabel(col)
    plt.ylabel(target_var)
    plt.show()

target_var = 'SalePrice'
vars_to_study_numerical = ['1stFlrSF','GarageArea', 'GarageYrBlt', 'GrLivArea',  'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd']
vars_to_study_categorical = ['GarageFinish','KitchenQual', 'OverallQual']

for col in vars_to_study_numerical:
    plot_regression(df_eda, col, target_var)
    print("\n\n")

for col in vars_to_study_categorical:
    plot_boxplot(df_eda, col, target_var)
    print("\n\n")

---

# Total Floor Space

There are four variables that are related to floor space.
- 1stFlrSf is assumed to be included in GrLivArea.
- It is not clear if GarageArea is included in GrLivArea
- GrLivArea does not include below ground space, therefore assumption TotalBsmtSF is an additional feature to a property.


Therefore, we will combine GrLivArea and TotalBsmtSF to create a TotalLivArea and review if there is a stronger correlation to SalePrice

In [None]:
df_eda['TotalLivArea'] = df_eda['GrLivArea'] + df_eda['TotalBsmtSF']
df_eda.head(10)

Both correlations (Spearmane and Pearson) demonstrated a stronger correlation of TotalLivArea to SalePrice

In [None]:
corr_spearman_eda = df_eda.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(15)
corr_spearman_eda

In [None]:
corr_pearson_eda = df_eda.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(24)
corr_pearson_eda

We plot the new TotalLivArea column and note there visually appears less distribution of scatter plots away from the regression line. Combined with higher correlation scores.

In [None]:
plot_regression(df_eda, 'TotalLivArea', target_var)

---

# Conclusion and Next Steps

The correlations and plots interpretation converge.
* Larger floor space (GrLivArea, 1stFlrSF, GarageArea and TotalBsmtSF) is associated with a higher SalePrice
    - Combined TotalLivArea, which includes above and below ground correlates stronger to a higher SalePrice.
    - GrLivArea has the highest individual correlation to a higher SalePrice.
* Newer built properties (YearBuilt) are higher in SalePrice.
* Recently built garages and refurbishments (GarageYrBlt, YearRemodAdd) are associated with a higher SalePrice
* Higher overall quality houses (OveralQual) are associated with a higher SalePrice
    - Higher quality kitchens and garages also contribute (less strong correlation) to a higher SalePrice (GarageFinish and KitchenQual) 