# **Sale Price Study**

## Objectives

Answer business requirement 1:
* The client is interested in understanding the relationship between the attributes of a house and its market value as a consequence.
Therefore, this notebook will demonstrate a correlation study and will display data visualisations in the form of graphs.

## Inputs

* outputs/datasets/collection/house_price_records.csv 

## Outputs

* Generate code and plots/graphs that answers business requirement 1 and can be used to build the Streamlit App 

## Additional 

* 
 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [14]:
import os
current_dir = os.getcwd()
current_dir

'/'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [15]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [16]:
current_dir = os.getcwd()
current_dir

'/'

# Load Data

In [17]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/house_prices_records.csv"))
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'outputs/datasets/collection/house_prices_records.csv'

---

# Data Exploration

We will now familiarise ourselves with the dataset

In [18]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

NameError: name 'df' is not defined

---

# Correlation study

First we will explore the missing data

In [None]:
vars_with_missing_data = df[df.columns[df.isna().sum() > 0]]
vars_with_missing_data

We will impute the missing values in the data with the most common value for each column.

In [None]:
catagorical_missing_var = (vars_with_missing_data
                    .columns[vars_with_missing_data.dtypes == 'object']
                    .to_list())
catagorical_missing_var

We will use catagorical Imputer to fill in the missing values 

In [None]:
from feature_engine.imputation import CategoricalImputer
categorical_imputer = CategoricalImputer(imputation_method='frequent',
                                         variables=catagorical_missing_var)
df = categorical_imputer.fit_transform(df)

In [None]:
df.filter(catagorical_missing_var).info()

We will now use OneHotEncoder to complete the transformation for the missing values

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

We will use .corr() for spearman and pearson and investigate the top 10 correlations

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

The above figures show strong correlations between the variables and the 'SalePrice'.
We will look further at the top correlations:

In [None]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

In [None]:
vars_to_study = ['1stFlrSF', 'GarageArea', 'GrLivArea', 'OverallQual', 'TotalBsmtSF', 'YearBuilt']
vars_to_study

# EDA on selected variables

In [None]:
df_eda = df.filter(vars_to_study + ['SalePrice'])
df_eda.head(3)

# Data Visualisation 

We will map plots below to gain insights regarding the high scoring variables and their relationship to the 'SalePrice' and save the outputs.
We will use feature_engine discretiser and in particular the equal frequancy method.

In [None]:
from feature_engine.discretisation import EqualFrequencyDiscretiser
discretiser = EqualFrequencyDiscretiser(q=6, variables=['SalePrice'])
discretiser.fit(df_eda)
df_eda = discretiser.transform(df_eda)
df_eda

In [None]:
discretiser.binner_dict_['SalePrice']

We will now create a map to replace the 'SalePrice' variable with more informative levels

In [None]:
labels = discretiser.binner_dict_['SalePrice']
n_factor = len(labels)-1
labels_map = {}

for n in range(0, n_factor):
    if n == 0:
        labels_map[n] = f"< {labels[1]}"
    elif n < n_factor - 1:
        labels_map[n] = f"+{labels[n]}"
    else:
        labels_map[n] = f"{labels[n]} to - {labels[n+1]}"

labels_map

Now we have the 'SalePrice' ranges and thus have 'bins' which the properties can be allocated to, as per their sale price.
Lets view these in a dataFrame format:

In [None]:
df_eda['SalePrice'] = df_eda['SalePrice'].replace(labels_map)
df_eda

We will plot histograms to visualise the above data. This will make it easier to digest the above data. 

In [None]:
hue_order = labels_map.values()
list(hue_order)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

def plot_numerical(df, col, target_var, hue_order):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, hue_order=hue_order, kde=True,
                 element="step")
    plt.title(f"{target_var} distribution", fontsize=20, y=1.05)
    plt.show()


target_var = 'SalePrice'
for col in ['1stFlrSF', 'GarageArea', 'GrLivArea', 'OverallQual', 'TotalBsmtSF', 'YearBuilt']:
    plot_numerical(df_eda, col, target_var, hue_order)
    print("\n\n")

We will delve further by mapping the variables of our hypothesis againstthe sale price individually. 

In [None]:
df_pearson = df.corr(method='pearson')['SalePrice'].filter(['GrLivArea'])
df_pearson

In [None]:
x, y = 'GrLivArea', 'SalePrice'
fig, axes = plt.subplots(figsize=(8, 5))
sns.scatterplot(data=df, x=x, y=y)
plt.show()

The pearson correlation score points towards the fact that the size of the house has a relatively high correlation to the value of the house. 
The scatter plot and histogram we mapped earlier, both illustrate that as the property size increases, so does the sale price. This supports our first hypothesis. 

We will explore the second third hypothesis regarding house age against the sale price. 

In [None]:
df_pearson = df.corr(method='pearson')['SalePrice'].filter(['YearBuilt'])
df_pearson

In [None]:
x, y = 'YearBuilt', 'SalePrice'
fig, axes = plt.subplots(figsize=(8, 5))
sns.scatterplot(data=df, x=x, y=y)
plt.show()

Here we can see the correlation does exist, but is not as strong or clear. For example, the most expensive property may be the most recent, however there are instances where a property built in 1920 can be valued the same as a property built in more recent times.

We will explore the third hypothesis by examing the overall quality of the property against the sale price. 

In [None]:
df_pearson = df.corr(method='pearson')['SalePrice'].filter(['OverallQual'])
df_pearson

In [None]:
x, y = 'OverallQual', 'SalePrice'
fig, axes = plt.subplots(figsize=(8, 5))
sns.scatterplot(data=df, x=x, y=y)
plt.show()

Clearly indicates that house quality is key to  increasing the house price and the correlation score is also very high. 

Although the year a property may be remodelled is not one of the top corraltive indicators of sale price, it would be good to explore the data to see if more recent refurbishments will have a positive imapct on the sale price. 

In [None]:
df_pearson = df.corr(method='pearson')['SalePrice'].filter(['YearRemodAdd'])
df_pearson

In [None]:
x, y = 'YearRemodAdd', 'SalePrice'
fig, axes = plt.subplots(figsize=(8, 5))
sns.scatterplot(data=df, x=x, y=y)
plt.show()

Although the imapct is significant, it does not appear to have an overwhelming impact. If we consider that a property remodelled in 1950 is valued higher than some properties remodelled in the most recent years, this shows that the other factors must be coming in to play. 

In [None]:
# def plot_numerical(df, col, target_var, hue_order):
#     plt.figure(figsize=(8, 5))
#     sns.histplot(data=df, x=col, hue=target_var, hue_order=hue_order, kde=True,
#                  element="step")
#     plt.title(f"{target_var} ", fontsize=20, y=1.05)
#     plt.show()


# target_var = 'OverallQual'
# for col in ['YearBuilt']:
#     plot_numerical(df_eda, col, target_var, hue_order)
#     print("\n\n")

# Conclusions

* House lot size has a big impact on house price. The histogram plots show that where the 'GrLivArea', '1stflrSF', 'GarageArea' and 'TotalBmntSF' have greater square footage the sale price in generally tend to be higher. 

* The age of the property does also have a relevance to property value, however I would deduce that the quality also has to be present. This I gather from the fact that that most properties from 1920 have a lower sale value which for the most part is ikely due to the overall quality having deterioted. 
A new property is likely to be of a higher quality, therefore this can be a reason why there tends to be more instances of a higher sale price. Although the data does not conclude that these factors alone can increase the value. It requires supporting factors, such as size for example. 
We have explored this by plotting the 'YearRemodadd' variable, and the data clearly indicates having remodelled a house recently will not necessairly mean it will outperform other deducing factors. This may be because the remodels have not all been to a high standard, but again this would be an assumption. There is not enough data regarding the remodelling to conclude its effects. Having data that showed the sale price before and after a remodelling would provide a clearer insight into its positive effects. 

* It is clear that no variable alone is powerful enough to dictate the highest sale price, but from the above observations, one can conclude that all other things being equal, the higher the overall quality, the higher the sale price will be. 