# **Sales Price Study Notebook**

## Objectives

* Answer business requirement 1
  * The client is interested in discovering how the house attributes correlate with the sale price.
  * Therefore, the client expects data visualizations of the correlated variables against the sale price.

<br>

* Load and inspect the data prepared during data collection
* Data exploration
* Correlation study
* EDA on selected variables
* Conclusions and next steps

## Inputs

* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv

## Additional Comments

* This notebook was written based on the guidelines provided in the Customer Churn walk through project, data cleaning lesson.
* In this note book we explore the data using the CRISP-DM Data Understanding methodology

---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Import Packages

In [None]:
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None
from pandas_profiling import ProfileReport

---

# Load the House Price Records prepared during data collection

Read the house_prices_records dataset csv file into a Pandas dataframe

In [None]:
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
print(df.shape)
df.head()

---

# Data Exploration

Explore the dataset, by checking variable types and distribution, missing levels and what value these variables may add in the context of the first business requirement.  

In [None]:
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

---

# Correlation Study

Assess correlation levels across **numerical** variables using  `spearman` and `pearson` methods.

* We will exclude the first item returned as this will be the correlation between SalePrice and SalePrice
* We will only fetch the 10 most relevant correlations

---

Using the '`pearson`' method to measure the linear relationship between two features

In [None]:
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

---

Using the '`spearman`' method to measure the linear relationship between two features

In [None]:
corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

#### Observation
* For both methods, we notice positively strong levels of correlation between Sales Price and at least 5 variables respectively. 

---

* Now we take the top 4 variables returned for each method, transform them to a list and concatenate the two lists
* The result will be the top (unique) correlated variables from both methods

In [None]:
top_n = 4
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

* The result is 5 variables that correlate to Sale Price
* These variables will be tested on strength to predicting Sale Price 

In [None]:
corr_var_list = list(set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list()))
corr_var_list

---

# EDA on the Correlated Variable List

---

* Filter the house price dataset on only the correlated variable list and include the sale price

In [None]:
df_eda = df.filter(corr_var_list + ['SalePrice'])
print(df_eda.shape)
df_eda.head(5)

## Visualize variable correlation to Sale Price

Plot the distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

%matplotlib inline

def plot_numerical(df, col, target_var):
  plt.figure(figsize=(15, 8))
  sns.regplot(data=df, x=col, y=target_var)  
  plt.title(f"{col}", fontsize=20)
  plt.show()


target_var = 'SalePrice'
for col in corr_var_list:
  plot_numerical(df_eda, col, target_var)
  print("\n\n")

---

# Conclusions and Next Steps

#### The correlations and plots interpretation converge.

* The following are the variables isolated in the correlation study:
  * GarageArea: Size of garage in square feet
  * GrLivArea: Above grade (ground) living area square feet
  * OverallQual: Rates the overall material and finish of the house
  * YearBuilt: Original construction date
  * TotalBsmtSF: Total square feet of basement area

* The correlation analysis shows that the sizes of the ground floor living area, Basement area and the garage area, play a key role in determining house price. In addition, the year the house was built and the overall quality of materials used and the finishes in the house also play a significant role in determining house price.

* The plots show that the variables, isolated in the correlation study, do indeed have a strong correlation and hence possibly strong predictive power for Sale Price

* The next step is Data Cleaning

---