# Objectives
* The client is interested in discovering how the house attributes correlate with the sale price. Therefore, the client expects data visualisations of the correlated variables against the sale price to show that.

* The client is interested in predicting the house sale price from her four inherited houses and any other house in Ames, Iowa.

## Change working directory
We need to change the working directory from its current folder to its parent folder

  * We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory

    * os.path.dirname() gets the parent directory
    * os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Import packages

In [None]:
# import packages
import pandas as pd
import numpy as np
import seaborn as sns
from feature_engine.imputation import CategoricalImputer
from feature_engine.encoding import OneHotEncoder

# Load Dataset

In [None]:
# Load dataset
df = pd.read_csv("/workspace/housing-market-analysis/outputs/datasets/collection/house-price-2021.csv")
df = df.sample(frac=0.2, random_state=101)
print(df.shape)
df.head(5)

#X = df.drop(columns=['SalePrice'])  # Extract features
#y = df['SalePrice']  # Extract target variable

#print(df.columns)
#df = df.fillna(0)
#df.info()


---

# Data Exploration

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

# Correlation Study

* Used OneHotEncoder to transform categorical variables into a format that can be provided to machine learning algorithms.

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(6)

* Checked information for reassurance

In [None]:
df_ohe.info()

Use .corr() for **spearman** and **pearson** methods, and investigate the top n correlations

    * We know this command returns a pandas series and the first item is the correlation between SalePrice and SalePrice, which happens to be 1, so we exclude that with [1:]
    * We sort values considering the absolute value, by setting key=abs

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

Now we do the same for **pearson**

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

In both methods, we observe correlations between Churn and individual variables to be moderate or strong. 
    
    * OverallQual being the strongest among both methods.

    * GarageYrBlt and GarageFinish_Unf are the weakest links in both methods.

    

* Ideally, we pursue strong correlation levels. Reason being is the available dataset may have limited information or may not capture all relevant factors influencing the variables of interest.


The top 8 correlation levels are considered at df_ohe and will study associated variable at df

In [None]:
top_n = 8
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

Therefore, in our analysis of the DataFrame (df). Moreover, we will explore the following variables to determine whether:

* A higher SalePrice is typically associated with the First Floor square feet  having a larger surface area **['1stFlrSF']**.

* A higher SalePrice is typically associated with a larger Size of garage in square feet **['GarageArea']**.
* A higher SalePrice is typically associated with the year the garage was built **['GarageYrBlt']**.
* A higher SalePrice is typically associated with a larger Above grade (ground) living area in square feet **['GrLivArea']**.
* A higher SalePrice is typically associated with a higher Rate of the overall material and finish of the house **['OverallQual']**.
* A higher SalePrice is typically associated with the total square feet of the basement area **['TotalBsmtSF']**.
* A higher SalePrice is typically associated with the original construction date **['YearBuilt']**. 
* A higher SalePrice is typically associated with the remodel date **['YearRemodAdd']**.

In [None]:
vars_to_study = ['1stFlrSF', 'GarageArea', 'GarageYrBlt', 'GrLivArea', 'OverallQual', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd']
vars_to_study

# Exploratory Data Analysis (EDA) on selected variables

In [None]:
df_expda = df.filter(vars_to_study + ['SalePrice']) # Filter out target variable and integrate to the feature variables
df_expda.head(5)

# Variables Distribution by SalePrice
A visual ditribution representation (numerical) coloured by SalePrice

In [None]:
import matplotlib.pyplot as plt
from IPython.display import Image, display
sns.set_style('whitegrid')

def numerical_plot(df, col, target_var):
    plt.figure(figsize=(10, 5))
    sns.scatterplot(data=df, x=col, y=target_var, hue=target_var)
    plt.title(f"{col}", fontsize=15, y=1.5)
    plt.show()
    plt.savefig('numerical_plot.png')
    plt.close()


target_var = "SalePrice"
# f_var = "OverallQual"
for col in vars_to_study:
    numerical_plot(df_expda, col, target_var)
    print("\n\n")


plot_path = 'numerical_plot.png'

numerical_image = Image(plot_path)

display(numerical_image)


# Conclusions and next steps

The plots and correlations interpretation coverage.

* A higher SalePrice is typically associated with the First Floor square feet  having a larger surface area **['1stFlrSF']**.

* A higher SalePrice is typically associated with a larger Size of garage in square feet **['GarageArea']**.
* A higher SalePrice is typically associated with the year the garage was built **['GarageYrBlt']**.
* A higher SalePrice is typically associated with a larger Above grade (ground) living area in square feet **['GrLivArea']**.
* A higher SalePrice is typically associated with a higher Rate of the overall material and finish of the house **['OverallQual']**.
* A higher SalePrice is typically associated with the total square feet of the basement area **['TotalBsmtSF']**.
* A higher SalePrice is typically associated with the original construction date **['YearBuilt']**. 
* A higher SalePrice is typically associated with the remodel date **['YearRemodAdd']**.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned') # create outputs/datasets/collection folder
except Exception as e:
  print(e)