# **House Sales Price Study**

<br>

## Objectives

* Answer business requirement 1:
  * The client is interested to identify the house variables most strongly correlated with sale price.

## Inputs

* outputs/datasets/collection/house_prices_after_inspection.csv

## Outputs

* Generate code that answers business requirement 1 and can be used to build the Streamlit App

---

# Change working directory to the parent folder

Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load the Data

In [None]:
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/house_prices_after_inspection.csv")
#df.head()
df.tail()

In [None]:
df_house_price_study = df.copy()
df_house_price_study.tail()

# Create a profile report for quick Exploratory Data Analysis (EDA)

In [None]:
from ydata_profiling import ProfileReport
profile_report= ProfileReport(df=df_house_price_study, minimal=True)
#profile_report
#profile_report.to_notebook_iframe()

## EDA Observations

* This dataset hast a predominance for numerical variables.
* Only 4 variables are categorical: BsmtExposure, BsmtFinType1, GarageFinish and KitchenQual.
* The 4 categorical variables are imbalanced.
* Several variables have missing values and zeros.
* Most numerical variables seem to be not normally distributed. 

# Handle Missing Values (NaN)

* Handle missing data before performing correlation analysis

Step 1: Confirm the variables with missing values

In [None]:
df_house_price_study.isnull().sum().to_frame(name="Is Null")

Step 2: Categorical variables -> Handle their missing data

* Retrieve the categorical variables

In [None]:
categorical_variables = df_house_price_study.select_dtypes(include='object').columns.to_list()
categorical_variables

* Assess the frequency of their values

In [None]:
for col in categorical_variables:
    print(df_house_price_study[col].value_counts())

* Perform Categorical Imputation: Replace missing values for the categorical variables with the word "Missing"

In [None]:
from feature_engine.imputation import CategoricalImputer
categorical_imputer= CategoricalImputer(imputation_method='missing',
                                                  fill_value='Missing',
                                                  variables=categorical_variables)
df_categ_imputed =categorical_imputer.fit_transform(df_house_price_study)

* Confirm that the categorical variables do not have misssing values 

In [None]:
df_categ_imputed[categorical_variables].isnull().sum().to_frame(name="Is Null")

* Assess the frequency of their values now including the "Missing" category

In [None]:
for col in categorical_variables:
    print(df_categ_imputed[col].value_counts())

In [None]:
profile_report_categ = ProfileReport(df=df_categ_imputed, minimal=True)
#profile_report_categ.to_notebook_iframe()

Step 3: Numerical variables -> Handle their missing data as well

* Check normality (based on the results below, the numerical variables are not normally distributed)

In [None]:
import pingouin as pg
pg.normality(data=df_categ_imputed, alpha = 0.05)

In [None]:
numerical_variables = df_categ_imputed.select_dtypes(include="number").columns
print(len(numerical_variables))

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')

def DistributionPlot(n_cols, numerical_variables, df):
    n_rows = (len(numerical_variables) + n_cols - 1) // n_cols

    # Create the figure and subplots grid
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 5, n_rows * 4))

    # Flatten the axes array if there's more than one row/column
    axes = axes.flatten()

    # Iterate through columns and plot on respective axes
    for i, col in enumerate(numerical_variables):
        sns.histplot(data=df, x=col, kde=True, ax=axes[i]) 
        axes[i].set_title(f'Distribution of {col}')

    # Remove unused subplots
    for j in range(len(numerical_variables), len(axes)):
        fig.delaxes(axes[j])

    # Prevent titles from overlapping
    plt.tight_layout()

    plt.show()


In [None]:
DistributionPlot(n_cols =5, numerical_variables = numerical_variables,
                  df = df_categ_imputed)


* Retrieve the numerical variables as a list

In [None]:
numerical_variables = df_categ_imputed.select_dtypes(include="number").columns.to_list()
numerical_variables

* Perform Median Imputation: Replace missing values for the numerical variables with the median value of the variable

In [None]:
from feature_engine.imputation import MeanMedianImputer
numerical_imputer = MeanMedianImputer(imputation_method='median',
                            variables= numerical_variables)

df_categ_and_numb_imputed = numerical_imputer.fit_transform(df_categ_imputed)


* Confirm that all variables do not have misssing values 

In [None]:
df_categ_and_numb_imputed.isnull().sum().to_frame(name="Is Null")

* Check once again their distribution (the numerical variables are still not normally distributed)

In [None]:
DistributionPlot(n_cols =5, numerical_variables = numerical_variables,
                  df = df_categ_and_numb_imputed)

pg.normality(data=df_categ_and_numb_imputed, alpha = 0.05)

#### Summary of Handling Missing Values for Correlation analysis

* Both categorical and numerical variables have now no missing values.
* For the categorical variables, the proportion of missing values is not so high as to warrant dropping any of them. 
* Given that only a single categorical variable (BsmtExposure) presented a low degree of *missingness*, the strategy of imputing 'Missing' values with the most frequent category is at the moment not being considered.


<br>

# Correlation Study: Pearson and Spearman

**Goal:** identify how the target (SalesPrice) correlate to the variables, and retrieve the top 5 correlation variables for SalesPrice.

Step 1: Transform categorical variables to numerical variables using one hot encoding.

* This step is performed because Spearman and Peason methods need numeric variables.

In [None]:
from feature_engine.encoding import OneHotEncoder
one_hot_encoder = OneHotEncoder(variables=df_categ_and_numb_imputed.select_dtypes(include='object').columns.to_list(), drop_last=False)
df_ohe = one_hot_encoder.fit_transform(df_categ_and_numb_imputed)
df_ohe.head()

Step 2: Evaluate if One Hot Encoding worked

* OHE worked because the code below shows that the dataset has no longer categorical variables

In [None]:
categorical_variables = df_ohe.select_dtypes(include='object').columns.to_list()
categorical_variables

Step 3: Perform Pearson

* Retrieve the top 10 correlated variables/features against the target SalePrice.

In [None]:
corr_pearson_top10 = df_ohe.corr(method='pearson', numeric_only=True)['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10).to_frame(name="Correlation Coefficient")
corr_pearson_top10

Step 3: Perform Spearman

* Retrieve the top 10 correlated variables/features against the target SalePrice.

In [None]:
corr_spearman_top10 = df_ohe.corr(method='spearman', numeric_only=True)['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10).to_frame(name="Correlation Coefficient")
corr_spearman_top10

Step 4: Get the top 5 combined features of Pearson and Spearman

In [None]:
combined_top_features = set(corr_pearson_top10[:5].index.to_list() + corr_spearman_top10[:5].index.to_list())
combined_top_features

Therefore we will study if:

* The sale price tends to increase as the first floor square footage increases.
* The sale price tends to increase as the size of the garage increases.
* The sale price tends to increase as the above-grade living area increases.
* The sale price tends to increase as the overall house material and finish of the house improves.
* The sale price tends to increase as the total square feet of basement area increases.
* The sale price tends to increase with newer construction dates  

In [None]:
vars_to_study = ['1stFlrSF', 'GarageArea', 'GrLivArea', 'OverallQual', 'TotalBsmtSF', 'YearBuilt']
vars_to_study

#### Summary of Pearson and Spearman 

* For both methods we observe from moderate to very strong correlations between the Sale Price and a given variable to study.

<br>

# EDA on variables to study

Step 1: Create a dataframe with the variables to study and the target (SalePrice)

In [None]:
df_eda_subset = df_house_price_study.filter(vars_to_study + ["SalePrice"])
df_eda_subset.head()

Step 2: Plot their variable distribution

* The target variable (SalePrice) is numeric.
* Thus, to visualize the distribution for categorical variables the choice is a boxplot and for numerical variables a scatter plot.

In [None]:
sns.set_style('whitegrid')

# %matplotlib inline

def plot_categorical(df, col, target_var):

    plt.figure(figsize=(12, 5))
    sns.boxplot(data=df, x=col, y=target_var)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()
    plt.close()

def plot_numerical_vs_continuous(df, col, target_var):
    plt.figure(figsize=(10, 6))

    sns.scatterplot(data=df, x=col, y=target_var, alpha=0.6)
    sns.regplot(data=df, x=col, y=target_var, scatter=False, color='red', line_kws={'linestyle':'--'}) # Adds a regression line
    plt.title(f'{col} vs. {target_var} (Scatter Plot with Regression Line)', fontsize=16)


    plt.xlabel(col, fontsize=12)
    plt.ylabel(target_var, fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()
    plt.close()


print(df_eda_subset.head())
target_var = 'SalePrice'
for col in vars_to_study:
    if df_eda_subset[col].dtype == 'object':
        plot_categorical(df_eda_subset, col, target_var)
        print("\n\n")
    else:
        plot_numerical_vs_continuous(df_eda_subset, col, target_var)
        print("\n\n")

# Parallel Plot with numerical variables

* Get a parallel plot.
* As all variables are numerical, there is no need for transformation step(s).

In [None]:
import plotly.express as px
fig = px.parallel_coordinates(df_eda_subset, color="SalePrice",
                              dimensions = vars_to_study,
                              color_continuous_scale = 'Jet')
fig.show()


<br>

# Conclusions

The correlations and plots interpretation converge.

* Sale prices are typically higher for homes with larger first-floor square footage.
* Sale prices are typically higher for homes with larger garages.
* Sale prices are typically higher for homes with larger above-grade living areas.
* Sale prices are typically higher when the overall quality of the house's materials and finish improves.
* Sale prices are typically higher for homes with larger total basement area.
* Sale prices are typically higher for homes that were recently constructed.   