# **House Sales Price Study**

## Objectives

* Answer business requirement 1:
  * The client is interested to understand the most relevant house variables correlate against the sale price.

## Inputs

* outputs/datasets/collection/house_prices_after_inspection.csv

## Outputs

* Generate code that answers business requirement 1 and can be used to build the Streamlit App

## Additional Comments

* Data derives from Kaggle but has been provided by CI 


---

# Change working directory to the parent folder

Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load the Data

In [None]:
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/house_prices_after_inspection.csv")
#df.head()
df.tail()

In [None]:
df_house_price_study = df.copy()
df_house_price_study.tail()

# Create a profile report for quick Exploratory Data Analysis (EDA)

In [None]:
from ydata_profiling import ProfileReport
profile_report= ProfileReport(df=df_house_price_study, minimal=True)
#profile_report
#profile_report.to_notebook_iframe()

## EDA Observations

* This dataset hast a predominance for numerical variables.
* Only 4 variables are categorical: BsmtExposure, BsmtFinType1, GarageFinish, KitchenQual
* The 4 categorical variables are imbalanced.
* Several variables have missing values and zeros.
* Most numerical variables seem to be not normally distributed. 

# Handle Missing Values (NaN)

In [None]:
df_house_price_study.isnull().sum().to_frame(name="Is Null")

In [None]:
categorical_variables = df_house_price_study.select_dtypes(include='object').columns.to_list()
categorical_variables

In [None]:
for col in categorical_variables:
    print(df_house_price_study[col].value_counts())

In [None]:
from feature_engine.imputation import CategoricalImputer
categorical_imputer= CategoricalImputer(imputation_method='missing',
                                                  fill_value='Missing',
                                                  variables=categorical_variables)
df_categ_imputed =categorical_imputer.fit_transform(df_house_price_study)

In [None]:
df_categ_imputed[categorical_variables].isnull().sum().to_frame(name="Is Null")

In [None]:
import pingouin as pg
pg.normality(data=df_categ_imputed, alpha = 0.05) # check normality: They are all not normally distributed


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
# for col in df.select_dtypes(include="number").columns:
  # sns.histplot(data=df, x=col, kde=True)
  # plt.show()
  # print('\n')

numerical_variables = df_categ_imputed.select_dtypes(include="number").columns

n_cols = 3
n_rows = (len(numerical_variables) + n_cols - 1) // n_cols

# Create the figure and subplots grid
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 5, n_rows * 4))

# Flatten the axes array if there's more than one row/column
axes = axes.flatten()

# Iterate through columns and plot on respective axes
for i, col in enumerate(numerical_variables):
    sns.histplot(data=df_categ_imputed, x=col, kde=True, ax=axes[i]) 
    axes[i].set_title(f'Distribution of {col}')

# Remove unused subplots
for j in range(len(numerical_variables), len(axes)):
    fig.delaxes(axes[j])

# Prevent titles/labels from overlapping
plt.tight_layout()

# Display all plots
plt.show()


In [None]:
numerical_variables = df_categ_imputed.select_dtypes(include="number").columns.to_list()
numerical_variables

In [None]:
from feature_engine.imputation import MeanMedianImputer
numerical_imputer = MeanMedianImputer(imputation_method='median',
                            variables= numerical_variables)

df_categ_and_numb_imputed = numerical_imputer.fit_transform(df_categ_imputed)


In [None]:
df_categ_and_numb_imputed.isnull().sum().to_frame(name="Is Null")

# Correlation Study: Pearson and Spearman

**Goal:** identify how the target (SalesPrice) correlate to the variables, and retrieve the top 5 correlation variables for SalesPrice.

* Step 1: Handle M

* Step 1: Since Spearman and Peason need numeric variables: transform categorical variables to numerical variables using one hot encoding.

In [None]:
from feature_engine.encoding import OneHotEncoder
one_hot_encoder = OneHotEncoder(variables=df_categ_and_numb_imputed.select_dtypes(include='object').columns.to_list(), drop_last=False)
#one_hot_encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = one_hot_encoder.fit_transform(df_categ_and_numb_imputed)
df_ohe.head()

### Evaluate if One Hot Encoding Worked

In [None]:
categorical_variables = df_ohe.select_dtypes(include='object').columns.to_list()
categorical_variables

# Correlation Study

In [None]:
corr_pearson_top10 = df_ohe.corr(method='pearson', numeric_only=True)['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10).to_frame(name="Correlation Coefficient")
corr_pearson_top10

---

In [None]:
corr_spearman_top10 = df_ohe.corr(method='spearman', numeric_only=True)['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10).to_frame(name="Correlation Coefficient")
corr_spearman_top10

In [None]:
combined_top_features = set(corr_pearson_top10[:5].index.to_list() + corr_spearman_top10[:5].index.to_list())
combined_top_features

Therefore we will investigate if:

* The sale price tends to increase as the first floor square footage increases.
* The sale price tends to increase as the size of the garage increases.
* The sale price tends to increase as the above-grade living area increases.
* The sale price tends to increase as the overall house material and finish of the house improves.
* The sale price tends to increase as the total square feet of basement area increases.
* The sale price tends to increase with newer construction dates  

In [None]:
vars_to_study=['1stFlrSF', 'GarageArea', 'GrLivArea', 'OverallQual', 'TotalBsmtSF', 'YearBuilt']
vars_to_study

In [None]:
df_eda_subset = df_house_price_study.filter(vars_to_study + ["SalePrice"])
df_eda_subset.head()

In [None]:
sns.set_style('whitegrid')

# %matplotlib inline

def plot_categorical(df, col, target_var):

    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()
    plt.close()


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()
    plt.close()

def plot_numerical_vs_continuous(df, col, target_var):
    plt.figure(figsize=(10, 6))

    # Option A: Scatter Plot with Regression Line
    sns.scatterplot(data=df, x=col, y=target_var, alpha=0.6)
    sns.regplot(data=df, x=col, y=target_var, scatter=False, color='red', line_kws={'linestyle':'--'}) # Adds a regression line
    plt.title(f'{col} vs. {target_var} (Scatter Plot with Regression Line)', fontsize=16)

    # Option B: Joint Plot (provides scatter + marginal distributions)
    # Not using within the function, but mentioning as an alternative outside for deeper dives
    # g = sns.jointplot(data=df, x=col, y=target_var, kind='reg', height=8)
    # g.set_axis_labels(col, target_var, fontsize=12)
    # g.fig.suptitle(f'{col} vs. {target_var} (Joint Plot)', y=1.02, fontsize=16) # Title for jointplot

    plt.xlabel(col, fontsize=12)
    plt.ylabel(target_var, fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()
    plt.close()

# Example usage:
# plot_numerical_vs_continuous(df_house_price_study, 'GrLivArea', 'SalePrice')
# plot_numerical_vs_continuous(df_house_price_study, 'YearBuilt', 'SalePrice')

print(df_eda_subset.head())
target_var = 'SalePrice'
for col in vars_to_study:
    if df_eda_subset[col].dtype == 'object':
        plot_categorical(df_eda_subset, col, target_var)
        print("\n\n")
    else:
        plot_numerical_vs_continuous(df_eda_subset, col, target_var)
        print("\n\n")

# Parallel Plot with numerical variables

In [None]:
import plotly.express as px
fig = px.parallel_coordinates(df_eda_subset, color="SalePrice",
                              dimensions = vars_to_study)
fig.show()

# px.colors.sequential.swatches() 

# fig = px.parallel_coordinates(df, color="species", color_continuous_scale='viridis')
# fig.show()




plot them 
parallel plot
Summary observations


# Conclusions

The correlations and plots interpretation converge.

* Sale prices are typically higher for homes with larger first-floor square footage.
* Sale prices are typically higher for homes with larger garages.
* Sale prices are typically higher for homes with larger above-grade living areas.
* Sale prices are typically higher when the overall quality of the house's materials and finish improves.
* Sale prices are typically higher for homes with larger total basement area.
* Sale prices are typically higher for homes that were recently constructed.   