# **How house attributes influence the sale price**

## Objectives

**Generate content to satisfy business requirement 1**:
* Determine how features are correlated to the target, and thus their significance in determining the sale price.

**Business requirement 2**:
* Determine which features are significant enough, with regard to predicting the sale price, to be studied in the exploratory data analysis notebook.

## Inputs

* house prices dataset: outputs/datasets/collection/house_prices.csv

## Outputs
**For use on the dashboard**:
* Code for calculating Spearman, phi_k correlation coefficients, significance values and PPS's.
* Code to produce heatmaps for the coefficients and PPS's.
* Code to produce scatter plots for feature-target pairs.
* Cleaned dataset for correlation tests: outputs/datasets/sale_price_study/cleaned/house_prices.csv
* categorical feature encoded dataset: outputs/datasets/sale_price_study/cleaned/encoded_house_prices.csv
* pickled selected significant features list: outputs/selected_significant_features.pkl



---

## Change working directory

Working directory changed to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
os.getcwd()

---

## Load house prices dataset

In [None]:
import pandas as pd

house_prices_df = pd.read_csv(filepath_or_buffer='outputs/datasets/collection/house_prices.csv')
house_prices_df.dtypes

---

## Dealing with missing data

Discover columns with missing data

In [None]:
missing_data_df = house_prices_df.loc[:, house_prices_df.isna().any()]
print(missing_data_df.info())
house_prices_numeric_df = house_prices_df.select_dtypes(exclude=['object']).drop(['SalePrice'], axis=1)
print(house_prices_numeric_df.columns)
house_prices_non_numeric_df = house_prices_df.select_dtypes(include='object')
house_prices_non_numeric_df.columns


**Impute missing values in numeric columns**.  
Want to choose an impute method that does not distort the distribution significantly, whilst preserving any existing correlations.
With this in mind the KNNImputer is employed with 5 nearest neighbours used.

In [None]:
from sklearn.impute import KNNImputer
import seaborn as sns
import matplotlib.pyplot as plt

# plotting distributions for numeric variables with missing data
counter = 0
imputed_columns = []
while counter < len(house_prices_numeric_df.columns):
    if house_prices_numeric_df.iloc[:, counter].name in missing_data_df.columns:
        fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(10,4))
        sns.histplot(x=house_prices_numeric_df.iloc[:, counter], ax=ax[0])
        imputed_columns.append(house_prices_numeric_df.iloc[:, counter].name)
    counter += 1

# Imputing the missing values for required columns
imputer = KNNImputer()
imputer.set_output(transform='pandas')
house_prices_numeric_df = imputer.fit_transform(house_prices_numeric_df)
print(house_prices_numeric_df.isna().sum())
house_prices_df[house_prices_numeric_df.columns.values] = house_prices_numeric_df

for col in ['BedroomAbvGr','GarageYrBlt','OverallCond','OverallQual','YearBuilt', 'YearRemodAdd']:
    house_prices_numeric_df[col] = house_prices_numeric_df[col].round()

# plotting distributions after imputation on same figures, for visual comparison
for fig in plt.get_fignums():
    sns.histplot(x=house_prices_numeric_df[imputed_columns[fig - 1]], ax=plt.figure(fig).get_axes()[1])


Inspection of the above figures showing the variable distributions before and after imputation, shows that apart from the 'EnclosedPorchSF' and 'WoodDeckSF' variables,
there is no noticeable change to the distribution shapes. In the case of the EnclosedPorchSF variable, a large quantity of data was missing,
so the somewhat noticeable change is inevitable; the same is true for the WoodDeckSF variable.

**Impute missing values in non-numeric columns**.  
In the absence of a viable distance metric, and considering the fact that only a small amount of data is missing, the KNNImputer was not used. Instead the more
simple method for filling the values by cycling through the possible values was used.

In [None]:
missing_data_non_numeric_df = missing_data_df.select_dtypes(include='object')
print(missing_data_non_numeric_df.columns)

# plotting the variable distributions before imputation
counter = 0
imputed_columns = []
while counter < len(missing_data_non_numeric_df.columns):
    fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(10,4))
    sns.countplot(x=missing_data_non_numeric_df.iloc[:, counter], ax=ax[0])
    imputed_columns.append(missing_data_non_numeric_df.iloc[:, counter].name)
    counter += 1

# imputing the missing values
for col in imputed_columns:
    number_of_nans = house_prices_df[col].loc[(house_prices_df[col].isna() == True)].size
    unique_values = house_prices_df[col].unique()
    index_no = 0
    while number_of_nans > 0:
        if index_no + 1 >= unique_values.size:
            index_no = 0
        house_prices_df[col].fillna(value=unique_values[index_no], limit=1, inplace=True)
        index_no += 1
        number_of_nans = house_prices_df[col].loc[(house_prices_df[col].isna() == True)].size

# plotting the distributions on the same figures after imputation for comparison
for fig in plt.get_fignums():
    sns.countplot(x=house_prices_df[imputed_columns[fig - 1]], ax=plt.figure(fig).get_axes()[1])
          
house_prices_df.isna().sum()

Can see from the above figures that there are no significant changes in the distribution shapes.

### Save cleaned dataset

Needed for the dashboard, for the sale price correlation study page code.

In [None]:
try:
    path = os.path.join(os.getcwd(), 'outputs/datasets/sale_price_study/cleaned')
    os.makedirs(path)
except Exception as e:
  print(e)

In [None]:
try:
    house_prices_df.to_csv(os.path.join(path, 'house_prices.csv'), index=False)
except Exception as e:
    print(e)

---

## Correlation matrices for features and target: sale price

### Normality test for sale price

In [None]:
import pingouin as pg

fig, axes = plt.subplots(nrows=2, figsize=(10,10))
sns.histplot(x=house_prices_df['SalePrice'], kde=True, ax=axes[0])
pg.qqplot(x=house_prices_df['SalePrice'], ax=axes[1])

In [None]:
print('skew:', house_prices_df['SalePrice'].skew())
print('kurtosis:', house_prices_df['SalePrice'].kurtosis())
pg.normality(house_prices_df['SalePrice'])

Examining the sale price histogram, QQ-plot, values of the skew and kutosis, as well as the Shapiro-Wilk normality test, shows that the sale price is not normally distributed.
It is also known that many of the features are not normally distributed and may have possible outliers; as such considering the value-time tradeoff, the pearson correlation test will not be used. To use it would require an attempt to transform the distributions in order to normalise them (if possible), as well as remove any outliers.

**Instead the Spearman and phi_k correlation coefficients will be calculated**.

### Spearman

**Categorical variable encoding**:

In [None]:
house_prices_categorical_df = house_prices_df.select_dtypes(include='object')
house_prices_categorical_df.columns

All categorical variables are ordinal, therefore use ordinal encoding.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

bsmt_fin_type1_cat = np.array(list(reversed(['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'None'])))
bsmt_exposure_cat = np.array(['None', 'No', 'Mn', 'Av', 'Gd'])
garage_finish_cat = np.array(['None', 'Unf', 'RFn', 'Fin'])
kitchen_quality_cat = np.array(['Po', 'Fa', 'TA', 'Gd', 'Ex'])

categories = [bsmt_exposure_cat, bsmt_fin_type1_cat, garage_finish_cat, kitchen_quality_cat]
encoder = OrdinalEncoder(categories=categories, dtype='int64')
encoder.set_output(transform='pandas')


house_prices_df[house_prices_categorical_df.columns] = encoder.fit_transform(X=house_prices_categorical_df)
house_prices_df[house_prices_categorical_df.columns].head()


In [None]:
try:
    path = os.path.join(os.getcwd(), 'outputs/datasets/sale_price_study/cleaned')
    os.makedirs(path)
except Exception as e:
  print(e)

In [None]:
try:
    house_prices_df.to_csv(os.path.join(path, 'encoded_house_prices.csv'), index=False)
except Exception as e:
    print(e)

When assessing and describing correlation coefficients the following definitions will be used:  
r value range|Description
|:----|:----|
|0-0.34|weak|
|0.35-0.64|moderate|
|0.65-1|strong|




Chosen significance level value: 0.05

In [None]:
spearman_df = house_prices_df.pairwise_corr(columns=['SalePrice'], alternative='greater', method='spearman')
spearman_df

Saving the spearman_df dataframe

In [None]:
try:
    path = os.path.join(os.getcwd(), 'src/sale_price_study/')
    os.makedirs(path)
except Exception as e:
  print(e)

In [None]:
try:
    spearman_df.to_csv(os.path.join(path, 'spearman_df.csv'), index=False)
except Exception as e:
    print(e)

In [None]:

spearman_heatmap = sns.heatmap(spearman_df.pivot(index='Y', columns=['X'], values=['r']).sort_values(by=('r', 'SalePrice'), ascending=False), annot=True,
                                              vmax=1, vmin=-1, xticklabels=['SalePrice'], linecolor='black', linewidth=0.05)
spearman_heatmap.set(xlabel='', ylabel='Feature', title='SalePrice-Feature pair spearman correlations')
spearman_heatmap

Evaluation of the coefficients and their associated p values implies the following outcomes with regard to accepting the common alternative hypotheses that the correlation coefficients are greater than zero or equivalently the correlations are positive monotonic:

In [None]:
print(f'|Feature|Null hypothesis (r≤0) outcome|')
for feature_name in house_prices_df.columns:
    if feature_name not in ['OverallCond', 'EnclosedPorchSF', 'SalePrice']:
        print(f'{feature_name}:', 'reject' )
    elif feature_name in ['OverallCond', 'EnclosedPorchSF']:
        print(f'{feature_name}:', 'accept' )

Therefore the Size group and Quality group (alternative) hypotheses, can be accepted: namely there exists statistically significant positive monotonic correlations between those features and the sale price. The same is true for the Age/Condition group and feature group 4, except for
the features 'OverallCond' and 'EnclosedPorch', where the correlation has a negative value, and could well be zero as indicated by the 95% confidence intervals.

With regard to strength as defined earlier, the correlations for each feature can be described as follows:

In [None]:
print(f'|Feature|Strength|')
strong_features = []
moderate_features = []
weak_features = []
for feature_name, r_value in spearman_df[['Y', 'r']].set_index('Y').iterrows():
    strength = ''
    if abs(r_value.values[0]) > 0.65:
        strength = 'strong'
        strong_features.append((feature_name, r_value))
    elif abs(r_value.values[0]) > 0.35:
        strength = 'moderate'
        moderate_features.append((feature_name, r_value))
    elif abs(r_value.values[0]) > 0:
        strength = 'weak'
        weak_features.append((feature_name, r_value))

for feature_tuple in strong_features:
    print(f'{feature_tuple[0]}:', 'strong')
for feature_tuple in moderate_features:
    print(f'{feature_tuple[0]}:', 'moderate')
for feature_tuple in weak_features:
    print(f'{feature_tuple[0]}:', 'weak')

* Only 'GrLivArea' of the size group is as predicted; all the basement related features (except for 'TotalBsmtSF) expressed weak correlations with the sale price; the lot based features have moderate correlations. More surprisingly the '2ndFlrSF' and 'BedroomAbvGrade' features are only weakly correlated to the sale price; in the case of '2ndFlrSf', this may be because of the large number of instances with the value 0 impacting the r coefficient value.
* For the quality group both 'KitchenQual' and 'OverallQual' are strongly correlated as predicted.  
* For the Age/Condition group, only 'YearRemodAdd' is as expected, with 'YearBuilt' being strongly correlated, and 'OverallCond being weakly correlated with sale price.  
* For feature group 4, somewhat surprisingly the garage related features are all moderately correlated to sale price; also unexpected is the moderate correlation with sale price for the 'MasVnrArea' and 'OpenPorchSF' features. The remaining feature correlation strengths are as predicted.


### Phi_k correlation test

In [None]:

import phik
from phik.phik import phik_matrix
from phik.report import plot_correlation_matrix

phik_matrix_df = pd.DataFrame(phik.phik_matrix(house_prices_df)['SalePrice'].sort_values())
matrix_plot = plot_correlation_matrix(phik_matrix_df.values, x_labels=phik_matrix_df.columns, y_labels=phik_matrix_df.index, figsize=(10,10), vmin=0, vmax=1,
                                      y_label='Feature', title='SalePrice-Feature pair $\phi_k$ correlations ')
matrix_plot

Saving the phi_k matrix

In [None]:
try:
    phik_matrix_df.to_csv(os.path.join(path, 'phik_matrix_df.csv'), index=True)
except Exception as e:
    print(e)

**Significance**:

In [None]:
from phik.significance import significance_matrix
significance_matrix_df = significance_matrix(house_prices_df)

Saving the significance matrix

In [None]:
try:
    significance_matrix_df.to_csv(os.path.join(path, 'phik_significance_matrix_df.csv'), index=True)
except Exception as e:
    print(e)

In [None]:

plot_correlation_matrix(pd.DataFrame(significance_matrix_df['SalePrice']).sort_values(by='SalePrice').values,
                                     x_labels=pd.DataFrame(significance_matrix_df['SalePrice']).sort_values(by='SalePrice').columns,
                                     y_labels=pd.DataFrame(significance_matrix_df).sort_values(by='SalePrice').index, y_label='Feature',
                                     title='Significance of $\phi_k$ coefficients', vmin=0, vmax=5)

The phi_k coefficient significance values for all SalePrice-feature pairs suggest the phi_k coefficients are statistically significant, with nearly all features having significance values greater than 5 standard deviations. The remaining features have values greater than 3 SD's.

With regard to the coefficients themselves, generally features that have moderate-to-strong spearman correlations, also seem to exhibit a similar strength (with some variation) dependence on sale price, as would be expected. However the '2ndFlrSF' feature exhibits a strong variable dependence with the sale price, despite having a weak spearman (r) value. Additionally the 'MasVnrArea' has a stronger dependence on sale price than what might be expected from its r-value; this could be a consequence of non-monotonic relationships, or the impact of outliers, but, an examination of the scatter plots may reveal this.

### Scatter plots

In [None]:
# creating copy with ordinal encoding reversed
house_prices_df_copy = house_prices_df.copy(deep=True)
house_prices_df_copy[house_prices_categorical_df.columns] = encoder.inverse_transform(X=house_prices_df[house_prices_categorical_df.columns])

Saving this unencoded dataframe for use on the dashboard app

In [None]:
try:
    house_prices_df_copy.to_csv(os.path.join(path, 'scatterplot_data_df.csv'), index=True)
except Exception as e:
    print(e)

In [None]:
column_names = house_prices_df.columns.tolist()[0:-1]
partitioned_names = []
counter = 0
no_of_groups = int(len(column_names)/4)
while counter < no_of_groups:
    partitioned_names.append(column_names[counter*4:counter*4 + 4])
    counter +=1
partitioned_names.append(column_names[-3:])

for group in partitioned_names:
    fig, axes = plt.subplots(ncols=len(group), nrows=1, figsize=(20,5), tight_layout=True)
    for feature in group:
        order = []
        if feature == 'BsmtExposure':
            order = bsmt_exposure_cat
        elif feature == 'BsmtFinType1':
            order = bsmt_fin_type1_cat
        elif feature == 'GarageFinish':
            order = garage_finish_cat
        elif feature == 'KitchenQual':
            order = kitchen_quality_cat

        if len(order) == 0:
            sns.scatterplot(data=house_prices_df_copy[[feature, 'SalePrice']], x=feature, y='SalePrice', ax=axes[group.index(feature)])
        else:
            sns.stripplot(data=house_prices_df_copy[[feature, 'SalePrice']], x=feature, y='SalePrice', ax=axes[group.index(feature)], order=order)
        

* The scatter-type plots largely agree with the calculated spearman coefficients.  

* The r-values suggesting strong correlations are supported for the corresponding feature plots.  

* Likewise for the moderate r-value features, it can be seen that whilst there are matching observable trends, there is a greater degree of clustering about certain values of a given feature, and where the variation is greater w.r.t sale price, and so this disrupts any potential monotonic trend.  

*  Again the same is true for weak r-value features, where there is less of a clear monotonic trend, but instead either a static clustering, peaks that fall off, or no real pattern at all. For example for the feature 'OverallCond' there is a somewhat positive monotonic trend, that is counteracted by a clustering of values with the score 5 displaying a greater variation in sale price. A further example is for '2ndFlrSF', where if you were to ignore the zero values, a moderate monotonic trend would likely exist. A final illustrative case is for the feature 'BedroomAbvGr' where on average the sale price appears to increase with the number of bedrooms, peaking at 5, before dropping off --- this may be because the fewer values >5 are outliers.

* The phi_k coefficients are also largely supported by the plots. Of course as these coefficients capture non-montonic relationships as well, you would expect them to suggest a stronger relationship for the features where there are peaks and or troughs in the plot. The large difference in magnitude of the spearman and phi_k coefficients for the feature 'OverallCond' can be understood from its aforementioned plot and the cluster of values with score 5 displaying greater sale price variation than for other scores. Similarly for '2ndFlrSF' the very large difference in coefficients can be understood by the cluster with value zero having a greater degree of sale price variation.


---

## Predictive Power Score (PPS)

In [None]:
import ppscore as pps
pps_df = pps.predictors(df=house_prices_df, y='SalePrice', sample=1460, cross_validation=10)
pps_df

In [None]:
pps_heatmap = sns.heatmap(pps_df.pivot(index='x', columns=['y'], values=['ppscore']).sort_values(by=('ppscore', 'SalePrice'), ascending=False), annot=True,
                                              vmax=1, vmin=0, xticklabels=['SalePrice'], linecolor='white', linewidth=0.05)
pps_heatmap.set(xlabel='', ylabel='Feature', title='Feature PPS for target SalePrice')
pps_heatmap

* None of the PPS scores are above 0.5, however this is not surprising since no one feature in this dataset (many features exist) would be expected to account for all variation in the target.
Instead a combination of the most significant features is likely to be able to predict the target more effectively. With this is mind any features with a PPS similar to the
feature with the highest PPS, that also demonstrates a strong correlation, is likely to act as a benchmark for a 'good' score.

* Thus the feature 'OverallQual' with a PPS of 0.44, that also has a strong correlation coefficients can act as a benchmark score.

* The ranking of the features PPS's (grouped by strength) roughly concord with the ranking of their correlation coefficients, albeit with small differences.

* However, the ratio of the 'GrLivArea' PPS to the 'OverallQual' PPS is larger than the respective ratios for the correlation coefficients. Also Features such as 'EnclosedPorchSF', and 'WoodDeckSF' have, relative to features with far stronger correlations, stronger PPS's which is unexpected.

* At the same time features such as '1stFlrSF' and 'TotalBsmtSF' which have at least moderate correlation coefficients, have zero or close to zero PPS's, which again is unexpected.

* The overall low scores and unexpected scores for various features, again is probably more of a reflection on the model used to predict the target, and its poor performance
relative to the naive model, as well as the aforementioned point that the predictive power of a group of features will be far better than any individual feature.

---

## Conclusions

**Hypotheses**:  

* The common alternative hypothesis (positive monotonic correlation) for all features can be accepted, except for the 'OverallCond' and 'EnclosedPorch' features.

**With regard to the strength of any relationship to the sale price, as implied by the correlation tests**:  

* The quality feature group features have a strong relationship to sale price.

* Of the size group features, all but the '2ndFlrSF', 'BedroomAbvGr', 'BsmtUnfSF', 'BsmtFinSF1' have at least moderate Spearman correlation; whilst all but 'LotArea', 'LotFrontage', 'BedroomAbvGr' have at least a moderate dependence on sale price.

* The garage related features have at least a moderate correlation/dependence with sale price.

* On the whole the enclosed porch related feature has a weak relationship to the sale price, whilst the open porch feature may have a moderate relationship.

* Age related features, and the 'MasVnrArea' feature have at least a moderate relationship to the sale price.

* The 'OverallCond', 'BedroomAbvGr' and 'WoodDeckSF' features have a weak relationship to the sale price.

**Scatter plots**:

* The scatter plots largely agree with the relationships implied by the correlation tests.

* They also to some extent explain why certain features appear to have a weak monotonic relationship, but at least a moderate dependence on the sale price: namely because of certain feature values having greater variations in sale price; likewise weak relationships are illustrated in the plots as clustering with less variation.

**PPS**:

* The PPS's for all features are not that strong or always consistent with strength of the relationships implied by the correlation coefficients, but this is more likely because 
multiple features are necessary to predict the sale price.

* However some of the strongest correlated features, also have the largest PPS's.

**Selection of the most significant features for business requirement 2 user story task EDA**:

Based on all metrics used, the following features have been deemed significant enough for further exploratory data analysis:
* '1stFlrSF', '2ndFlrSF', 'BsmtFinSF1', 'GarageArea', 'GarageFinish', 'GarageYrBlt', 'GrLivArea', 'KitchenQual', 'LotArea', 'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'OverallQual', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd'.


In [None]:
import joblib

selected_significant_features = ['1stFlrSF', '2ndFlrSF', 'BsmtFinSF1', 'GarageArea', 'GarageFinish',
                                 'GarageYrBlt', 'GrLivArea', 'KitchenQual', 'LotArea', 'LotFrontage',
                                 'MasVnrArea', 'OpenPorchSF', 'OverallQual', 'TotalBsmtSF', 'YearBuilt',
                                 'YearRemodAdd']

path = os.path.join(os.getcwd(), 'outputs/selected_significant_features.pkl')

joblib.dump(selected_significant_features, path)

