# **EDA of the proposed significant features for predicting sale price**

## Objectives

**Perform Business requirement 2 user story task: EDA**
* Analyse the most significant features in predicting the target, as selected during the sale price correlation study: study feature distributions, assess normality, outliers, and check correlations between features.



## Inputs

* cleaned house price sale price correlation study dataset: outputs/datasets/sale_price_study/cleaned/house_prices.csv
* pickled selected significant features list from the correlation study notebook: outputs/ml/selected_significant_features.pkl
* cleaned and encoded (categorical variables) house price sale correlation study dataset: outputs/datasets/sale_price_study/cleaned/encoded_house_prices.csv

## Outputs
 * code that generates information and plots that aids data understanding, and informs how to process/clean/engineer the dataset during the corresponding ML task notebooks. Some of the garnered information or code used to produce it will be needed for these notebooks. It also may feature on the relevant dashboard page.



---

## Change working directory

Working directory changed to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
os.getcwd()

---

## Load modified house prices dataset

In [None]:
import pandas as pd

house_prices_df = pd.read_csv(filepath_or_buffer='outputs/datasets/sale_price_study/cleaned/house_prices.csv')
house_prices_df.dtypes

---

## Feature distribution analysis

**Pandas profiling report**

Load list of selected significant features:

In [None]:
import joblib

selected_significant_features = joblib.load('outputs/ml/selected_significant_features.pkl')
selected_significant_features

In [None]:
import numpy as np
from pandas_profiling import ProfileReport

significant_feature_df = house_prices_df[selected_significant_features + ['SalePrice']]


In [None]:
feature_profiles = ProfileReport(significant_feature_df, title='Feature statistics', minimal=True)

In [None]:
feature_profiles.to_notebook_iframe()

### Normality tests for continuous numeric features

**Shapiro-Wilk test**:

In [None]:
import pingouin as pg

continuous_numeric_features = ['1stFlrSF', '2ndFlrSF', 'BsmtFinSF1', 'GarageArea', 'GrLivArea',
                               'LotArea',
                               'LotFrontage',
                               'MasVnrArea',
                               'OpenPorchSF',
                               'TotalBsmtSF']
# alpha = 0.05
pg.normality(house_prices_df[continuous_numeric_features])

**QQ plots**:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

def create_plot_array(df, kind, features):
    """
    Creates array of box or qq plots of shape (no.of.features, 3), for a subset of features in a provided dataframe.

    Args:
        df: dataframe.
        kind: either 'box' or 'qqplot'.
        features: valid subset of the columns in df.
    """
    for feature in features:
        feature_index = features.index(feature)
        if feature_index % 3 == 0:
            fig = plt.figure(figsize=(15,5), tight_layout=True)
            features_left = len(features) - feature_index
            if features_left >= 3:
                axes = fig.subplots(ncols=3)
            else:
                if features_left != 0:
                    axes = fig.subplots(ncols=features_left)
                    fig.set_size_inches(5*features_left, 5)
                    
                else:
                    return
            
            for index in [0, 1, 2]:
                ax = axes[index] if (features_left > 1) else axes
                try:
                    if kind == 'qqplot':
                        qq_plot = pg.qqplot(x=df[features[feature_index + index]], ax=ax)
                        qq_plot.set(title=features[feature_index + index])
                        fig.add_axes(qq_plot)
                    elif kind == 'box':
                        box_plot = sns.boxplot(x=df[features[feature_index + index]], ax=ax)
                        box_plot.set(title=features[feature_index + index])
                        fig.add_axes(box_plot)
                    else:
                        print("kind must be one of ['box', 'qqplot']")
                        return
                except:
                    return

In [None]:
create_plot_array(df=house_prices_df, kind='qqplot', features=continuous_numeric_features)

### Outlier assessment for numeric features: Box plots

In [None]:
create_plot_array(df=house_prices_df, kind='box', features=continuous_numeric_features)

---

## Discussion

**Comments by variable**:

**1stFlrSF**:

General:

* The distribution has an extended central region with a fairly sharp drop-off, particulary for smaller values: mean/median ~ 1100SF, 50% 882-1400SF, 90% 670-1800SF; very large positive kurtosis.
* The very large range is a consequence of possible outliers/extreme values.
* Consequence of the extended central region is a moderate amount of dispersion about the mean: 33% CV.
* There is a broad peak close to the median/mean.

Normality:

* It has a moderate positive skew, with likely outliers contributing to this. Q3 - median > median - Q2. Max - Q3 >> Q1 - min.
* A very large positive kurtosis as a result of the extended central region and the presence of possible outliers.
* The moderate skew and large kurtosis are evidenced by the histogram plot.
* The QQ plot indicates positive skew.
* The Shaprio-Wilk test indicates the distribution is not normal.

Outliers:

* The box plot suggests multiple outliers (IQR method) outside the main central region. It also illustrates the positive skew.

**2ndFlrSF**:

General:

* Most houses do not have a 2nd floor: >50%.
* The distribution stats heavily influenced by these zero data.
* Even if you ignore the zero data, there is significant dispersion in the rest of the distribution, with an extremely broad peak.

Normality:

* With the large number of zero data the distribution is clearly not normal. However even in the absence of these zero data the distribution would be flatter/broader than the normal distribution when looking at the histogram.
* Unsurpringly the QQ plot shows the significant deviation from normal at the left tail, and the Shapiro-Wilk test indicates non-normalilty.

Outliers:

* It's difficult to say whether the indicated points on the box plot are actually outliers, or if there are actually more outliers, because of how it is skewed due to the zero data.




**BsmtFinSF1**:

General:

* Similar to '2ndFlrSF', large portion of zero data: 32%. Again skews statistics.
* Distribution is quite dispersed: large kurtosis. This is likely impacted by the 32% zero data shifting the mean/median, but also likely because of an extended tail at higher values.
* Large bulk of the data is within two SD's of the mean/median, and the most frequent bins are adjacent to the zero data.

Normality:

* Small positive skew value, also seen in the histogram and box plot.
* Distribution is not visibly normal from the histogram.
* The QQ plot indicates a positive skew, and shows the deviation from normality caused by the very high frequency of same zero data value.
* The Shaprio-Wilk test indicates the distribution is not normal.

Outliers:

* The box plot indicates a few potential outliers, one in particular is clearly more extreme. The other possible outliers are not far from the upper whisker limit, and may not be outliers if the portion of zero data in the sample were slightly less.

**GarageArea**:

General:

* Moderate central peak/region: 90% of data between 0-850 ~ ± 2 SD region; IQR=241.5 ~ 1 SD; CV < 0.5*mean.
* Many less common values at higher values. Isolated portion of zero values (5.5%).

Normality:

* Small positive skew value. Histogram shows a longer tail to the right, but left of centre there are more values in a few bins, and also there is the large zero portion that collectively counteract any large positive skew. The box plot also shows that there is more data between Q1 and Q2 than Q2 and Q3, but there are values at more extreme higher values.
* Small-to-moderate kurtosis value.
* The QQ plot indicates more dispersion than the normal distribution, despite reasonable agreement in the central region.
* The Shaprio-Wilk test indicates the distribution is not normal.
* Relative to other features, it approximates a normal distribution to a larger degree.

Outliers:

* The box plot indicates three possible small groups of outliers that differ in their extremity.


**GarageLivArea**:

General:

* Somewhat uniform broad central region: Q2 - Q1 ~ Q3 - Q2; IQR=647; CV=0.3.
* Long right tail of lower frequency values: 5% of data between 2466 - 5642 vs Q3 ~ 1776.

Normality:

* Moderate positive skew value, supported by the histogram.
* The QQ plot indicates noticeable positive skew.
* Very large kurtosis value, expected from the histogram.
* The Shaprio-Wilk test indicates the distribution is not normal.

Outliers:

* The box plot indicates multiple outliers.

**LotArea**:

General:

* Narrow central region (1 bins width) with the majority of data (~40%). Also IQR=4048 ~ 0.5SD, CV=0.9. Also supported by the box plot.
* Extended low frequency right tail.
* Very large range.

Normality:

* Median is similar to mean indicating some symmetry/normality.
* Very large positive skew and kurtosis values.
* The QQ plot indicates large positive skew and large dispersion, relative to a normal distribution.
* The Shaprio-Wilk test indicates the distribution is not normal.


Outliers:

* From the box plot there appears to be numerous outliers, a few much more extreme than the rest.

**LotFrontage**:

General:

* Fairly symmetric distribution centrally (mean ~ median), albeit a moderately long right tail.
* Range is dictated by a single very extreme value.
* Fairly compact distribution: 90% between 35 - 107, IQR=21.6, CV ~ 0.3.

Normality:
* Very large positive kurtosis value. Moderate positive skew value, also evident in the QQ plot.
* The Shaprio-Wilk test indicates the distribution is not normal.

Outliers:

* From the box plot there appears to be numerous outliers, a one far more extreme than the rest.

**MasVnrArea**:

General:

* 59% zero data. Dominates stats.
* Still appears to be large dispersion, and extended right tail.

Normality:

* Very large positive skew and kurtosis values.
* The QQ plot suggests positive skew.
* The Shaprio-Wilk test indicates the distribution is not normal.

Outliers:

* From the box plot there appears to be numerous outliers.

**OpenPorchSF**:

General:

* Very similar to the 'MasVnrArea' distribution.
* 45% zero data.

Normality:
* The histogram does not look normal.
* Large positive skew value. Very large positive kurtosis value. Box plot supports this, can clearly see Q3 - Q2 > Q2 - Q1.
* The QQ plot is similar to that for the 'MasVnrArea': positive skew indicated.
* The Shaprio-Wilk test indicates the distribution is not normal.

Outliers:

* From the box plot there appears to be numerous outliers, three in particular are very extreme. 

**TotalBsmtSF**:

General:

* Moderate width central region.
* Moderately long right tail.
* Moderate dispersion: CV=0.4.

Normality:

* Moderate positive skew, large positive kurtosis.
* The QQ plot suggests positive skew, and more dispersion relative to a normal distribution.
* The Shaprio-Wilk test indicates the distribution is not normal.

Outliers:

* The box plot shows some outliers, one in particular stands out with regard to how far above the upper limit whisker it is.

**GarageFinish**:

* Fairly even distribution between finished and some sort of unfinshed rating.
* The 'None' value (73) may be inconsistent with the number of zeros in the GarageArea (81). This may be a by-product of previous missing value imputation.

**GarageYrBlt** and **YearBuilt**:

* Of course these features are related as most garages are built with the house. This is reflected in the similarity of the distributions.
* Each distribution has the highest frequency post the year 2000. Lowest frequency around 1900, that increases smoothly up until 1965, before decreasing until 1975, before increasing again.
* This is reflected in the median of 1973/1978, indicating that as many houses/garages have been built after 1973/1978 than before, despite the shorter period.
* There is also a slight negative skew and slight negative kurtosis as a result of the trend in building rates.
* In the context of the dataset, there is a fairly large range.

**YearRemodAdd**:

* Again you would expect some overlap with the feature 'YearBuilt', since houses that have not been remodeled, nominally have a value equal to their year built. This is
reflected in the distribution to some extent.
* However that being said, the most houses were remodeled around 1950, decreasing after, and being somewhat uniform up until the year 2000, where the count increases again.
* Again there is slight increase in average rate post 2000: median 1994 (of course there are more houses over time that could be remodeled).
* Also consequently have a long tail and negative skew.
* There is a risk that the houses not remodeled having a nominal value in the dataset, could distort the distribution.

**KitchenQual**:

* Vast majority of instances have a rating of typical or good (90%).

**OverallQual**:

* Broad peak falling off quickly and symmetrically: 90% between below-average and very good; median/mean: above average; IQR=2 and so 50% between average and good.

---

### Multivariate outliers

Have already assessed each features outliers for their own distribution. However cannot necessarily consider removing an outlier for one feature, as doing so would remove the entire instance and thus remove all component values, which may be valid, from other feature distributions. Must consider whether the same instances are outliers in the whole dataset as vectors.

Will weigh up the trade-off between removing a common outlier from a subset of features (improving their distributions), whilst simultaneously altering the distributions of other features, where the instance is not an outlier with respect to its component.

As such will make the choice to remove instances where at least the instance is an outlier in greater than 50% of the features; arguably these are likely to be the most extreme points as well, due to the moderate degree of correlation between features.

**Create functions to assess commonality of outliers**

In [None]:
def locate_single_feature_outliers(feature, df):
    """
    Locates outliers for a feature in a dataframe (containing only numeric features) using the IQR method.

    Args:
        feature (str): the feature name.
        df: dataframe containing only numeric feature.

    Returns a list of indices corresponding to the dataframe indices of the outliers.
    """
    sample = df[feature]
    mean = sample.mean()
    SD = sample.std()
    Q1 = sample.quantile(q=0.25)
    Q3 = sample.quantile(q=0.75)
    IQR = Q3 - Q1
    def return_outliers(instance):
        return instance > IQR*1.5 + Q3 or instance < Q1 - 1.5*IQR
    result = sample.apply(func=return_outliers)
    return result[result == True].index.tolist()

In [None]:
def locate_all_feature_outliers(df):
    """
    Amalgamates into a single list, the dataframe (containing only numeric features) indices corresponding to all outliers of features in a dataframe.

    args:
        df: dataframe containing numeric features.

    Returns a list. It contains a series with index corresponding to the index of an outlier, and a column value
    corresponding to the number of times the instance is a common outlier across all features. Also contains
    a value_counts series for the series; finally contains a float for the number of features in the dataframe.
    """
    outlier_indices = []
    for col in df.columns:
        found_ouliers = locate_single_feature_outliers(col, df)
        outlier_indices.extend(found_ouliers)
    index_freq = np.array(outlier_indices)
    index_count = np.unique(index_freq, return_counts=True)
    index_count_series = pd.Series(data=index_count[1], index=index_count[0]).sort_values(ascending=False)
    return [index_count_series, index_count_series.value_counts().sort_values(), df.columns.size]

**Use functions to find common outliers**

In [None]:
outlier_series, outlier_series_unique_count, total_feature_num = locate_all_feature_outliers(house_prices_df[continuous_numeric_features])
print('Total number of features:', total_feature_num)
print(outlier_series_unique_count)


**Filter out instances that are outliers in more than 50% of features**

In [None]:
print('\n','Instances whose component values correspond to potential outliers in more than 50% of continuous numeric features:')
house_prices_df[continuous_numeric_features].loc[outlier_series[outlier_series > 5].index.tolist()]   

From looking at the location of the values for each instance in the respective box plot, it can be seen that:

* The instance with index 1298, is the extremest or close to the extremest value for 5 of the features, indicating something very unusual about it.
* The instance with index 523, is the 2nd most extreme value for 4 of the features.
* The instance with index 1182, is the 1st or 2nd or 3rd most extreme value for 3 of the features.
* The instance with index 691 is the most extreme value for one feature, and in the top 4 most extreme for 2 features.

This then does support to some extent that the extremest outliers for a feature are likely to be close, at least, to the extremest values for other features as was suggested.
These instances may well be removed in cleaning.

---

### Feature - Feature pair correlations

Loading dataset with encoded categorical features, generated during the sale price correlation study.

In [None]:
encoded_house_prices_df = pd.read_csv(filepath_or_buffer='outputs/datasets/sale_price_study/cleaned/encoded_house_prices.csv')


In [None]:
encoded_significant_feature_df = encoded_house_prices_df[selected_significant_features]

**Spearman**:

In [None]:
spearman_df = encoded_significant_feature_df.pairwise_corr(method='spearman')

Only want to see strong correlations > 0.8

In [None]:
spearman_df[spearman_df['r'] > 0.8]

**phi_k**:

In [None]:
import phik
from phik.phik import phik_matrix
from phik.report import plot_correlation_matrix

phik_df = phik.phik_matrix(encoded_significant_feature_df)
# unpivot
phik_df = phik_df.melt(value_vars=phik_df.columns, ignore_index=False).reset_index().rename(columns={'index': 'X', 'variable': 'Y'})
#remove diagonals
phik_df = phik_df[phik_df['X'] != phik_df['Y']]
# remove symmetric pairs
phik_df['value'] = phik_df['value'].drop_duplicates()
phik_df.dropna(inplace=True)
# filter
phik_df[phik_df['value'] > 0.8]

* Both phi_k and spearman agree that strong relationships exist for the pairs (1stFlrSF, TotalBsmtSF), (GarageYrBlt, YearBuilt).

* phi_k also suggests a strong relationship between (YearRemodAdd, GarageYrBlt), (YearRemodAdd, YearBuilt), (BsmtFinSF1, 1stFlrSF), (GrLivArea, 2ndFlrSF) and (GrLivArea, BsmtFinSF1)

* This all makes sense, as the basement, ground and second floor areas should be related since each level covers at least part of the surface area of the level below. The year built, remodeled, and garage year built, are again related as was explained earlier when discussing their distributions.

* The strong relationships between each pair may lead to redundancy in the ML model, and so it may be wise to drop one of each pair.

---

## Conclusions

* Established that none of the significant features have normal distributions. A few were somewhat normally distributed.
* Identified instances whose vector components are outliers in greater than 50% of the continuous numeric features. These instances may be dropped.
* Identified significantly related (correlation coefficients > 0.8) feature-feature pairs. To reduce redundancy in the ML model a feature of each pair may need to be dropped from the dataset.