Best features selection for the subset regression (using visualizations)

In [120]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from scipy.interpolate import griddata



In [121]:
df = pd.read_csv("DecisionTreeRegressor_subset_picu_laura.csv")

In [122]:
df.columns

Index(['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional',
       'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'SalePrice'],
      dtype='object')

In [123]:
df.dtypes

BsmtFullBath      int64
BsmtHalfBath      int64
FullBath          int64
HalfBath          int64
BedroomAbvGr      int64
KitchenAbvGr      int64
KitchenQual      object
TotRmsAbvGrd      int64
Functional       object
Fireplaces        int64
FireplaceQu      object
GarageType       object
GarageYrBlt     float64
GarageFinish     object
GarageCars        int64
GarageArea        int64
GarageQual       object
GarageCond       object
PavedDrive       object
SalePrice         int64
dtype: object

Heatmap for all numerical variables: 

In [124]:
all_numerical_cols = df.select_dtypes(include=['number'])


In [125]:
corr = all_numerical_cols.corr()


In [126]:
fig = px.imshow(
    corr,
    text_auto=".2f",      
    color_continuous_scale='mint',  
    aspect="auto",
    title="All Numerical Features - Correlation Heatmap"
)

fig.show()


Based on the initial interactive heatmap, it seems there are different types of correlations between SalePrice and other numerical features, the highest positive correlations being only moderately correlated:

SalePrice - FullBath: 0.56
SalePrice - GarageArea: 0.62
SalePrice - GarageCars: 0.64

I wanted to also look at significant negative correlations, but there are only ones that would not change very much the result, e.g.: -0.14 KitchenAbvGr.



In [127]:


x_feature, y_feature, z_feature = 'FullBath', 'GarageArea', 'SalePrice'

x = df[x_feature]
y = df[y_feature]
z = df[z_feature]

xi = np.linspace(x.min(), x.max(), 50)
yi = np.linspace(y.min(), y.max(), 50)
X, Y = np.meshgrid(xi, yi)

Z = griddata((x, y), z, (X, Y), method='linear')

fig = go.Figure(go.Surface(x=X, y=Y, z=Z, colorscale='Peach'))
fig.update_layout(
    title=f'{z_feature} vs {x_feature} & {y_feature}',
    scene=dict(
        xaxis_title=x_feature,
        yaxis_title=y_feature,
        zaxis_title=z_feature
    ),
    width=700, height=700
)
fig.show()


In [128]:


x_feature, y_feature, z_feature = 'GarageCars', 'GarageArea', 'SalePrice'

x = df[x_feature]
y = df[y_feature]
z = df[z_feature]

xi = np.linspace(x.min(), x.max(), 50)
yi = np.linspace(y.min(), y.max(), 50)
X, Y = np.meshgrid(xi, yi)

Z = griddata((x, y), z, (X, Y), method='linear')

fig = go.Figure(go.Surface(x=X, y=Y, z=Z, colorscale='Mint'))
fig.update_layout(
    title=f'{z_feature} vs {x_feature} & {y_feature}',
    scene=dict(
        xaxis_title=x_feature,
        yaxis_title=y_feature,
        zaxis_title=z_feature
    ),
    width=700, height=700
)
fig.show()


In [129]:


x_feature, y_feature, z_feature = 'FullBath', 'GarageCars', 'SalePrice'

x = df[x_feature]
y = df[y_feature]
z = df[z_feature]

xi = np.linspace(x.min(), x.max(), 50)
yi = np.linspace(y.min(), y.max(), 50)
X, Y = np.meshgrid(xi, yi)

Z = griddata((x, y), z, (X, Y), method='linear')

fig = go.Figure(go.Surface(x=X, y=Y, z=Z, colorscale='Pinkyl'))
fig.update_layout(
    title=f'{z_feature} vs {x_feature} & {y_feature}',
    scene=dict(
        xaxis_title=x_feature,
        yaxis_title=y_feature,
        zaxis_title=z_feature
    ),
    width=700, height=700
)
fig.show()


In [130]:
fig = px.density_contour(all_numerical_cols, x='GarageCars', y='SalePrice', marginal_x='histogram', marginal_y='histogram')
fig.show()

In [131]:
fig = px.density_contour(all_numerical_cols, x='GarageArea', y='SalePrice', marginal_x='histogram', marginal_y='histogram')
fig.show()

In [132]:
fig = px.density_contour(all_numerical_cols, x='GarageArea', y='SalePrice', facet_col='GarageCars', color="SalePrice")

fig.show()

In [133]:
fig = px.density_contour(all_numerical_cols, x='FullBath', y='SalePrice', marginal_x='histogram', marginal_y='histogram')
fig.show()

In [134]:
df_cat_col = df.select_dtypes(include=['object'])


In [135]:
df_cat_col.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   KitchenQual   1460 non-null   object
 1   Functional    1460 non-null   object
 2   FireplaceQu   770 non-null    object
 3   GarageType    1379 non-null   object
 4   GarageFinish  1379 non-null   object
 5   GarageQual    1379 non-null   object
 6   GarageCond    1379 non-null   object
 7   PavedDrive    1460 non-null   object
dtypes: object(8)
memory usage: 91.4+ KB


In [136]:
for col in df_cat_col:
    fig = px.box(df, x=col, y='SalePrice', points='all', title=f'SalePrice by {col}')
    fig.show()


In [137]:
import plotly.express as px

fig = px.box(df, x='KitchenQual', y='SalePrice', points='all', title='SalePrice by Kitchen Quality')
fig.show()


Conclusions

Based on data visualizations methods used, I’ve found new numerical features that have a greater correlation with SalePrice than the one chosen before (GarageArea, 0.62) or similar: SalePrice – FullBath (0.56), SalePrice – GarageCars (0.64). Also, the last second density contour that looks at how GarageCars influence GarageArea and SalesPrice capture the relation between these. 

In past asssignments, I looked at GarageArea vs SalePrice and GarageYrBlt vs SalePrice and chose to focus on GarageArea vs SalePrice. However, new results showed that GarageCars would have been a better feature than GarageArea, but also that FullBath was a good candidate, although I omitted it at first.
In this assignment, I also focused on categorical features. Boxplots applied over categorical data found interesting relationships between: SalePrice and KitchenQual and also between SalePrice and PavedDrive. These features are worth exploring in how they behave when the SalePrice rises.

As a conclusion, based on past results and new ones, I should have taken into consideration more GarageCars and SalePrice instead of GarageArea and SalePrice and include some categorical features too (after encoding them with One-Hot-Encoding). However, I should also take note of the fact that even if GarageCars has a better correlation with SalePrice than GarageArea, it doesn’t mean the homoscedasticity and normality would be better for this feature.
