# Ames Housing Dataset - Intro and Data Preparation

> Gianmaria Pizzo - 872966@stud.unive.it

These notebooks represent the project submission for the course [Data and Web Mining](https://www.unive.it/data/course/337525) by Professor [Claudio Lucchese](https://www.unive.it/data/people/5590426) at [Ca' Foscari University of Venice](https://www.unive.it).


---

## Structure of this notebook

This notebook covers the following points
* Domain Research, Context of Data
* Imports and Globals
* Data loading 
* Preliminary Dataset Overview
    * Numerical Features
    * Categorical Features
* Correction of possible errors and coherence check 
* Some preliminary feature creation

---

### Before running this notebook

To avoid issues, before running the following notebook it is best to
* Clean previous cell outputs
* Restart the kernel

---

# Problem Statement 


## Goal - Regression on `Sale_Price`

The project's goal is to build two or models for a prediction task where the target is the `Sale_Price` feature from the [Ames Housing Dataset](https://www.openml.org/search?type=data&status=active&id=43926&sort=runs).

## Problems - What do we know?
Facing a dataset for the first can be challenging as we lack some knowledge:
* We are focusing on a particular geographical area, which we are not familiar with;
* We are focusing on a market we have no prior domain-knowledge of;

Therefore, we have not a clue of how the market is going to behave, nor we know what dictates a change in `Sale_Price`. While the first can be hardly tracked, we can find out about the latter two through data exploration and domain research.

As we know, datasets rarely are perfect, so we must perform some magic!

Let us start from the basics: researching information

---

# Domain Research

Domain research should be the very first step for any project of this kind as it provides a general insight on how the domain behaves, when it comes to data.

## Context of Data: Ames, Iowa (USA)

As the name of the repository suggests, we are looking at a instances of houses which are located in *Ames, Iowa (USA)*. 

From Wikipedia we know that: 

*Ames (/eɪmz/) is a city in Story County, Iowa, United States, located approximately 30 miles (48 km) north of Des Moines in central Iowa. It is best known as the home of Iowa State University (ISU), with leading agriculture, design, engineering, and veterinary medicine colleges. A United States Department of Energy national laboratory, Ames Laboratory, is located on the ISU campus. According to the 2020 census, Ames had a population of 66,427, making it the state's ninth largest city. Iowa State University was home to 27,854 students as of spring 2023, which make up approximately one half of the city's population* 

But this is not a sufficient insight for our goal, as the description lacks information about the real estate context. This mean we might need a more general rule, such as house price assessment in U.S.A.

## House Pricing Method (USA) - Predicting `Sale_Price`

After an extensive research I found out the main driving factors for house prices, generally speaking, are:
1. Neighborhood comps
2. Location
3. Home size and usable space
4. Age and condition
5. Upgrades and updates
6. The local market and economic change
7. Mortgage interest rate

The importance given to the first three points, is underlined by the article [Cracking the Ames Housing Dataset with Linear Regression](https://towardsdatascience.com/wrangling-through-dataland-modeling-house-prices-in-ames-iowa-75b9b4086c96) where [Alvin T. Tan](https://at-tan.medium.com) states that "*[...] no two houses are exactly identical, and the basic idea of hedonic price modeling is that neighborhood-specific and unit-specific characteristics help determine house prices.*". 

This is not far from the truth: most people value more the neighborhood, than the quality of the house itself as it can always be modified later. 

However, under this scenario the dataset is not exhaustive. Some aspects such as the local market, and interest rates are hard to guess, although adding features and transformations can reinforce the importance of some features or even include new hidden trends when possible.


## Ames Housing Dataset - What we know so far

As the referenced repository lacks of a detailed description, I went on looking for a better version that you can find [here](https://www.openml.org/search?type=data&sort=runs&id=42165&status=active) thanks to [Thomas Schmitt](https://www.openml.org/search?type=user&id=3422&sort=date)

The dataset presents 2930 instances, representing buildings which were sold in Ames, Iowa. 

The number of features included are exactly 81.
* 35 `float64`
* 46 `object`

These features include information for each entry, about the structure of the house, its surroundings, access to the road, location and sale conditions etc... 

The presence of many object features, bring some issues to the table:
* Representing a quality scales, and labels is hard and usually is achieved through heuristics. 
* The encoding can get messy for regular categorical variables, the increase in the number of features can lead to overfitting and increased variance in our predictions.
* The encoding of categorical features is strictly related to the model we are trying to build. For example, we cannot feed a one-hot-encoding of all the categorical features to a tree-based model as it would create a sparse decision tree.
* Miscellaneous features and their values might be hard to consider as they count only for some instances, which are usually outliers, 
* Longitude and Latitude cannot be used to train models as they lead to overfitting, but cannot be excluded as they are needed for edges creation 

Nevertheless, there are still plenty of numeric columns, which happen to be some of the most important ones as the domain research suggests.

This is indeed, a very complete dataset. In fact, it is presented without any `Nan` values, which is not the case for other versions of the same dataset. This could mean the values were replaced with zeros, or the information was never provided to begin with. This resulted in an ambiguous version of the datase and it is hard to tell whether a feature should be kept a-priori.

Thanks to a notebook for a similar competition (see [Exploratory Data Analysis of Housing in Ames, Iowa](https://www.kaggle.com/code/leeclemmer/exploratory-data-analysis-of-housing-in-ames-iowa)), I found out some more interesting information about the nature of the data:
* First of all some of the features are missing, but they are not fundamental for this task. 
* However, the data has been gathered between 2006 and 2010 but there is no information regarding whether one house appears more than once. Even though this is not relevant from a training point of view, as the state of the house dictates the price, it can be a problem for prediction as two instances of the same house (before and after remodeling) can have a very big gap in terms of `Sale_Price`, while their features can be almost always the same. 
* Furthermore, there are some warnings about instances which are not residential houses and could deviate the regression.

We can finally start looking at the data now.

---

# Environment, Imports and Global Variables

This project requires to satisfy the following conditions in order to run properly:
* The required libraries must be installed
* The notebooks and dataset files must be in the same directory

For this project I use both matplotlib and seaborn as they offer a wide variety of data visualization techniques and plots

In [1]:
# Interactive
%matplotlib notebook
# Static
# %matplotlib inline

# Environment for this notebook
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import warnings
import IPython

# Set the style for the plots
sns.set()
plt.style.use('ggplot')
sns.set_style("darkgrid")
# Ignore warnings
warnings.filterwarnings('ignore') 

In [2]:
# Working folder, where this file is at
# It should be in the root directory
WORKING_DIR = os.getcwd()

# Resources folder
# It should be a subpath of root directory
RESOURCES_DIR = os.path.join(WORKING_DIR, 'resources')

# Must be created if not exists
if not os.path.exists(RESOURCES_DIR):
    os.mkdir(RESOURCES_DIR)

In [3]:
# Utils Module

def sort_alphabetically(dataset, last_label = None):
    """
    Sorts the dataset alphabetically 

    :param dataset: a pd.DataFrame
    :param last_label: a str containing an existing column label in the dataset
    :returns: pd.DataFrame
    """
    # Sort
    dataset = dataset.reindex(sorted(dataset.columns), axis=1)
    # Move target column to last index
    if last_label is not None:
        col = dataset.pop(last_label)
        dataset.insert(dataset.shape[1], last_label, col)
    return dataset

def drop_if_exists(dataset, to_drop):
    for f in to_drop:
        if f in list(dataset.columns):
            dataset.drop(columns = f, inplace=True)
    pass

def get_cols(dataset, col_substring):
    """
    Returns the dataset having only the columns matching the substring
    
    :param dataset: pandas.DataFrame
    :param col_substring: a list of str
    :returns: pandas.DataFrame
    """
    cols = df.columns.tolist()
    col_lst = []
    for i in cols:
        if col_substring in i:
            col_lst.append(i)
    return col_lst

def drop_rows_cond(dataset, condition, inplace = True):
    """
    Drops rows on boolean condition
    :param dataset: pandas.DataFrame
    :param condition: boolean
    :param inplace: boolean
    :returns: pandas.DataFrame if inplace = False, else None
    """
    rows_to_drop = dataset[condition].index.tolist()
    if inplace:
        dataset.drop(rows_to_drop, inplace=inplace)
        dataset.reset_index(drop=True, inplace=True)
    else:
        return dataset.drop(rows_to_drop)

In [4]:
import folium
from folium import *

# Folium Function to display Ames

def display_ames_houses(dataset, target='Sale_Price', labels=None, title='Ames, Iowa', map_zoom=None):
    """
    Plots an ad-hoc map of a city through Folium api
    :param map_zoom: a list of float indicating Latitude and Longitude for initial zoom of the map
    :param labels: list of strings indicating the columns names for Latitude, Longitude, Neighborhood
    :param target: string label of a target numeric value to display
    :param title: title of the map
    :param dataset: pandas.DataFrame containing the labels 'Latitude', 'Longitude', and the target column
    :return: folium.Map
    """
    if dataset is None:
        raise Exception("dataset must be a valid pandas.DataFrame")

    if target is None:
        raise Exception("target must be a string column label present inside dataset")

    if labels is None:
        raise Exception("No labels provided")
    
    labels.append(target)
    subset = dataset[labels]
    city_map = folium.Map(
        location= map_zoom,
        tiles="OpenStreetMap",
        zoom_start=12.45,
        control_scale = True,
        min_lat= subset.Latitude.min(),
        max_lat= subset.Latitude.max(),
        min_lon= subset.Longitude.min(),
        max_lon= subset.Longitude.max()
    )
    title_html = '''<h3 align="center" style="font-size:20px"><b>{0}</b></h3>'''.format(title)
    city_map.get_root().html.add_child(folium.Element(title_html))

    subset.reset_index()
    # Add Pop-Ups
    if 'Neighborhood' in labels:
        for index, row in subset.iterrows():
            coordinates = [row.Latitude, row.Longitude]
            # color based on price, icon_color based on neighborhood
            Marker(location=coordinates, popup='Price: $' + str(row.Sale_Price) + '\n Neighborhood: ' + ((str(row.Neighborhood)).strip('\'b')).replace("_", " "), icon=folium.Icon()).add_to(city_map)
    else:
        for index, row in subset.iterrows():
            coordinates = [row.Latitude, row.Longitude]
            # color based on price, icon_color based on neighborhood
            Marker(location=coordinates, popup='Price: $' + str(row.Sale_Price), icon=folium.Icon()).add_to(city_map)
    return city_map

In [5]:
def plot_frequency_distr_numeric(dataset, exclude = None, include_kde=False, plot_cols=2, notebook_fig_size=None, adjust=None):
    """
    Plots the frequency distributions of its numeric features through histograms from seaborn
    
    :param exclude: a string list representing the columns to exclude
    :param include_kde: boolean representing whether to include kde
    :param dataset: pandas dataframe
    :param plot_cols: count of plots per column
    :param notebook_fig_size: dictionary of integers including keys 'width', 'height' which represent the measures in inches for notebook display purposes
    :param adjust: dictionary of float including keys 'left', 'right', 'top', 'bottom', 'wspace', 'hspace' which are used to space the different plots between them
    """
    # Numeric dataframe
    num_df = dataset.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64'])
    
    if exclude is not None:
        num_df = num_df.drop(exclude, axis=1)
    
    # Set subplot shape
    fig, axes = plt.subplots(nrows = int(np.ceil(num_df.shape[1]/plot_cols)), ncols = plot_cols, figsize=(9,2))
    # Flat 1-D flat iterator over the array.
    axes = axes.flatten()
    
    # Notebook figure dimensions
    if notebook_fig_size is None:
        # Default
        fig.set_size_inches(10, 40)
    else:
        fig.set_size_inches(notebook_fig_size.get('width'), notebook_fig_size.get('height'))
    
    # Plot distribution for each feature
    for ax, col in zip(axes, num_df.columns):
        sns.histplot(data=num_df, y=col, ax = ax, color='cornflowerblue', kde=include_kde, stat='count')
        ax.set_title(col.replace("_", " ")+'\'s Distribution', fontweight='bold')
        ax.set_ylabel('Values')
        ax.set_xlabel('Count')
        
    # Adjust spacing between plots
    if adjust is None:
        # Default
        plt.subplots_adjust(left=0.1, right=0.9, top=0.98, bottom=0.05, wspace=0.4, hspace=0.9)
    else:
        plt.subplots_adjust(left=adjust.get('left'), right=adjust.get('right'), 
                            top=adjust.get('top'), bottom=adjust.get('bottom'),
                            wspace=adjust.get('wspace'), hspace=adjust.get('hspace'))
    pass

# Util function for pretty printing
def print_num_col_skewness(dataset, exclude = None):
    """
    Prints the skewness of the numeric columns in the dataset
    
    : param dataset: pandas.DataFrame
    : param exclude: list of str representing column labels to exclude
    : returns: None
    """
    for x in dataset.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']):
        if exclude is not None and x in exclude:
            continue
        else:
            print("Skewness of " + x + ": \n" + str(scipy.stats.skew(df[""+x])) + "\n")
    pass

In [6]:
from itertools import zip_longest

# We use a decoded version of the dataset to display the labels better
def decode_byte_str(dataset):
    """
    Decodes the dataset's object feature values in place and substitutes the byte-code string formatting
    
    : param dataset: a pandas.DataFrame
    : returns: None
    """
    categorical = dataset.select_dtypes(object)
    categorical = categorical.stack().str.decode('utf-8').unstack()
    for col in categorical:
        dataset[col] = categorical[[col]].apply(lambda x: x.str.replace("b'", "").str.replace("'", ""))
    pass

def barplot_categ(dataset):
    """
    Produces a barplot for a dataset, including only the object types
    
    : param dataset: a pandas.DatFrame
    : returns: None
    """
    # Categorical Features Only 
    categorical_data = dataset.select_dtypes(object)
    # Rows
    n = categorical_data.shape[1]
    # Params for subplots
    nrows, ncols = (int(np.ceil(n / 2))+1, 2)

    fig, axs = plt.subplots(ncols=ncols, nrows=nrows, figsize=(9.9, 60))
    
    for feature_name, ax in zip_longest(categorical_data, axs.ravel()):
        if feature_name is None:
            # Avoid showing axis
            ax.axis("off")
            continue

        ax = categorical_data[feature_name].value_counts().plot.barh(ax=ax, color='cornflowerblue')
        ax.set_title(feature_name + '\'s Frequency', fontweight='bold')
        ax.set_xlabel('Count', fontsize = 8)
        ax.tick_params(axis='both', which='major', labelsize=8)
        ax.tick_params(axis='both', which='minor', labelsize=6)
        plt.setp(ax.get_yticklabels(), rotation=40)

    plt.subplots_adjust(left=0.2,
                    bottom=0.02,
                    right=0.9,
                    top=0.98,
                    wspace=0.9,
                    hspace=0.9)
    pass


def boxplot_categ(dataset, target):
    """
    Produces a target related boxplot for a dataset, including only the object types
    
    : param dataset: a pandas.DatFrame
    : param target: a string label for a column
    : returns: None
    """
    # Categorical Features Only 
    categorical_data = dataset.select_dtypes(object)
    # Rows
    n = categorical_data.shape[1]
    # Params for subplots
    nrows, ncols = (n, 1)
    
    fig, axs = plt.subplots(ncols=ncols, nrows=nrows, figsize=(10, n*7))
    
    for feature_name, ax in zip_longest(categorical_data, axs.ravel()):
        if feature_name is None:
            # Avoid showing axis
            ax.axis("off")
            continue   
        ax = dataset[[target, feature_name]].boxplot(ax=ax, rot = 45,
                                                     column=target,
                                                     by=feature_name)
        ax.set_title(feature_name + ' Boxplot', fontweight='bold')
        ax.tick_params(axis='both', which='major', labelsize=8)
        ax.tick_params(axis='both', which='minor', labelsize=6)
        plt.setp(ax.get_yticklabels(), rotation=40)

    plt.subplots_adjust(left=0.2,
                    bottom=0.03,
                    right=0.9,
                    top=0.96,
                    wspace=0.7,
                    hspace=0.7)
    pass

We need to set the right directories to separate the notebooks from the resources

## Dataset Loading

Now, in the current path there should be a `[project]/resources/` folder with the `ames_housing.arff` file

In [7]:
from scipy.io import arff

# Read the data
data = arff.loadarff(os.path.join(RESOURCES_DIR, "ames_housing.arff"))

## Dataset Overview

If the import has not failed we can get our first insights on the dataset. 

In [8]:
# Get the actual dataset
df = pd.DataFrame(data[0])
print('Our dataset has {0} rows and {1} columns.'.format(df.shape[0], df.shape[1]))

Our dataset has 2930 rows and 81 columns.


This corresponds perfectly to what we expected! `data[0]` is exactly the dataset itself. 
While `data[1]` contains the information about the features' values.

In [9]:
data[1]

Dataset: R_data_frame
	MS_SubClass's type is nominal, range is ('One_Story_1946_and_Newer_All_Styles', 'One_Story_1945_and_Older', 'One_Story_with_Finished_Attic_All_Ages', 'One_and_Half_Story_Unfinished_All_Ages', 'One_and_Half_Story_Finished_All_Ages', 'Two_Story_1946_and_Newer', 'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages', 'Split_or_Multilevel', 'Split_Foyer', 'Duplex_All_Styles_and_Ages', 'One_Story_PUD_1946_and_Newer', 'One_and_Half_Story_PUD_All_Ages', 'Two_Story_PUD_1946_and_Newer', 'PUD_Multilevel_Split_Level_Foyer', 'Two_Family_conversion_All_Styles_and_Ages')
	MS_Zoning's type is nominal, range is ('Floating_Village_Residential', 'Residential_High_Density', 'Residential_Low_Density', 'Residential_Medium_Density', 'A_agr', 'C_all', 'I_all')
	Lot_Frontage's type is numeric
	Lot_Area's type is numeric
	Street's type is nominal, range is ('Grvl', 'Pave')
	Alley's type is nominal, range is ('Gravel', 'No_Alley_Access', 'Paved')
	Lot_Shape's type is nominal, range is 

However, we have no need for it as their description has already been provided.

In [10]:
# Delete it from memory
del data

### Sorting for a better visualization

Before actually getting into the data, we want to make the dataset readable and well organized. Thus, we will sort its column alphabetically and move the `Sale_Price` to the very end. 

This is just for personal tastes, I like to have the columns in alphabetical order and the last column as the target column, since later this will be the best way to display the correlation matrix.

In [11]:
df = sort_alphabetically(df, 'Sale_Price')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 81 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Alley               2930 non-null   object 
 1   Bedroom_AbvGr       2930 non-null   float64
 2   Bldg_Type           2930 non-null   object 
 3   BsmtFin_SF_1        2930 non-null   float64
 4   BsmtFin_SF_2        2930 non-null   float64
 5   BsmtFin_Type_1      2930 non-null   object 
 6   BsmtFin_Type_2      2930 non-null   object 
 7   Bsmt_Cond           2930 non-null   object 
 8   Bsmt_Exposure       2930 non-null   object 
 9   Bsmt_Full_Bath      2930 non-null   float64
 10  Bsmt_Half_Bath      2930 non-null   float64
 11  Bsmt_Qual           2930 non-null   object 
 12  Bsmt_Unf_SF         2930 non-null   float64
 13  Central_Air         2930 non-null   object 
 14  Condition_1         2930 non-null   object 
 15  Condition_2         2930 non-null   object 
 16  Electr

Again, the description was right about the missing values. But we will still need to check them later to figure out whether some might be misleading. 

For now we try only to add features so that later we can check their contributions.

### Keeping more than one dataset

Since the size allows us to keep two dataset per time, I want to keep two datasets instead of one:
* One dataset that is touched only if necessary (obvious errors, or transformations)
* Another one for heavy data preparation

Both of them will undergo almost the same transformations and will be used to train, tune and test the models. This will help us understand how the manipulation of the dataset affects the final output.

In [12]:
df_original= df.copy()

### Dropping useless features

From the description and some research I found out it is better to avoid including some features from the start, like the following ones

In [13]:
df.drop(columns=['Misc_Val', 'Misc_Feature', 'Functional'], inplace=True)

### Some context: Ames, Iowa

As the dataset provides longitude and latitude, we can exploit it to understand the geographical boundaries of the entries. 

Visualizing the location and the geographical distribution of the data is one approach to double check possible errors and to understand the density of the data.

This is an approach to understand what data we are considering, and which entries have been removed during our exploration. Plus, this can be useful to visualize train and test instances that might be problematic.

Last but not least, this is an elegant way to get a high-level idea about which predictive approach could be the best.

 **Mind that this step could be very heavy on the overall memory and can be avoided**.


In [14]:
# Uncommenting will take more time to run the whole notebook
# display_ames_houses(df, target='Sale_Price', labels=['Latitude', 'Longitude', 'Neighborhood'], map_zoom=[42.030781, -93.631912])

Now that we have an idea of the context we can proceed with the real deal

## Numeric Features Overview

First of all let's give a look at the numeric features, which are easier to interpret.

I provided a function to display all the numeric features and their values' distributions through some binned histplots. This aims at giving a rough idea of the possible issues. 

All these kinds of functions will be necessary later, to double check the final output.

In [15]:
plot_frequency_distr_numeric(df, exclude = ['Latitude', 'Longitude'])

<IPython.core.display.Javascript object>

The first things we notice from this is that 
* Most of the distributions are skewed and could benefit from some log transformation or box-cox transformation
    * Eventually they could be paired with a sentinel value for the absence or presence of an attribute
* There are many attributes which lack values other than zero.
* Many feature center their values in a very smal range.

As we said before, this is problematic but we have no clue (as for now) on how to deal with this. Some of them can be just used as dummy variables to encode the fact a feature could be present, but the evidence is very low and highly uncorrelated with the `Sale_Price` (i.e. `Three_Seasons_Porch`)

Let us first check some statistics


In [16]:
df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).describe()

Unnamed: 0,Bedroom_AbvGr,BsmtFin_SF_1,BsmtFin_SF_2,Bsmt_Full_Bath,Bsmt_Half_Bath,Bsmt_Unf_SF,Enclosed_Porch,Fireplaces,First_Flr_SF,Full_Bath,...,Screen_Porch,Second_Flr_SF,Three_season_porch,TotRms_AbvGrd,Total_Bsmt_SF,Wood_Deck_SF,Year_Built,Year_Remod_Add,Year_Sold,Sale_Price
count,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,...,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0
mean,2.854266,4.177474,49.705461,0.431058,0.061092,559.071672,23.011604,0.599317,1159.557679,1.566553,...,16.002048,335.455973,2.592491,6.443003,1051.255631,93.751877,1971.356314,1984.266553,2007.790444,180796.060068
std,0.827731,2.233372,169.142089,0.524762,0.245175,439.540571,64.139059,0.647921,391.890885,0.552941,...,56.08737,428.395715,25.141331,1.572964,440.968018,126.361562,30.245361,20.860286,1.316613,79886.692357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,1872.0,1950.0,2006.0,12789.0
25%,2.0,3.0,0.0,0.0,0.0,219.0,0.0,0.0,876.25,1.0,...,0.0,0.0,0.0,5.0,793.0,0.0,1954.0,1965.0,2007.0,129500.0
50%,3.0,3.0,0.0,0.0,0.0,465.5,0.0,1.0,1084.0,2.0,...,0.0,0.0,0.0,6.0,990.0,0.0,1973.0,1993.0,2008.0,160000.0
75%,3.0,7.0,0.0,1.0,0.0,801.75,0.0,1.0,1384.0,2.0,...,0.0,703.75,0.0,7.0,1301.5,168.0,2001.0,2004.0,2009.0,213500.0
max,8.0,7.0,1526.0,3.0,2.0,2336.0,1012.0,4.0,5095.0,4.0,...,576.0,2065.0,508.0,15.0,6110.0,1424.0,2010.0,2010.0,2010.0,755000.0


From the table, it is easy to see that we have a problematic dataset from a values'distribution point of view. However we will get deeper into that side on the next notebook.

For what concerns the skewness:

In [17]:
print_num_col_skewness(df, exclude = ['Latitude', 'Longitude'])

Skewness of Bedroom_AbvGr: 
0.30553769035566597

Skewness of BsmtFin_SF_1: 
0.08911009666138348

Skewness of BsmtFin_SF_2: 
4.138673580995156

Skewness of Bsmt_Full_Bath: 
0.617411423419396

Skewness of Bsmt_Half_Bath: 
3.9403706782664814

Skewness of Bsmt_Unf_SF: 
0.922572362659496

Skewness of Enclosed_Porch: 
4.012390205634154

Skewness of Fireplaces: 
0.7388367095522606

Skewness of First_Flr_SF: 
1.4686762661218558

Skewness of Full_Bath: 
0.1718640349156229

Skewness of Garage_Area: 
0.23994167112208759

Skewness of Garage_Cars: 
-0.22104927953966302

Skewness of Gr_Liv_Area: 
1.2734573491164038

Skewness of Half_Bath: 
0.6973558240143646

Skewness of Kitchen_AbvGr: 
4.311615838595926

Skewness of Lot_Area: 
12.814333637733153

Skewness of Lot_Frontage: 
0.025051361901111895

Skewness of Low_Qual_Fin_SF: 
12.111956844115396

Skewness of Mas_Vnr_Area: 
2.617963998324404

Skewness of Mo_Sold: 
0.19249745018212133

Skewness of Open_Porch_SF: 
2.5340877554403782

Skewness of Pool_Are

It is very high for some values, highlighting the absence of many normally-distributed features. This indicates some kind of transformation is required for certain models (such as ANN). 

## Categorical Features Overview

### Categorical features recap
Before plotting categorical features, which might be unnecessary, we should look out for their meaning. In fact, categorical features have some special subcases:
* **Nominal features**, describe a name, label or category **without natural order**;
* **Ordinal features**, variables whose values are defined by an **order relation** between the different categories;

From another point of view we also can define **Dichotomous features**, categorical variables with two categories or levels;
* *Discrete Dichotomous* features, categorical variables with two categories or levels and **nothing in between them**. 
    * Binary features, variables assigned either a 0 or a 1.
* *Continuous Dichotomous* features, categorical variables **with possibilities in between** the two (extreme) categories.

### Our case

The problem here is we have mix of these features which can be confusing when it is time for encoding, since the difference between ordinal features and continuous dichotomous features can be subtle.

#### Ordinal Features

As principle that guides us to classify a feature as ordinal is the usage of the words "Quality" or "Condition", and the presence of levels in between two extremes.

In certain cases, we might choose not to adopt a conventional approach as some variables have most instances in a single category, which then makes that the 'average' category. However, here we trust the levels given by the description of the features.

In [18]:
%%html --isolated

<style type="text/css">.tg-sort-header::-moz-selection{background:0 0}.tg-sort-header::selection{background:0 0}.tg-sort-header{cursor:pointer}.tg-sort-header:after{content:'';float:right;margin-top:7px;border-width:0 5px 5px;border-style:solid;border-color:#404040 transparent;visibility:hidden}.tg-sort-header:hover:after{visibility:visible}.tg-sort-asc:after,.tg-sort-asc:hover:after,.tg-sort-desc:after{visibility:visible;opacity:.4}.tg-sort-desc:after{border-bottom:none;border-width:5px 5px 0}@media screen and (max-width: 767px) {.tg {width: auto !important;}.tg col {width: auto !important;}.tg-wrap {overflow-x: auto;-webkit-overflow-scrolling: touch;margin: auto 0px;}}</style><div class="tg-wrap"><table id="tg-oyTCE" style="border-collapse:collapse;border-color:#9ABAD9;border-spacing:0;margin:0px auto" class="tg"><thead><tr><th style="background-color:#409cff;border-color:inherit;border-style:solid;border-width:1px;color:#fff;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;position:-webkit-sticky;position:sticky;text-align:left;top:-1px;vertical-align:top;will-change:transform;word-break:normal"><span style="font-weight:bold">Feature</span></th><th style="background-color:#409cff;border-color:inherit;border-style:solid;border-width:1px;color:#fff;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;position:-webkit-sticky;position:sticky;text-align:center;top:-1px;vertical-align:top;will-change:transform;word-break:normal"><span style="font-weight:bold">Meaning</span></th></tr></thead><tbody><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">BsmtFin_Type_1</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Rating of basement finished area</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">BsmtFin_Type_2</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Rating of basement finished area (if multiple types)</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Bsmt_Cond</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Evaluates the general condition of the basement</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Bsmt_Exposure</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Refers to walkout or garden level walls</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Bsmt_Qual</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Evaluates the height of the basement</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Exter_Cond</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Evaluates the present condition of the material on the exterior</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Exter_Qual</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Evaluates the quality of the material on the exterior</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Fence</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Fence quality</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Fireplace_Qu</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Fireplace quality</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Functional</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Home functionality (Assume typical unless deductions are warranted)</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Garage_Cond</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Garage condition</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Garage_Finish</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Interior finish of the garage</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Garage_Qual</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Garage quality</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Heating_QC</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Heating quality and condition</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Kitchen_Qual</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Kitchen quality</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Land_Contour</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Flatness of the property</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Land_Slope</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Slope of property</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Lot_Shape</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">General shape of property</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Overall_Cond</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Rates the overall condition of the house</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Overall_Qual</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Rates the overall material and finish of the house</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Paved_Drive</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Paved driveway</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Pool_QC</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Pool quality</td></tr><tr><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Utilities</td><td style="background-color:#EBF5FF;border-color:inherit;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Type of utilities available</td></tr></tbody></table></div><script charset="utf-8">var TGSort=window.TGSort||function(n){"use strict";function r(n){return n?n.length:0}function t(n,t,e,o=0){for(e=r(n);o<e;++o)t(n[o],o)}function e(n){return n.split("").reverse().join("")}function o(n){var e=n[0];return t(n,function(n){for(;!n.startsWith(e);)e=e.substring(0,r(e)-1)}),r(e)}function u(n,r,e=[]){return t(n,function(n){r(n)&&e.push(n)}),e}var a=parseFloat;function i(n,r){return function(t){var e="";return t.replace(n,function(n,t,o){return e=t.replace(r,"")+"."+(o||"").substring(1)}),a(e)}}var s=i(/^(?:\s*)([+-]?(?:\d+)(?:,\d{3})*)(\.\d*)?$/g,/,/g),c=i(/^(?:\s*)([+-]?(?:\d+)(?:\.\d{3})*)(,\d*)?$/g,/\./g);function f(n){var t=a(n);return!isNaN(t)&&r(""+t)+1>=r(n)?t:NaN}function d(n){var e=[],o=n;return t([f,s,c],function(u){var a=[],i=[];t(n,function(n,r){r=u(n),a.push(r),r||i.push(n)}),r(i)<r(o)&&(o=i,e=a)}),r(u(o,function(n){return n==o[0]}))==r(o)?e:[]}function v(n){if("TABLE"==n.nodeName){for(var a=function(r){var e,o,u=[],a=[];return function n(r,e){e(r),t(r.childNodes,function(r){n(r,e)})}(n,function(n){"TR"==(o=n.nodeName)?(e=[],u.push(e),a.push(n)):"TD"!=o&&"TH"!=o||e.push(n)}),[u,a]}(),i=a[0],s=a[1],c=r(i),f=c>1&&r(i[0])<r(i[1])?1:0,v=f+1,p=i[f],h=r(p),l=[],g=[],N=[],m=v;m<c;++m){for(var T=0;T<h;++T){r(g)<h&&g.push([]);var C=i[m][T],L=C.textContent||C.innerText||"";g[T].push(L.trim())}N.push(m-v)}t(p,function(n,t){l[t]=0;var a=n.classList;a.add("tg-sort-header"),n.addEventListener("click",function(){var n=l[t];!function(){for(var n=0;n<h;++n){var r=p[n].classList;r.remove("tg-sort-asc"),r.remove("tg-sort-desc"),l[n]=0}}(),(n=1==n?-1:+!n)&&a.add(n>0?"tg-sort-asc":"tg-sort-desc"),l[t]=n;var i,f=g[t],m=function(r,t){return n*f[r].localeCompare(f[t])||n*(r-t)},T=function(n){var t=d(n);if(!r(t)){var u=o(n),a=o(n.map(e));t=d(n.map(function(n){return n.substring(u,r(n)-a)}))}return t}(f);(r(T)||r(T=r(u(i=f.map(Date.parse),isNaN))?[]:i))&&(m=function(r,t){var e=T[r],o=T[t],u=isNaN(e),a=isNaN(o);return u&&a?0:u?-n:a?n:e>o?n:e<o?-n:n*(r-t)});var C,L=N.slice();L.sort(m);for(var E=v;E<c;++E)(C=s[E].parentNode).removeChild(s[E]);for(E=v;E<c;++E)C.appendChild(s[v+L[E-v]])})})}}n.addEventListener("DOMContentLoaded",function(){for(var t=n.getElementsByClassName("tg"),e=0;e<r(t);++e)try{v(t[e])}catch(n){}})}(document)</script>

Feature,Meaning
BsmtFin_Type_1,Rating of basement finished area
BsmtFin_Type_2,Rating of basement finished area (if multiple types)
Bsmt_Cond,Evaluates the general condition of the basement
Bsmt_Exposure,Refers to walkout or garden level walls
Bsmt_Qual,Evaluates the height of the basement
Exter_Cond,Evaluates the present condition of the material on the exterior
Exter_Qual,Evaluates the quality of the material on the exterior
Fence,Fence quality
Fireplace_Qu,Fireplace quality
Functional,Home functionality (Assume typical unless deductions are warranted)


#### Nominal Features

These features are very easy to recognize as they give us no hint of an "order" in their levels.

We prefer to use a One-Hot encoding for these variables from which we obtain a series of binary variables (1 or 0) representing whether or not a category was present for a particular row. 

However, this is not a good way to encode features which have a lot of labels. This is why we tried to find a better way for our prediction, by transforming the OHE in a binary encoding of the kind $1$ vs all

In [19]:
%%html --isolated
<style type="text/css">.tg-sort-header::-moz-selection{background:0 0}.tg-sort-header::selection{background:0 0}.tg-sort-header{cursor:pointer}.tg-sort-header:after{content:'';float:right;margin-top:7px;border-width:0 5px 5px;border-style:solid;border-color:#404040 transparent;visibility:hidden}.tg-sort-header:hover:after{visibility:visible}.tg-sort-asc:after,.tg-sort-asc:hover:after,.tg-sort-desc:after{visibility:visible;opacity:.4}.tg-sort-desc:after{border-bottom:none;border-width:5px 5px 0}@media screen and (max-width: 767px) {.tg {width: auto !important;}.tg col {width: auto !important;}.tg-wrap {overflow-x: auto;-webkit-overflow-scrolling: touch;margin: auto 0px;}}</style><div class="tg-wrap"><table id="tg-qpwKN" style="border-collapse:collapse;border-color:#9ABAD9;border-spacing:0;margin:0px auto" class="tg"><thead><tr><th style="background-color:#409cff;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#fff;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;position:-webkit-sticky;position:sticky;text-align:left;top:-1px;vertical-align:top;will-change:transform;word-break:normal"><span style="font-weight:bold">Feature</span></th><th style="background-color:#409cff;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#fff;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;position:-webkit-sticky;position:sticky;text-align:center;top:-1px;vertical-align:top;will-change:transform;word-break:normal"><span style="font-weight:bold">Meaning</span></th></tr></thead><tbody><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Alley</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Type of alley access</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Bldg_Type</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Type of dwelling</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Condition_1</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Proximity to various conditions</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Condition_2</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Proximity to various conditions (if more than one is present)</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Electrical</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Electrical system</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Exterior_1st</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Exterior covering on house</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Exterior_2nd</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Exterior covering on house</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Foundation</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Type of foundation</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Garage_Type</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Garage location</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Heating</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Type of heating</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">House_Style</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Style of dwelling</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Lot_Config</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Lot Configuration</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Mas_Vnr_Type</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Masonry veneer type</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">MS_SubClass</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Identifies the type of dwelling involved in the sale.</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">MS_Zoning</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Identifies the general zoning classification of the sale.</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Neighborhood</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Physical locations within Ames city limits</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Roof_Matl</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Roof Material</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Roof_Style</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Type of roof</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Sale_Condition</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Condition of sale</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Sale_Type</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Type of sale</td></tr></tbody></table></div><script charset="utf-8">var TGSort=window.TGSort||function(n){"use strict";function r(n){return n?n.length:0}function t(n,t,e,o=0){for(e=r(n);o<e;++o)t(n[o],o)}function e(n){return n.split("").reverse().join("")}function o(n){var e=n[0];return t(n,function(n){for(;!n.startsWith(e);)e=e.substring(0,r(e)-1)}),r(e)}function u(n,r,e=[]){return t(n,function(n){r(n)&&e.push(n)}),e}var a=parseFloat;function i(n,r){return function(t){var e="";return t.replace(n,function(n,t,o){return e=t.replace(r,"")+"."+(o||"").substring(1)}),a(e)}}var s=i(/^(?:\s*)([+-]?(?:\d+)(?:,\d{3})*)(\.\d*)?$/g,/,/g),c=i(/^(?:\s*)([+-]?(?:\d+)(?:\.\d{3})*)(,\d*)?$/g,/\./g);function f(n){var t=a(n);return!isNaN(t)&&r(""+t)+1>=r(n)?t:NaN}function d(n){var e=[],o=n;return t([f,s,c],function(u){var a=[],i=[];t(n,function(n,r){r=u(n),a.push(r),r||i.push(n)}),r(i)<r(o)&&(o=i,e=a)}),r(u(o,function(n){return n==o[0]}))==r(o)?e:[]}function v(n){if("TABLE"==n.nodeName){for(var a=function(r){var e,o,u=[],a=[];return function n(r,e){e(r),t(r.childNodes,function(r){n(r,e)})}(n,function(n){"TR"==(o=n.nodeName)?(e=[],u.push(e),a.push(n)):"TD"!=o&&"TH"!=o||e.push(n)}),[u,a]}(),i=a[0],s=a[1],c=r(i),f=c>1&&r(i[0])<r(i[1])?1:0,v=f+1,p=i[f],h=r(p),l=[],g=[],N=[],m=v;m<c;++m){for(var T=0;T<h;++T){r(g)<h&&g.push([]);var C=i[m][T],L=C.textContent||C.innerText||"";g[T].push(L.trim())}N.push(m-v)}t(p,function(n,t){l[t]=0;var a=n.classList;a.add("tg-sort-header"),n.addEventListener("click",function(){var n=l[t];!function(){for(var n=0;n<h;++n){var r=p[n].classList;r.remove("tg-sort-asc"),r.remove("tg-sort-desc"),l[n]=0}}(),(n=1==n?-1:+!n)&&a.add(n>0?"tg-sort-asc":"tg-sort-desc"),l[t]=n;var i,f=g[t],m=function(r,t){return n*f[r].localeCompare(f[t])||n*(r-t)},T=function(n){var t=d(n);if(!r(t)){var u=o(n),a=o(n.map(e));t=d(n.map(function(n){return n.substring(u,r(n)-a)}))}return t}(f);(r(T)||r(T=r(u(i=f.map(Date.parse),isNaN))?[]:i))&&(m=function(r,t){var e=T[r],o=T[t],u=isNaN(e),a=isNaN(o);return u&&a?0:u?-n:a?n:e>o?n:e<o?-n:n*(r-t)});var C,L=N.slice();L.sort(m);for(var E=v;E<c;++E)(C=s[E].parentNode).removeChild(s[E]);for(E=v;E<c;++E)C.appendChild(s[v+L[E-v]])})})}}n.addEventListener("DOMContentLoaded",function(){for(var t=n.getElementsByClassName("tg"),e=0;e<r(t);++e)try{v(t[e])}catch(n){}})}(document)</script>

Feature,Meaning
Alley,Type of alley access
Bldg_Type,Type of dwelling
Condition_1,Proximity to various conditions
Condition_2,Proximity to various conditions (if more than one is present)
Electrical,Electrical system
Exterior_1st,Exterior covering on house
Exterior_2nd,Exterior covering on house
Foundation,Type of foundation
Garage_Type,Garage location
Heating,Type of heating


#### Binary Features

Some features that can interpreted as **binary** categorical features as they have two levels and which we can eventually represent with one dummy column.

We have already manually encoded the features to become binary featutres, indicating the presence $1$ or absence $0$ of the alternative feature/values.

In [20]:
%%html --isolate

<style type="text/css">.tg-sort-header::-moz-selection{background:0 0}.tg-sort-header::selection{background:0 0}.tg-sort-header{cursor:pointer}.tg-sort-header:after{content:'';float:right;margin-top:7px;border-width:0 5px 5px;border-style:solid;border-color:#404040 transparent;visibility:hidden}.tg-sort-header:hover:after{visibility:visible}.tg-sort-asc:after,.tg-sort-asc:hover:after,.tg-sort-desc:after{visibility:visible;opacity:.4}.tg-sort-desc:after{border-bottom:none;border-width:5px 5px 0}@media screen and (max-width: 767px) {.tg {width: auto !important;}.tg col {width: auto !important;}.tg-wrap {overflow-x: auto;-webkit-overflow-scrolling: touch;margin: auto 0px;}}</style><div class="tg-wrap"><table id="tg-VcE8X" style="border-collapse:collapse;border-color:#9ABAD9;border-spacing:0;margin:0px auto" class="tg"><thead><tr><th style="background-color:#409cff;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#fff;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;position:-webkit-sticky;position:sticky;text-align:left;top:-1px;vertical-align:top;will-change:transform;word-break:normal">Feature</th><th style="background-color:#409cff;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#fff;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;position:-webkit-sticky;position:sticky;text-align:center;top:-1px;vertical-align:top;will-change:transform;word-break:normal">Meaning</th><th style="background-color:#409cff;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#fff;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;position:-webkit-sticky;position:sticky;text-align:center;top:-1px;vertical-align:top;will-change:transform;word-break:normal">Values</th></tr></thead><tbody><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Central_Air</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Central air conditioning</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">'Y', 'N'</td></tr><tr><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;font-style:italic;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Street</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Type of road access to property</td><td style="background-color:#EBF5FF;border-color:#9ABAD9;border-style:solid;border-width:1px;color:#444;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">'Grvl', 'Pave'</td></tr></tbody></table></div><script charset="utf-8">var TGSort=window.TGSort||function(n){"use strict";function r(n){return n?n.length:0}function t(n,t,e,o=0){for(e=r(n);o<e;++o)t(n[o],o)}function e(n){return n.split("").reverse().join("")}function o(n){var e=n[0];return t(n,function(n){for(;!n.startsWith(e);)e=e.substring(0,r(e)-1)}),r(e)}function u(n,r,e=[]){return t(n,function(n){r(n)&&e.push(n)}),e}var a=parseFloat;function i(n,r){return function(t){var e="";return t.replace(n,function(n,t,o){return e=t.replace(r,"")+"."+(o||"").substring(1)}),a(e)}}var s=i(/^(?:\s*)([+-]?(?:\d+)(?:,\d{3})*)(\.\d*)?$/g,/,/g),c=i(/^(?:\s*)([+-]?(?:\d+)(?:\.\d{3})*)(,\d*)?$/g,/\./g);function f(n){var t=a(n);return!isNaN(t)&&r(""+t)+1>=r(n)?t:NaN}function d(n){var e=[],o=n;return t([f,s,c],function(u){var a=[],i=[];t(n,function(n,r){r=u(n),a.push(r),r||i.push(n)}),r(i)<r(o)&&(o=i,e=a)}),r(u(o,function(n){return n==o[0]}))==r(o)?e:[]}function v(n){if("TABLE"==n.nodeName){for(var a=function(r){var e,o,u=[],a=[];return function n(r,e){e(r),t(r.childNodes,function(r){n(r,e)})}(n,function(n){"TR"==(o=n.nodeName)?(e=[],u.push(e),a.push(n)):"TD"!=o&&"TH"!=o||e.push(n)}),[u,a]}(),i=a[0],s=a[1],c=r(i),f=c>1&&r(i[0])<r(i[1])?1:0,v=f+1,p=i[f],h=r(p),l=[],g=[],N=[],m=v;m<c;++m){for(var T=0;T<h;++T){r(g)<h&&g.push([]);var C=i[m][T],L=C.textContent||C.innerText||"";g[T].push(L.trim())}N.push(m-v)}t(p,function(n,t){l[t]=0;var a=n.classList;a.add("tg-sort-header"),n.addEventListener("click",function(){var n=l[t];!function(){for(var n=0;n<h;++n){var r=p[n].classList;r.remove("tg-sort-asc"),r.remove("tg-sort-desc"),l[n]=0}}(),(n=1==n?-1:+!n)&&a.add(n>0?"tg-sort-asc":"tg-sort-desc"),l[t]=n;var i,f=g[t],m=function(r,t){return n*f[r].localeCompare(f[t])||n*(r-t)},T=function(n){var t=d(n);if(!r(t)){var u=o(n),a=o(n.map(e));t=d(n.map(function(n){return n.substring(u,r(n)-a)}))}return t}(f);(r(T)||r(T=r(u(i=f.map(Date.parse),isNaN))?[]:i))&&(m=function(r,t){var e=T[r],o=T[t],u=isNaN(e),a=isNaN(o);return u&&a?0:u?-n:a?n:e>o?n:e<o?-n:n*(r-t)});var C,L=N.slice();L.sort(m);for(var E=v;E<c;++E)(C=s[E].parentNode).removeChild(s[E]);for(E=v;E<c;++E)C.appendChild(s[v+L[E-v]])})})}}n.addEventListener("DOMContentLoaded",function(){for(var t=n.getElementsByClassName("tg"),e=0;e<r(t);++e)try{v(t[e])}catch(n){}})}(document)</script>

Feature,Meaning,Values
Central_Air,Central air conditioning,"'Y', 'N'"
Street,Type of road access to property,"'Grvl', 'Pave'"


In [21]:
temp=df.copy()
decode_byte_str(dataset=temp)
barplot_categ(temp)
df = temp

temp = df_original.copy()
decode_byte_str(dataset=temp)
df_original = temp

<IPython.core.display.Javascript object>

Again we find some features which do not show much diversity in their observations, which corroborates the fact that there might be a hidden pattern of missing entries or total lack of that specific house characteristic.

We can notice two major aspects from this plot:
* First, we see that there are plenty of feature were one value is heavily overrpresented.
    * In some cases, there is no need to consider that feature as the relation with the price might not be useful
    * In other cases, it might still be too early to decide to drop those features. In fact, they could be combined with other ones and become very important for the final prediction.
* A number of categorical features that are really ordinal features, such as the ones including `Qual`, `Cond`, `QC` in their labels (as we said before).

There are two main approaches now, for what concern nominal features: 
* The first one would be a quick sweep of those features which are hardly important and keep on going with the work. 
* The second one would be to transform the features through some categorical binning (merging labels of the same feature with similar distribution) to increase the diversity. This can be then used for
    * Lighter One-Hot-Encoding, as we have a smaller number of labels for each nominal feature ($2$-$3$ instead of $8$)
    * Or even create binary numerical indicators of the type $1$ vs all (or $m$ vs $n-m$) which are easier to handle.
    
However, this can only be determined later on.

I decided to drop the following features as from my research and the comparison with other challeges involving this dataset, they are never really relevant.

In [22]:
df.drop(columns=['Condition_1', 'Condition_2', 
                 'Exterior_1st', 'Exterior_2nd', 
                 'Foundation', 
                 'Utilities'], inplace = True)

I have also provided a boxplot to compare the interquantile range of the categorical features with the sale price, and the resulting plots confirm what we just said.

In [23]:
boxplot_categ(temp, "Sale_Price")

<IPython.core.display.Javascript object>

In [24]:
del temp

However, the encoding will take place later, at the end of the EDA.

## Data Correction, Coherence Check and some Feature Engineering

At this point just by looking at the plots above we notice discrepancies between features. Here we try to correct them and to add some features or indicators that could be of help later.

Most of them concern
* Zoning Class
* Living Area features
* Basement related features
* Garage related features
* Fireplace related features
* Sale Condition

### Zoning Class

We want to remove all the data that does not concern house prices, namely non-residential buildings.

In [25]:
df['MS_Zoning'].unique()

array(['Residential_Low_Density', 'Residential_High_Density',
       'Floating_Village_Residential', 'Residential_Medium_Density',
       'C_all', 'I_all', 'A_agr'], dtype=object)

We notice, as predicted, that some buildings are not residential.

In [26]:
df[(df['MS_Zoning'] == 'C_all') | ((df['MS_Zoning'] == 'I_all') | (df['MS_Zoning'] == 'A_agr'))]

Unnamed: 0,Alley,Bedroom_AbvGr,Bldg_Type,BsmtFin_SF_1,BsmtFin_SF_2,BsmtFin_Type_1,BsmtFin_Type_2,Bsmt_Cond,Bsmt_Exposure,Bsmt_Full_Bath,...,Second_Flr_SF,Street,Three_season_porch,TotRms_AbvGrd,Total_Bsmt_SF,Wood_Deck_SF,Year_Built,Year_Remod_Add,Year_Sold,Sale_Price
213,No_Alley_Access,4.0,OneFam,6.0,0.0,Rec,Unf,Typical,No,0.0,...,942.0,Pave,0.0,7.0,707.0,0.0,1907.0,1950.0,2010.0,93369.0
304,No_Alley_Access,2.0,OneFam,7.0,0.0,Unf,Unf,Typical,No,0.0,...,430.0,Pave,0.0,6.0,698.0,30.0,1920.0,1950.0,2010.0,68400.0
305,Paved,2.0,OneFam,7.0,0.0,Unf,Unf,Typical,Mn,0.0,...,319.0,Pave,0.0,7.0,859.0,68.0,1900.0,1950.0,2010.0,102776.0
306,No_Alley_Access,2.0,OneFam,7.0,0.0,Unf,Unf,Typical,Av,0.0,...,0.0,Grvl,0.0,4.0,540.0,0.0,1952.0,1952.0,2010.0,55993.0
307,No_Alley_Access,3.0,OneFam,7.0,0.0,Unf,Unf,Typical,No,0.0,...,0.0,Grvl,0.0,5.0,756.0,0.0,1896.0,1950.0,2010.0,50138.0
720,No_Alley_Access,2.0,OneFam,7.0,0.0,Unf,Unf,Typical,No,0.0,...,0.0,Pave,0.0,5.0,624.0,0.0,1910.0,1950.0,2009.0,58500.0
726,No_Alley_Access,2.0,OneFam,6.0,0.0,Rec,Unf,Typical,No,0.0,...,0.0,Pave,0.0,4.0,720.0,0.0,1920.0,1950.0,2009.0,34900.0
727,No_Alley_Access,2.0,OneFam,7.0,0.0,Unf,Unf,Typical,No,0.0,...,0.0,Pave,0.0,5.0,245.0,0.0,1900.0,1950.0,2009.0,44000.0
942,No_Alley_Access,2.0,OneFam,7.0,0.0,Unf,Unf,Fair,No,0.0,...,0.0,Pave,0.0,6.0,1013.0,0.0,1915.0,1982.0,2009.0,85000.0
943,Gravel,2.0,OneFam,7.0,0.0,Unf,Unf,Typical,No,0.0,...,0.0,Pave,0.0,4.0,572.0,0.0,1925.0,1950.0,2009.0,75000.0


As this might cause confusion for our predictions, we want to ignore those instances

In [27]:
# Delete all commercial, agriculture and industrial buildings
df = drop_rows_cond(df, 
          condition = ((df['MS_Zoning'] == 'C_all') | 
                       ((df['MS_Zoning'] == 'I_all') | 
                        (df['MS_Zoning'] == 'A_agr'))), 
          inplace = False)

Now, since we only have labels which divide houses into 4 categories based on the density, we can use it as an ordinal feature and encode "the best" as the one with lower density (if that is the case when compared to our target). This will happen later on.

### Subclass


In [28]:
df['MS_SubClass'].unique()

array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
       'One_Story_PUD_1946_and_Newer',
       'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
       'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
       'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
       'Two_Family_conversion_All_Styles_and_Ages',
       'One_and_Half_Story_Unfinished_All_Ages',
       'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
       'One_Story_with_Finished_Attic_All_Ages',
       'PUD_Multilevel_Split_Level_Foyer',
       'One_and_Half_Story_PUD_All_Ages'], dtype=object)

As `MS_Subclass` looks like a mixture of building year, building type and other features, it seems to have little relevance here and safe to ignore.

In [29]:
df.drop(columns='MS_SubClass',inplace=True)

### Living area related features

#### Above ground area

For what concerns the living space we can see `Gr_Liv_Area` is exaclty the sum of `First_Flr_SF`, `Second_Flr_SF` and `Low_Qual_Fin_SF`

In [30]:
# Dataset partition with just the columns that include those substrings
liv = df[get_cols(df, 'AbvGr') + 
         get_cols(df, 'Flr') + 
         get_cols(df, 'Liv') + 
         get_cols(df, 'Low_Qual_Fin_SF') + 
         get_cols(df, 'Bsmt') + 
         get_cols(df, 'Price')]

# Coherence
cond_total_error = (df['Gr_Liv_Area'] != (df['First_Flr_SF'] + df['Second_Flr_SF'] + df['Low_Qual_Fin_SF']))

liv[cond_total_error]

Unnamed: 0,Bedroom_AbvGr,Kitchen_AbvGr,TotRms_AbvGrd,First_Flr_SF,Second_Flr_SF,Gr_Liv_Area,Low_Qual_Fin_SF,BsmtFin_SF_1,BsmtFin_SF_2,BsmtFin_Type_1,BsmtFin_Type_2,Bsmt_Cond,Bsmt_Exposure,Bsmt_Full_Bath,Bsmt_Half_Bath,Bsmt_Qual,Bsmt_Unf_SF,Total_Bsmt_SF,Sale_Price


The data is coherent here.

We can add a column to represent the ratio `Low_Qual_Fin_SF/Gr_Liv_Area` to give us an idea of how that affects the price.

In [31]:
df['LowQ_Total_Liv_Ratio'] = df.apply(lambda x: 0.0 if (x['Low_Qual_Fin_SF'] <= 0.0) else (x['Low_Qual_Fin_SF']/x['Gr_Liv_Area']), axis=1)
df_original['LowQ_Total_Liv_Ratio'] = df_original.apply(lambda x: 0.0 if (x['Low_Qual_Fin_SF'] <= 0.0) else (x['Low_Qual_Fin_SF']/x['Gr_Liv_Area']), axis=1)

#### Bedrooms

There are some houses which lack of bedrooms above ground, which is almost never the case considering that in the count of the rooms above ground, bathrooms are excluded.

In [32]:
# No Bedrooms is impossible when
no_bedroom_abvgr = (df['Bedroom_AbvGr'] <=0)

liv[no_bedroom_abvgr]

Unnamed: 0,Bedroom_AbvGr,Kitchen_AbvGr,TotRms_AbvGrd,First_Flr_SF,Second_Flr_SF,Gr_Liv_Area,Low_Qual_Fin_SF,BsmtFin_SF_1,BsmtFin_SF_2,BsmtFin_Type_1,BsmtFin_Type_2,Bsmt_Cond,Bsmt_Exposure,Bsmt_Full_Bath,Bsmt_Half_Bath,Bsmt_Qual,Bsmt_Unf_SF,Total_Bsmt_SF,Sale_Price
158,0.0,2.0,4.0,1056.0,0.0,1056.0,0.0,3.0,0.0,GLQ,Unf,Typical,No,2.0,0.0,Typical,0.0,1056.0,144000.0
232,0.0,1.0,4.0,1332.0,192.0,1524.0,0.0,3.0,0.0,GLQ,Unf,Typical,Gd,2.0,0.0,Good,74.0,1332.0,260000.0
999,0.0,1.0,5.0,1593.0,0.0,1593.0,0.0,3.0,0.0,GLQ,Unf,Typical,Av,1.0,0.0,Excellent,440.0,1593.0,286000.0
1385,0.0,2.0,6.0,1258.0,0.0,1258.0,0.0,3.0,0.0,GLQ,Unf,Typical,Av,2.0,0.0,Good,0.0,1198.0,108959.0
2118,0.0,1.0,5.0,1743.0,0.0,1743.0,0.0,4.0,915.0,LwQ,GLQ,Good,Gd,2.0,0.0,Good,0.0,966.0,279000.0
2279,0.0,1.0,3.0,936.0,0.0,936.0,0.0,6.0,904.0,Rec,GLQ,Typical,Av,2.0,0.0,Excellent,0.0,920.0,140000.0
2522,0.0,1.0,5.0,1842.0,0.0,1842.0,0.0,3.0,0.0,GLQ,Unf,Typical,Gd,2.0,0.0,Excellent,32.0,1842.0,385000.0
2723,0.0,1.0,3.0,960.0,0.0,960.0,0.0,3.0,0.0,GLQ,Unf,Good,Av,1.0,1.0,Typical,0.0,648.0,145000.0


We can see that we have 0 bedrooms above ground for these rows but that cannot be true. If we suppose that all the houses have at least one living room, and that bathrooms are not counted as living space, there should be at least one bedroom per house. So we will correct them with the median for each cluster of houses with the same rooms above ground.

In [33]:
# Describe includes the mean
desc = liv.loc[:,['TotRms_AbvGrd', 'Bedroom_AbvGr']].groupby(by='TotRms_AbvGrd').describe()
# The index is the count of rooms above ground
index = desc.index.tolist()
# For each entry
for i in index:
    # Substitute entry if it has no bedroom above ground
    new_val = np.asarray(desc[i:i])[0][5]
    df['Bedroom_AbvGr'] = df.apply(lambda x: new_val if((x['TotRms_AbvGrd']== i) & (x['Bedroom_AbvGr']== 0.0)) else x['Bedroom_AbvGr'] ,axis=1)
    pass

#### Bathrooms

In [34]:
df['Total_Bath'] = (df['Full_Bath'] + (df['Half_Bath'] * 0.5))
df_original['Total_Bath'] = (df_original['Full_Bath'] + (df_original['Half_Bath'] * 0.5))

### Basement related features

First of all we want to check if the data is coherent.

In [35]:
df['Bsmt_Err'] = df['Total_Bsmt_SF'] - df['BsmtFin_SF_1'] - df['BsmtFin_SF_2'] - df['Bsmt_Unf_SF']

# This should result in no entry
cond_total_bsmt_error = (df['Total_Bsmt_SF'] != (df['BsmtFin_SF_1'] + df['BsmtFin_SF_2'] + df['Bsmt_Unf_SF']))
# Possible garages 
possible_garage = ((df['Garage_Type'] == 'Attachd') | (df['Garage_Type'] == 'BuiltIn') | (df['Garage_Type'] ==  'Basment'))

These rows present a mismatch in the data, and their errors are pretty big

In [36]:
df[cond_total_bsmt_error][get_cols(df, 'Bsmt_Err')+get_cols(df, 'Total_Bsmt') + get_cols(df, 'Bsmt_Unf') + get_cols(df, 'BsmtFin_SF') + get_cols(df, 'Garage_Area') + get_cols(df, 'Garage_Type')]

Unnamed: 0,Bsmt_Err,Total_Bsmt_SF,Bsmt_Unf_SF,BsmtFin_SF_1,BsmtFin_SF_2,Garage_Area,Garage_Type
0,637.0,1080.0,441.0,2.0,0.0,528.0,Attchd
1,462.0,882.0,270.0,6.0,144.0,730.0,Attchd
2,922.0,1329.0,406.0,1.0,0.0,312.0,Attchd
3,1064.0,2110.0,1045.0,1.0,0.0,522.0,Attchd
4,788.0,928.0,137.0,3.0,0.0,482.0,Attchd
...,...,...,...,...,...,...,...
2925,816.0,1003.0,184.0,3.0,0.0,588.0,Detchd
2926,299.0,864.0,239.0,2.0,324.0,484.0,Attchd
2927,334.0,912.0,575.0,3.0,0.0,0.0,No_Garage
2928,1070.0,1389.0,195.0,1.0,123.0,418.0,Attchd


Even if we adjusted the ones which could include the garage area, it is still hard to tell how to use the remaining ones

In [37]:
df[cond_total_bsmt_error & possible_garage][get_cols(df, 'Bsmt_Err')+get_cols(df, 'Total_Bsmt') + get_cols(df, 'Bsmt_Unf') + get_cols(df, 'BsmtFin_SF') + get_cols(df, 'Garage_Area') + get_cols(df, 'Garage_Type')]

Unnamed: 0,Bsmt_Err,Total_Bsmt_SF,Bsmt_Unf_SF,BsmtFin_SF_1,BsmtFin_SF_2,Garage_Area,Garage_Type
15,1415.0,1650.0,234.0,1.0,0.0,841.0,BuiltIn
16,424.0,559.0,132.0,3.0,0.0,492.0,Basment
50,-7.0,764.0,764.0,7.0,0.0,474.0,BuiltIn
54,323.0,384.0,58.0,3.0,0.0,400.0,BuiltIn
55,622.0,860.0,235.0,3.0,0.0,440.0,BuiltIn
...,...,...,...,...,...,...,...
2824,-7.0,1150.0,1150.0,7.0,0.0,502.0,BuiltIn
2825,544.0,547.0,0.0,3.0,0.0,525.0,Basment
2826,544.0,547.0,0.0,3.0,0.0,525.0,Basment
2827,544.0,547.0,0.0,3.0,0.0,525.0,Basment


One big issue here is that the data does not match, which is probably because we are lacking how much space the bathrooms and the bedrooms (or the possible basement garages) take up in the basement. This is one of the case where we cannot do much about it.

Plus it is evident that we have some errors, but the garage area is not included inside the computation!


Furthermore, if we consider `BsmtFin_Type_x` and `Bsmt_Fin_SF_x` which are strictly related to each other.
* `BsmtFin_Type_x` refers to *Rating of basement finished area (if multiple types)*
* `BsmtFin_SF_x` refers to  *Type x finished square feet*

For a house with `Bsmt_Fin_SF == 0.0` it should be coherent to see either `'No_Basement'` or `'Unf'` for `BsmtFin_Type`.

To double check we insert a new feature which might be of use later.

In [38]:
df['Bsmt'] = df.apply(lambda x: 1 if (x['Total_Bsmt_SF']>0.0) else 0 ,axis=1)
df_original['Bsmt'] = df_original.apply(lambda x: 1 if (x['Total_Bsmt_SF']>0.0) else 0 ,axis=1)

The following rows have an issue, they get an arbitrary value while being labeled as having no basement whatsoever.

In [39]:
df[df['BsmtFin_Type_1']== 'No_Basement'].loc[:,['BsmtFin_Type_1', 'BsmtFin_SF_1', 'Bsmt_Unf_SF', 'Total_Bsmt_SF']].groupby(by = 'BsmtFin_SF_1').describe()

Unnamed: 0_level_0,Bsmt_Unf_SF,Bsmt_Unf_SF,Bsmt_Unf_SF,Bsmt_Unf_SF,Bsmt_Unf_SF,Bsmt_Unf_SF,Bsmt_Unf_SF,Bsmt_Unf_SF,Total_Bsmt_SF,Total_Bsmt_SF,Total_Bsmt_SF,Total_Bsmt_SF,Total_Bsmt_SF,Total_Bsmt_SF,Total_Bsmt_SF,Total_Bsmt_SF
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
BsmtFin_SF_1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0.0,1.0,0.0,,0.0,0.0,0.0,0.0,0.0,1.0,0.0,,0.0,0.0,0.0,0.0,0.0
5.0,75.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
# Errors
df[(df['BsmtFin_SF_1']==5.0) & 
     (df['BsmtFin_Type_1'] == 'No_Basement') & 
     (df['Bsmt_Unf_SF'] == 0.0) & 
     (df['Total_Bsmt_SF'] == 0.0) & 
     ((df['Bsmt_Full_Bath'] == 0.0) | (df['Bsmt_Half_Bath'] == 0.0 ))][get_cols(df, 'Bsmt')]

Unnamed: 0,BsmtFin_SF_1,BsmtFin_SF_2,BsmtFin_Type_1,BsmtFin_Type_2,Bsmt_Cond,Bsmt_Exposure,Bsmt_Full_Bath,Bsmt_Half_Bath,Bsmt_Qual,Bsmt_Unf_SF,Total_Bsmt_SF,Bsmt_Err,Bsmt
83,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,0.0,No_Basement,0.0,0.0,-5.0,0
154,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,0.0,No_Basement,0.0,0.0,-5.0,0
206,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,0.0,No_Basement,0.0,0.0,-5.0,0
243,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,0.0,No_Basement,0.0,0.0,-5.0,0
273,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,0.0,No_Basement,0.0,0.0,-5.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2702,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,0.0,No_Basement,0.0,0.0,-5.0,0
2706,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,0.0,No_Basement,0.0,0.0,-5.0,0
2739,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,0.0,No_Basement,0.0,0.0,-5.0,0
2744,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,0.0,No_Basement,0.0,0.0,-5.0,0


There should not be instances of houses where they claim to have `5.0` finished square feet of basement when they are not there. I decided to replace their values with `0.0` since there are five features corroborating my hypothesis

Probably an error was made upon insertion, and since the primary basement is the first one we swap their values and set the second basement type as unfinished because there are still a great deal of area left unfinished here.

Let us finish checking other possible cases

In [41]:
# These rows should have no basement.
cond_no_bsmt = ((df['Bsmt_Cond'] == 'No_Basement')|(df['Bsmt_Qual'] == 'No_Basement'))

# These rows have some unfinished basements
cond_unf_bsmt = ((df['BsmtFin_Type_1'] == 'Unf')|(df['BsmtFin_Type_2'] == 'Unf'))

# These rows have some finished basements
cond_fin_bsmt = ((df['BsmtFin_SF_1'] > 0.0) | (df['BsmtFin_SF_2'] > 0.0))

In [42]:
# If it is finished, it must exist
df[(cond_no_bsmt & cond_fin_bsmt)]

Unnamed: 0,Alley,Bedroom_AbvGr,Bldg_Type,BsmtFin_SF_1,BsmtFin_SF_2,BsmtFin_Type_1,BsmtFin_Type_2,Bsmt_Cond,Bsmt_Exposure,Bsmt_Full_Bath,...,Total_Bsmt_SF,Wood_Deck_SF,Year_Built,Year_Remod_Add,Year_Sold,Sale_Price,LowQ_Total_Liv_Ratio,Total_Bath,Bsmt_Err,Bsmt
83,No_Alley_Access,4.0,Duplex,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,...,0.0,0.0,1978.0,1978.0,2010.0,112000.0,0.0,2.0,-5.0,0
154,No_Alley_Access,2.0,OneFam,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,...,0.0,0.0,1955.0,2007.0,2010.0,107500.0,0.0,1.0,-5.0,0
206,No_Alley_Access,3.0,TwoFmCon,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,...,0.0,0.0,1930.0,1950.0,2010.0,55000.0,0.0,2.0,-5.0,0
243,No_Alley_Access,2.0,OneFam,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,...,0.0,0.0,1946.0,2006.0,2010.0,84900.0,0.0,1.0,-5.0,0
273,No_Alley_Access,2.0,OneFam,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,...,0.0,0.0,1945.0,2007.0,2010.0,84900.0,0.0,1.0,-5.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2702,No_Alley_Access,2.0,OneFam,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,...,0.0,0.0,1935.0,1950.0,2006.0,125000.0,0.0,1.0,-5.0,0
2706,No_Alley_Access,2.0,Duplex,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,...,0.0,0.0,1967.0,1967.0,2006.0,90000.0,0.0,2.0,-5.0,0
2739,No_Alley_Access,3.0,OneFam,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,...,0.0,0.0,1952.0,2002.0,2006.0,135000.0,0.0,1.0,-5.0,0
2744,No_Alley_Access,3.0,OneFam,5.0,0.0,No_Basement,No_Basement,No_Basement,No_Basement,0.0,...,0.0,0.0,1954.0,1954.0,2006.0,93000.0,0.0,1.0,-5.0,0


In [43]:
# If it is unfinished, it still exists
df[(cond_no_bsmt & cond_unf_bsmt)]

Unnamed: 0,Alley,Bedroom_AbvGr,Bldg_Type,BsmtFin_SF_1,BsmtFin_SF_2,BsmtFin_Type_1,BsmtFin_Type_2,Bsmt_Cond,Bsmt_Exposure,Bsmt_Full_Bath,...,Total_Bsmt_SF,Wood_Deck_SF,Year_Built,Year_Remod_Add,Year_Sold,Sale_Price,LowQ_Total_Liv_Ratio,Total_Bath,Bsmt_Err,Bsmt


We have determined, these four might confound our prediction and it is better to discard them

In [44]:
df.drop(columns=['BsmtFin_SF_1', 'BsmtFin_SF_2', 'BsmtFin_Type_1', 'BsmtFin_Type_2','Bsmt_Err'], inplace=True)

We can create a new variable to take into account all the baths in the basement

In [45]:
df['Bsmt_Total_Bath'] = (df['Bsmt_Full_Bath']+(df['Bsmt_Half_Bath']*0.5))
df_original['Bsmt_Total_Bath'] = (df_original['Bsmt_Full_Bath']+(df_original['Bsmt_Half_Bath']*0.5))

Now that we established these features, we can create a new one for the total sf in the house including the basement and one for all the bathrooms

In [46]:
df['Total_SF'] = df['Total_Bsmt_SF']+df['Gr_Liv_Area']
df_original['Total_SF'] = df_original['Total_Bsmt_SF']+df_original['Gr_Liv_Area']

df['Baths'] = df['Total_Bath']+df['Bsmt_Total_Bath']
df_original['Baths'] = df_original['Total_Bath']+df_original['Bsmt_Total_Bath']

We are done with this part, let us continue

### Garage Related Features

Again we cannot have ghost garages and we need to correct those instances!

In [47]:
# No garage
cond_no_garage = ((df['Garage_Qual'] == 'No_Garage')|(df['Garage_Cond'] == 'No_Garage'))
# Garage with some area
cond_has_area = (df['Garage_Area'] > 0.0)

In [48]:
# This should not happen
df[get_cols(df, 'Garage')][(cond_no_garage & cond_has_area)]

Unnamed: 0,Garage_Area,Garage_Cars,Garage_Cond,Garage_Finish,Garage_Qual,Garage_Type
1356,360.0,1.0,No_Garage,No_Garage,No_Garage,Detchd


We want to use some average values just be sure it does not affect much our prediction

In [49]:
df.at[1356, 'Garage_Cond'] = 'Typical'
df.at[1356, 'Garage_Finish'] = 'Fin'
df.at[1356, 'Garage_Finish'] = 'Typical'

### External related features

For what concerns the surroundings of the house itself, we want to be sure to have a way to check the presence of external structures suchs as a porch.

#### Porch and deck

In [50]:
ext_sub = df[get_cols(df, 'Porch')+get_cols(df, 'porch')+ get_cols(df, 'Deck') + get_cols(df, 'Sale_Price')]
ext_sub

Unnamed: 0,Enclosed_Porch,Open_Porch_SF,Screen_Porch,Three_season_porch,Wood_Deck_SF,Sale_Price
0,0.0,62.0,0.0,0.0,210.0,215000.0
1,0.0,0.0,120.0,0.0,140.0,105000.0
2,0.0,36.0,0.0,0.0,393.0,172000.0
3,0.0,0.0,0.0,0.0,0.0,244000.0
4,0.0,34.0,0.0,0.0,212.0,189900.0
...,...,...,...,...,...,...
2925,0.0,0.0,0.0,0.0,120.0,142500.0
2926,0.0,0.0,0.0,0.0,164.0,131000.0
2927,0.0,32.0,0.0,0.0,80.0,132000.0
2928,0.0,38.0,0.0,0.0,240.0,170000.0


We want to add a new feature to include the area covered by the external structures around the house which might be of use later.

In [51]:
df['External_SF'] = (df['Enclosed_Porch'] + df['Open_Porch_SF'] + 
                     df['Screen_Porch']+ df['Three_season_porch'] + 
                     df['Wood_Deck_SF'])
df_original['External_SF'] = (df_original['Enclosed_Porch'] + df_original['Open_Porch_SF'] + 
                              df_original['Screen_Porch']+ df_original['Three_season_porch'] + 
                              df_original['Wood_Deck_SF'])

And we want to remove the following ones as most houses lack evidence and we can just combine the features

In [52]:
df.drop(columns=['Enclosed_Porch', 'Screen_Porch', 'Three_season_porch', 'Wood_Deck_SF', 'Open_Porch_SF'], inplace=True)

#### Lot Area

In [53]:
df[['Lot_Area', 'Total_SF', 'External_SF', 'Garage_Area', 'Pool_Area']]

Unnamed: 0,Lot_Area,Total_SF,External_SF,Garage_Area,Pool_Area
0,31770.0,2736.0,272.0,528.0,0.0
1,11622.0,1778.0,260.0,730.0,0.0
2,14267.0,2658.0,429.0,312.0,0.0
3,11160.0,4220.0,0.0,522.0,0.0
4,13830.0,2557.0,246.0,482.0,0.0
...,...,...,...,...,...
2925,7937.0,2006.0,120.0,588.0,0.0
2926,8885.0,1766.0,164.0,484.0,0.0
2927,10441.0,1882.0,112.0,0.0,0.0
2928,10010.0,2778.0,278.0,418.0,0.0


We cannot determine how much free land there is left from lot area, as we do not really have any way compute the space of the ground floor from the given features

#### Mas Vnr Area

In [54]:
mas_sub = df[get_cols(df, 'Mas')+get_cols(df, 'Sale_Price')]
mas_sub

Unnamed: 0,Mas_Vnr_Area,Mas_Vnr_Type,Sale_Price
0,112.0,Stone,215000.0
1,0.0,,105000.0
2,108.0,BrkFace,172000.0
3,0.0,,244000.0
4,0.0,,189900.0
...,...,...,...
2925,0.0,,142500.0
2926,0.0,,131000.0
2927,0.0,,132000.0
2928,0.0,,170000.0


In [55]:
mas_cond1= ((df['Mas_Vnr_Area'] != 0.0) & (df['Mas_Vnr_Type'] == 'None'))

mas_cond2= ((df['Mas_Vnr_Area'] == 0.0) & (df['Mas_Vnr_Type'] != 'None'))

mas_sub[mas_cond1]

Unnamed: 0,Mas_Vnr_Area,Mas_Vnr_Type,Sale_Price
363,344.0,,225000.0
403,312.0,,125000.0
441,285.0,,324000.0
1861,1.0,,190000.0
1913,1.0,,114500.0
2003,1.0,,104500.0
2528,288.0,,165150.0


In [56]:
df.at[1861, 'Mas_Vnr_Area'] = 0.0
df.at[1913, 'Mas_Vnr_Area'] = 0.0
df.at[2003, 'Mas_Vnr_Area'] = 0.0

df.drop(df[(df['Mas_Vnr_Area']>0.0)&(df['Mas_Vnr_Type']=='None')].index.tolist(), inplace=True)

In [57]:
mas_sub = df[get_cols(df, 'Mas')+get_cols(df, 'Sale_Price')]

In [58]:
mas_sub[mas_cond2]

Unnamed: 0,Mas_Vnr_Area,Mas_Vnr_Type,Sale_Price
1640,0.0,BrkFace,392000.0
1740,0.0,BrkFace,173500.0
1785,0.0,Stone,248328.0


In [59]:
mas_sub.groupby(by = 'Mas_Vnr_Type').describe()

Unnamed: 0_level_0,Mas_Vnr_Area,Mas_Vnr_Area,Mas_Vnr_Area,Mas_Vnr_Area,Mas_Vnr_Area,Mas_Vnr_Area,Mas_Vnr_Area,Mas_Vnr_Area,Sale_Price,Sale_Price,Sale_Price,Sale_Price,Sale_Price,Sale_Price,Sale_Price,Sale_Price
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Mas_Vnr_Type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
BrkCmn,25.0,195.48,160.361082,40.0,67.0,161.0,250.0,621.0,25.0,140199.0,41838.296567,61500.0,118500.0,139000.0,158900.0,277000.0
BrkFace,880.0,261.646591,210.217183,0.0,120.0,203.0,340.0,1600.0,880.0,210798.592045,84861.039301,75000.0,152750.0,186900.0,253073.25,755000.0
CBlock,1.0,198.0,,198.0,198.0,198.0,198.0,198.0,1.0,80000.0,,80000.0,80000.0,80000.0,80000.0,80000.0
,1742.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1742.0,156532.257176,57702.234014,12789.0,120906.25,144000.0,184000.0,745000.0
Stone,249.0,239.550201,180.09155,0.0,120.0,200.0,300.0,1224.0,249.0,260547.297189,104490.683371,97500.0,184750.0,245000.0,315500.0,615000.0


In [60]:
# We can correct these by inferring some values coherent with the price and area for that type
df.at[1640, 'Mas_Vnr_Area'] = 420.0
df.at[1740, 'Mas_Vnr_Area'] = 180.0
df.at[1785, 'Mas_Vnr_Area'] = 200.0

### Fireplace related features

The data about the fireplaces, is accurate, no need for further investigation

In [61]:
df[get_cols(df, 'Fire')][(df['Fireplace_Qu'] == 'No_Fireplace') & (df['Fireplaces'] != 0.0)]

Unnamed: 0,Fireplace_Qu,Fireplaces


### Sale condition related features

The condition of sale impact the price very heavily. First of all, we need to consider abnormal sales

> [Abnormal Sale](https://payrollheaven.com/define/abnormal-sale/) is described as "*A sale that does not represent a market transaction*"

Then, transactions are between family members are trivially problematic, along with the adjoining land transactions whose subject is the land confining with a pre-existing property. 

From [Cracking the Ames Housing Dataset with Linear Regression](https://towardsdatascience.com/wrangling-through-dataland-modeling-house-prices-in-ames-iowa-75b9b4086c96) we learn that it might be a good idea to remove certain instances, which might translate into reduced bias for the prediction.

Let us look at the data more closely

In [62]:
df[get_cols(df, 'Sale')].groupby(by = 'Sale_Condition').describe()

Unnamed: 0_level_0,Sale_Price,Sale_Price,Sale_Price,Sale_Price,Sale_Price,Sale_Price,Sale_Price,Sale_Price
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Sale_Condition,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Abnorml,179.0,144646.608939,80439.202615,12789.0,105250.0,131000.0,160950.0,745000.0
AdjLand,12.0,108916.666667,21988.461161,81000.0,89500.0,110000.0,126375.0,150000.0
Alloca,22.0,171732.636364,66942.18181,89471.0,118884.5,152123.0,204881.0,359100.0
Family,46.0,157488.586957,63376.521175,79275.0,121025.0,144400.0,174000.0,409900.0
Normal,2393.0,176115.81613,70776.708366,35000.0,130000.0,159000.0,207000.0,755000.0
Partial,245.0,273374.371429,100001.455502,113000.0,195400.0,250000.0,336820.0,611657.0


The difference is evident, but we will save further analysis for later. Let us remove both family and abnormal transactions. 

In [63]:
drop_rows_cond(dataset=df, 
               condition = (
                   (df['Sale_Condition'] == 'Abnorml') | 
                   ((df['Sale_Condition'] == 'Family') | 
                    (df['Sale_Condition'] == 'A_agr'))), 
               inplace = True)

---

### To sum up what we did so far, plus some ideas for later

1. Removed these features:
    * `Misc_Val`, `Misc_Feature` are not specific, nor they are valid for every entry. The result of a one-hot-encoding would be confusing and misleading
    * `Utilities`, `Functional`, `Condition_1` `Condition_2`, `Exterior_1st`, `Exterior_2nd`, `Foundation`, `MS_SubClass`,
    `Screen_Porch`, `Three_season_porch`, `Wood_Deck_SF`, `Open_Porch_SF` as there is low evidence, or only few classes have relevant instances, or they
    does not look relevant at all.
    * Some other indicator we might not be needed later on (for later).
    * `BsmtFin_SF_1`, `BsmtFin_SF_2`, `BsmtFin_Type_1`, `BsmtFin_Type_2`, `Bsmt_Err` as they were incoherent for most entries.
2. Removed some instances
    * In `MS_Zoning` we find buildings that are not residential and can be considered as confounders.
    * Other that were not coherent
3. Added some features (for later contribution analysis)
    * `LowQ_Total_Liv_Ratio`, `External_SF`, `Total_Bsmt_Fin_SF` (removed) ,`Bsmt`, `Total_SF`, `Bsmt_Total_Bath`, `Baths`,
4. Checked the coherence of the data, modified/substituted/imputed values when needed
5. Binary, Ordinal encoding, Categorical (for later)
    * Many features can be interpreted as numeric values that can contribute to prediction.
    * The quality measures, for example, can be encoded and then multiplied by their related numeric features, in order to obtain one final value to weight the latter better.
6. Transformations (for later)

## The resulting dataset

The resulting dataset has the advantage of less incoherent data and some outliers removal. This can be confirmed through the plots, but now were are going to skip this part and wait for the final dataset before performing another overview. 

In [64]:
df = sort_alphabetically(df, 'Sale_Price')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2672 entries, 0 to 2671
Data columns (total 69 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Alley                 2672 non-null   object 
 1   Baths                 2672 non-null   float64
 2   Bedroom_AbvGr         2672 non-null   float64
 3   Bldg_Type             2672 non-null   object 
 4   Bsmt                  2672 non-null   int64  
 5   Bsmt_Cond             2672 non-null   object 
 6   Bsmt_Exposure         2672 non-null   object 
 7   Bsmt_Full_Bath        2672 non-null   float64
 8   Bsmt_Half_Bath        2672 non-null   float64
 9   Bsmt_Qual             2672 non-null   object 
 10  Bsmt_Total_Bath       2672 non-null   float64
 11  Bsmt_Unf_SF           2672 non-null   float64
 12  Central_Air           2672 non-null   object 
 13  Electrical            2672 non-null   object 
 14  Exter_Cond            2672 non-null   object 
 15  Exter_Qual           

In [65]:
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 88 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Alley                 2930 non-null   object 
 1   Bedroom_AbvGr         2930 non-null   float64
 2   Bldg_Type             2930 non-null   object 
 3   BsmtFin_SF_1          2930 non-null   float64
 4   BsmtFin_SF_2          2930 non-null   float64
 5   BsmtFin_Type_1        2930 non-null   object 
 6   BsmtFin_Type_2        2930 non-null   object 
 7   Bsmt_Cond             2930 non-null   object 
 8   Bsmt_Exposure         2930 non-null   object 
 9   Bsmt_Full_Bath        2930 non-null   float64
 10  Bsmt_Half_Bath        2930 non-null   float64
 11  Bsmt_Qual             2930 non-null   object 
 12  Bsmt_Unf_SF           2930 non-null   float64
 13  Central_Air           2930 non-null   object 
 14  Condition_1           2930 non-null   object 
 15  Condition_2          

### Save the dataset

We save the dataset locally to continue with the next part, the EDA.

In [66]:
df.to_csv(os.path.join(RESOURCES_DIR, "ames_housing_out_0.csv"))
df_original.to_csv(os.path.join(RESOURCES_DIR, "ames_housing_out_0_orig.csv"))

#### Credits and References
* Thomas Schmitt, [house_prices](https://www.openml.org/search?type=data&sort=runs&id=42165&status=active). OpenML (2019).(accessed April 12, 2023)
* E.J. Martin, [How much is my house worth? A beginner's guide]( https://www.bankrate.com/real-estate/how-much-is-my-house-worth/#faq). Bankrate (2023). (accessed April 12, 2023).
* J. Gomez, [8 critical factors that influence a home’s value](https://www.opendoor.com/articles/factors-that-influence-home-value). Opendoor (4 June 2022). (accessed April 12, 2023).
* [Ames (Iowa)](https://it.wikipedia.org/wiki/Ames_(Iowa)), Wikipedia (2020). (accessed April 12, 2023).
*  Leeclemmer, [Exploratory Data Analysis of Housing in Ames, Iowa](https://www.kaggle.com/code/leeclemmer/exploratory-data-analysis-of-housing-in-ames-iowa). Kaggle (2017). (accessed April 18, 2023).
* Alvin T. Tan, [Cracking the Ames Housing Dataset with Linear Regression](https://towardsdatascience.com/wrangling-through-dataland-modeling-house-prices-in-ames-iowa-75b9b4086c96). Towards Data Science (2022). (accessed April 19, 2023).
* The Strategic Plan For North Richland Hill, [Land Use Categories](https://www.nrhtx.com/DocumentCenter/View/9222/Draft-Land-Use-Categories?bidId=). North Richland Hills, TX. (accessed April 19, 2023
* [Abnormal Sale](https://payrollheaven.com/define/abnormal-sale/). PayrollHeaven.com. Payroll & Accounting Heaven Ltd. (accessed April 19, 2023).
