# House Price Prediction - Exploration

This analysis deals with the prediction of house prices based on the house's properties. The prediction is based on a sample of houses from Ames, Iowa. The dataset itself is obtained from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) as part of a competition.

## Extract-Transform-Load (ETL)

In [None]:
import pandas as pd

In [None]:
houses = pd.read_csv("../data/raw/train.csv")
houses.set_index("Id", inplace=True)

In [None]:
houses.head()

We have round about eighty numerical/categorical input features and one target variable, containing the price of the respective house.

## Exporative Data Analysis (EDA)

In [None]:
import matplotlib.pyplot as plt

First of all, we want to get a feeling for the data. Therefore, we will have a look at the distribution of the target variable and the correlation between the input features and the target variable. As a first step, we will take a look at missing values, later on we will have a look at the different types of features and their distribution.

In [None]:
houses.isna().sum()[houses.isna().sum() > 0].sort_values(ascending=False)

### Building

First of all let's take a look at the type of house. These include the type of house, the year it was built, any renovations and the condition of the facade and roof. The following features are included:

- `MSSubClass`:     Identifies the type of dwelling involved in the sale.
- `BldgType`:       Type of dwelling.
- `HouseStyle`:     Style of dwelling.
- `YearBuilt`:      Original construction date.
- `YearRemodAdd`:   Remodel date (same as construction date if no remodeling or additions).
- `RoofStyle`:      Type of roof.
- `RoofMatl`:       Roof material.
- `Exterior1st`:    Exterior covering on house.
- `Exterior2nd`:    Exterior covering on house (if more than one material).
- `ExterQual`:      Exterior material quality.
- `ExterCond`:      Present condition of the material on the exterior.
- `MasVnrType`:     Masonry veneer type.
- `MasVnrArea`:     Masonry veneer area in square feet.
- `Foundation`:     Type of foundation.
- `OverallQual`:    Rates the overall material and finish of the house.
- `OverallCond`:    Rates the overall condition of the house.

In [None]:
houses[["MSSubClass",
        "BldgType",
        "HouseStyle",
        "YearBuilt",
        "YearRemodAdd",
        "RoofStyle",
        "RoofMatl",
        "Exterior1st",
        "Exterior2nd",
        "ExterQual",
        "ExterCond",
        "MasVnrType",
        "MasVnrArea",
        "Foundation"]].info()

In [None]:
houses[["MSSubClass",
        "BldgType",
        "HouseStyle",
        "YearBuilt",
        "YearRemodAdd",
        "RoofStyle",
        "RoofMatl",
        "Exterior1st",
        "Exterior2nd",
        "ExterQual",
        "ExterCond",
        "MasVnrType",
        "MasVnrArea",
        "Foundation"]].isna().sum()

In [None]:
houses["MSSubClass"].replace({
    20: "1-STORY 1946 & NEWER ALL STYLES (20)",
    30: "1-STORY 1945 & OLDER (30)",
    40: "1-STORY W/FINISHED ATTIC ALL AGES (40)",
    45: "1-1/2 STORY - UNFINISHED ALL AGES (45)",
    50: "1-1/2 STORY FINISHED ALL AGES (50)",
    60: "2-STORY 1946 & NEWER (60)",
    70: "2-STORY 1945 & OLDER (70)",
    75: "2-1/2 STORY ALL AGES (75)",
    80: "SPLIT OR MULTI-LEVEL (80)",
    85: "SPLIT FOYER (85)",
    90: "DUPLEX - ALL STYLES AND AGES (90)",
    120: "1-STORY PUD (Planned Unit Development) - 1946 & NEWER (120)",
    150: "1-1/2 STORY PUD - ALL AGES (150)",
    160: "2-STORY PUD - 1946 & NEWER (160)",
    180: "PUD - MULTILEVEL - INCL SPLIT LEV/FOYER (180)",
    190: "2 FAMILY CONVERSION - ALL STYLES AND AGES (190)"}, inplace=True)

houses["BldgType"].replace({
    "1Fam": "Single-family Detached (1Fam)",
    "2FmCon": "Two-family Conversion; originally built as one-family dwelling (2FmCon)",
    "Duplx": "Duplex (Duplx)",
    "TwnhsE": "Townhouse End Unit (TwnhsE)",
    "TwnhsI": "Townhouse Inside Unit (TwnhsI)"}, inplace=True)

houses["HouseStyle"].replace({
    "1Story": "One story (1Story)",
    "1.5Fin": "One and one-half story: 2nd level finished (1.5Fin)",
    "1.5Unf": "One and one-half story: 2nd level unfinished (1.5Unf)",
    "2Story": "Two story (2Story)",
    "2.5Fin": "Two and one-half story: 2nd level finished (2.5Fin)",
    "2.5Unf": "Two and one-half story: 2nd level unfinished (2.5Unf)",
    "SFoyer": "Split Foyer (SFoyer)",
    "SLvl": "Split Level (SLvl)"}, inplace=True)

houses["OverallCond"].replace({
    10: "Very Excellent (10)",
    9: "Excellent (9)",
    8: "Very Good (8)",
    7: "Good (7)",
    6: "Above Average (6)",
    5: "Average (5)",
    4: "Below Average (4)",
    3: "Fair (3)",
    2: "Poor (2)",
    1: "Very Poor (1)"}, inplace=True)

houses["OverallQual"].replace({
    10: "Very Excellent (10)",
    9: "Excellent (9)",
    8: "Very Good (8)",
    7: "Good (7)",
    6: "Above Average (6)",
    5: "Average (5)",
    4: "Below Average (4)",
    3: "Fair (3)",
    2: "Poor (2)",
    1: "Very Poor (1)"}, inplace=True)

In [None]:
overall_quality_counts = houses.groupby("OverallQual")["OverallQual"].count().sort_values(ascending=False)
overall_condition_counts = houses.groupby("OverallCond")["OverallCond"].count().sort_values(ascending=False)

fig = plt.figure(figsize=(15, 5))

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Distribution of overall quality")
ax1.pie(overall_quality_counts)
ax1.legend(labels=overall_quality_counts.index)

ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("Distribution of overall condition")
ax2.pie(overall_condition_counts)
ax2.legend(labels=overall_condition_counts.index)

plt.show()

In [None]:
house_type_counts = houses.groupby("MSSubClass")["MSSubClass"].count().sort_values(ascending=False)

fig = plt.figure(figsize=(10, 5))

ax = fig.add_subplot(1, 1, 1)
ax.set_title("Distribution of the house types")
ax.barh(house_type_counts.index, house_type_counts)
ax.tick_params(axis="x", rotation=90)

plt.show()

In [None]:
building_type_counts = houses.groupby("BldgType")["BldgType"].count().sort_values(ascending=False)
house_style_counts = houses.groupby("HouseStyle")["HouseStyle"].count().sort_values(ascending=False)

fig = plt.figure(figsize=(15, 5))

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Distribution of building types")
ax1.pie(building_type_counts)
ax1.legend(labels=building_type_counts.index)

ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("Distribution of house styles")
ax2.pie(house_style_counts)
ax2.legend(labels=house_style_counts.index)

plt.show()

In [None]:
fig = plt.figure(figsize=(15, 5))

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Distribution of built years")
ax1.hist(houses["YearBuilt"], bins=20)

ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("Distribution of remodeled years")
ax2.hist(houses["YearRemodAdd"], bins=20)

plt.show()

In the documentation of the dataset is written that `None` is used for representing non existent masonary veneer. It seems that `NA` is used instead. We fix this by replacing `NA` with `None`. In addition some values of `MasVnrArea` are missing. We drop these lines later.

In [None]:
houses["MasVnrType"].fillna("None", inplace=True)

### Lot/Property

In a further step, we consider the property independently of the house. This includes total area, incline and soil conditions.

- `LotFrontage`:    Linear feet of street connected to property.
- `LotArea`:        Lot size in square feet.
- `LotShape`:       General shape of property. 
- `LotConfig`:      Lot configuration.
- `LandContour`:    Flatness of the property.
- `LandSlope`:      Slope of property.

In [None]:
houses[["LotFrontage", "LotArea", "LotShape", "LotConfig", "LandContour", "LandSlope"]].info()

In [None]:
houses[["LotFrontage", "LotArea", "LotShape", "LotConfig", "LandContour", "LandSlope"]].isna().sum()

In [None]:
houses["LotFrontage"][houses["LotFrontage"] == 0].count()

A missing value in `LotFrontage` may represent a missing connection to the road. These could be houses that can be reached, for example, via a pedestrian path, but not by car. We will fill these missing values with zero.

In [None]:
houses["LotFrontage"].fillna(0, inplace=True)

In [None]:
houses["LotShape"].replace({
    "Reg": "Regular (Reg)",
    "IR1": "Slightly irregular (IR1)",
    "IR2": "Moderately Irregular (IR2)",
    "IR3": "Irregular (IR3)"}, inplace=True)

houses["LotConfig"].replace({
    "Inside": "Inside lot (Inside)",
    "Corner": "Corner lot (Corner)",
    "CulDSac": "Cul-de-sac (CulDSac)",
    "FR2": "Frontage on 2 sides of property (FR2)",
    "FR3": "Frontage on 3 sides of property (FR3)"}, inplace=True)

houses["LandContour"].replace({
    "Lvl": "Near Flat/Level (Lvl)",
    "Bnk": "Banked - Quick and significant rise from street grade to building (Bnk)",
    "HLS": "Hillside - Significant slope from side to side (HLS)",
    "Low": "Depression (Low)"}, inplace=True)

houses["LandSlope"].replace({
    "Gtl": "Gentle slope (Gtl)",
    "Mod": "Moderate Slope (Mod)",
    "Sev": "Severe Slope (Sev)"}, inplace=True)

In [None]:
fig = plt.figure(figsize=(15, 5))

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Distribution of lot frontage (feet)")
ax1.hist(houses["LotFrontage"].dropna(), bins=50)
ax1.tick_params(axis="x", rotation=90)

ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("Distribution of lot area (square feet)")
ax2.hist(houses["LotArea"], bins=50)
ax2.tick_params(axis="x", rotation=90)

plt.show()

In [None]:
houses[["LotFrontage", "LotArea"]].dropna().describe()

In [None]:
lot_shape_counts = houses.groupby("LotShape")["LotShape"].count().sort_values(ascending=False)
lot_config_counts = houses.groupby("LotConfig")["LotConfig"].count().sort_values(ascending=False)

fig = plt.figure(figsize=(15, 5))

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Distribution of lot shapes")
ax1.pie(lot_shape_counts)
ax1.legend(labels=lot_shape_counts.index)

ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("Distribution of lot configurations")
ax2.pie(lot_config_counts)
ax2.legend(labels=lot_config_counts.index)

plt.show()

In [None]:
land_slope_counts = houses.groupby("LandSlope")["LandSlope"].count().sort_values(ascending=False)
land_contour_counts = houses.groupby("LandContour")["LandContour"].count().sort_values(ascending=False)

fig = plt.figure(figsize=(15, 5))

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Distribution of land slopes")
ax1.pie(land_slope_counts)
ax1.legend(labels=land_slope_counts.index)

ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("Distribution of land contours")
ax2.pie(land_contour_counts)
ax2.legend(labels=land_contour_counts.index)

plt.show()

### Utilities

We consider the connection of the house, therefore the driveway, road connection and the connection to the infrastructure such as electricity and gas.

- `Street`:     Type of road access to property
- `Alley`:      Type of alley access to property.
- `PavedDrive`: Paved driveway.
- `Utilities`:  Type of utilities available.

In [None]:
houses[["Street", "Alley", "Utilities", "PavedDrive"]].info()

In [None]:
houses[["Street", "Alley", "Utilities", "PavedDrive"]].isna().sum()

Missing values in `Alley` represent a missing alley. We will fill these missing values with `None`.

In [None]:
houses["Alley"].fillna("None", inplace=True)

In [None]:
street_type_counts = houses.groupby("Street")["Street"].count().sort_values(ascending=False)
alley_type_counts = houses.groupby("Alley")["Alley"].count().sort_values(ascending=False)

fig = plt.figure(figsize=(15, 5))

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Distribution of street types")
ax1.pie(street_type_counts)
ax1.legend(labels=street_type_counts.index)

ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("Distribution of alley types")
ax2.pie(alley_type_counts)
ax2.legend(labels=alley_type_counts.index)

plt.show()

In [None]:
utility_type_counts = houses.groupby("Utilities")["Utilities"].count().sort_values(ascending=False)

fig = plt.figure(figsize=(10, 5))

ax = fig.add_subplot(1, 1, 1)
ax.set_title("Distribution of utility types")
ax.pie(utility_type_counts)
ax.legend(labels=utility_type_counts.index)

plt.show()

### Neighborhood & Location

In this step we take a look at the neighborhood and the social environment of the house.

- `MSZoning`:       Identifies the general zoning classification of the sale.
- `Neighborhood`:   Physical locations within Ames city limits.
- `Condition1`:     Proximity to various conditions.
- `Condition2`:     Proximity to various conditions (if more than one is present).

In [None]:
houses[["MSZoning", "Neighborhood", "Condition1", "Condition2"]].info()

In [None]:
houses[["MSZoning", "Neighborhood", "Condition1", "Condition2"]].isna().sum()

In [None]:
houses["MSZoning"].replace({
    "A": "Agriculture (A)",
    "C (all)": "Commercial (C)",
    "FV": "Floating Village Residential (FV)",
    "I": "Industrial",
    "RH": "Residential High Density (RH)",
    "RL": "Residential Low Density (RL)",
    "RP": "Residential Low Density Park (RP)",
    "RM": "Residential Medium Density (RM)"}, inplace=True)

In [None]:
zoning_type_counts = houses.groupby("MSZoning")["MSZoning"].count().sort_values(ascending=False)

fig = plt.figure(figsize=(10, 5))

ax = fig.add_subplot(1, 1, 1)
ax.set_title("Distribution of zoning types")
ax.barh(zoning_type_counts.index, zoning_type_counts)
ax.tick_params(axis="x", rotation=90)

plt.show()

In [None]:
neighborhoos_counts = houses.groupby("Neighborhood")["Neighborhood"].count().sort_values(ascending=False)

fig = plt.figure(figsize=(10, 5))

ax = fig.add_subplot(1, 1, 1)
ax.set_title("Distribution of neighborhoods")
ax.bar(neighborhoos_counts.index, neighborhoos_counts)
ax.tick_params(axis="x", rotation=90)

plt.show()

### Garage

Some of the houses has a separate garage for cars and other vehicles. Let's take a look on it.

- `GarageType`:     Garage location.
- `GarageYrBlt`:    Year garage was built.
- `GarageFinish`:   Interior finish of the garage.
- `GarageCars`:     Size of garage in car capacity
- `GarageArea`:     Size of garage in square feet
- `GarageQual`:     Garage quality
- `GarageCond`:     Garage condition

In [None]:
houses[["GarageType", "GarageYrBlt", "GarageFinish", "GarageCars", "GarageArea", "GarageQual", "GarageCond"]].info()

In [None]:
houses[["GarageType", "GarageYrBlt", "GarageFinish", "GarageCars", "GarageArea", "GarageQual", "GarageCond"]].isna().sum()

As we can see, we've an equal number of missing values over all garage specific columns. One explanation would be that in accordance with the data sheet, missing values represent missing garages. A house has several features that describe the garage. To achieve logical consistency, the garage should exist or be absent for all features of a data point. Hence, we assume that a house has a garage whenever feature `GarageQual` is occupied.

In [None]:
print("Number of GarageYrBlt is 'NA' and garage exists: {}".format(houses[~pd.isna(houses["GarageQual"])]["GarageYrBlt"].isna().sum()))
print("Number of GaragFinish is 'NA' and garage exists: {}".format(houses[~pd.isna(houses["GarageQual"])]["GarageFinish"].isna().sum()))
print("Number of GarageType is 'NA' and garage exists: {}".format(houses[~pd.isna(houses["GarageQual"])]["GarageType"].isna().sum()))
print("Number of GarageCond is 'NA' and garage exists: {}".format(houses[~pd.isna(houses["GarageQual"])]["GarageCond"].isna().sum()))
print("Number of GarageCars is 'NA' and garage exists: {}".format(houses[~pd.isna(houses["GarageQual"])]["GarageCars"].isna().sum()))
print("Number of GarageArea is 'NA' and garage exists: {}".format(houses[~pd.isna(houses["GarageQual"])]["GarageArea"].isna().sum()))

If feature `GarageQual` is not occupied, the remaining features for the garage are filled with `0` or `None`.

In [None]:
houses.loc[houses["GarageQual"].isna(), "GarageFinish"] = "None"
houses.loc[houses["GarageQual"].isna(), "GarageType"] = "None"
houses.loc[houses["GarageQual"].isna(), "GarageCond"] = "None"
houses.loc[houses["GarageQual"].isna(), "GarageArea"] = 0
houses.loc[houses["GarageQual"].isna(), "GarageCars"] = 0
houses.loc[houses["GarageQual"].isna(), "GarageYrBlt"] = 0
houses["GarageQual"].fillna("None", inplace=True)

In [None]:
houses[["GarageType", "GarageYrBlt", "GarageFinish", "GarageCars", "GarageArea", "GarageQual", "GarageCond"]].isna().sum()

### Supplies

Next, we look at the available energy sources, the type of heating and the energy infrastructure.

- `Heating`:    Type of heating.
- `HeatingQC`:  Heating quality and condition.
- `CentralAir`: Central air conditioning.
- `Electrical`: Electrical system.

In [None]:
houses[["Heating", "HeatingQC", "CentralAir", "Electrical"]].info()

In [None]:
houses[["Heating", "HeatingQC", "CentralAir", "Electrical"]].isna().sum()

Electrical contains missing values. This may be due to the fact that the house has no electrical connection. The dataset documentation does not provide any information on this. We will drop these rows.

### Basement

Some houses also have a basement.

- `BsmtQual`:       Evaluates the height of the basement.
- `BsmtCond`:       Evaluates the general condition of the basement.
- `BsmtExposure`:   Refers to walkout or garden level walls.
- `BsmtFinType1`:   Rating of basement finished area.
- `BsmtFinSF1`:     Type 1 finished square feet.
- `BsmtFinType2`:   Rating of basement finished area (if multiple types).
- `BsmtFinSF2`:     Type 2 finished square feet.
- `BsmtUnfSF`:      Unfinished square feet of basement area.
- `TotalBsmtSF`:    Total square feet of basement area.
- `BsmtFullBath`:   Basement full bathrooms.
- `BsmtHalfBath`:   Basement half bathrooms.

In [None]:
houses[["BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath"]].info()

In [None]:
houses[["BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath"]].isna().sum()

According to the documentation, missing valus can be interpreted as missing basements. A house has several features that describe the basement. To achieve logical consistency, the basement should exist or be absent for all features of a data point. Hence, we assume that a house has a basement whenever feature `BsmtQual` is occupied.

In [None]:
print("Number of BsmtCond is 'NA' and basement exists: {}".format(houses[~pd.isna(houses["BsmtQual"])]["BsmtCond"].isna().sum()))
print("Number of BsmtExposure is 'NA' and basement exists: {}".format(houses[~pd.isna(houses["BsmtQual"])]["BsmtExposure"].isna().sum()))
print("Number of BsmtFinType1 is 'NA' and basement exists: {}".format(houses[~pd.isna(houses["BsmtQual"])]["BsmtFinType1"].isna().sum()))
print("Number of BsmtFinType2 is 'NA' and basement exists: {}".format(houses[~pd.isna(houses["BsmtQual"])]["BsmtFinType2"].isna().sum()))
print("Number of BsmtFinSF1 is 'NA' and basement exists: {}".format(houses[~pd.isna(houses["BsmtQual"])]["BsmtFinSF1"].isna().sum()))
print("Number of BsmtFinSF2 is 'NA' and basement exists: {}".format(houses[~pd.isna(houses["BsmtQual"])]["BsmtFinSF2"].isna().sum()))
print("Number of BsmtUnfSF is 'NA' and basement exists: {}".format(houses[~pd.isna(houses["BsmtQual"])]["BsmtUnfSF"].isna().sum()))
print("Number of TotalBsmtSF is 'NA' and basement exists: {}".format(houses[~pd.isna(houses["BsmtQual"])]["TotalBsmtSF"].isna().sum()))
print("Number of BsmtFullBath is 'NA' and basement exists: {}".format(houses[~pd.isna(houses["BsmtQual"])]["BsmtFullBath"].isna().sum()))
print("Number of BsmtHalfBath is 'NA' and basement exists: {}".format(houses[~pd.isna(houses["BsmtQual"])]["BsmtHalfBath"].isna().sum()))

If feature `BsmtQual` is not occupied, the remaining features for the basement are filled with `0` or `None`.

In [None]:
houses.loc[houses["BsmtQual"].isna(), "BsmtCond"] = "None"
houses.loc[houses["BsmtQual"].isna(), "BsmtExposure"] = "None"
houses.loc[houses["BsmtQual"].isna(), "BsmtFinType1"] = "None"
houses.loc[houses["BsmtQual"].isna(), "BsmtFinType2"] = "None"
houses.loc[houses["BsmtQual"].isna(), "BsmtFinSF1"] = 0
houses.loc[houses["BsmtQual"].isna(), "BsmtFinSF2"] = 0
houses.loc[houses["BsmtQual"].isna(), "BsmtUnfSF"] = 0
houses.loc[houses["BsmtQual"].isna(), "TotalBsmtSF"] = 0
houses.loc[houses["BsmtQual"].isna(), "BsmtFullBath"] = 0
houses.loc[houses["BsmtQual"].isna(), "BsmtHalfBath"] = 0
houses["BsmtQual"].fillna("None", inplace=True)

In fact, there seem to be still houses that have a basement according to feature `BsmtQual`, but have missing values in other features related to the basement. This is a logical contradiction.

In [None]:
houses[["BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath"]].isna().sum()

Because the number of these houses is very small, we will drop these rows.

### Outdoor area

Some houses have a fireplace, pools and more in the garden. This feature will be highlighted next.

- `Fireplaces`:     Number of fireplaces.
- `FireplaceQu`:    Fireplace quality.
- `PoolArea`:       Pool area in square feet.
- `PoolQC`:         Pool quality.
- `Fence`:          Fence quality.
- `MiscFeature`:    Miscellaneous feature not covered in other categories.
- `MiscVal`:        $Value of miscellaneous feature.
- `WoodDeckSF`:     Wood deck area in square feet.
- `OpenPorchSF`:    Open porch area in square feet.
- `EnclosedPorch`:  Enclosed porch area in square feet.
- `3SsnPorch`:      Three season porch area in square feet.
- `ScreenPorch`:    Screen porch area in square feet.

In [None]:
houses[["Fireplaces", "FireplaceQu", "PoolArea", "PoolQC", "Fence", "MiscFeature", "MiscVal", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "3SsnPorch", "ScreenPorch"]].info()

In [None]:
houses[["Fireplaces", "FireplaceQu", "PoolArea", "PoolQC", "Fence", "MiscFeature", "MiscVal", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "3SsnPorch", "ScreenPorch"]].isna().sum()

In [None]:
print("Number of FireplaceQu is 'NA' and fireplace exists: {}".format(houses[houses["Fireplaces"] >= 1]["FireplaceQu"].isna().sum()))

Missing values in `FireplaceQu` represent a missing fireplace. We will fill these missing values with `None`.

In [None]:
houses["FireplaceQu"].fillna("None", inplace=True)

Let's check if both properties of the pool are missing. If the values for both properties are missing, this indicates that the pool is missing (the values are missing on purpose). If only one of the two values is missing, there is probably an error in the data. So the question is, are there entries with a lack of pool quality, even though there should be a pool based on the surface area?

In [None]:
print("Number of PoolQC is 'NA' and pool exists: {}".format(houses[houses["PoolArea"] > 0]["PoolQC"].isna().sum()))

Missing values in `PoolQC` represent a missing pool. We will fill these missing values with `None`.

In [None]:
houses["PoolQC"].fillna("None", inplace=True)

Missing values in `Fence` represent a missing fence. We will fill these missing values with `None`.

In [None]:
houses["Fence"].fillna("None", inplace=True)

Missing values in `MiscFeature` represent a missing feature. We will fill these missing values with `None`.

In [None]:
houses["MiscFeature"].fillna("None", inplace=True)

### Kitchens

Next take a look at available kitchens.

- `Kitchen (KitchenAbvGr)`: Kitchens above grade.
- `KitchenQual`:            Kitchen quality.

In [None]:
houses[["KitchenAbvGr", "KitchenQual"]].info()

In [None]:
houses[["KitchenAbvGr", "KitchenQual"]].isna().sum()

### Sale

Finally we take a look at the target variable, the sale price. Moreover, we will have a look at the circumstances of the sale.

- `SaleType`:   Type of sale.
- `SaleCondition`: Condition of sale.
- `SalePrice`:  The property's sale price in dollars. This is the target variable that you're trying to predict.
- `MoSold`:     Month Sold (MM).
- `YrSold`:     Year Sold (YYYY).

In [None]:
houses[["SaleType", "SaleCondition", "MoSold", "YrSold"]].info()

In [None]:
houses[["SaleType", "SaleCondition", "MoSold", "YrSold"]].isna().sum()

In [None]:
fig = plt.figure(figsize=(15, 5))

ax = fig.add_subplot(1, 1, 1)
ax.set_title("Distribution of the sale price")
ax.hist(houses["SalePrice"], bins=50)

plt.show()

In [None]:
houses["SalePrice"].describe()