# House Prices - Exploratory Data Analysis

## Overview

This notebook explores the data provided by the [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview) competition.  
Feedback is much appreciated, as this is my first EDA.

Importing libraries and loading data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy

%pip install empiricaldist

import empiricaldist

#%matplotlib inline

plt.style.use('seaborn-notebook')

df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

pd.set_option('display.max_columns', 100)

### Summary statistics

In [None]:
df.describe()

In [None]:
df.info()

### Does missing data give us data?

In [None]:
incomplete_entry_df = df[[ col for col in df if ((df[col].count() / len(df)) <= 0.6) ]]
incomplete_entry_df

From data description:  
  
  
Alley: Type of alley access to property
> Grvl -- Gravel  
> Pave -- Paved  
> NA   -- No alley access  
  
FireplaceQu: Fireplace quality  
> Ex -- Excellent - Exceptional Masonry Fireplace  
> Gd -- Good - Masonry Fireplace in main level  
> TA -- Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement  
> Fa -- Fair - Prefabricated Fireplace in basement  
> Po -- Poor - Ben Franklin Stove  
> NA -- No Fireplace  

PoolQC: Pool quality
> Ex -- Excellent  
> Gd -- Good  
> TA -- Average/Typical  
> Fa -- Fair  
> NA -- No Pool  

Fence: Fence quality  
> GdPrv -- Good Privacy  
> MnPrv -- Minimum Privacy  
> GdWo -- Good Wood  
> MnWw -- Minimum Wood/Wire  
> NA -- No Fence  

MiscFeature: Miscellaneous feature not covered in other categories
> Elev -- Elevator  
> Gar2 -- 2nd Garage (if not described in garage section)  
> Othr -- Other  
> Shed -- Shed (over 100 SF)  
> TenC -- Tennis Court  
> NA -- None  

In our case, NA is useful, so we shall replace it with "Nope"

In [None]:
categorical_df = df[[col for col in df.columns if (df[col].dtype != "int64") and (df[col].dtype != "float64")]]  # Categorical data

categorical_df = categorical_df.fillna("Nope")

In [None]:
numerical_df = df[[col for col in df.columns if (df[col].dtype == "int64") or (df[col].dtype == "float64")]]  # Numerical data

numerical_with_null_df = numerical_df[[col for col in numerical_df.columns if numerical_df.isnull().any()[col]]] # Numerical data with missing values
numerical_with_null_df

LotFrontage: Linear feet of street connected to property

MasVnrArea: Masonry veneer area in square feet

GarageYrBlt: Year garage was built  
<hr>
  
From the descriptions of these features, we can conclude that the values are missing for a logical reason. 
(i.e. A home with no garage will have a missing GarageYrBlt value)  

Thusly, we can replace these values with 0

In [None]:
numerical_df = numerical_df.fillna(0.0)
numerical_df.isnull().any().any() # Making sure there are no missing values

Updating the original dataframe with filled-in missing values

In [None]:
for column in df.columns:
    if (column in numerical_df.columns):
        df[column] = numerical_df[column]
    if (column in categorical_df.columns):
        df[column] = categorical_df[column]
df

Perfect! Now since data cleaning is done, let's get on wit the fun and visual stuff.

## Finding linear correlations

In [None]:
corr_matrix = df.corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
f, ax = plt.subplots(figsize=(13, 10))
cmap = sns.diverging_palette(255, -255, as_cmap=True)

sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .8})

### Skewed target variable

In [None]:
plt.figure(figsize=(10, 8))
sns.histplot(df['SalePrice'], color='g', bins=100, line_kws={'alpha': 0.4}, kde=True)

### Applying log to normalize the distribution

In [None]:
gaussian_sale_price = np.log(1 + df['SalePrice'])

plt.figure(figsize=(10, 8))
sns.histplot(gaussian_sale_price, color='g', bins=100, line_kws={'alpha': 0.4}, kde=True)

## Finding non-linear correlations