<a href="https://colab.research.google.com/github/Nachoxt17/Real-Estate-Price-Estimator-for-Tokyo/blob/main/02_Exploratory_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. import packages and dataset

First, I should import necessary packages and also the cleaned dataset.

In [None]:
!pip install sweetviz



In [None]:
import pandas as pd
import numpy as np
import sweetviz as sv
import pickle
from google.colab import drive
drive.mount('/content/drive')

import seaborn as sns

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib
plt.style.use('ggplot')
from matplotlib.pyplot import figure

%matplotlib inline

Mounted at /content/drive


In [None]:
with open('Dataset/df_preprocessed.pickle', 'rb') as file:
    df = pickle.load(file)

df.head(5)

<hr>

## 2. Run SweetViz package

Then I activated the SweetViz library and prepared the environment for statistical analysis.  This library gives us all important information about descriptive statistics analysis.

In [None]:
analyze_report = sv.analyze(df)
analyze_report.show_html('Statistical_Analysis.html', open_browser=False)

<hr>

## 3. Qualitative Variables

Here we analyze qualitative variables, and we see each label in every single variable account for the highest frequency. Then, we can see the statistical interpretation for categorical variables:
<br><br>
Please read the report file.

<hr>

## 4. Quantitative Variables

Now, let’s analyze descriptive statics for quantitative variables. In this section we can find central tendency, dispersion, and shape measurements. Then we draw the distribution plot to compare with normal distribution.

In [None]:
def kde_plot(x):
    import seaborn as sns
    import matplotlib.pyplot as plt

    plt.figure(figsize = (8,3))
    sns.distplot(df[x], kde_kws={"lw": 5}, hist_kws = {'alpha': 0.25})
    sns.despine(left = True)

    mean = df[x].mean()
    median = df[x].median()

    plt.axvline(mean, color ='black', linestyle ='dashed')
    plt.axvline(median, color ='green', linestyle ='solid')
    plt.xlabel('')
    plt.ylabel('')

    return plt.show()

Now we can see the statistical interpretation for quantitative variables:

In [None]:
kde_plot('Area')

In [None]:
kde_plot('Frontage')

In [None]:
kde_plot('NearStation')

In [None]:
kde_plot('BuildingYear')

In [None]:
kde_plot('Price')

<hr>

In [None]:
#minimum Area
df[df['Area'] == df['Area'].min()]

In [None]:
#maximum Area
df[df['Area'] == df['Area'].max()]

<hr>

In [None]:
#minimum Frontage
df[df['Frontage'] == df['Frontage'].min()]

In [None]:
#maximum Frontage
df[df['Frontage'] == df['Frontage'].max()]

<hr>

In [None]:
#minimum NearStation
df[df['NearStation'] == df['NearStation'].min()].head(3)

In [None]:
#maximum NearStation
df[df['NearStation'] == df['NearStation'].max()].head(3)

<hr>

In [None]:
#minimum BuildingYear
df[df['BuildingYear'] == df['BuildingYear'].min()].head(3)

In [None]:
#maximum BuildingYear
df[df['BuildingYear'] == df['BuildingYear'].max()].head(3)

<hr>

In [None]:
#minimum Price
df[df['Price'] == df['Price'].min()]

In [None]:
#maximum Price
df[df['Price'] == df['Price'].max()]

<hr>

## 5. Handle Outliers

In the previous sections, we saw some variables have outliers, and before going further we should handle them. The big outliers are in “Price” and when we see statistical analysis of Price, we realize this variable in correlated Area. So, we can run feature engineering to handle price’s outlier. To begin with, I work on price and area. For doing this, I create a new feature as price per area:

In [None]:
df['Price_per_Area'] = df['Price'] / df['Area']
df.head(3)

now, let’s see the distribution shape of price_per_area

In [None]:
plt.figure(figsize = (8,3))
sns.distplot(df['Price_per_Area'], kde_kws={"lw": 5}, hist_kws = {'alpha': 0.25})
sns.despine(left = True)

mean = df['Price_per_Area'].mean()
median = df['Price_per_Area'].median()

plt.axvline(mean, color ='black', linestyle ='dashed')
plt.axvline(median, color ='green', linestyle ='solid')
plt.xlabel('')
plt.ylabel('')

plt.show()

The shape shows it’s extremely right skewed and we have big outlier. So, for handling outlier, we cannot use z-score, because this variable does not stick to normal distribution. Instead, we use MAD (median absolute deviation) technique that is a very robust method for this condition. First of all, I searched on the internet and found the median price per square feet in Tokyo.

In [None]:
median_prices = {"Tokyo": 951000}

Now, I filter dataset (removing outlier) based on the MAD technique.

In [None]:
df.shape

In [None]:
#call province median from the dictionary
Median = median_prices["Tokyo"]

#difference between each price_per_area with the called median
df['Median_Diff'] = 0

for i in range(len(df)):
    median_diff = abs(df.loc[i , 'Price_per_Area'] - Median)
    df.at[i, 'Median_Diff'] = median_diff

#calculate the median of new column
MAD = df['Median_Diff'].median()

#determine treshold
threshold = MAD * 3

#detect and filter rows based on outlier
df = df[~(df['Median_Diff'] > threshold)]

#remove the differenece column
df = df.drop(['Median_Diff'], axis=1)

For the last time, let’s see the distribution of price.

In [None]:
plt.figure(figsize = (8,3))
sns.distplot(df['Price_per_Area'], kde_kws={"lw": 5}, hist_kws = {'alpha': 0.25})
sns.despine(left = True)

mean = df['Price_per_Area'].mean()
median = df['Price_per_Area'].median()

plt.axvline(mean, color ='black', linestyle ='dashed')
plt.axvline(median, color ='green', linestyle ='solid')
plt.xlabel('')
plt.ylabel('')

plt.show()

In [None]:
df=df.drop(['Price_per_Area'], axis=1)

<hr>

# Check Point

In [None]:
import pickle
with open('Dataset/final_dataset.pickle', 'wb') as file:
    pickle.dump(df, file)