<h1 style="color:#ffc0cb;font-size:70px;font-family:Georgia;text-align:center;"><strong>Predicting House Model</strong></h1>

### <b>Author: Nguyen Dang Huynh Chau</b>

<h1 style="color:#ffc0cb;font-size:40px;font-family:Georgia;text-align:center;"><strong>Table of Content</strong></h1>

### 1. [Data Preparation](#1)

1.1 [Importing Necessary Libraries and datasets](#1.1)

1.2 [Data Retrieving](#1.2)

1.3 [Rename Column](#1.3)

<br>

### 2. [Data Cleaning](#2)

2.1 [About This Dataset](#2.1)

2.2 [Data Types](#2.2)

2.2.1 [Format Data Features](#2.2.1)

2.2.2 [Remove Unit for Measurement](#2.2.2)

2.2.3 [Remove Prefix & Typo Check](#2.2.3)

2.3 [Translate The Content](#2.3)

2.4 [Uppercase the Content](#2.4)

2.5 [Missing Values](#2.5)

2.6 [Check data types & Make the data homogeneous](#2.6)

2.7 [Extra-whitespaces](#2.7)

2.8 [Sanity Checks](#2.8)

2.9 [Checking for Impossible values & Outliers](#2.9)

2.9.1 [Some domain knowledge](#2.9.1)

2.9.2 [Descriptive Statistics for Central Tendency](#2.9.2)

2.9.3 [Descriptive Statistics for Variability](#2.9.3)

2.9.4 [Remove Impossible Values](#2.9.4)

2.10 [Create Categorical Price](#2.10)

2.11 [Save The Intermediate Data](#2.11)

<br>

### 3. [Data Exploration (EDA)](#3)
3.1 [Frequency of each corresponiding Target variable type](#3.1)

3.2 [Determine Location (urban & suburban) influence on price](#3.2)

3.3 [Legal Document Factor](#3.3)

3.4 [The most common Wards](#3.4)

<br>

### 4. [Feature Engineering](#4)
4.1 [Drop Unrelated columns to the target](#4.1)

4.2 [Class imbalances](#4.2)

4.3 [Encoding](#4.3)

<br>

### 5. [Model Building](#5)
5.1 [Train/Test split](#5.1)

5.2 [Simple Logistic Regression as Baseline](#5.2)

5.3 [Random Forest with Pipelines](#5.3)

5.4 [Combining GridSearch + Random Forest with Pipelines](#5.4)

<br>

### 6. [Conculsions](#6)

<br>

### 7. [References](#7)

<br>

### 8. [Appendix](#8)

<hr>

<a id="1"></a>
<h1 style="color:#ffc0cb;font-size:40px;font-family:Georgia;text-align:center;"><strong>1. Data Preparation</strong></h1>

<a id="1.1"></a>
# 1.1 Importing Necessary Libraries and datasets

In [None]:
# Install a conda package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install missingno
!{sys.executable} -m pip install scikit-learn
!{sys.executable} -m pip install xgboost
!{sys.executable} -m pip install statsmodels
!{sys.executable} -m pip install imbalanced-learn
!{sys.executable} -m pip install category_encoders


# work with data in tabular representation
from datetime import time
import pandas as pd
# round the data in the correlation matrix
import numpy as np
import os


# Modules for data visualization
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import plot_confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
# encoding
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# for saving the pipeline
import joblib

# from Scikit-learn
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, Binarizer

pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

plt.rcParams['figure.figsize'] = [6, 6]

# Ensure that our plots are shown and embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up windows
%matplotlib inline

# overwrite the style of all the matplotlib graphs
sns.set()

# ignore DeprecationWarning Error Messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
# check the version of the packages
print("Numpy version: ", np.__version__)
print("Pandas version: ",pd.__version__)
! python --version

<a id="1.2"></a>
# 1.2 Data Retrieving
***
In order to load data properly, the data in csv file have to be examined carefully. First of all, all the categories are seperated by the "," and strip the extra-whitespaces at the begin by setting "skipinitialspace = True".