## House Price Prediction Statistics Report

### By: Jerry Zhu

## Abstract

The real estate industry is a rich sector that introduces thousands of new investors, many that are looking to profit from the buying and selling of homes [[1]](#scrollTo=c526B8E9Fol3&line=3&uniqifier=1). On the other hand, the act of selling a house at a fair price, and finding a suitable place to live, is an absolute neccessity in today's society. <br>
In the wake of the COVID-19 pandemic, the average house price in Toronto has gone up by 19.3% [[2]](#scrollTo=c526B8E9Fol3&line=3&uniqifier=1), and many brokers have been struggling to adjust to the sudden change in supply and demand. Being able to predict house prices will help incoming sellers determine an acceptable selling price of a house and can help the customer find a residence that fits their budget. It will also transform and skyrocket the already popular real estate industry and unlock its true potential, and give buyers and homeowners the safety and security they desperately require in a heavily fluctuating market. <br>
In this report, we will attempt to solve the fundamental problem of predicting the price of a house using its physical properties, including its condition, location, and eatures. Today, there is a large amount of data available on relevant statistics and contextual factors relating to house prices, inorder to improve our understanding of the real estate industry. Notably, this problem has already been introduced and dissected in Zillow's Zestimate [[3]](#scrollTo=c526B8E9Fol3&line=3&uniqifier=1) and Kaggle's competition on housing prices [[4]](#scrollTo=c526B8E9Fol3&line=3&uniqifier=1). 
By using the comprehensive dataset from Kaggle, we will attempt to predict the price of a house using regression techniques, and further extend the accuracy of our hypothesis using a machine learning model and advanced regression [[5]](#scrollTo=c526B8E9Fol3&line=3&uniqifier=1). Finally, using a web application [[6]](#scrollTo=c526B8E9Fol3&line=3&uniqifier=1), we will explore the practicality of such a model in real life, and its relevance in the current real estate industry. 

### Initial Analysis

First off, we have to find a quantitative variable to use as the dependent variable. To do this, we will extract the dataset, and perform an **exploratory data analysis**. 

First off, we import the Kaggle Dataset using the Kaggle API, setup a root folder to dump the zip archive file from the Kaggle competitions cloud, and extract the contents of the zip archive into the root folder. Finally, we delete the zip archive file, and view the contents of the dataset (data_description.txt). 

In [None]:
!pip install -q kaggle

In [None]:
from google.colab import files

In [None]:
files.upload() # Upload your kaggle.json API key

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"bobliuuu","key":"51f254951fd0f16f0e345c1593c94a0e"}'}

In [None]:
!mkdir ~/.kaggle

In [None]:
!cp kaggle.json ~/.kaggle/

In [None]:
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets list

ref                                                                   title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
victorsoeiro/netflix-tv-shows-and-movies                              Netflix TV Shows and Movies                           2MB  2022-05-15 00:01:23           6445        201  1.0              
surajjha101/stores-area-and-sales-data                                Supermarket store branches sales analysis            10KB  2022-04-29 11:10:16           5659        163  1.0              
devansodariya/student-performance-data                                Student Performance Dataset                           7KB  2022-05-26 13:55:09           2636        119  0.9705882        
paradisejoy/top-hits-spotify-f

In [None]:
!kaggle competitions download -c house-prices-advanced-regression-techniques

In [None]:
!cd /content

In [None]:
!ls

In [None]:
!unzip house-prices-advanced-regression-techniques.zip

In [None]:
!rm house-prices-advanced-regression-techniques.zip

### Description Of Data

In [None]:
!cat data_description.txt

### Dataset Files

train.csv - Main dataset of aggregate values for each house. 
test.csv - Another dataset of aggregate values for testing on smaller data. 
sample_submission.csv - A sample of houses and their prices (used for predictions).
data_description.txt - A description of the compiled data. 

### Dependent Variable

From the description of the data content of the dataset, all the factors relate to housing price in some way, so `SalePrice` would be the most natural dependent variable. 
We can also see that the data is cross sectional data, as the data is ID-stamped, not time stamped, and the data is collected on a single group of people, at some point in time. <br>
To find a suitable independent variable to compare against the dependent variable, we can construct a correlation matrix, to determine the correlation of each variable with the dependent variable of `SalePrice`. <br>
To do this, we will convert the training dataset (which contain the aggregate entries) into a `DataFrame` using Python's Pandas module, to help us visualize the data. We will then drop all the non-numeric (qualitative) rows in the dataset, and create a correlation matrix using Pandas' `corr()` command [[6]](#scrollTo=c526B8E9Fol3&line=3&uniqifier=1). Finally, we will sort the columns by the correlation, to find the quantitative variable(s) with the highest correlation with `SalePrice`, to be used as our independent variables. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
df_train = pd.read_csv('../content/train.csv')

In [None]:
df_train.columns

In [None]:
numerical_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

df_numeric = df_train.select_dtypes(include=numerical_types)

In [None]:
print("Total number of columns:", len(df_train.columns))
print("Number of numerical (quantitative) columns:", len(df_numeric.columns))

In [None]:
corr_matrix = df_numeric.corr()
corr_matrix

In [None]:
plt.subplots(figsize=(32, 20))
sns.heatmap(corr_matrix, annot=True)
plt.show()

In [None]:
corr_matrix_saleprice = corr_matrix.copy()
corr_matrix_saleprice.drop(corr_matrix.columns.difference(['SalePrice']), 1, inplace = True)
corr_matrix_saleprice = corr_matrix_saleprice.drop(['SalePrice', 'Id'])
corr_matrix_saleprice

In [None]:
corr_sorted = corr_matrix_saleprice.copy()
corr_sorted = corr_sorted.sort_values(by = ['SalePrice'], ascending = False, kind = 'quicksort', key = abs)
corr_sorted

### Variables with highest correlation

In [None]:
corr_sorted.head()

### Independent Variables

Since overall living quality (`OverallQual`) and size of living area (`GrLivArea`) have the highest correlations with `SalePrice`, we will use these two variables as our independent variables, to explain what our dependent variable depends on. <br>
This conclusion makes sense, as we know from prior knowledge that comfort of living and size of living area are important considerations when choosing a suitable house. However, we can extend on our prior knowledge, and use statistical analysis to explore exactly how crucial these variables are to the sale price of a house. 

### Hypothesis

My hypothesis is that as `OverallQual` and `GrLivArea` increase, `SalePrice` increases. 
To test this hypothesis, I will first isolate the data of each independent variable and the dependent variable, and graph the data to find such a correlation, and then prove this correlation. <br>
To do this, I will drop all rows except the desired ones relating to the correlation I am trying to determine. I will then get a sample of the data, and graph it using Python's seaborn module's `sns.regplot`, which will create a regression model including a scatter plot and line of best fit for the data. Using this, I can determine any trends in the data. 

In [None]:
df_dropped = df_train.copy()
df_dropped.drop(df_dropped.columns.difference(['Id', 'OverallQual', 'GrLivArea', 'SalePrice']), 1, inplace = True)
# df_dropped = df_dropped[['Id', 'OverallQual', 'GrLivArea', 'SalePrice']]

In [None]:
df_dropped.columns

In [None]:
df_dropped.head()

In [None]:
df_dropped.shape # (rows, columns)

### Sampling Of Data

Since the dataset is very large (over 1000 columns), we will have to collect an unbiased sample of the data. To do this, we will use three common sampling methods: simple random sampling, systematic random sampling, and convenience sampling. After collecting the samples using the various methods, we will save them as `.csv` (comma seperated values) files instead of dataframes, for easier access and storage. Finally, we will calculate and determine which sampling method is the best, using the correlation `corr()` of each sampling method. 

In [None]:
df_rand = df_dropped.sample(n = 100)
df_rand.head()

In [None]:
df_rand.to_csv(r'simplerand_num.csv', index=False) # Export collected data as csv

In [None]:
# Find size of interval
interval = df_dropped.shape[0] // 100
print(interval)

In [None]:
df_rand2 = df_dropped[df_dropped.index % interval == 0]
df_rand2.head()

In [None]:
df_rand2.to_csv(r'intervalrand_num.csv', index=False) 

In [None]:
df_rand3 = df_dropped.head(100)
df_rand3.head()

In [None]:
df_rand3.to_csv(r'convenience_num.csv', index=False) 

### Correlations of Different Sampling Techniques

In [None]:
df_rand.corr() # Simple random sampling

In [None]:
df_rand2.corr() # Systematic random sampling

In [None]:
df_rand3.corr() # Convenience random sampling

In [None]:
# Convert correlation matrices to LaTeX for rendering in Overleaf
# df_rand.corr().to_latex()
# df_rand2.corr().to_latex()
# df_rand3.corr().to_latex()

### Sampling Conclusion

We notice that the best sampling method (the one that gives the highest correlations with `SalePrice`) is simple random sampling. This make sense because a purely pseudorandom sampling algorithm will generate the least amount of bias. <br>
Next, we will graph the data of this sampling method to determine any trends in the two variables. We will graph a scatter plot and a line of best fit. 

In [None]:
scatterplot = df_rand.plot.scatter(x = 'GrLivArea', y = 'SalePrice', title='Area of Living Space vs Price Of House')
scatterplot.set_xlabel(r"Area Of Living Space ($m^2$)")
scatterplot.set_ylabel("Price Of House ($)")

In [None]:
# No confidence interval
regression = sns.regplot(x = 'GrLivArea', y = 'SalePrice', data=df_rand, fit_reg = True, 
            ci = None)
regression.set(title='Area of Living Space vs Price Of House')
regression.set_xlabel(r"Area Of House ($m^2$)")
regression.set_ylabel("Price Of House ($)")

### Trends 

From the overall trend of the graph and the line of best fit, we see that there is a strong positive correlation between the area of living space and the price of the house, as well as the overall quality of the house and the price of the house. <br>
Therefore, the variables we found are useful, and we can continue by proving this correlation using a more in-depth statistical analysis of the trends and relationship of the variables. 

## Works Cited

[1]:https://www.researchgate.net/publication/320801620_Modeling_House_Price_Prediction_using_Regression_Analysis_and_Particle_Swarm_Optimization_Case_Study_Malang_East_Java_Indonesia <br>

[2]:https://globalnews.ca/news/8400321/canada-housing-prices-central-bank-warning/ <br>

[3]:https://www.zillow.com/blog/zestimate-updates-230614/ <br>

[4]:House Prices Dataset. Retrieved 05/16/22 from 
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data. <br>

[5]:https://m2pi.ca/project/2020/bc-financial-services-authority/BCFSA-final.pdf <br>

[6]:TBD <br>

[7]:https://pandas.pydata.org/docs/reference/index.html <br>


In [None]:
# Converting this Jupyter notebook to PDF and html
# https://stackoverflow.com/questions/15998491/how-to-convert-ipython-notebooks-to-pdf-and-html 
!pip install nbconvert



In [None]:
!jupyter nbconvert --to html Culminating_Statistics_Project.ipynb

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr

In [None]:
!sudo apt-get install texlive-xetex texlive-fonts-recommended texlive-generic-recommended

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  fonts-droid-fallback fonts-lato fonts-lmodern fonts-noto-mono fonts-texgyre
  javascript-common libcupsfilters1 libcupsimage2 libgs9 libgs9-common
  libijs-0.35 libjbig2dec0 libjs-jquery libkpathsea6 libpotrace0 libptexenc1
  libruby2.5 libsynctex1 libtexlua52 libtexluajit2 libzzip-0-13 lmodern
  poppler-data preview-latex-style rake ruby ruby-did-you-mean ruby-minitest
  ruby-net-telnet ruby-power-assert ruby-test-unit ruby2.5
  rubygems-integration t1utils tex-common tex-gyre texlive-base
  texlive-binaries texlive-latex-base texlive-latex-extra
  texlive-latex-recommended texlive-pictures texlive-plain-generic tipa
Suggested packages:
  fonts-noto apache2 | lighttpd | httpd poppler-utils ghostscri

In [None]:
!jupyter nbconvert --to pdf Culminating_Statistics_Project.ipynb

[NbConvertApp] Converting notebook Culminating_Statistics_Project.ipynb to pdf
[NbConvertApp] Support files will be in Culminating_Statistics_Project_files/
[NbConvertApp] Making directory ./Culminating_Statistics_Project_files
[NbConvertApp] Making directory ./Culminating_Statistics_Project_files
[NbConvertApp] Making directory ./Culminating_Statistics_Project_files
[NbConvertApp] Writing 97154 bytes to ./notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', './notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', './notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 650486 bytes to Culminating_Statistics_Project.pdf
