# **UNLOCKING REAL ESTATE SUCCESS WITH TIME SERIES MODELLING**

This is a collaborative group project done at the end of Phase 4 of Moringa School's Data Science program. The team members of this group include:

- [Abdideq Adan](https://github.com/AdanAbdideq)
- [Clara Gatambia](https://github.com/claragatambia)
- [Isaack Odera](https://github.com/derak-isaack)
- [Mwiti Mwongo](https://github.com/M13Mwongo)
- [Wilson Mutungu](https://github.com/mutungu)

## 1. BUSINESS UNDERSTANDING

TODO - Answer a sample question like why prices would be different in different states but one state may be more expensive on average than another

TODO - Analysis into the market trends (e.g. 2008 financial crash) and how the model may account for that in future (market trend analysis)

TODO - More important to show in depth analysis i.e. data prep and eda

TODO - Think as the investor in this area

### Introduction

### Objectives
The main objective of the project is to establish the 5 best zip codes that are the best to invest in.

The objectives of this project are as follows:
 - Identify the key metrics that would be used to classify profitability.
 - Identify the criteria that would classify a house as a "high-end" house or "affordable" house.

### Potential Challenges

### Conclusion


## PRELIMINARIES

### Loading of Necessary Modules/Packages

All necessary python packages are loaded at once to make the work easy to view and cleaner as well. 

In [1]:
# Importing necessary libraries
# Basics
import pandas as pd
import numpy as np
import itertools
from io import StringIO

# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import matplotlib.patches as mpatches
from matplotlib.pylab import rcParams
import time

# Modeling
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import acf, pacf, adfuller
from sklearn.linear_model import LassoLarsCV

# Warnings
import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)
warnings.filterwarnings('ignore')

# Custom Options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns',100)


### Definition of Custom Classes & Functions

In line with the concepts of OOP, custom classes were created to speed up the opening and manipulation of data in this project. They are as follows: 

In [2]:
class DataSourcing:
  def __init__(self, dataframe):
    self.original = dataframe
    self.dataframe = dataframe

  def give_info(self):
    with StringIO() as buffer:
      self.dataframe.info(buf = buffer)
      info_string = buffer.getvalue()
    
    message = f"""
    ----------------------------------------------------------------------->
    DESCRIPTION OF THE DATAFRAME IN QUESTION:
    ----------------------------------------------------------------------->

    Dataframe information => \n{info_string}
    ------------------------------------------------------------------------------------------------------------------------->

    Dataframe shape => {self.dataframe.shape[0]} rows, {self.dataframe.shape[1]} columns
    ------------------------------------------------------------------------------------------------------------------------->

    There are {len(self.dataframe.columns)} columns, namely: {self.dataframe.columns}.
    ------------------------------------------------------------------------------------------------------------------------->

    The first 5 records in the dataframe are seen here:
    ------------------------------------------------------------------------------------------------------------------------->
    {self.dataframe.head()}
    ------------------------------------------------------------------------------------------------------------------------->

    The last 5 records in the self.dataframe are as follows:
    ------------------------------------------------------------------------------------------------------------------------->
    {self.dataframe.tail()}
    ------------------------------------------------------------------------------------------------------------------------->

    The descriptive statistics of the dataframe (mean,median, max, min, std) are as follows:
    ------------------------------------------------------------------------------------------------------------------------->
    {self.dataframe.describe()}
    ------------------------------------------------------------------------------------------------------------------------->
    """
    print(message)

  def null_count(self):
    return print(self.dataframe.isnull().sum())

  def unique_count(self):
    return self.dataframe.nunique()

  def unique_per_column(self):
    print("<----- UNIQUE VALUES IN EACH COLUMN ----->")
    for col in self.dataframe.columns:
      print(f"{col} : \n {self.dataframe[col].unique()}")
      print()
    print("<----- END OF UNIQUE VALUES IN EACH COLUMN ----->")
    return

  def plot_barplot(self, dataframe, x, y, x_title, y_title):
    fig, ax = plt.subplots(figsize=(10, 8))

    sns.barplot(data=dataframe, x=x, y=y, orient='h', errorbar=None)
    ax.set_title(f"{y_title} vs {x_title}")
    ax.set_xlabel(x_title)
    ax.set_ylabel(y_title)
    plt.show()

  def plot_boxplot(self, dataframe, x, y, x_title, y_title):
    fig, ax = plt.subplots(figsize=(10, 8))

    sns.boxplot(data=dataframe, x=x, y=y)
    ax.set_title(f"{y_title} vs {x_title}")
    ax.set_xlabel(x_title)
    ax.set_ylabel(y_title)
    plt.xticks(rotation=90)
    plt.show()

In [3]:
class DataPreProcessing(DataSourcing):
  def __init__(self, dataframe):
    super().__init__(dataframe)

  def dropColumns(self, columns):
    return self.dataframe.drop(columns, axis=1)

  def dropRows(self, rows):
    return self.dataframe.drop(rows, axis=0)

## 2. DATA UNDERSTANDING

### a) Loading & Viewing Data

We start by loading the data as a dataframe.

In [4]:
df = pd.read_csv('./data/zillow_data.csv')

The `df.head()` function is used to get a rough look at a few of the records in the dataframe to understand the data better.

In [5]:
df.head(10)

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,1996-07,1996-08,1996-09,1996-10,1996-11,1996-12,1997-01,1997-02,1997-03,1997-04,1997-05,1997-06,1997-07,1997-08,1997-09,1997-10,1997-11,1997-12,1998-01,1998-02,1998-03,1998-04,1998-05,1998-06,1998-07,1998-08,1998-09,1998-10,1998-11,1998-12,1999-01,1999-02,1999-03,1999-04,1999-05,1999-06,1999-07,1999-08,1999-09,1999-10,...,2014-03,2014-04,2014-05,2014-06,2014-07,2014-08,2014-09,2014-10,2014-11,2014-12,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,2015-11,2015-12,2016-01,2016-02,2016-03,2016-04,2016-05,2016-06,2016-07,2016-08,2016-09,2016-10,2016-11,2016-12,2017-01,2017-02,2017-03,2017-04,2017-05,2017-06,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
0,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,337600.0,338500.0,339500.0,340400.0,341300.0,342600.0,344400.0,345700.0,346700.0,347800.0,349000.0,350400.0,352000.0,353900.0,356200.0,358800.0,361800.0,365700.0,370200.0,374700.0,378900.0,383500.0,388300.0,393300.0,398500.0,403800.0,409100.0,414600.0,420100.0,426200.0,432600.0,438600.0,444200.0,450000.0,455900.0,462100.0,468500.0,475300.0,482500.0,490200.0,...,863900.0,872900.0,883300.0,889500.0,892800,893600,891300,889900,891500,893000,893000,895000,901200,909400,915000,916700,917700,919800,925800,937100,948200,951000,952500,958600,966200,970400,973900,974700,972600,974300,980800,988000,994700,998700,997000,993700,991300,989200,991300,999100,1005500,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0,236900.0,236700.0,235400.0,233300.0,230600.0,227300.0,223400.0,219600.0,215800.0,211100.0,205700.0,200900.0,196800.0,193600.0,191400.0,190400.0,190800.0,192700.0,196000.0,201300.0,207400.0,212200.0,214600.0,215100.0,213400.0,210200.0,206100.0,202100.0,198800.0,196100.0,194100.0,193400.0,193400.0,193100.0,192700.0,193000.0,193700.0,194800.0,196100.0,197800.0,199700.0,201900.0,...,234200.0,235400.0,236600.0,238500.0,240500,242600,244700,246300,247600,249600,251400,253000,255200,258000,261200,264700,268400,271400,273600,275200,276400,277000,277900,280000,282600,285400,288400,290800,292000,292800,293700,295200,297000,299000,300800,301800,302800,304400,306200,307000,308000,310000,312500,314100,315000,316600,318100,319600,321100,321800
2,91982,77494,Katy,TX,Houston,Harris,3,210400.0,212200.0,212200.0,210700.0,208300.0,205500.0,202500.0,199800.0,198300.0,197300.0,195400.0,193000.0,191800.0,191800.0,193000.0,195200.0,198400.0,202800.0,208000.0,213800.0,220700.0,227500.0,231800.0,233400.0,233900.0,233500.0,233300.0,234300.0,237400.0,242800.0,250200.0,258600.0,268000.0,277000.0,283600.0,288500.0,293900.0,299200.0,304300.0,308600.0,311400.0,312300.0,311900.0,...,282100.0,284200.0,286000.0,288300.0,290700,293300,295900,298300,300200,301300,301700,302400,303600,306200,309100,311900,314100,316300,319000,322000,324300,326100,327300,327000,327200,328500,329800,330000,329000,327800,326700,325500,324700,324500,323700,322300,320700,320000,320000,320900,321000,320600,320200,320400,320800,321200,321200,323000,326900,329900
3,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0,500900.0,503100.0,504600.0,505500.0,505700.0,505300.0,504200.0,503600.0,503400.0,502200.0,500000.0,497900.0,496300.0,495200.0,494700.0,494900.0,496200.0,498600.0,502000.0,507600.0,514900.0,522200.0,529500.0,537900.0,546900.0,556400.0,566100.0,575600.0,584800.0,593500.0,601600.0,610100.0,618600.0,625600.0,631100.0,636600.0,642100.0,647600.0,653300.0,659300.0,665800.0,672900.0,...,1149900.0,1155200.0,1160100.0,1163300.0,1167700,1173900,1175100,1173500,1175500,1178500,1176400,1174600,1178500,1185700,1192900,1198800,1200400,1198900,1200200,1207400,1218600,1226600,1230700,1235400,1241300,1245700,1247000,1246700,1245700,1246000,1247700,1252900,1260900,1267900,1272600,1276600,1280300,1282500,1286000,1289000,1289800,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000
4,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0,77300.0,77300.0,77300.0,77400.0,77500.0,77600.0,77700.0,77700.0,77800.0,77900.0,77900.0,77800.0,77800.0,77800.0,77800.0,77800.0,77900.0,78100.0,78200.0,78400.0,78600.0,78800.0,79000.0,79100.0,79200.0,79300.0,79300.0,79300.0,79400.0,79500.0,79500.0,79600.0,79700.0,79900.0,80100.0,80300.0,80600.0,80900.0,81200.0,81400.0,81700.0,82100.0,...,112000.0,112500.0,112700.0,113100.0,113900,114400,114500,114400,114300,114400,114700,115000,115000,115200,115600,115900,115600,115400,115400,115500,115800,116300,116200,115600,115000,114500,114200,114000,114000,113900,114100,114900,115700,116300,116900,117300,117600,118000,118600,118900,119100,119400,120000,120300,120300,120300,120300,120500,121000,121500
5,91733,77084,Houston,TX,Houston,Harris,6,95000.0,95200.0,95400.0,95700.0,95900.0,96100.0,96200.0,96100.0,96000.0,95800.0,95500.0,95300.0,95100.0,95100.0,95200.0,95400.0,95600.0,95800.0,96000.0,96200.0,96400.0,96500.0,96600.0,96600.0,96700.0,96900.0,97000.0,97200.0,97500.0,97900.0,98500.0,99400.0,100500.0,101800.0,103100.0,104100.0,104800.0,105100.0,105200.0,105200.0,105000.0,104800.0,104700.0,...,125900.0,127100.0,128200.0,129400.0,130300,130900,131900,133300,134800,136200,137700,138900,139900,141000,142100,142900,143900,145200,146200,146900,147600,148500,149600,150600,151200,151800,153000,154400,155200,155500,155400,155000,155100,155900,156500,156900,157300,157600,157700,157700,157900,158700,160200,161900,162800,162800,162800,162900,163500,164300
6,61807,10467,New York,NY,New York,Bronx,7,152900.0,152700.0,152600.0,152400.0,152300.0,152000.0,151800.0,151600.0,151600.0,151700.0,151800.0,151800.0,151900.0,152000.0,152200.0,152400.0,152500.0,152600.0,152700.0,152900.0,153200.0,153800.0,154300.0,154700.0,155200.0,155700.0,156400.0,157000.0,157600.0,158100.0,158600.0,159200.0,160000.0,160900.0,161800.0,162700.0,163700.0,164900.0,166100.0,167300.0,168400.0,169500.0,170700.0,...,327400.0,326400.0,325100.0,324200.0,323400,322300,320500,320500,324200,331700,339300,344300,346900,349400,352500,356500,359400,361200,362900,364000,367100,370800,370700,367600,365800,367200,372200,377300,378000,378300,381100,385400,386000,385300,387300,391300,394200,394500,392500,391200,394400,400000,407300,411600,413200,414300,413900,411400,413200,417900
7,84640,60640,Chicago,IL,Chicago,Cook,8,216500.0,216700.0,216900.0,217000.0,217100.0,217200.0,217500.0,217900.0,218600.0,219700.0,220900.0,221800.0,223000.0,224200.0,225600.0,227100.0,228800.0,230600.0,232700.0,235000.0,237900.0,241300.0,244600.0,247900.0,251300.0,254800.0,258500.0,262300.0,266100.0,269900.0,273900.0,277900.0,282400.0,287200.0,292000.0,296500.0,301200.0,305900.0,310700.0,315500.0,320500.0,325600.0,330800.0,...,589700.0,593300.0,602900.0,614300.0,624400,634700,644400,651700,657600,662000,662200,663700,668700,673800,677100,682900,688200,692500,698400,708200,717300,726400,734700,739400,736900,732800,731000,730500,730800,735500,740000,740100,741500,746100,749500,754800,765800,776600,785900,795500,798000,787100,776100,774900,777900,777900,778500,780500,782800,782800
8,91940,77449,Katy,TX,Houston,Harris,9,95400.0,95600.0,95800.0,96100.0,96400.0,96700.0,96800.0,96800.0,96700.0,96600.0,96400.0,96200.0,96100.0,96200.0,96300.0,96600.0,97000.0,97500.0,98000.0,98400.0,98800.0,99200.0,99500.0,99700.0,100000.0,100200.0,100400.0,100700.0,101100.0,101800.0,102900.0,104300.0,106200.0,108400.0,110400.0,112100.0,113200.0,113600.0,113500.0,113000.0,112500.0,112200.0,112100.0,...,130000.0,131200.0,132500.0,133600.0,134500,135400,136500,137700,138900,140100,141000,142000,143200,144600,146000,147100,148400,149700,151200,152300,153100,154200,156100,157800,159500,161500,164000,166000,167400,168200,168500,168500,168600,168500,168300,167900,167300,166800,166700,166700,166800,167400,168400,169600,170900,172300,173300,174200,175400,176200
9,97564,94109,San Francisco,CA,San Francisco,San Francisco,10,766000.0,771100.0,776500.0,781900.0,787300.0,793000.0,799100.0,805800.0,814400.0,824300.0,833800.0,842900.0,852900.0,863500.0,874800.0,886500.0,898200.0,910200.0,922800.0,936100.0,951500.0,968400.0,984900.0,1001100.0,1018700.0,1037200.0,1056700.0,1076900.0,1097300.0,1118200.0,1139800.0,1162400.0,1187600.0,1214200.0,1240400.0,1266500.0,1294400.0,1323300.0,1353300.0,1384000.0,1414800.0,1445800.0,1477300.0,...,3215700.0,3243400.0,3277900.0,3309700.0,3341800,3357200,3362600,3384400,3419200,3431800,3426400,3439000,3486600,3534800,3566500,3596400,3625500,3641400,3657900,3666000,3667000,3685600,3731000,3776100,3793100,3766500,3720300,3683100,3664400,3656100,3652900,3635600,3635900,3669900,3717900,3734900,3726800,3717500,3734000,3759300,3767700,3763900,3775000,3799700,3793900,3778700,3770800,3763100,3779800,3813500


The `df.info()` and `df.dtypes` functions are both called to give a rough understanding of the dataframe, and the types of data held in each column. Normally, `df.info()` would be sufficient, but due to the sheer number of columns in the dataframe, that information isn't displayed. Thus, `df.dtypes` is called. 

In [6]:
print(df.info())
print()
print(df.dtypes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Columns: 272 entries, RegionID to 2018-04
dtypes: float64(219), int64(49), object(4)
memory usage: 30.6+ MB
None

RegionID        int64
RegionName      int64
City           object
State          object
Metro          object
CountyName     object
SizeRank        int64
1996-04       float64
1996-05       float64
1996-06       float64
1996-07       float64
1996-08       float64
1996-09       float64
1996-10       float64
1996-11       float64
1996-12       float64
1997-01       float64
1997-02       float64
1997-03       float64
1997-04       float64
1997-05       float64
1997-06       float64
1997-07       float64
1997-08       float64
1997-09       float64
1997-10       float64
1997-11       float64
1997-12       float64
1998-01       float64
1998-02       float64
1998-03       float64
1998-04       float64
1998-05       float64
1998-06       float64
1998-07       float64
1998-08       float64
1998-09       floa

As expected, all the columns representing time-series data will be storing either an integer or a float data type. It is also important to note that the `RegionID` & `RegionName` are stored as integers - to ensure they are unique - and the remaining 5 columns all contain strings.

The shape of the dataframe was observed as follows: 

In [7]:
df.shape

(14723, 272)

The dataframe has 14723 records and 272 column. However, majority of those columns represent the time series values. The first seven columns - `RegionID`, `RegionName`, `City`, `State`, `Metro`, `CountyName` and `SizeRank` - are the columns that give us more information about the dataset. 

Each of these columns is intended to hold a certain type of data as follows: 
- *RegionID*: The unique ID of the region in question.
- *RegionName*: The name of the region in question.
- *City*: The name of the city within a given region.
- *State*: The state in which the RegionID is found.
- *Metro*: The metropolitan name within which the RegionID is found.
- *CountyName*: The name of the county within a given region.
- *SizeRank*: The region's area ranking vis-a-vis other regions, organised in descending order.

Going forward, these columns shall be referred to as the columns of interest.

In [8]:
# ['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName','SizeRank']

cols_of_interest = df.columns[:7]
cols_of_interest

Index(['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName',
       'SizeRank'],
      dtype='object')

### b) Data Cleaning

#### **i. Handling of Null Values**

The number of null values in the dataframe will now be established. This is necessary to ensure an appropriate strategy is used when deciding how to handle null values

In [9]:
df.isnull().sum()

RegionID         0
RegionName       0
City             0
State            0
Metro         1043
CountyName       0
SizeRank         0
1996-04       1039
1996-05       1039
1996-06       1039
1996-07       1039
1996-08       1039
1996-09       1039
1996-10       1039
1996-11       1039
1996-12       1039
1997-01       1039
1997-02       1039
1997-03       1039
1997-04       1039
1997-05       1039
1997-06       1039
1997-07       1038
1997-08       1038
1997-09       1038
1997-10       1038
1997-11       1038
1997-12       1038
1998-01       1036
1998-02       1036
1998-03       1036
1998-04       1036
1998-05       1036
1998-06       1036
1998-07       1036
1998-08       1036
1998-09       1036
1998-10       1036
1998-11       1036
1998-12       1036
1999-01       1036
1999-02       1036
1999-03       1036
1999-04       1036
1999-05       1036
1999-06       1036
1999-07       1036
1999-08       1036
1999-09       1036
1999-10       1036
1999-11       1036
1999-12       1036
2000-01     

Aside from the columns that contain time-series values, only the metro column has null values. This can be seen as inconsequential as there is enough data from the other columns to overlook this. 

Taking a closer look at the missing time-series values, there seems to be a steady trend. The values do not appear to be random, as they steadily decrease from April 1996 to July 2014, from where they remain 0 throughout till April 2018.

The gradual change indicates that the presence of null values in these columns is anything but random. As such, when dealing with the null values in the time-series value columns, **all null values will be left as is**. No replacement or removal of records will be done. This is done as it is assumped that not all the houses were built at the same time, thus it is expected that there will be null values for some houses and not others. Furthermore, some of these null values are attributed to the differential times that the houses were put on the market.

When dealing with the null values in the metro column, all null values will be left as the data missing does not impact the objectives of the project. 



#### **ii. Duplicate Values Check**

A check for duplicate values is done. This check specifically applies to the non time-series columns, as it possible to have duplicates in the value of a property, and such duplicates do not serve to understand the dataset better.

In [10]:
duplicates = df.duplicated(subset=cols_of_interest,keep=False)

df[duplicates]

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,1996-07,1996-08,1996-09,1996-10,1996-11,1996-12,1997-01,1997-02,1997-03,1997-04,1997-05,1997-06,1997-07,1997-08,1997-09,1997-10,1997-11,1997-12,1998-01,1998-02,1998-03,1998-04,1998-05,1998-06,1998-07,1998-08,1998-09,1998-10,1998-11,1998-12,1999-01,1999-02,1999-03,1999-04,1999-05,1999-06,1999-07,1999-08,1999-09,1999-10,...,2014-03,2014-04,2014-05,2014-06,2014-07,2014-08,2014-09,2014-10,2014-11,2014-12,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,2015-11,2015-12,2016-01,2016-02,2016-03,2016-04,2016-05,2016-06,2016-07,2016-08,2016-09,2016-10,2016-11,2016-12,2017-01,2017-02,2017-03,2017-04,2017-05,2017-06,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04


No duplicates are observed in the dataset.

#### **iii. Consistent Data Types**

Having run `df.dtypes` earlier, it was noted that some of the price data in the time-series columns was saved as integers while others were saved as floats. All the data in the time-series columns will now be converted to floats since by doing so, we are sure we are not losing/ rounding off any of the price values. 

In [11]:
# Ensuring all time series columns are converted to float64 i.e. the last 265 columns

for col in df.columns[-265:]:
  if df[col].dtype is not 'float64':
    df[col] = df[col].astype('float64')


df.dtypes

RegionID        int64
RegionName      int64
City           object
State          object
Metro          object
CountyName     object
SizeRank        int64
1996-04       float64
1996-05       float64
1996-06       float64
1996-07       float64
1996-08       float64
1996-09       float64
1996-10       float64
1996-11       float64
1996-12       float64
1997-01       float64
1997-02       float64
1997-03       float64
1997-04       float64
1997-05       float64
1997-06       float64
1997-07       float64
1997-08       float64
1997-09       float64
1997-10       float64
1997-11       float64
1997-12       float64
1998-01       float64
1998-02       float64
1998-03       float64
1998-04       float64
1998-05       float64
1998-06       float64
1998-07       float64
1998-08       float64
1998-09       float64
1998-10       float64
1998-11       float64
1998-12       float64
1999-01       float64
1999-02       float64
1999-03       float64
1999-04       float64
1999-05       float64
1999-06   

### c) Data Handling

#### **i. Aggregation of Data**

As all values in the time-series columns are deemed important, aggregating the data to a yearly basis, for example, at such an early stage was deemed unnecessary. This is because a lot of vital information would be lost in the process that would be necessary later on. Thus, aggregation would be done on an as-needed basis.

Data will now be grouped by state, to give a better idea as to which states have more property listings than others:

### d) Feature Engineering

#### **i. Addition of New Features** 

Relevant features that could affect house prices were noted and their creation was deemed necessary. These features are: 

- *Return On Investment(%)*: Calculated as 
$
\left( \frac{{\text{{Last Price of Property}}}}{{\text{{Initial Price of Property}}}} - 1 \right) \times 100\% 
$


-

In [None]:
df[['State','RegionID','RegionName','City','Metro','CountyName','SizeRank']].groupby("State").count()

Particular focus will be placed on the `RegionID` as that column does not have any null values, thus an accurate count of properties can be done.

In [None]:
df_state_grouping = df[['State','RegionID']].groupby("State").count().sort_values(by='RegionID', ascending=False)
df_state_grouping

California (CA) is seen to have the most listings with 1224, followed closely by New York, NY (1015) and Texas, TX (989). Vermont (VT), Washington DC (DC) and San Diego (SD) have the least listings at 16, 18 and 19 respectively. 

A new dataframe is now created which contains the melted data. This will convert the original dataframe `df` from wide to long format. This will be done by applying the custom function `melt_data`

In [None]:
def melt_data(df):
    """
    Takes the zillow_data dataset in wide form or a subset of the zillow_dataset.  
    Returns a long-form datetime dataframe 
    with the datetime column names as the index and the values as the 'values' column.
    
    If more than one row is passes in the wide-form dataset, the values column
    will be the mean of the values from the datetime columns in all of the rows.
    """

    melted = pd.melt(df, id_vars=['RegionName', 'RegionID', 'SizeRank',
                     'City', 'State', 'Metro', 'CountyName'], var_name='time')
    melted['time'] = pd.to_datetime(melted['time'], infer_datetime_format=True)
    melted = melted.dropna(subset=['value'])
    return melted.groupby('time').aggregate({'value': 'mean'})

Converting the dataframe to long format:

In [None]:
df_melted = melt_data(df)

## 3. EDA & VISUALISATION

### Descriptive Statistics

## 4. FEATURE ENGINEERING

### Dealing with Missing Data

### Dealing with Outliers

### Scaling & Normalization

### Extracting Time & Date Features

## 5. ARIMA MODELLING

## 6. FINE TUNING WITH PROPHET

## 7. INTERPRETING RESULTS