## PHASE 4 PROJECT : GROUP 10.
## Project Title: Time Series Modelling of Real Estate Prices.

## Team Members
1. Celiajoy Omiah
2. Crystal Wanjiru
3. Kelvin Rotich
4. Miriam Nguru
5. Paul Kamau
6. Stephen Butiya



## Introduction
This project aims to construct a time series model leveraging Zillow's dataset to assist real estate investors in making well-informed investment decisions.


## Overview
In this project we aim to do the following:

1. Loading the dataset.
2. Gaining a comprehensive understanding of the dataset.
3. dentifying and selecting our target variable.
4. Preparing the dataset, which involves tasks such as cleaning, checking for multicollinearity, and ensuring data integrity.
5. Encoding categorical variables to make them compatible with our modeling process.
6. Building several models to explore different approaches.
7. Evaluating the performance of our models using appropriate metrics.
8. Utilizing our trained models for making predictions.
9. Drawing meaningful insights and conclusions based on our findings.

##  Business understanding
Real estate investment stands as a profitable and ever-evolving industry, demanding meticulous analysis and strategic decision-making. A fictitious real estate investment firm is currently in search of insights to pinpoint the top five zip codes offering promising investment opportunities. To tackle this inquiry, we leverage historical data sourced from Zillow Research.

##  Objectives
The objectives of this project:
* To identify the top 5 zip codes and states that offer the best investment potential in terms of real estate value. By analyzing historical trends and patterns, the project aims to provide actionable insights to the investment firm, enabling them to make informed decisions on where to allocate their resources.
* To analyze the historical data of the real estate value by looking into the monthly, quarterly, semi-annual and annual patterns over time.
* To create an ARIMA model that will be able to predict future Real Estate Value.




##  Data Understanding

The dataset encompasses details on a range of attributes, including RegionID, RegionName, City, State, Metro, SizeRank, CountyName, and the value representing real estate prices. This dataset, known as the Zillow Housing Dataset, has been obtained from the Zillow Research Page.

To gain an initial insight into the structure of our dataset, let's load and preview the data.

In [4]:
#Importing the data libraries
import numpy as np
import pandas as pd

#importing visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

#Importing modeling libraries
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from matplotlib.pylab import rcParams
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [6]:
# Function to load and examine the data

def load_and_examine_data(file_path):
    try:
        # Load the data from the specified file path
        data = pd.read_csv(file_path)

        # Display the shape, columns and the first few rows of the dataset
        print("----------------------------Details about the data----------------------------\n")
        print("----------------------------Shape of the dataset----------------------------\n")
        display(data.shape)
        print()
        print("----------------------------Columns of the dataset----------------------------\n")
        display(data.columns)
        print()
        print("----------------------------Head of the dataset----------------------------\n")
        display(data.head())
        print()
        # Display information about the dataset

        print("\n----------------------------Data information ----------------------------\n")
        display(data[['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName',
       'SizeRank']].info())
        print("\n----------------------------Descriptive Statistics of the dataset ----------------------------\n")
        display(data.describe())

        return data

    except FileNotFoundError:
        print(f"File '{file_path}' not found.")
    except Exception as e:
        print(f"An error occurred: {e}")

# Replace with your data file path
file_path = "zillow_data.csv" 
data = load_and_examine_data(file_path)

----------------------------Details about the data----------------------------

----------------------------Shape of the dataset----------------------------



(14723, 272)


----------------------------Columns of the dataset----------------------------



Index(['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName',
       'SizeRank', '1996-04', '1996-05', '1996-06',
       ...
       '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12',
       '2018-01', '2018-02', '2018-03', '2018-04'],
      dtype='object', length=272)


----------------------------Head of the dataset----------------------------



Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
0,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,...,1005500,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0,236900.0,236700.0,...,308000,310000,312500,314100,315000,316600,318100,319600,321100,321800
2,91982,77494,Katy,TX,Houston,Harris,3,210400.0,212200.0,212200.0,...,321000,320600,320200,320400,320800,321200,321200,323000,326900,329900
3,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0,500900.0,503100.0,...,1289800,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000
4,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0,77300.0,77300.0,...,119100,119400,120000,120300,120300,120300,120300,120500,121000,121500




----------------------------Data information ----------------------------

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   RegionID    14723 non-null  int64 
 1   RegionName  14723 non-null  int64 
 2   City        14723 non-null  object
 3   State       14723 non-null  object
 4   Metro       13680 non-null  object
 5   CountyName  14723 non-null  object
 6   SizeRank    14723 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 805.3+ KB


None


----------------------------Descriptive Statistics of the dataset ----------------------------



Unnamed: 0,RegionID,RegionName,SizeRank,1996-04,1996-05,1996-06,1996-07,1996-08,1996-09,1996-10,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
count,14723.0,14723.0,14723.0,13684.0,13684.0,13684.0,13684.0,13684.0,13684.0,13684.0,...,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0
mean,81075.010052,48222.348706,7362.0,118299.1,118419.0,118537.4,118653.1,118780.3,118927.5,119120.5,...,273335.4,274865.8,276464.6,278033.2,279520.9,281095.3,282657.1,284368.7,286511.4,288039.9
std,31934.118525,29359.325439,4250.308342,86002.51,86155.67,86309.23,86467.95,86650.94,86872.08,87151.85,...,360398.4,361467.8,362756.3,364461.0,365600.3,367045.4,369572.7,371773.9,372461.2,372054.4
min,58196.0,1001.0,1.0,11300.0,11500.0,11600.0,11800.0,11800.0,12000.0,12100.0,...,14400.0,14500.0,14700.0,14800.0,14500.0,14300.0,14100.0,13900.0,13800.0,13800.0
25%,67174.5,22101.5,3681.5,68800.0,68900.0,69100.0,69200.0,69375.0,69500.0,69600.0,...,126900.0,127500.0,128200.0,128700.0,129250.0,129900.0,130600.0,131050.0,131950.0,132400.0
50%,78007.0,46106.0,7362.0,99500.0,99500.0,99700.0,99700.0,99800.0,99900.0,99950.0,...,188400.0,189600.0,190500.0,191400.0,192500.0,193400.0,194100.0,195000.0,196700.0,198100.0
75%,90920.5,75205.5,11042.5,143200.0,143300.0,143225.0,143225.0,143500.0,143700.0,143900.0,...,305000.0,306650.0,308500.0,309800.0,311700.0,313400.0,315100.0,316850.0,318850.0,321100.0
max,753844.0,99901.0,14723.0,3676700.0,3704200.0,3729600.0,3754600.0,3781800.0,3813500.0,3849600.0,...,18889900.0,18703500.0,18605300.0,18569400.0,18428800.0,18307100.0,18365900.0,18530400.0,18337700.0,17894900.0


The dataset contains the following columns:
* Region Id - is unique ID for the Regions

* Region Name contains the zip code for the region

* City Specific city name of housing data

* Metro Name of the metro city around that region

* County Name this is the county name of that region

* SizeRank this is the ranking done based on the size of that region

* Date this refers to a point in time

## Data Preparation