# Business Understanding

## Business Problem

Real investment firms weigh on many factors when decided on which investment to make. These firms’ goals are to enhance their decision making processes. With historical real estate market data, the challenge is to leverage data science techniques to identify potential investment opportunities and helping the real estate firms in making informed investment decisions.


## Problem Statement

A Real Estate Investment Firm wants to know the top 5 best zip codes to invest in. As a Data Science consulting group, we have been tasked with finding out.
The task at hand is to create a model that will inform us on the Real Estate investment market trends for the next 10 years.

## Objectives

- To select the 5 best zipcodes to invest in
- To predict the value range for the top 5 zipcodes

# Data Understanding

# Data Preprocessing

The data used in the project is from [Zillow Research](https://www.zillow.com/research/data/)

### Loading the data set

In [1]:
import pandas as pd

df = pd.read_csv('data/zillow_data.csv')

df.head(5)

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
0,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,...,1005500,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0,236900.0,236700.0,...,308000,310000,312500,314100,315000,316600,318100,319600,321100,321800
2,91982,77494,Katy,TX,Houston,Harris,3,210400.0,212200.0,212200.0,...,321000,320600,320200,320400,320800,321200,321200,323000,326900,329900
3,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0,500900.0,503100.0,...,1289800,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000
4,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0,77300.0,77300.0,...,119100,119400,120000,120300,120300,120300,120300,120500,121000,121500


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Columns: 272 entries, RegionID to 2018-04
dtypes: float64(219), int64(49), object(4)
memory usage: 30.6+ MB


After familiarising with the data, the next step is to select the features in the data set that will be used.
The columns `RegionID` and `CountyName` will be dropped because the information contained is additional information on location which is covered by other features of the data set, while `Metro` and  `SizeRank` are irrelevant to our objectives. 

In [3]:
# Selecting relevant columns

irrelevant_features = ['RegionID', 'Metro', 'SizeRank', 'CountyName']

df = df.drop(columns=irrelevant_features)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Columns: 268 entries, RegionName to 2018-04
dtypes: float64(219), int64(47), object(2)
memory usage: 30.1+ MB


### Dealing with Missing values

In [4]:
columns_missing_values_check = df.isnull().any().sum()
columns_missing_values_check

219

In [8]:
rows_missing_values_check = df.isnull().sum(axis=1)
rows_with_missing_values = (rows_missing_values_check > 0).sum()
rows_with_missing_values

1039

There are few missing values in the dates columns of the data set and chose leave them as they are since they would not affect the analysis

EDA