# Predicting United States Real Estate Prices
By: Grace Li, Olivia Weisiger and Fionnuala Eastwood

### Outline
1. Dataset Preprocessing and Merging
2. Data Visualization
3. Classic Machine Learning Models
4. Deep Learning Models
5. Analysis of accuracy and results

### Context
The United States housing market is...

### Our Goal

This project will explore the current United States real-estate market, investigate what factors influence the price of property, and create multiple machine learning models that predict these housing costs throughout the country. More specifically, this will be accomplished through implementation of (add briefly about what models we end up using....) Being able to infer and understand the trends of real estate is extremely valuable economic knowledge that will provide important insights about our country. 

Furthermore, our project aims to deepen our understanding of how societal biases influence external structures such as the economy. By merging datasets, we will investigate which underlying factors such as (add briefly when choose other data) affect the prices of houses in order to draw deeper conclusions about intangible factors impacting our economic climate.

### Import Packages and Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

This data is from Kaggle's "USA Real Estate Dataset" found here: https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset

In [2]:
df = pd.read_csv('realtor-data.zip.csv')

### Initial Data Processing
Let's first break down what our dataset looks like...

In [3]:
df.shape

(2226382, 12)

We have a dataset with over 2 million rows and 12 columns. Since this is way too many samples to process in a reasonable computational time, we will take a random subset of 40,000 of these samples to perform analysis on.

In [4]:
df = df.sample(50000)

With our refined sample, let's get an idea of what our dataset looks like by outputting a few rows of the table.

In [5]:
df.head()

Unnamed: 0,brokered_by,status,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,prev_sold_date
570890,2861.0,for_sale,303950.0,3.0,2.0,0.17,1405853.0,Beverly Hills,Florida,34465.0,1720.0,
396259,56692.0,for_sale,185000.0,,,9.98,1892098.0,Cleveland,Georgia,30528.0,,
1587017,58652.0,sold,210000.0,3.0,2.0,0.21,1823001.0,Newport News,Virginia,23608.0,1860.0,2021-12-15
1726366,3810.0,sold,190000.0,3.0,2.0,0.15,1102433.0,Zephyrhills,Florida,33542.0,1842.0,2021-11-12
206275,14210.0,for_sale,299000.0,3.0,3.0,0.14,1498070.0,Middletown,Delaware,19709.0,1750.0,2014-06-27


Notice that each sample in the dataset is a real estate listing in the United States (the listings are all from 2022-2024), and each sample has 12 features that provide numerical or categorical information about the listing.

Here is an overview of each feature's meaning and data type:

- brokered_by:

- status:

- price:

- bed:

- bath:

- etc fill in later (look on kaggle these descriptins are provided)

Now that we have an understanding of our data set, we will perform some processing on the data so that it is cleaner to use. Firstly, we will drop some unnecessary columns that do not contribute to our analysis goals. The brokered_by column which encodes the real-estate company in charge of the property is not necessary because we are interested in the qualities of the house itself. Additionally, the status column is not needed because we will use the price set for the house equivalently regardless if it is sold or for sale. Lastly, the previously sold date can be dropped since we are focused on the current selling price. We will trim our dataset from 12 columns to 9 with these modifications.

In [6]:
df = df.drop(['brokered_by', 'status', 'prev_sold_date'], axis=1)

This dataset contains listings from the United States and all it's territories. For our purposes, we only want to analyze data from the 50 states (and Washington, DC) so let's trim out samples taken from Puerto Rico and the Virgin Islands.

In [7]:
df = df[(df['state'] != "Puerto Rico") & (df['state'] != "Virgin Islands")]

Our next processing step is making sure we don't have any NaN's in our dataset, as empty data values might impact our analysis models.

In [8]:
#sum up all NaN values present in dataset (in any feature column)
print (df.isnull().sum().sum())

42918


We see that we have some data entries with no value, so let's remove all rows that contain any NaN values. We will also check the shape of our data frame after this removal to make sure we still have plenty of samples to work with.

In [9]:
#remove all rows missing data
df = df.dropna()

#verify we now have no NaN values, expect a value of zero
print (f'We now have: {df.isnull().sum().sum()} NaN entires')

#print new shape
print (f'Our new dataset shape is {df.shape}')


We now have: 0 NaN entires
Our new dataset shape is (30293, 9)


We successfully dropped all empty entries and still have a substantial size data frame to analyze. Now we are ready to merge this dataframe with others in order to add more features that may correlate to real estate pricing.

**NOTE: Do we need to turn pricing column into categorical data instead of numerical? Most of our machine learning and nn uses categorical. Could probably do this later on in notebook too**

### Dataset Merging

While the relationship between features such as number of rooms or number of acres on real-estate prices is quite intuitive, this project aims to delve beyond these variables and investigate more abstract influences. This will be done by merging our current dataframe with new datasets in order to add features including minimum wage of the state, median income by zip code, and even political affiliation, as we are curious if any of these variables will display a strong correlation with housing prices. One caution to note is that our original real estate data is from the past two years, so we will need to make sure the data we are merging with is taken from the same time period in order to obtain accurate conclusions.

The first dataset we will merge with is Kaggle's "US Household Income by Zip Code 2021-2011" found here: https://www.kaggle.com/datasets/claygendron/us-household-income-by-zip-code-2021-2011

In [10]:
df2 = pd.read_csv('us_income_zipcode.csv')
df2.head()

Unnamed: 0,ZIP,Geography,Geographic Area Name,Households,Households Margin of Error,"Households Less Than $10,000","Households Less Than $10,000 Margin of Error","Households $10,000 to $14,999","Households $10,000 to $14,999 Margin of Error","Households $15,000 to $24,999",...,"Nonfamily Households $150,000 to $199,999","Nonfamily Households $150,000 to $199,999 Margin of Error","Nonfamily Households $200,000 or More","Nonfamily Households $200,000 or More Margin of Error",Nonfamily Households Median Income (Dollars),Nonfamily Households Median Income (Dollars) Margin of Error,Nonfamily Households Mean Income (Dollars),Nonfamily Households Mean Income (Dollars) Margin of Error,Nonfamily Households Nonfamily Income in the Past 12 Months,Year
0,601,860Z200US00601,ZCTA5 00601,5397.0,264.0,33.2,4.4,15.7,2.9,23.9,...,0.0,2.8,0.0,2.8,9386.0,1472.0,13044.0,1949.0,15.0,2021.0
1,602,860Z200US00602,ZCTA5 00602,12858.0,448.0,27.1,2.9,12.7,2.1,20.5,...,0.0,1.3,0.0,1.3,11242.0,1993.0,16419.0,2310.0,20.1,2021.0
2,603,860Z200US00603,ZCTA5 00603,19295.0,555.0,32.1,2.5,13.4,1.6,17.2,...,0.6,0.6,0.2,0.4,10639.0,844.0,16824.0,2217.0,34.9,2021.0
3,606,860Z200US00606,ZCTA5 00606,1968.0,171.0,28.4,5.5,13.3,4.4,23.3,...,0.0,7.5,0.0,7.5,15849.0,3067.0,16312.0,2662.0,13.0,2021.0
4,610,860Z200US00610,ZCTA5 00610,8934.0,372.0,20.5,2.5,13.2,2.5,23.3,...,0.0,1.8,0.0,1.8,12832.0,2405.0,16756.0,1740.0,14.5,2021.0


This dataset contains the results of the 2011 and 2021 national census, and we have chosen it in order to add a median income feature to our real estate pricing dataset. As explained above, we are only interested in the 2021 data since our pricing data comes from recent years, so we will trim down our dataset accordingly. Additionally, the dataset comes with dozens of feature columns, but for our purposes we only need to keep the zip code column (which we will use to merge our original dataset), and the median household income column. So let's process our dataset and display the cleaner result.

In [11]:
#select only samples from most recent census
df2 = df2[df2["Year"] == 2021.0]

#select only features we want
df2 = df2[["ZIP", "Nonfamily Households Median Income (Dollars)"]]

df2.head()

Unnamed: 0,ZIP,Nonfamily Households Median Income (Dollars)
0,601,9386.0
1,602,11242.0
2,603,10639.0
3,606,15849.0
4,610,12832.0


Now we are ready to merge with our original dataset. Currently our zip code columns have different names so we will rename them identically, and they also have different types (integer vs float) so we will convert to a float variable to avoid type error interference.

In [12]:
df2["ZIP"] = df2["ZIP"].astype(float)

df2 = df2.rename(columns={'ZIP': 'zip_code'})

We will use an inner merge (explain why...)
The census data was very thorough (we have very few NaN values), so we can just remove any empty data rows and our dataset remains practically the same. We verify this assumption by outputting our dataset shape after the merge.

In [13]:
df = pd.merge(df, df2, on = ["zip_code"])

#remove all rows missing data
df = df.dropna()

print (df.shape)
df.head()

(29969, 10)


Unnamed: 0,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,Nonfamily Households Median Income (Dollars)
0,303950.0,3.0,2.0,0.17,1405853.0,Beverly Hills,Florida,34465.0,1720.0,34055.0
1,139999.0,2.0,2.0,0.22,1322125.0,Beverly Hills,Florida,34465.0,908.0,34055.0
2,118000.0,1.0,1.0,0.14,846848.0,Beverly Hills,Florida,34465.0,675.0,34055.0
3,239500.0,3.0,2.0,0.25,625843.0,Beverly Hills,Florida,34465.0,1898.0,34055.0
4,165000.0,2.0,1.0,0.21,986060.0,Beverly Hills,Florida,34465.0,908.0,34055.0


This feature looks good, let's move on to some more merges.

Next, we want to add to our dataset election results and minimum wage data by state (/whatever else anyone wants to add!!), which should be slightly simpler than merging by zipcode

....

### Data Visualization

(Just writing some notes for us to use later)
- maybe create a fancy visual heatmap type thing showing our prices by zipcode on the us map
- Create bar plots, histograms, correlation plots etc using tangible factors from our og dataset (room number, acres etc) should show clear trend
- Same thing but for some intangible factors using newly merged data, see if we can come to cool conclusions about those correlations
- Writeup analysis about what this shows us about society/housing market