# Predicting United States Real Estate Prices
By: Grace Li, Olivia Weisiger and Fionnuala Eastwood

### Outline
1. Dataset Merging and Preprocessing
2. Data Visualization
3. Classic Machine Learning Models
4. Deep Learning Models
5. Analysis of accuracy and results

### Context
The United States housing market is...

### Our Goal

This project will explore the current United States real-estate market, investigate what factors influence the price of property, and create multiple machine learning models that predict these housing costs throughout the country. More specifically, this will be accomplished through implementation of (add briefly about what models we end up using....) Being able to infer and understand the trends of real estate is extremely valuable economic knowledge that will provide important insights about our country. 

Furthermore, our project aims to deepen our understanding of societal biases on external structures such as the economy. By merging datasets, we will investigate which underlying factors such as (add briefly when choose other data) affect the prices of houses in order to draw deeper conclusions about intangible factors impacting our economic climate.

### Import Packages and Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

This data is from Kaggle's "USA Real Estate Dataset" found here: https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset

In [2]:
df = pd.read_csv('realtor-data.zip.csv')

### Data Processing
Let's first break down what our dataset looks like...

In [3]:
df.shape

(2226382, 12)

We have a dataset with over 2 million rows and 12 columns. Since this is way too many samples to process in a reasonable computational time, we will take a random subset of 40,000 of these samples to perform analysis on.

In [4]:
df = df.sample(50000)

With our refined sample, let's get an idea of what our dataset looks like by outputting a few rows of the table.

In [5]:
df.head()

Unnamed: 0,brokered_by,status,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,prev_sold_date
1597875,10727.0,sold,499900.0,4.0,4.0,0.32,1429746.0,Kernersville,North Carolina,27284.0,,2021-11-19
1877446,2375.0,sold,164900.0,3.0,2.0,0.33,861652.0,Cape Girardeau,Missouri,63701.0,1424.0,2022-02-28
1496472,102073.0,sold,215000.0,,,0.09,1708296.0,Rochester,New York,14620.0,2126.0,2022-01-18
1009399,22930.0,for_sale,160000.0,3.0,2.0,0.29,1591110.0,Skiatook,Oklahoma,74070.0,1464.0,
461280,51266.0,for_sale,496990.0,3.0,3.0,,738245.0,Apopka,Florida,32712.0,1992.0,


Notice that each sample in the dataset is a real estate listing in the United States (the listings are all from 2022-2024), and each sample has 12 features that provide numerical or categorical information about the listing.

Here is an overview of each feature's meaning and data type:

- brokered_by:

- status:

- price:

- bed:

- bath:

- etc fill in later (look on kaggle these descriptins are provided)

Now that we have an understanding of our data set, we will perform some processing on the data so that it is cleaner to use. Firstly, we will drop some unnecessary columns that do not contribute to our analysis goals. The brokered_by column which encodes the real-estate company in charge of the property is not necessary because we are interested in the qualities of the house itself. Additionally, the status column is not needed because we will use the price set for the house equivalently regardless if it is sold or for sale. Lastly, the previously sold date can be dropped since we are focused on the current selling price. We will trim our dataset from 12 columns to 9 with these modifications.

In [6]:
df = df.drop(['brokered_by', 'status', 'prev_sold_date'], axis=1)

This dataset contains listings from the United States and all it's territories. For our purposes, we only want to analyze data from the 50 states (and Washington, DC) so let's trim out samples taken from Puerto Rico and the Virgin Islands.

In [7]:
df = df[(df['state'] != "Puerto Rico") & (df['state'] != "Virgin Islands")]

Our next processing step is making sure we don't have any NaN's in our dataset, as empty data values might impact our analysis models.

In [8]:
#sum up all NaN values present in dataset (in any feature column)
print (df.isnull().sum().sum())

42516


We see that we have some data entries with no value, so let's remove all rows that contain any NaN values. We will also check the shape of our data frame after this removal to make sure we still have plenty of samples to work with.

In [12]:
#remove all rows missing data
df = df.dropna()

#verify we now have no NaN values, expect a value of zero
print (f'We now have: {df.isnull().sum().sum()} NaN entires')

#print new shape
print (f'Our new dataset shape is {df.shape}')


We now have: 0 NaN entires
Our new dataset shape is (30408, 9)


We successfully dropped all empty entries and still have a substantial size data frame to analyze. Now we are ready to merge this dataframe to add more features.

### Dataset Merging