# Predicting United States Real Estate Prices
By: Grace Li, Olivia Weisiger and Fionnuala Eastwood

### Outline
1. Dataset Preprocessing and Merging
2. Data Visualization
3. Classic Machine Learning Models
4. Deep Learning Models
5. Analysis of accuracy and results

### Context
The United States housing market is...

### Our Goal

This project will explore the current United States real-estate market, investigate what factors influence the price of property, and create multiple machine learning models that predict these housing costs throughout the country. More specifically, this will be accomplished through implementation of (add briefly about what models we end up using....) Being able to infer and understand the trends of real estate is extremely valuable economic knowledge that will provide important insights about our country. 

Furthermore, our project aims to deepen our understanding of how societal biases influence external structures such as the economy. By merging datasets, we will investigate which underlying factors such as (add briefly when choose other data) affect the prices of houses in order to draw deeper conclusions about intangible factors impacting our economic climate.

### Import Packages and Data

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.decomposition import PCA, KernelPCA
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import tensorflow as tf
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import KernelPCA
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier




2024-06-05 23:11:29.581579: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


This data is from Kaggle's "USA Real Estate Dataset" found here: https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset

In [8]:
df = pd.read_csv('realtor-data.zip.csv')

### Initial Data Processing
Let's first break down what our dataset looks like...

In [10]:
df.shape

(2226382, 12)

We have a dataset with over 2 million rows and 12 columns. Since this is way too many samples to process in a reasonable computational time, we will take a random subset of 100,000 of these samples to perform analysis on.

In [12]:
random.seed(10)
df = df.sample(50000)
print(df.shape)

(50000, 12)


With our refined sample, let's get an idea of what our dataset looks like by outputting a few rows of the table.

In [14]:
df.head()

Unnamed: 0,brokered_by,status,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,prev_sold_date
1201130,62531.0,for_sale,585407.0,4.0,3.0,0.15,735851.0,Queen Creek,Arizona,85142.0,2186.0,
387359,21986.0,for_sale,59500.0,,,18.57,1285890.0,Roberta,Georgia,31078.0,,2022-04-29
2174460,59831.0,sold,230000.0,3.0,2.0,,1105743.0,Davis,California,95618.0,1344.0,2022-01-04
776215,10595.0,for_sale,298000.0,3.0,2.0,1.55,961740.0,Sumner,Iowa,50674.0,1240.0,
1233886,103011.0,for_sale,150000.0,,,0.75,564581.0,Edgewood,New Mexico,87015.0,,


Notice that each sample in the dataset is a real estate listing in the United States (the listings are all from 2022-2024), and each sample has 12 features that provide numerical or categorical information about the listing.

Here is an overview of each feature's meaning and data type:

- brokered_by:

- status:

- price:

- bed:

- bath:

- etc fill in later (look on kaggle these descriptins are provided)

Now that we have an understanding of our data set, we will perform some processing on the data so that it is cleaner to use. Firstly, we will drop some unnecessary columns that do not contribute to our analysis goals. The brokered_by column which encodes the real-estate company in charge of the property is not necessary because we are interested in the qualities of the house itself. Additionally, the status column is not needed because we will use the price set for the house equivalently regardless if it is sold or for sale. Lastly, the previously sold date can be dropped since we are focused on the current selling price. We will trim our dataset from 12 columns to 9 with these modifications.

In [18]:
df = df.drop(['brokered_by', 'status', 'prev_sold_date'], axis=1)

This dataset contains listings from the United States and all it's territories. For our purposes, we only want to analyze data from the 50 states (and Washington, DC) so let's trim out samples taken from Puerto Rico and the Virgin Islands.

In [20]:
df = df[(df['state'] != "Puerto Rico") & (df['state'] != "Virgin Islands")]

Our next processing step is making sure we don't have any NaN's in our dataset, as empty data values might impact our analysis models.

In [22]:
#sum up all NaN values present in dataset (in any feature column)
print (df.isnull().sum().sum())

42323


We see that we have some data entries with no value, so let's remove all rows that contain any NaN values. We will also check the shape of our data frame after this removal to make sure we still have plenty of samples to work with.

In [24]:
#remove all rows missing data
df = df.dropna()

#verify we now have no NaN values, expect a value of zero
print (f'We now have: {df.isnull().sum().sum()} NaN entires')

#print new shape
print (f'Our new dataset shape is {df.shape}')


We now have: 0 NaN entires
Our new dataset shape is (30425, 9)


We successfully dropped all empty entries and still have a substantial size data frame to analyze.

Our last step in data processing is preparing our target price data for our future machine learning models. We noticed that predicting the price to an exact number (as the current column does) is quite specific, so instead in some cases we will want to predict whether any piece of real-estate is more generally expensive or cheap. The next question that follows is how we will quantify this "expensive" vs "cheap". 

Our natural thought was just categorizing the samples based on if they were on the higher half of all in our dataset vs the lowest. However, upon further analysis we realize that the state the property is in has an overwhelmingly powerful influence on this categorization. For example we would see that practically all samples from New York would fall in the upper portion of data, while a huge majority of samples from rural states will be in the lower. This would leave our model with very little to do, so to work around this we have decided to create a categorical column that contains a 1 if the property price is above the median housing price **of the state it is in**, and a 0 if the property is below this median average of its state. This takes out the state bias and may lead to more informative conclusions about other features that are no longer overshadowed.

To do this, we first merge data with the median housing prices by state. This data was taken from the following link: https://www.bankrate.com/real-estate/median-home-price/#how-much, and is up to date as of November 2023 (which is the same time period our real-estate data was taken from).

In [26]:
#upload data
df_med = pd.read_csv('median_prices.csv')
df_med.head()

#select state and median price columns we want
df_med = df_med[["state", "med_price"]]

#merge along state column
df = pd.merge(df, df_med, on = ["state"])
df.head()

Unnamed: 0,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,med_price
0,585407.0,4.0,3.0,0.15,735851.0,Queen Creek,Arizona,85142.0,2186.0,"$435,300"
1,298000.0,3.0,2.0,1.55,961740.0,Sumner,Iowa,50674.0,1240.0,"$289,900"
2,739900.0,4.0,2.0,0.25,930049.0,Spring Valley,California,91977.0,1340.0,"$793,600"
3,315000.0,3.0,2.0,0.42,1826673.0,Maggie Valley,North Carolina,28751.0,1344.0,"$362,200"
4,855000.0,3.0,2.0,0.22,872293.0,Bigfork,Montana,59911.0,2822.0,"$609,900"


Notice one more problem exists: we have a price listed in string form with dollar sign and commas. Instead we want it to be numerical in order to compare it with our current price column.

In [28]:
# Remove dollar signs and commas, then convert to integers
df['med_price'] = df['med_price'].replace({'\$': '', ',': ''}, regex=True).astype(int)

Lastly, we want to create a new column which we will call above_average. This column will contain a 1 if the price of that sample is above the median price in the state, and a 0 if it is below. We will also remove the median price column afterwards because it served its purpose.

In [30]:
df['above_average'] = df.apply(lambda row: 1 if row['price'] > row['med_price'] else 0, axis=1)
df = df.drop('med_price', axis = 1)
df.head()

Unnamed: 0,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,above_average
0,585407.0,4.0,3.0,0.15,735851.0,Queen Creek,Arizona,85142.0,2186.0,1
1,298000.0,3.0,2.0,1.55,961740.0,Sumner,Iowa,50674.0,1240.0,1
2,739900.0,4.0,2.0,0.25,930049.0,Spring Valley,California,91977.0,1340.0,0
3,315000.0,3.0,2.0,0.42,1826673.0,Maggie Valley,North Carolina,28751.0,1344.0,0
4,855000.0,3.0,2.0,0.22,872293.0,Bigfork,Montana,59911.0,2822.0,1


This looks good, now we are ready to merge with other data sets to add more features to analyze.

### Dataset Merging

While the relationship between features such as number of rooms or number of acres on real-estate prices is quite intuitive, this project aims to delve beyond these variables and investigate more abstract influences. This will be done by merging our current dataframe with new datasets in order to add features including minimum wage of the state, median income by zip code, and even political affiliation, as we are curious if any of these variables will display a strong correlation with housing prices. One caution to note is that our original real estate data is from the past two years, so we will need to make sure the data we are merging with is taken from the same time period in order to obtain accurate conclusions.

The first dataset we will merge with is Kaggle's "US Household Income by Zip Code 2021-2011" found here: https://www.kaggle.com/datasets/claygendron/us-household-income-by-zip-code-2021-2011

In [34]:
df_income = pd.read_csv('us_income_zipcode.csv')
df_income.head()

Unnamed: 0,ZIP,Geography,Geographic Area Name,Households,Households Margin of Error,"Households Less Than $10,000","Households Less Than $10,000 Margin of Error","Households $10,000 to $14,999","Households $10,000 to $14,999 Margin of Error","Households $15,000 to $24,999",...,"Nonfamily Households $150,000 to $199,999","Nonfamily Households $150,000 to $199,999 Margin of Error","Nonfamily Households $200,000 or More","Nonfamily Households $200,000 or More Margin of Error",Nonfamily Households Median Income (Dollars),Nonfamily Households Median Income (Dollars) Margin of Error,Nonfamily Households Mean Income (Dollars),Nonfamily Households Mean Income (Dollars) Margin of Error,Nonfamily Households Nonfamily Income in the Past 12 Months,Year
0,601,860Z200US00601,ZCTA5 00601,5397.0,264.0,33.2,4.4,15.7,2.9,23.9,...,0.0,2.8,0.0,2.8,9386.0,1472.0,13044.0,1949.0,15.0,2021.0
1,602,860Z200US00602,ZCTA5 00602,12858.0,448.0,27.1,2.9,12.7,2.1,20.5,...,0.0,1.3,0.0,1.3,11242.0,1993.0,16419.0,2310.0,20.1,2021.0
2,603,860Z200US00603,ZCTA5 00603,19295.0,555.0,32.1,2.5,13.4,1.6,17.2,...,0.6,0.6,0.2,0.4,10639.0,844.0,16824.0,2217.0,34.9,2021.0
3,606,860Z200US00606,ZCTA5 00606,1968.0,171.0,28.4,5.5,13.3,4.4,23.3,...,0.0,7.5,0.0,7.5,15849.0,3067.0,16312.0,2662.0,13.0,2021.0
4,610,860Z200US00610,ZCTA5 00610,8934.0,372.0,20.5,2.5,13.2,2.5,23.3,...,0.0,1.8,0.0,1.8,12832.0,2405.0,16756.0,1740.0,14.5,2021.0


This dataset contains the results of the 2011 and 2021 national census, and we have chosen it in order to add a median income feature to our real estate pricing dataset. As explained above, we are only interested in the 2021 data since our pricing data comes from recent years, so we will trim down our dataset accordingly. Additionally, the dataset comes with dozens of feature columns, but for our purposes we only need to keep the zip code column (which we will use to merge our original dataset), and the median household income column. So let's process our dataset and display the cleaner result.

In [36]:
#select only samples from most recent census
df_income = df_income[df_income["Year"] == 2021.0]

#select only features we want
df_income = df_income[["ZIP", "Nonfamily Households Median Income (Dollars)"]]

df_income.head()

Unnamed: 0,ZIP,Nonfamily Households Median Income (Dollars)
0,601,9386.0
1,602,11242.0
2,603,10639.0
3,606,15849.0
4,610,12832.0


Now we are ready to merge with our original dataset. Currently our zip code columns have different names so we will rename them identically, and they also have different types (integer vs float) so we will convert to a float variable to avoid type error interference.

In [38]:
df_income["ZIP"] = df_income["ZIP"].astype(float)

df_income = df_income.rename(columns={'ZIP': 'zip_code'})

We will use an inner merge (explain why...)
The census data was very thorough (we have very few NaN values), so we can just remove any empty data rows and our dataset remains practically the same. We verify this assumption by outputting our dataset shape after the merge.

In [40]:
df = pd.merge(df, df_income, on = ["zip_code"])

#remove all rows missing data
df = df.dropna()

print (df.shape)
df.head()

(30081, 11)


Unnamed: 0,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,above_average,Nonfamily Households Median Income (Dollars)
0,585407.0,4.0,3.0,0.15,735851.0,Queen Creek,Arizona,85142.0,2186.0,1,54890.0
1,298000.0,3.0,2.0,1.55,961740.0,Sumner,Iowa,50674.0,1240.0,1,40625.0
2,739900.0,4.0,2.0,0.25,930049.0,Spring Valley,California,91977.0,1340.0,0,50286.0
3,315000.0,3.0,2.0,0.42,1826673.0,Maggie Valley,North Carolina,28751.0,1344.0,0,42108.0
4,855000.0,3.0,2.0,0.22,872293.0,Bigfork,Montana,59911.0,2822.0,1,42860.0


This feature looks good, let's move on to some more merges.

Next, we want to add to our dataset statistics on political affiliation by state and minimum wage by state, which should be slightly simpler than merging by zipcode. 

First we will use is Kaggle's "2020 US Presidential Election Results by State" linked here: https://www.kaggle.com/datasets/callummacpherson14/2020-us-presidential-election-results-by-state. This data was taken appropriately recently to match our real-estate data, and it contains voting percentage and win vs loss data on Biden and Trump from the 2020 election.


In [42]:
df_election = pd.read_csv('voting.csv.xls')
df_election.head()

Unnamed: 0,state,state_abr,trump_pct,biden_pct,trump_vote,biden_vote,trump_win,biden_win
0,Alaska,AK,53.1,43.0,189543,153502,1,0
1,Hawaii,HI,34.3,63.7,196864,366130,0,1
2,Washington,WA,39.0,58.4,1584651,2369612,0,1
3,Oregon,OR,40.7,56.9,958448,1340383,0,1
4,California,CA,34.3,63.5,5982194,11082293,0,1


Notice this is quite a clean dataset already, all we need to do is select the columns we are interested in and perform another inner merge along the column column of state. Here, we will choose to keep the state column which is needed for the merge, as well as the Trump pct, Biden pct columns since these provide more detailed information then the binary win vs loss columns. Let's do so and check our new dataset.

In [44]:
#select only features we want
df_election = df_election[["state", "biden_pct", "trump_pct"]]

#merge dataframe along the column of state
df = pd.merge(df, df_election, on = ["state"])

#verify there were no null data values added
print (f'We still have: {df.isnull().sum().sum()} NaN entires')

#output model summary
print (df.shape)
df.head()

We still have: 0 NaN entires
(30081, 13)


Unnamed: 0,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,above_average,Nonfamily Households Median Income (Dollars),biden_pct,trump_pct
0,585407.0,4.0,3.0,0.15,735851.0,Queen Creek,Arizona,85142.0,2186.0,1,54890.0,49.4,49.1
1,298000.0,3.0,2.0,1.55,961740.0,Sumner,Iowa,50674.0,1240.0,1,40625.0,45.0,53.2
2,739900.0,4.0,2.0,0.25,930049.0,Spring Valley,California,91977.0,1340.0,0,50286.0,63.5,34.3
3,315000.0,3.0,2.0,0.42,1826673.0,Maggie Valley,North Carolina,28751.0,1344.0,0,42108.0,48.7,50.1
4,855000.0,3.0,2.0,0.22,872293.0,Bigfork,Montana,59911.0,2822.0,1,42860.0,40.6,56.9


Notice we still have no NaN entries, so our merging didn't add any problematic data and appears to be successfully added. 

Finally, we will perform this process one more time in order to add data on what each state's minimum wage is. This time we will use Kaggle's "Living Wage - State Capitals" found at https://www.kaggle.com/datasets/brandonconrady/living-wage-state-capitals. We again verified this was taken from the past two years for consistency.

In [46]:
df_minwage = pd.read_csv('LivingWageStateCapitals.csv.xls')
df_minwage.head()

Unnamed: 0,state_territory,city,minimum_wage,one_adult_no_kids_living_wage,one_adult_one_kid_living_wage,one_adult_two_kids_living_wage,one_adult_three_kids_living_wage,two_adults_one_working_no_kids_living_wage,two_adults_one_working_one_kid_living_wage,two_adults_one_working_two_kids_living_wage,...,one_adult_two_kids_poverty_wage,one_adult_three_kids_poverty_wage,two_adults_one_working_no_kids_poverty_wage,two_adults_one_working_one_kid_poverty_wage,two_adults_one_working_two_kids_poverty_wage,two_adults_one_working_three_kids_poverty_wage,two_adults_both_working_no_kids_poverty_wage,two_adults_both_working_one_kid_poverty_wage,two_adults_both_working_two_kids_poverty_wage,two_adults_both_working_three_kids_poverty_wage
0,District of Columbia,Washington,13.25,19.97,38.95,48.99,63.96,29.61,34.55,38.32,...,10.44,12.6,8.29,10.44,12.6,14.75,4.14,5.22,6.3,7.38
1,Alabama,Montgomery,7.25,13.56,27.35,33.42,42.17,22.59,26.66,30.27,...,10.44,12.6,8.29,10.44,12.6,14.75,4.14,5.22,6.3,7.38
2,Alaska,Juneau,10.19,15.48,29.99,36.0,47.42,24.48,29.46,33.01,...,13.05,15.75,10.36,13.05,15.75,18.44,5.18,6.53,7.87,9.22
3,Arizona,Phoenix,12.0,15.41,29.44,35.4,46.01,24.85,29.25,32.98,...,10.44,12.6,8.29,10.44,12.6,14.75,4.14,5.22,6.3,7.38
4,Arkansas,Little Rock,10.0,13.97,28.81,35.49,45.33,23.21,27.66,31.36,...,10.44,12.6,8.29,10.44,12.6,14.75,4.14,5.22,6.3,7.38


Again, we want to select the columns we need which in this case is the state column to merge along and the minimum_wage column which has the minimum wage data we desire (in dollars). Here, we will also rename the "state_territory" column to have the same title "state" as our original dataframe to streamline the merging process. Then after we complete the inner merge we will verify our final dataset.

In [48]:
#select only features we want
df_minwage = df_minwage[["state_territory", "minimum_wage"]]

#rename state_territory column
df_minwage = df_minwage.rename(columns={'state_territory': 'state'})

#merge dataframe along the column of state
df = pd.merge(df, df_minwage, on = ["state"])

#verify there were no null data values added
print (f'We still have: {df.isnull().sum().sum()} NaN entires')

#output model summary
print (df.shape)
df.head()

We still have: 0 NaN entires
(30081, 14)


Unnamed: 0,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,above_average,Nonfamily Households Median Income (Dollars),biden_pct,trump_pct,minimum_wage
0,585407.0,4.0,3.0,0.15,735851.0,Queen Creek,Arizona,85142.0,2186.0,1,54890.0,49.4,49.1,12.0
1,298000.0,3.0,2.0,1.55,961740.0,Sumner,Iowa,50674.0,1240.0,1,40625.0,45.0,53.2,7.25
2,739900.0,4.0,2.0,0.25,930049.0,Spring Valley,California,91977.0,1340.0,0,50286.0,63.5,34.3,12.0
3,315000.0,3.0,2.0,0.42,1826673.0,Maggie Valley,North Carolina,28751.0,1344.0,0,42108.0,48.7,50.1,7.25
4,855000.0,3.0,2.0,0.22,872293.0,Bigfork,Montana,59911.0,2822.0,1,42860.0,40.6,56.9,8.65


In [49]:
df[["above_average"]].agg("nunique")

above_average    2
dtype: int64

Now we are officially done with merging our dataset and have plenty of new columns to work with!

### Data Visualization

(Just writing some notes for us to use later)
- maybe create a fancy visual heatmap type thing showing our prices by zipcode on the us map
- Create bar plots, histograms, correlation plots etc using tangible factors from our og dataset (room number, acres etc) should show clear trend
- Same thing but for some intangible factors using newly merged data, see if we can come to cool conclusions about those correlations
- Writeup analysis about what this shows us about society/housing market

## Prep For Machine Learning Models

Our code uses categorical, non-numerical columns, which doesnt work with PCA. To allow us to use dimesnion reduction tecniques such as PCA or kernel PCA, we must assign our categorical names into binary/numeric data.

To do this we first must determine which columns we need to change, meaning we need to check which columns are categorical and which are numerical.

In [55]:
# these two lists hold the columns that are numerical versus categorical
numeric = []
categoric = []

# iterate through the dataframe and sort the column into the numeric list if the type is int or float, otherwise sorting it into the categorical list.
for col in df.columns:
    if df[col].dtype == np.float64 or df[col].dtype == np.int64:
        numeric.append(col)
    else:
        categoric.append(col)

# print
print('Numeric columns:', numeric)
print('Categorical columns:', categoric)

Numeric columns: ['price', 'bed', 'bath', 'acre_lot', 'street', 'zip_code', 'house_size', 'above_average', 'Nonfamily Households Median Income (Dollars)', 'biden_pct', 'trump_pct', 'minimum_wage']
Categorical columns: ['city', 'state']


#### Hot Encoding

After determining which columns are categorical, and therefore need to be changed to numeric values, we applied hot encoding as a first attempt to make our data more usable for further processes such as pca or neural networks.

In [58]:
# create a copy of the data so that we dont affect the origional dataframe
data = df.copy()

# one hot encode categorical features that we discovered in the last cell
one_hot_encoded_data = pd.get_dummies(data, columns = ['city', 'state'])

# print our data to understand what we are working with
one_hot_encoded_data.head()


Unnamed: 0,price,bed,bath,acre_lot,street,zip_code,house_size,above_average,Nonfamily Households Median Income (Dollars),biden_pct,...,state_South Dakota,state_Tennessee,state_Texas,state_Utah,state_Vermont,state_Virginia,state_Washington,state_West Virginia,state_Wisconsin,state_Wyoming
0,585407.0,4.0,3.0,0.15,735851.0,85142.0,2186.0,1,54890.0,49.4,...,False,False,False,False,False,False,False,False,False,False
1,298000.0,3.0,2.0,1.55,961740.0,50674.0,1240.0,1,40625.0,45.0,...,False,False,False,False,False,False,False,False,False,False
2,739900.0,4.0,2.0,0.25,930049.0,91977.0,1340.0,0,50286.0,63.5,...,False,False,False,False,False,False,False,False,False,False
3,315000.0,3.0,2.0,0.42,1826673.0,28751.0,1344.0,0,42108.0,48.7,...,False,False,False,False,False,False,False,False,False,False
4,855000.0,3.0,2.0,0.22,872293.0,59911.0,2822.0,1,42860.0,40.6,...,False,False,False,False,False,False,False,False,False,False


Above, we can see that the hot encoding has taken the two categorical columns, and seperated each of the different values into their own unique column, and assigned binary(true/false) to their association with the values of other features. Even though this has done what we were interested in it accomplishing - making the categorical columns numerical - there are way too many now, and the data is too complex. Instead, below we will use another method - label encoding.

"Very high-dimensional data that is created from one-hot encoding means that these models are very computationally expensive and unable to be run in our project currently.

So, we will label encode our categorical variables for Support Vector Machines and Logistic Regression and prep the data in the same way as for the neural network."

In [60]:
# # split into test/train data
# X = one_hot_encoded_data.drop(['above_average'], axis = 1)
# y = one_hot_encoded_data['above_average']
# train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.33, random_state=42)

# # Scale the numeric features before converting to arrays
# numeric_features = X.select_dtypes(include=[np.number]).columns
# scaler = StandardScaler()
# train_X[numeric_features] = scaler.fit_transform(train_X[numeric_features])
# test_X[numeric_features] = scaler.transform(test_X[numeric_features])

# # Convert to arrays for the model
# X_train = np.array(train_X, dtype=np.float32)
# X_test = np.array(test_X, dtype=np.float32)
# y_train = tf.keras.utils.to_categorical(train_y).astype(np.int64)
# y_test = tf.keras.utils.to_categorical(test_y).astype(np.int64)

#### Label Encoding

Label encoding further simplifies the number of columns for kernel pca, as the hot encoding creates too many columns to be used. In label encoding, each unique category value is assigned an integer - making our data numerical and able to be used in PCA.

In [63]:
# create the label encoder
label_encoder = LabelEncoder()

# create the label encoded data
label_encoded_data = df.copy()
label_encoded_data['state'] = label_encoder.fit_transform(label_encoded_data['state'])
label_encoded_data['city'] = label_encoder.fit_transform(label_encoded_data['city'])

# create X and y data
X_le = label_encoded_data.drop(['above_average'], axis=1)
y_le = label_encoded_data['above_average']

# split into train and test sets
#set the random state to be the same for all train test splits so we know our results are from the same data.
train_Xle, test_Xle, train_yle, test_yle = train_test_split(X_le, y_le, test_size=0.33, random_state=42)

# scale the numeric features
scaler = StandardScaler()
train_Xle = scaler.fit_transform(train_Xle)
test_Xle = scaler.transform(test_Xle)


In [64]:
# Visualise the label encoded data
label_encoded_data

Unnamed: 0,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,above_average,Nonfamily Households Median Income (Dollars),biden_pct,trump_pct,minimum_wage
0,585407.0,4.0,3.0,0.15,735851.0,4451,2,85142.0,2186.0,1,54890.0,49.4,49.1,12.00
1,298000.0,3.0,2.0,1.55,961740.0,5260,15,50674.0,1240.0,1,40625.0,45.0,53.2,7.25
2,739900.0,4.0,2.0,0.25,930049.0,5145,4,91977.0,1340.0,0,50286.0,63.5,34.3,12.00
3,315000.0,3.0,2.0,0.42,1826673.0,3170,33,28751.0,1344.0,0,42108.0,48.7,50.1,7.25
4,855000.0,3.0,2.0,0.22,872293.0,439,26,59911.0,2822.0,1,42860.0,40.6,56.9,8.65
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30076,240000.0,3.0,2.0,3.00,1560549.0,7,20,21520.0,1512.0,0,34758.0,65.8,32.4,11.00
30077,259999.0,2.0,2.0,0.14,1362707.0,2864,13,60156.0,2000.0,0,60107.0,57.6,40.5,10.00
30078,1650000.0,4.0,3.0,21.97,589967.0,2110,4,95949.0,4934.0,1,39872.0,63.5,34.3,12.00
30079,445900.0,4.0,2.0,0.14,1782368.0,4916,9,33777.0,1932.0,1,52365.0,47.9,51.2,8.56


As shown above, the label encoded data is much more condensed than the hot encoded data, as each element in the categorical columns have been given a unique value. This data is now ready to be used by dimension reduction models such as PCA or for further machine learning models and neural networks.

## PCA

Due to the fact that we have so much data, and so many features, we thought it might be the right decision to do a dimesion reduction on our data so that we could ensure that our models have the highest quality data and can run with the highest accuracy. To start out, we used regular PCA to see how many components were needed to get our cumulative explained variance ratio to be larger than 80%.

#### Normal PCA

In [69]:
# call PCA command
pca = PCA(n_components = 7) # using 4 components
X_pca_sklearn = pca.fit_transform(train_Xle) # fit transform the training data

# print explained variance ratio and the cumsum explained variance ratio
print(pca.explained_variance_ratio_)
print(np.cumsum(pca.explained_variance_ratio_))

[0.2417125  0.1835296  0.0933645  0.08111681 0.07776586 0.07473534
 0.0728888 ]
[0.2417125  0.4252421  0.5186066  0.59972341 0.67748927 0.75222461
 0.82511342]


After doing the PCA and testing a few n_component values, we found that there had to be 7 n_components in order for there to be a cumulative explained variance ratio larger than 80%. This seemed very large, but we wanted to see if Kernel PCA might be better.

#### Kernel PCA

In continuing our journey to imporve our data features and accuracy, we decided to try kernel pca, as the nonlinear dimensionality reduction method seemed like it would be able to capture the complexities of our large dataset.
When testing our data, though, we found that it still took at least 6 n_components to get the cumulative sum of explained variance ratio above 80%.

In [73]:
# define the KPCA, fit and transform our data based on new features
kpca = KernelPCA(n_components = 8, kernel = 'rbf')
# kpca_train_Xle = kpca.fit_transform(train_Xle)
# kpca_test_Xle = kpca.transform(test_Xle)
kpca_transform = kpca.fit_transform(train_Xle)
explained_variance = np.var(kpca_transform, axis=0)
explained_variance_ratio = explained_variance/ np.sum(explained_variance)
print(explained_variance_ratio)
print(np.cumsum(explained_variance_ratio))



[0.28672696 0.13112539 0.1232155  0.12012658 0.11488223 0.09467519
 0.07011192 0.05913622]
[0.28672696 0.41785236 0.54106786 0.66119444 0.77607667 0.87075186
 0.94086378 1.        ]


As we were determined to figure out a way to utilize kernel pca and ensure high quality, cleaned data, we tried to tune the parameters for the RBF kernel in the kernel PCA so that we could find the optimal parameters to reduce the dimensions of our data. To do this, we set a few test gamma values, and set a target variance ratio threshold. Then we loop through the test gammas and output the number of components it takes to get to the target variance threshold, and the cumulative explained variance. We are aiming for the least amounts of n_components to get to the threshold.

In [109]:
# Tune the gamma parameter for the RBF kernel
gamma_values = [0.0001, 0.001, .01, .05] # list of gamma values
best_n_components = None # initializing variable
best_explained_variance_ratio = None # initializing variable
best_gamma = None # initializing variable
target_variance_ratio = 0.80  # 80% variance threshold

# for loop to iterate through the gamma values and calculate cumulative explained variance
for gamma in gamma_values:
    kpca2 = KernelPCA(kernel='rbf', gamma=gamma) # create the PCA
    kpca_transform = kpca2.fit_transform(train_Xle) # fit and transform the training data
    explained_variance = np.var(kpca_transform, axis=0) # calculate explained variance
    explained_variance_ratio = explained_variance / np.sum(explained_variance) # calculate the ev ratio
    cumulative_explained_variance = np.cumsum(explained_variance_ratio) # caluclate cumulative explained variance
    
    # Find the number of components that meet the target variance ratio
    n_components = np.argmax(cumulative_explained_variance >= target_variance_ratio) + 1
    
    # print the number of components needed for 80% variance using that gamma value
    print(f'Gamma: {gamma}, Number of components for {target_variance_ratio*100}% variance: {n_components}')
    print(f'Cumulative explained variance: {cumulative_explained_variance[:n_components]}') # output the cumulative explained variance

    # fill in best n componenets variable and corresponding best explained variance ratio
    if best_n_components is None or n_components < best_n_components:
        best_n_components = n_components
        best_explained_variance_ratio = explained_variance_ratio
        best_gamma = gamma 

# Fit KPCA with the best parameters
kpca2 = KernelPCA(n_components=best_n_components, kernel='rbf', gamma=gamma)
kpca_train_Xle = kpca2.fit_transform(train_Xle)
kpca_test_Xle = kpca2.transform(test_Xle)

# print best gamma value and resulting information
print(f'Best gamma: {best_gamma}, Best number of components: {best_n_components}')
print(f'Explained variance ratio for best model: {best_explained_variance_ratio[:best_n_components]}')
print(f'Cumulative explained variance for best model: {np.cumsum(best_explained_variance_ratio[:best_n_components])}')

Gamma: 0.0001, Number of components for 80.0% variance: 7
Cumulative explained variance: [0.26493289 0.41976319 0.50997903 0.59559545 0.67872038 0.75340082
 0.81531486]
Gamma: 0.001, Number of components for 80.0% variance: 7
Cumulative explained variance: [0.28724862 0.43379253 0.53233162 0.62604111 0.71711559 0.79756182
 0.86282608]
Gamma: 0.01, Number of components for 80.0% variance: 7
Cumulative explained variance: [0.27263268 0.38778075 0.48546047 0.5790997  0.67020299 0.74867043
 0.80790238]
Gamma: 0.05, Number of components for 80.0% variance: 22
Cumulative explained variance: [0.18993986 0.26957799 0.34636212 0.42136268 0.493155   0.5536854
 0.59357747 0.63109514 0.65657642 0.67403168 0.69079382 0.70723569
 0.722911   0.73569176 0.74681894 0.75722594 0.766396   0.77411105
 0.78140722 0.78849753 0.7954313  0.80198856]
Best gamma: 0.0001, Best number of components: 7
Explained variance ratio for best model: [0.26493289 0.1548303  0.09021584 0.08561642 0.08312493 0.07468043
 0.06

From the outputs and visualizations, we saw that it was continuously taking large amounts of n_components to get us to the target variance ratio threshold, and the dimension reduction was not condensing our data by much. This meant that it seemed there was no need to preform PCA. This was confirmed as we proceeded in our model making without pca, instead just using the label encoded data, and found that the models were running with high accuracy nonetheless.

## Neural Network

## Other Machine Learning Models

In order to see if we can imporve our accuracy from the neural networks, we decided to run our data through various supervised machine learning models that are used to classify data. These include: SVM, Logistic Regression, K Nearest Neighbors, Decision Tree, Random Forest, and Max Voting Classifier.

#### SVM - Classifier, Supervised

To find the optimal parameters for the SVM model in order to get us the best accuracy, we set up a grid search, setting various C and gamma values to run thorugh.

In [82]:
# defining parameter range
param_grid = {'C': [1, 10, 100, 500],
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['rbf']}

# setting the grid
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)

# fitting the model for grid search
grid.fit(train_Xle, train_yle)

# print best parameter after tuning
print(grid.best_params_)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV 1/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.877 total time=  17.4s
[CV 2/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.874 total time=  20.3s
[CV 3/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.875 total time=  17.4s
[CV 4/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.869 total time=  17.3s
[CV 5/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.873 total time=  17.3s
[CV 1/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.924 total time=   6.1s
[CV 2/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.924 total time=   6.1s
[CV 3/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.924 total time=   6.2s
[CV 4/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.918 total time=   6.0s
[CV 5/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.929 total time=   6.1s
[CV 1/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.891 total time=   7.5s
[CV 2/5] END .......C=1, gamma=0.01, kernel=rbf;

We find that the above parameters ('C': 100, 'gamma': 0.1) are the most optimal for accuracy, so we input them to our SVM model.

In [84]:
# create SVM model and train
svc = SVC(kernel='rbf', C=1000, gamma=0.1)
svc.fit(train_Xle, train_yle)

# score model on training and test data
svm_train_score = svc.score(train_Xle, train_yle)
svm_test_score = svc.score(test_Xle, test_yle)
svm_preds = svc.predict(test_Xle)

# print scores
print(f'The training score is {svm_train_score}.')
print(f'The testing score is {svm_test_score}.')

The training score is 0.9982137540934802.
The testing score is 0.9682683590208522.


The training score resulting is extremely high, and although the testing score isnt the exact same, it is still evaluating very high, meaning we arent worried about overfitting.

#### Logistic Regression - Classifier, Supervised

To find the optimal parameters for the Logistic Regression model, we created a parameter grid to test various C values, and ran the grid search in order to find the best parameters for the highest accuracy.

In [88]:
# Define the parameter grid
log_param_grid = {
    'C': [0.1, 1, 10, 100, 500],
    'penalty': ['l2'],  # 'l2' is the only penalty supported by 'newton-cg'
    'solver': ['newton-cg']}  # Only solvers compatible with 'l2' penalty

# Initialize the Logistic Regression model
log = LogisticRegression()

# Set up the Grid Search
log_grid_search = GridSearchCV(estimator=log, param_grid=log_param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the Grid Search to the training data
log_grid_search.fit(train_Xle, train_yle)

# Get the best parameters and the best score
log_best_params = log_grid_search.best_params_
log_best_score = log_grid_search.best_score_

print(f"Best parameters found: {log_best_params}")
print(f"Best cross-validation accuracy: {log_best_score}")

# Evaluate the model with the best parameters on the test data
best_log = log_grid_search.best_estimator_
log_y_pred = best_log.predict(test_Xle)
log_test_accuracy = accuracy_score(test_yle, log_y_pred)

print(f"Test set accuracy: {log_test_accuracy}")

Best parameters found: {'C': 100, 'penalty': 'l2', 'solver': 'newton-cg'}
Best cross-validation accuracy: 0.8875661021623362
Test set accuracy: 0.8815352070111816


The logistic regression model parameters that according to the grid search were most optimal were C=100. This provided us with a lower accuracy(87%) than the SVM model, although the accuracy between train and test was closer.

#### K Nearest Neighbors - Classifier, Supervised

To find the optimal parameters for the KNN model, we created a parameter grid to test a few number of neighbor values, and ran the grid search in order to find the best parameter for the highest accuracy.

In [92]:
# accuracy_neigh = {} # create dictionary to hold the accuracy for each nearest neighbors

# # for loop to cycle through possible number of nearest neighbors
# for i in range (8,36,2):
    
#     # train model and fit data based on current iterations number of nearest neighbors
#     neigh = KNeighborsClassifier(n_neighbors=i).fit(train_Xle, train_yle)

#     # calculate test accuracy and print 
#     y_pred_neigh = neigh.predict(test_Xle)
#     acc_neigh = neigh.score(test_Xle, test_yle)
#     print(f'Test accuracy for {i} nearest neighbors is {acc_neigh:.2f}')
    
#     # append accuracy to dictionary with index (number of nearest neighbors)
#     accuracy_neigh[i] = acc_neigh

# max_accuracy_neigh = max(accuracy_neigh.values()) # find the highest accuracy
# accuracy_index_neighbors = max(accuracy_neigh, key=accuracy_neigh.get) # find the highest accuracy's index (number of nearest neighbor)
# print(f"\nHighest accuracy: {max_accuracy_neigh:.2f}, n_neighbors: {accuracy_index_neighbors}") # print

# # train and fit model based on the highest accuracy
# neighbors = KNeighborsClassifier(n_neighbors=accuracy_index_neighbors).fit(train_Xle, train_yle)


# Define the parameter grid
knn_param_grid = {
    'n_neighbors': range(8, 36, 2)}

# Initialize the KNN model
knn = KNeighborsClassifier()

# Set up the Grid Search
knn_grid_search = GridSearchCV(estimator=knn, param_grid=knn_param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the Grid Search to the training data
knn_grid_search.fit(train_Xle, train_yle)

# Get the best parameters and the best score
knn_best_params = knn_grid_search.best_params_
knn_best_score = knn_grid_search.best_score_

print(f"Best parameters found: {knn_best_params}")
print(f"Best cross-validation accuracy: {knn_best_score}")

# Evaluate the model with the best parameters on the test data
best_knn = knn_grid_search.best_estimator_
knn_y_pred = best_knn.predict(test_Xle)
knn_test_accuracy = accuracy_score(test_yle, knn_y_pred)

print(f"Test set accuracy: {knn_test_accuracy}")


Best parameters found: {'n_neighbors': 20}
Best cross-validation accuracy: 0.819043307665838
Test set accuracy: 0.8242167825123401


The KNN model parameters that according to the grid search were most optimal were 'n_neighbors': 12. This provided us with a lower accuracy(80%) than the Logistic Regression model.

#### K Nearest Neighbors AGAIN

In [95]:

# accuracy_neigh2 = {} # create dictionary to hold the accuracy for each nearest neighbors

# # for loop to cycle through possible number of nearest neighbors
# for j in range (8,36,2):
    
#     # train model and fit data based on current iterations number of nearest neighbors
#     neigh2 = KNeighborsClassifier(n_neighbors=j).fit(kpca_train_Xle, train_yle)

#     # calculate test accuracy and print 
#     y_pred_neigh = neigh.predict(kpca_test_Xle)
#     acc_neigh = neigh.score(kpca_test_Xle, test_yle)
#     print(f'Test accuracy for {i} nearest neighbors is {acc_neigh:.2f}')
    
#     # append accuracy to dictionary with index (number of nearest neighbors)
#     accuracy_neigh[i] = acc_neigh

# max_accuracy_neigh = max(accuracy_neigh.values()) # find the highest accuracy
# accuracy_index_neighbors = max(accuracy_neigh, key=accuracy_neigh.get) # find the highest accuracy's index (number of nearest neighbor)
# print(f"\nHighest accuracy: {max_accuracy_neigh:.2f}, n_neighbors: {accuracy_index_neighbors}") # print

# # train and fit model based on the highest accuracy
# neighbors = KNeighborsClassifier(n_neighbors=accuracy_index_neighbors).fit(kpca_test_Xle, train_yle)

# # # plot the decision region
# # plot_regions(neighbors, X_test_, y_test, 'K-Nearest-Neighbors Model')


#### Decision Tree - Clasisfication, Supervised

To find the optimal parameters for the Decision Tree model, we created a parameter grid to test a few max depths, and ran the grid search in order to find the best parameter for the highest accuracy.

In [98]:
# Define the parameter grid
tree_param_grid = {
    'max_depth': range(2, 21, 2),}

# Initialize the grid search
tree_grid_search = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=tree_param_grid, cv=5)

# Perform the grid search
tree_grid_search.fit(train_Xle, train_yle)

# Get the best parameters
tree_best_params = tree_grid_search.best_params_
tree_best_score = tree_grid_search.best_score_
print("Best Parameters:", tree_best_params)
print("Best train accuracy:", tree_best_score)

# Train the model with the best parameters
best_tree = DecisionTreeClassifier(**tree_best_params)
best_tree.fit(train_Xle, train_yle)

# Calculate test accuracy
acc_tree = best_tree.score(test_Xle, test_yle)
print(f'Test accuracy for the best model: {acc_tree:.2f}')

Best Parameters: {'max_depth': 20}
Best train accuracy: 0.9920612646530333
Test accuracy for the best model: 1.00


The Decision Tree model parameters that according to the grid search were most optimal were 'max_depth': 16. This provided us with a very high test accuracy(99%), much higher than any of the previous models. Plus the train and test scores were very similar, again meaning that its unlikely that there is overfitting.

#### Random Forest - Classification, Supervised

To find the optimal parameters for the Random Forest model, we created a parameter grid to test a few number of estimators and max depths, and ran the grid search in order to find the best parameter for the highest accuracy.

In [102]:
# Define the parameter grid
rf_param_grid = {
    'n_estimators': range(100, 111),  # Adjust the range as per your preference
    'max_depth': range(2, 21, 2)}

# Initialize the random search
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=rf_param_grid, n_iter=10, cv=5)

# Perform the random search
random_search.fit(train_Xle, train_yle)

# Get the best parameters
best_params_rf = random_search.best_params_
print("Best Parameters:", best_params_rf)
rf_best_score = random_search.best_score_
print("Best train accuracy:", rf_best_score)

# Train the model with the best parameters
best_rf = RandomForestClassifier(**best_params_rf)
best_rf.fit(train_Xle, train_yle)

# Calculate test accuracy
acc_rf = best_rf.score(test_Xle, test_yle)
print(f'Test accuracy for the best model: {acc_rf:.2f}')

Best Parameters: {'n_estimators': 107, 'max_depth': 14}
Best train accuracy: 0.9876948069336094
Test accuracy for the best model: 0.99


The Random Forest model parameters that according to the grid search were most optimal were . This provided us with a very high test accuracy(99%), much higher than any of the previous models. 

#### Max Voting Classifier

To find the optimal parameters for the Max Voting Classifier model, we collected all of the models we created above and ran them all together in order to compare their accuracy.

In [107]:
#define classifiers you want to use (single models you want to use), and use the models ran before - as they hold the selected hyperparameters 
kn_clf = knn
log_clf = log
tree_clf = best_tree
rf_clf = best_rf

# define max vote classifier - uses estimators which are all the models you want to use
voting_clf = VotingClassifier( estimators=[('kn',kn_clf),('lr',log_clf),('tree',tree_clf), ('rf',rf_clf)], voting='hard')

# train max vote classifier - fit data
voting_clf.fit(train_Xle, train_yle)

# look at and print each classifier's accuracy on the test set:
for clf in (kn_clf, log_clf, tree_clf, rf_clf, voting_clf):
    clf.fit(train_Xle, train_yle) # fit and train 
    acc_vc = clf.score(test_Xle, test_yle) # calculate test accuracy
    print(clf.__class__.__name__, acc_vc) # print

KNeighborsClassifier 0.8224035458849602
LogisticRegression 0.8817366777475572
DecisionTreeClassifier 0.9960713206406769
RandomForestClassifier 0.9895235217084718
VotingClassifier 0.9571874685201974


According to the collection of accuracies we have all together here, the Decision Tree and Random Forest Models have the highest testing accuracy.

# Discussion of these origional models

## Data Reprocessing and Exploration

## Preparing NN

In [None]:
Office Hours Notes:

- feature selection - do the prediciton on most relavant features
- lots of models - do hyperparameter selection, if we use simple nn, tune hyperparamters to improve preformance
- compare all models - nn and machine learning, must make sure they are all fully trained - okay if the machine learning models are working better than nn
- cross validation - estimates test accuracy - sometimes use this to select parameters - but can also use to estimate test accuracy
--- do a bunch of diff splitting and see if keeps preforming highest


Presentation
- whats our goal
- data, models, method, what did we do, what did we try