# Housing Price Trends and Factors in Current Day America

Table of Contents:
    
1. Rising Concerns for Housing
2. A Preliminary Look at the Data
3. *Later Thing*
4. *Later Thing*


By Eric Chi and Fox Davenport

# Rising Concerns for Housing

## Goal of Our Project

The project is trying to address the growing concern among Generation Z and Millennials regarding the increasing difficulty of homeownership due to rising housing prices and rental costs. This analysis aims to identify and quantify the various factors that influence housing prices and rent. The goal is to create a multiple linear regression model that can reasonably predict housing and rent prices, so that consumers can understand the trends in housing right now. With this new found knowledge, they can make more informed financial decisions. We will start by creating an MLR with all potential factors in our data. Using a combination of partial F-tests and ANOVAs, we will determine the significant predictors. Then we will establish a machine learning model to create and train MLRs before performing an F-test to determine the best model. Finally, we will check to make sure our model assumptions for MLR are satisfied before drawing conclusions.

## Why Housing?

Day after day, you constantly hear the news rerpot about the struggles that Generation Z and Millenials face. There is a growing shared sentiment amongst Generation Z and Millenials that the world is filled with dread and gloom. The world is becoming harder to survive and live in causing worries about the future. Housing is one of these issues with many people exclaiming how rent and housing prices only seem to go up. There are countless stories of people paying outrageous prices for poor living situations that would have been cheaper in the past. 

As students looking to work in the data science field this project allows us to practice our abilities in pattern recognition and data management for a topic which concerns us and our colleagues. I, Eric, am a math/economics major so a topic such as the housing market and the variables which can make the market fluctuate has a direct tie into my studies in business and mathematical modeling. I, Fox, am a financial actuarial math major so the analysis of contributing factors to an economic trend will be part of my daily work life in the future. One of our personal greatest fears is being able to secure a comfortable and proper living after college. Housing is crucial for that lifestyle and understanding the general trends that affect housing prices/rental costs will give us an advantage in decision making when we enter the housing market.

# A Preliminary Look at the Data

Our data was taken from Kaggle and is extensive housing data for the Ames, Iowa region. Necessary precautions will be taken to prevent overfitting and try to make it relatively generalizable to the rest of the USA. Already provides us with training and testing data

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

In [10]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from tensorflow.keras import layers, initializers, utils
from sklearn.model_selection import train_test_split
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder, StandardScaler
import seaborn as sns
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import KernelPCA
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

In [11]:
# Read in our data. Already given training and test ones
train_data = pd.read_csv('https://github.com/FoxDavenport/PIC16BFinalProject/blob/main/train.csv?raw=true')

test_data = pd.read_csv('https://github.com/FoxDavenport/PIC16BFinalProject/blob/main/test.csv?raw=true')

## Data Exploration and Processing 

Let's observe what information our dataset contains, how it's structued, and what it looks like.

In [14]:
print(train_data.shape)
print(test_data.shape)

(1460, 81)
(1459, 80)


The training data has 1460 houses and 81 features for each house. We will use all entries from the training data as the size is small enough that crashes shouldn't occur. 

Note that the test data has 1459 houses and 80 features for each house. The test data has one less feature becaue it does not include the sales price for the houses, which is the output feature. The testing data only has the input features.

Let us examine to see what features are potential factors for our housing price.

In [17]:
# Gets column names
train_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

We now know what features are in our dataset, but we don't know what sort of information they hold. Let's take a look at the entire dataset and see what type of outputs each feature gives.

In [19]:
# show first 5 rows of the dataframe and all columns
pd.set_option("display.max_rows", None, "display.max_columns", None)
train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


We now can see the type of outputs each feature has with some being categorical and others being numerical. 

Certain houses do not have applicable fields to them for some of the features. The Alley, MasVnrType, PoolQC, Fence, and MiscFeature columns all have NaNs. We will then remove these as they will mess with the data. Furthermore, my partner and I deemed these as marginally important. We really wanna focus on the big contributing factors of a house in our eyes: the location, time, size, interior, and materials.

The following will be the columns we wanna keep for our training and testing data.

In [21]:
# List of columns to keep
columns_to_keep = [
    'SalePrice', 'MSSubClass', 'LotArea', 'Street', 'LotShape', 'LandContour', 'LotConfig', 
    'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 
    'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Foundation', 
    'Heating', 'HeatingQC', 'CentralAir', '1stFlrSF', '2ndFlrSF', 
    'LowQualFinSF', 'GrLivArea', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 
    'TotRmsAbvGrd', 'Fireplaces', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 
    'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'MiscVal', 'MoSold', 'YrSold', 'SaleCondition'
]

# Keep columns we specified for training_data
train_data = train_data[columns_to_keep]

# Remove 'SalePrice' from columns_to_keep. Need to do this because our test_data only has input features.
# It does not have SalePrice, the output feature
columns_to_keep.remove('SalePrice')
test_data = test_data[columns_to_keep]

Now, let's make sure we don't have any NaNs left in our training and testing data

In [23]:
# find sums of NaNs in dataset if there are any
print(train_data.isnull().sum().sum())
print(test_data.isnull().sum().sum())

0
0


Our data no longer has any NaNs left. Let's now check the size of our training and testing data.

In [25]:
# Show new training data results
print(train_data.shape)
print(test_data.shape)

(1460, 43)
(1459, 42)


Our new training data after removing the unnecessary columns has 43 features for 1460 houses and our new testing data has 42 features for 1459 houses.

Let us now visualize our training data and see the entire dataset with the new number of features.

In [27]:
train_data.head()

Unnamed: 0,SalePrice,MSSubClass,LotArea,Street,LotShape,LandContour,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Foundation,Heating,HeatingQC,CentralAir,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,MiscVal,MoSold,YrSold,SaleCondition
0,208500,60,8450,Pave,Reg,Lvl,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,PConc,GasA,Ex,Y,856,854,0,1710,2,1,3,1,8,0,Y,0,61,0,0,0,0,2,2008,Normal
1,181500,20,9600,Pave,Reg,Lvl,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,CBlock,GasA,Ex,Y,1262,0,0,1262,2,0,3,1,6,1,Y,298,0,0,0,0,0,5,2007,Normal
2,223500,60,11250,Pave,IR1,Lvl,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,PConc,GasA,Ex,Y,920,866,0,1786,2,1,3,1,6,1,Y,0,42,0,0,0,0,9,2008,Normal
3,140000,70,9550,Pave,IR1,Lvl,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,BrkTil,GasA,Gd,Y,961,756,0,1717,1,0,3,1,7,1,Y,0,35,272,0,0,0,2,2006,Abnorml
4,250000,60,14260,Pave,IR1,Lvl,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,PConc,GasA,Ex,Y,1145,1053,0,2198,2,1,4,1,9,1,Y,192,84,0,0,0,0,12,2008,Normal


We still do not know exactly what units some of these numerical features are in. Additionally, some categorical features have weid abbreviations that we do not know the current meaning of. We will now define all of them, so that there is no confusion on what each feature represents. 

**SalePrice:** the property's sale price in dollars. This is the target variable that we're trying to predict.

**MSSubClass:** The building class

**LotArea:** Lot size in square feet

**Street:** Type of road access

**LotShape:** General shape of property

**LandContour:** Flatness of the property

**LotConfig:** Lot configuration

**LandSlope:** Slope of property

**Neighborhood:** Physical locations within Ames city limits

**Condition1:** Proximity to main road or railroad

**Condition2:** Proximity to main road or railroad (if a second is present)

**BldgType:** Type of dwelling

**HouseStyle:** Style of dwelling

**OverallQual:** Overall material and finish quality

**OverallCond:** Overall condition rating

**YearBuilt:** Original construction date

**YearRemodAdd:** Remodel date

**RoofStyle:** Type of roof

**RoofMatl:** Roof material

**Foundation:** Type of foundation

**Heating:** Type of heating

**HeatingQC:** Heating quality and condition

**CentralAir:** Central air conditioning

**1stFlrSF:** First Floor square feet

**2ndFlrSF:** Second floor square feet

**LowQualFinSF:** Low quality finished square feet (all floors)

**GrLivArea:** Above grade (ground) living area square feet

**FullBath:** Full bathrooms above grade

**HalfBath:** Half baths above grade

**Bedroom:** Number of bedrooms above basement level

**Kitchen:** Number of kitchens

**TotRmsAbvGrd:** Total rooms above grade (does not include bathrooms)

**Fireplaces:** Number of fireplaces

**PavedDrive:** Paved driveway

**WoodDeckSF:** Wood deck area in square feet

**OpenPorchSF:** Open porch area in square feet

**EnclosedPorch:** Enclosed porch area in square feet

**3SsnPorch:** Three season porch area in square feet

**ScreenPorch:** Screen porch area in square feet

**MiscVal:** Value of miscellaneous feature

**MoSold:** Month Sold

**YrSold:** Year Sold

**SaleCondition:** Condition of sale