# Housing Affordability Analysis in Massachusetts

The goal of this project is to investigate the affordability of different types of housing in Massachusetts, specifically focusing on 3-4 unit apartments. We aim to predict three different cost measures for these apartments: 'Monthly Owner Costs With Mortgage', 'Monthly Owner Costs Without Mortgage', and 'Monthly Renter Costs'. These cost measures are treated as a percentage of income spent on housing costs.

Understanding these costs can provide insights into housing affordability and help guide decisions around housing policy and development.

## Data Preprocessing

The data we're using contains information about different types of housing across various locations in Massachusetts. The features include housing characteristics and location, while the targets are different measures of housing costs.

Before we can use this data for modeling, we need to preprocess it:

1. **Create Dummy Variables**: The 'Housing Type' and 'MA' (location) columns are categorical variables. Machine learning algorithms require numerical input, so we convert these categorical variables into dummy variables (also known as indicator variables). Each category within the original columns is turned into a new binary column in the dataframe.
2. **Scale Numerical Features**: The numerical features in the dataset have different magnitudes, which can cause issues with some machine learning algorithms. To solve this, we standardize these features so that they have a mean of 0 and a standard deviation of 1.

## Train-Test Split

After preprocessing the data, we split it into a training set and a testing set. The training set is used to train our machine learning models, while the testing set is used to evaluate the performance of these models on new, unseen data.

We're creating three separate train-test splits, one for each target variable. This will allow us to train separate models for 'Monthly Owner Costs With Mortgage', 'Monthly Owner Costs Without Mortgage', and 'Monthly Renter Costs'. Each model will use all other columns in the dataset (except for the other target variables) as features.

In the next steps of this project, we'll train models on these datasets and evaluate their performance. This will help us understand how different housing characteristics and locations impact housing costs, and ultimately, housing affordability in Massachusetts.

In [1]:
import pandas as pd

df = pd.read_csv('updated_dataframe.csv')

df.head()

Unnamed: 0,Housing Type,MA,Mobile Home or Trailer,Mobile Home or Trailer.1,Mobile Home or Trailer.2,Mobile Home or Trailer.3,Mobile Home or Trailer.4,Mobile Home or Trailer.5,Mobile Home or Trailer.6,Mobile Home or Trailer.7,...,50 or More Apartments.1,50 or More Apartments.2,50 or More Apartments.3,50 or More Apartments.4,50 or More Apartments.5,50 or More Apartments.6,50 or More Apartments.7,50 or More Apartments.8,50 or More Apartments.9,50 or More Apartments.10
0,Data Type,Selected Geographies,Owned,Rented,Owned %,Rented %,Total,Monthly Owner Costs With Mortgage,Monthly Owner Costs Without Mortgage,Monthly Renter Costs,...,Rented,Owned %,Rented %,Total,Monthly Owner Costs With Mortgage,Monthly Owner Costs Without Mortgage,Monthly Renter Costs,Owned %,Rented %,Total
1,0,"Berkshire County--Pittsfield City PUMA, Massac...",545,52,91.3,8.7,597,43.0,39.0,101.0,...,1020,3.4,96.6,1056,10.0,,45.0,3.4,96.6,1056
2,1,"Franklin & Hampshire (North) Counties PUMA, Ma...",759,265,74.1,25.9,1024,37.0,31.0,20.0,...,1114,4.7,95.3,1169,,17.0,41.0,4.7,95.3,1169
3,2,Worcester County (Central)--Worcester City PUM...,0,113,0.0,100.0,113,,,,...,6714,5.4,94.6,7101,34.0,,42.0,5.4,94.6,7101
4,3,"Worcester County (Northeast)--Leominster, Fitc...",304,0,100.0,0.0,304,32.0,55.0,,...,3114,8.4,91.6,3401,,14.0,30.0,8.4,91.6,3401


In [2]:
# Drop the first row
df = df.drop(0)

# Check for missing values
missing_values = df.isnull().sum()

# Display the number of missing values by column
missing_values

Housing Type                 0
MA                           0
Mobile Home or Trailer       0
Mobile Home or Trailer.1     0
Mobile Home or Trailer.2    15
                            ..
50 or More Apartments.6     19
50 or More Apartments.7      3
50 or More Apartments.8      0
50 or More Apartments.9      0
50 or More Apartments.10     0
Length: 77, dtype: int64

In [3]:
# Fill missing values in the percentage columns with the mean of the column
for column in df.columns:
    if ".2" in column or ".3" in column or ".8" in column or ".9" in column:
        df[column] = df[column].astype(float)  # Ensure the column data is float
        df[column].fillna(df[column].mean(), inplace=True)

# Fill missing values in the cost columns with the median of the column
for column in df.columns:
    if ".5" in column or ".6" in column or ".7" in column:
        df[column] = df[column].astype(float)  # Ensure the column data is float
        df[column].fillna(df[column].median(), inplace=True)

# Check for missing values again
missing_values = df.isnull().sum()
missing_values

Housing Type                0
MA                          0
Mobile Home or Trailer      0
Mobile Home or Trailer.1    0
Mobile Home or Trailer.2    0
                           ..
50 or More Apartments.6     0
50 or More Apartments.7     0
50 or More Apartments.8     0
50 or More Apartments.9     0
50 or More Apartments.10    0
Length: 77, dtype: int64

In [4]:
# Check the number of unique categories in 'Housing Type' and 'MA'
num_unique_housing_types = df['Housing Type'].nunique()
num_unique_MA = df['MA'].nunique()

num_unique_housing_types, num_unique_MA

(52, 52)

In [5]:
# Create dummy variables for 'Housing Type' and 'MA'
df_dummies = pd.get_dummies(df, columns=['Housing Type', 'MA'])

# Display the first few rows of the new dataframe
df_dummies.head()

Unnamed: 0,Mobile Home or Trailer,Mobile Home or Trailer.1,Mobile Home or Trailer.2,Mobile Home or Trailer.3,Mobile Home or Trailer.4,Mobile Home or Trailer.5,Mobile Home or Trailer.6,Mobile Home or Trailer.7,One-family house detached,One-family house detached.1,...,"MA_Plymouth County (East)--Plymouth, Marshfield, Scituate, Duxbury & Kingston Towns PUMA; Massachusetts","MA_Suffolk County (North)--Revere, Chelsea & Winthrop Town Cities PUMA; Massachusetts","MA_Weymouth Town, Braintree Town Cities, Hingham, Hull & Cohasset Towns PUMA; Massachusetts","MA_Woburn, Melrose Cities, Saugus, Wakefield & Stoneham Towns PUMA; Massachusetts","MA_Worcester & Middlesex Counties (Outside Leominster, Fitchburg & Gardner Cities) PUMA; Massachusetts","MA_Worcester County (Central)--Worcester City PUMA, Massachusetts","MA_Worcester County (East Central) PUMA, Massachusetts","MA_Worcester County (Northeast)--Leominster, Fitchburg & Gardner Cities PUMA; Massachusetts","MA_Worcester County (South) PUMA, Massachusetts","MA_Worcester County (West Central) PUMA, Massachusetts"
1,545,52,91.3,8.7,597,43.0,39.0,101.0,35087,2086,...,0,0,0,0,0,0,0,0,0,0
2,759,265,74.1,25.9,1024,37.0,31.0,20.0,25815,4030,...,0,0,0,0,0,0,0,0,0,0
3,0,113,0.0,100.0,113,36.5,22.0,35.0,22183,2984,...,0,0,0,0,0,1,0,0,0,0
4,304,0,100.0,0.0,304,32.0,55.0,35.0,27257,1887,...,0,0,0,0,0,0,0,1,0,0
5,553,0,100.0,0.0,553,51.0,21.0,35.0,34045,2161,...,0,0,0,0,0,0,0,0,0,1


In [6]:
from sklearn.preprocessing import StandardScaler

# Convert all columns to float
df_dummies = df_dummies.astype(float)

# Create a scaler object
scaler = StandardScaler()

# Fit and transform the data
df_scaled = pd.DataFrame(scaler.fit_transform(df_dummies), columns=df_dummies.columns)

# Display the first few rows of the scaled dataframe
df_scaled.head()

Unnamed: 0,Mobile Home or Trailer,Mobile Home or Trailer.1,Mobile Home or Trailer.2,Mobile Home or Trailer.3,Mobile Home or Trailer.4,Mobile Home or Trailer.5,Mobile Home or Trailer.6,Mobile Home or Trailer.7,One-family house detached,One-family house detached.1,...,"MA_Plymouth County (East)--Plymouth, Marshfield, Scituate, Duxbury & Kingston Towns PUMA; Massachusetts","MA_Suffolk County (North)--Revere, Chelsea & Winthrop Town Cities PUMA; Massachusetts","MA_Weymouth Town, Braintree Town Cities, Hingham, Hull & Cohasset Towns PUMA; Massachusetts","MA_Woburn, Melrose Cities, Saugus, Wakefield & Stoneham Towns PUMA; Massachusetts","MA_Worcester & Middlesex Counties (Outside Leominster, Fitchburg & Gardner Cities) PUMA; Massachusetts","MA_Worcester County (Central)--Worcester City PUMA, Massachusetts","MA_Worcester County (East Central) PUMA, Massachusetts","MA_Worcester County (Northeast)--Leominster, Fitchburg & Gardner Cities PUMA; Massachusetts","MA_Worcester County (South) PUMA, Massachusetts","MA_Worcester County (West Central) PUMA, Massachusetts"
0,0.584913,-0.203924,0.55795,-0.55795,0.456138,0.146246,1.372609,2.970455,0.847298,0.221807,...,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028
1,1.097951,1.425702,-0.004151,0.004151,1.35235,-0.148133,0.5861,-0.949206,0.008192,2.271914,...,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028
2,-0.721656,0.262776,-2.42576,2.42576,-0.559708,-0.172664,-0.298722,-0.223343,-0.3205,1.168822,...,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,7.141428,-0.140028,-0.140028,-0.140028,-0.140028
3,0.007146,-0.601767,0.842268,-0.842268,-0.158827,-0.393448,2.945627,-0.223343,0.138691,0.011945,...,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,7.141428,-0.140028,-0.140028
4,0.604092,-0.601767,0.842268,-0.842268,0.363788,0.53875,-0.397036,-0.223343,0.752998,0.300901,...,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,-0.140028,7.141428


In [7]:
# Save the preprocessed dataframe as a CSV file
df_scaled.to_csv('preprocessed_dataframe.csv', index=False)

In [9]:
from sklearn.model_selection import train_test_split

# Define the target variables for '3-4 Apartments'
targets = ['3-4 Apartments.5', '3-4 Apartments.6', '3-4 Apartments.7']

# For each target variable, create a separate train-test split
for target in targets:
    features = df_scaled.drop(targets, axis=1)  # drop all target columns from the features
    target_data = df_scaled[target]  # select the current target

    # Split the data (80% training, 20% testing)
    features_train, features_test, target_train, target_test = train_test_split(features, target_data, test_size=0.2, random_state=42)

    # At this point, you could train a model using features_train and target_train
    # and validate it using features_test and target_test