The purpose of this project is to use the Ames Housing Dataset to preform predictive analysis on the sales price of homes. The dataset can be found at https://www.kaggle.com/datasets/prevek18/ames-housing-dataset. This dataset contains a number of attributes for homes such as square footage, overall quality as rated by real estate agents, and others that were sold in Ames, Iowa in the early 2000s. This dataset contains the final price the home was sold at, allowing for pricing predictions

To begin, we need to know the scope of the data we are working with. Luckily, a column definitions document can be found at https://jse.amstat.org/v19n3/decock/DataDocumentation.txt. Using this documentation, we can see that there are 2930 total observations within the dataset, along with 82 total features. Before we can start predictive analysis, it is clear that dimensionality reduction is required to reduce the chances of over-fitting. There are number of routes that could be taken to reduce the number of features, but it is often best to start with the simplest. We will naively select columns based on their correlation to the dependent variable, the sales price.

In [58]:
import pandas as pd
from sklearn import preprocessing

import plotly.express as px

import seaborn as sn
import matplotlib.pyplot as plt

from math import ceil

In [49]:
df = pd.read_csv("AmesHousing.csv")

In [63]:
# Converts non-numeric columns into categorical numerics so correlation can be determined
binned_df = df.apply(preprocessing.LabelEncoder().fit_transform)

# Creates a correlation matrix
top_columns = binned_df.corr('pearson').abs()

# Sorts the correlations based on their correlation to the sales price
top_columns = top_columns.sort_values('SalePrice', ascending=False)

# Captures the 20%, or 16, highest correlated columns, then stores the names of those columns as a list
top_columns = top_columns.head(ceil(len(df.columns) * .2))
top_columns = top_columns.index.tolist()

# Creates the heatmap to display the correlation
fig = px.imshow(df[top_columns].corr('pearson'), text_auto=True, aspect="auto")
fig.show()