Predict home prices in Ames Iowa.
A comparison study that aims to show which model works best for this geography, demographics and home type.
Sucess is evaluated by comparing RSME for train and test data as well as MSE and
Project 2 begins with the main file which in which I model the provided data in the dataset folder. This was acomplished by taking 2051 train data samples or rows and 879 test data rows or samples to train my model and make a predicting. EDA and cleanig, general housekeeping, and feature engineering was the first labor intensive part. The second part, the modeling was much easier but took several iterations before honing in on the right hyperparameters. Model ended up being overfit. A good problem to have generally speaking. Backed down some of the overfit issue by adding a lasso model into the equation.
With 26 features that had missing values from one to nearly all missing the challange stemmed from learning a way to impute or in other words calculate what said missing values were. For some features it was simple, NA likely meant no feature present like pool size. So for those features it was clear what to do, while also clear to add something for floating data that was likely not added due to carelessness or other reasons. This data was managed by taking the mean of the data and adding the same value for all mising data. This was crude and not ideal. In a perfect world where time is not on such a crunch, I would model the non missing data and predict the missing data from that non missing data. Effectively making the constant guesses variant. Now for the hardest part, what do we do with categorical data that is missing but taking the average would not make sence because the data is for the lack of a better term quantized. If the quality is integer based and goes up from zero to ten by 1 imputing a value of 4.25 would not have any real world value. For this category of data I made the NA's dummies or in other words converted the NA's into a feaature of 0 or 1 if NA was there.
Created functions to test for the best hyperparameter and estimator pair. The test RMSE is 22000 while the train RMSE 10000.
With an