# Ames Housing - Feature Importance
During Covid, many people fled large cities, with many of those fleeing finding a new home in the suburbs. It would therefore be helpful for those seeking to [find new homes outside of metropolitan cities](https://www.jchs.harvard.edu/blog/are-millennials-leaving-cities-yes-young-adults-are-not), and in doing so it would be important to note what house features will be most in-demand and ultimateloy inform us how such houses are priced. The location for the housing on which the analysis was conducted is Ames, Iowa.


## 1. Data
As per the Kaggle description, "Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence... With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa."

- [Kaggle Dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)
 

## 2. Method
Random Forest Regression was used for this case - it runs efficiently on large datasets. Random Forest has a higher accuracy than other algorithms. It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.


## 3. Data Cleaning
Values clustered down the low end are: 
- 3SsnPorch,
- BsmtFinSF2,
- MiscVal,
- PoolArea,
- Enclosed Porch
- ScreenPorch has all but one value of 0 so it has very little variance,
- LotArea still mostly 0,
- LowQualFinSF also for the same reason.

The miscellanious columns (*MiscFeature*, *MiscVal*) seemingly do not add any value, and are not informing us of any important house-related information, while they contain NaN values within it. The *MiscFeature* column alone had upwards of 90% of missing data!

In a collaborative-filtering system there are only three columns that matter to apply the machine learning algorithms: the user, the item, and the explicit rating (see the example matrix above). I also had to clean & normalize all the reference information (location, difficulty grade, etc.) to the route so that my user could get a useful and informative recommendation.


## 4. EDA

![image.png](attachment:image.png)

In creating a heatmap, it is possible to view which housing attributes have a strong, positive linear relationship. Some of the attributes with strong linear relationships are: _1. Sale Price & Overall Quality, 2. Gr Liv Area & Sale Price, 3. Garage Area & Garage Cars, 4.Garage Year Built & Year Built, and 5. Total Rooms Above Ground & Gr Liv Area._ There is an inverse relationship between Year Built and Overall Condition.

![image.png](attachment:image.png)

In the scatterplots you see what the high correlations were clearly picking up on. There's a strong positive correlation with Overall Quality. Year Built seems useful. Year Remodeled Add, Ground Living Area, and Basement Sq. Ft. appear quite similar and also useful. The obvious correlation with quality makes absolute sense, however the relationship with Overall Condition, while trending positively, isn't perfectly linear, however it passes an eye test. What we can infer from this is that the age of the building plays a huge part in determining price; newer buildings may be more modern with more amenities, or simply be more secure. It is also not a revelation to find out that the area of the house's ground floor, basement or garage area determine the price, with there being a positive trend on that front as well.

## 5. Which Dataset to choose?
The Random Forest Regression seemed to perform better relative to the linear regression, although it makes sense to always compare both for each business case rather than assuming that will always be the case.

## 6. Outcome
Encouragingly, the dominant top five features are in common with the linear model:
- Overall quality
- Above grade (ground) living area square feet
- Total square feet of basement area
- Type 1 finished square feet
- Full bathrooms above grade

![image-2.png](attachment:image-2.png)

Reducing columns that have low feature importance led to these features ultimately being identified. Features that had an importance threshold below or equal to 2% (which in and of itself is extremely small) meant the majority of the them had to be removed, as they all fell well below that threshold.

![image.png](attachment:image.png)

## 7. Future Improvements
In the future, I would love to spend more time working on the model to further draw it out, with the goal of predicting the price rather than simpy derive feature importance. 