**By:** Alfredo Martinez

**Date:** 2018

---

### Analyzing Weather, Real Estate Factors, Trends and Values through Random Forest



   Real estate is one of the oldest types of investments in the world, and to this day, the demand for land and housing places as one of the top favorite kind of assets among investors.  Understanding this market and the factors affecting real estate values can easily become handy to the public and investors in general. Throughout this project, I will try to uncover interesting insights such as; how the weather affects homes' sales and values, what factors have the most influence to home prices, and among others. Also, implementation of a random forest regression classifier is done and optimized to serve as future reference for estimating the cost of homes base on previous sales.

---
### Client

Our client is real estate broker and investment company Keller Williams Realty, Inc. The value obtained from this research analysis will be a data-driven rich summary of key trends in the market and a homes' prices predictive model, so real estate practitioners can have a hand ready analysis report and better compete in the negotiations of sales and purchases of real estate investment properties.

---

### Data

The following constitute of the needed and used data for this research project:

[Housing](https://github.com/AlfredMtz/price-prediction/tree/master/C.-Datasets)

Constitutes of “The Ames Housing Dataset” and was initially compiled by Dean De Cock. The data is publicly available and consists of significant home features, year and month sale prices for each home in Ames Iowa between the year 2006-2009 and part of 2010.

[Weather](https://github.com/AlfredMtz/price-prediction/tree/master/C.-Datasets)

Consists of daily, monthly, and yearly records from different weather stations in the city of Ames, Iowa and was obtained from the National Center for Environmental Information website.

---
### Early analysis of available data

The understanding, cleaning, preprocessing of data are early steps in the analysis and model creation, it helps with finding erroneous data, potential outliers, and form a proper structure to extract insights and model building. To first start working with the available data, the needed packages and modules get installed, and both of our data files get read to data frame tables. Then, an early exploratory analysis is performed in the search for erroneous or abnormality in the data.


The early exploratory of this data can be found in this link: [early-data-analysis](https://github.com/AlfredMtz/price-prediction/blob/master/B.-Data_Analysis/EDA-for-bad-data.ipynb), it visually shows a few of the findings and can similarly be done with some of the other features. After searching through these visuals and summaries, nothing was outstanding besides a couple features with missing values which will be taken care of on future steps, so we moved on. 

Also, for this analysis, it is good to mention that we are to build a random forest regression model which is not very sensitive to outliers in the sense that for any extreme values, these get averaged locally to the leaf, and ultimately get grouped and isolated to one single group. However, as mentioned, the early exploratory analysis serves as a stronger sense of assurance for erroneous values and any other abnormalities in the data.

---

### Data Cleaning, Restructuring and Preprocessing

[Housing Data:](https://github.com/AlfredMtz/price-prediction/blob/master/Code.ipynb)

After this early analysis of the type of data available, I moved on to restructure and clean some of the housing data. First, the data is filtered to dates between 2006 and 2009 since this was the best available period for the weather data to be merged into one single table in later steps, then after doing research on what the percentage of missing values should be allowed before a delete threshold condition is impose, there seems to be not a standard rule, instead it appears to be on an individual basis in accordance to industry experience and type of features, so with that said I implemented a delete treshold condition for those features which had almost 70% of the data missing and didn’t offer much value, such as Alley, Miscellaneous features, and fence. Finally, the data was formatted as a time series format, and the dates of homes’ sales columns were updated to a date time index format.

[Weather Data:](https://github.com/AlfredMtz/price-prediction/blob/master/Code.ipynb)

The weather data consisted of records from different weather stations. All different stations were analysed, and the station with the best recordings available was subtracted from the data file.
Also, since the weather data is expected to be joined to the housing data, the dates were filtered to match accordantly. Furthermore, this data consists of several features, and only the most suited variables were subtracted, which included the Date, Precipitation, Snow, Maximum Temperature and Minimum Temperature. Next, after filtering all these features and values, the Date Column had to be reformatted to a date time series.

After checking the new filtered data file, several missing values were present as (-9999), and a replacement was done to show as Not a Number Value(NaN). The features' name columns were also changed to show a better clarity of what the meaning value of each was. 

One of the key deciding factors when working with this data was to get the average weather values from each month and input these to a daily basis back to each month. The reasoning behind this is that people would not likely decide to purchase a home base on one single day, rather than a particular range of time.

Finally, the weather's data index was reformatted to match the housing' data index.  These two data frames were later joined through this date index, and the 'Id' column was dropped as it was not offering any value to the new data frame.

### Exploratory Data Analysis

Analysis was performed in these datasets and some of the findings can be found through the following references:

[Descriptive Analysis](https://github.com/AlfredMtz/price-prediction/blob/master/B.-Data_Analysis/Data-Analysis.ipynb)

[Inferencial Analysis](https://github.com/AlfredMtz/price-prediction/blob/master/B.-Data_Analysis/Data-Analysis-Infer.ipynb)

### Introduction to Predictive Price Modeling

Building a predictive price model can be done through the usage of different types of algorithms. Two popular types of models used now days are linear and decision tree based models. For this analysis, we will be concentrating on the latter. Decision trees, more specifically Random Forest models have been of a great talk lately and are now being preferred in many cases because of the usual better performance in accuracy when cross-validating in comparison to linear models. 

Some of the advantages that Random Forest models offer against Linear Model are as follow:
1. First and  for most, one of the primary goals of our analysis is finding the level of feature importance and given the broad set of features in the data Random Forest serves the better purpose in this matter whereas in linear models as variables start adding up it becomes harder to identify what features we should spend our time the most with.

2. Also, Random Forest might be able to uncover unknown relationships between the price of homes and given features versus linear models which can mainly be used in data that shows a linear relationship.

3. Linear regression models can serve a great purpose when a usual small set of group variables are considered for a prediction. However, in the real world, especially now and probably even truer in the future,  messy data and the need to find out the effect of several different features in the price of homes will become more important.

4. In a way, Random Forest models are really doing features selection for you by deciding where the best features for splitting should be.

5. Random Forest becomes very useful in the form of dealing with outliers, which can indeed be expected from those homes that always come up as being out of the ordinary, whereas with linear models, in which outliers can highly affect the model and need to be thoroughly found.

Building models through Random Forest is surely not perfect, however, it definitely offers extremely valuable and practical usages that can be applied to many different types of datasets. Up next are some of the steps taken in staging the new and cleaned dataset to work with the Random Forest Regression Model to be build.

### Data Preprocessing for Model

First, I uploaded all of the necessary packages and modules to build and validate the model. After this, the first split of data occurs as the separation of the training data( all the known features that will try to determine home values) and the target data (the particular values that we are trying to predict or in other words; all the homes' prices of these homes). 

The data is then split one more time setting 70% of data to train the model and 30% to validate the model. The 30% of the data is treated as a set of unseen data to be run through the train model and see how well of predicted target values we get in accordance to the homes variables given to the model.

Two significant changes were made after this; One was the addition of means and modes to missing values in the training data set and transferring these values over the test dataset as well. Secondly, was the encoding of categorical variables into a form that could be provided to the model to enhance a better prediction. 

The transferred and not the recalculation of the of mean and mode values within the data points in the test set is done  for the following principal reason:

a. To have the model become most useful under the assumption that the test data is coming from the same distribution used in the training data for the model. So the filled in values better represent what is most normal for a mean and modes to be used in the model. 

b. In other words, if the test data shows a significant difference in a couple of values than what was provided during the training of the model, the mean and mode values would also become significantly different and would not be a strong enough representation of the overall data. 

c. Furthermore meaning that the training data set contains 70% of the data which is a much higher amount of data than the test data with only 30%, turning the mean and mode values from the training data into a stronger representation of the overall of this data.

d. Finally, the overall goal of this procedure is trying and avoid linking the performance of the model to the new data in which is evaluating on.

The encoding of categorical features was a more straightforward method in which the training dataset was pass through the ".get_dummies" function from pandas' library, building all categorical variables' values into an integer form to be used in the model. Also since the test dataset was much smaller than the training dataset, some of the categorical values were missing, causing an uneven amount of encoding variables between the two. For this, the categorical columns from the training dataset were used and filled with zero values for all of the missing values in the test dataset. 

## Model

[Code](https://github.com/AlfredMtz/price-prediction/blob/master/Code.ipynb)

The target data is continuous, so I decided to use a Random Forest Regressor model. Using a Random Forest regressor model is an excellent alternative to a Linear Regression model in the sense that it has been shown to offer a stronger predictive accuracy in general. Not only that, but a linear model is better suited for features with linear data which in the case of a random forest model is a more powerful tool to uncover different relationships in the data to wether linearity exist or not. Another alternative method could be a Gradient Boosted Regressor Model which offers a stronger accuracy but is harder to tune its parameters to avoid overfitting problems in comparison to Random Forest. A comparison side by the side of these three different model related to accuracy is shown on the source code for this report, but the rest will be base on the outcome of the Random Forest Model.

### GridSearchCV
In creating this model, an early and significant step was finding the optimal values as parameters for the model. For this task, The GridSearchCV  module is called from the sklearn.model_selection Package; GridSearchCV is a power module that extensively looks for the optimal values as parameters within the given options. I passed some of the primary Random Forest Regression model features that can be used to tune the predictive model. The features' parameters given are as follow:

**n_estimators:**  This parameter represents the number of trees that will better represent the model. Usually, the higher the number of trees the better, as more testing gets done. However, the major downside of the higher number of trees comes as higher processing time since there are more trees to be analyze. 

**max_features:** Determines the allow amount of features to run on each tree. I set two different hyperparameters to search for within this parameter; 
- 'Auto': Which doesn't sets a particular amount of trees for each tree, rather sets the number of features that make more sense on each tree from all features.
- 'sqrt': This gets the square root of the total number of features in each run. For example, I have 78 different features, so on each tree run this would represent about  9 features per tree(square root of 78 = 8.83 = 9, rounding to the nearest whole number.

**max_depth:** These parameters represent the amount of depth to be tried on the trees, meaning the number of branches for the trees to consider and limit.

**bootstrap:** bootstrap searches weather is better to use a different random subset of the data on hand for each tree build. The use of bootstrapping is usual expected as its brings more variable and randomization fo the data.

**min_sample_leaf:**  These represents the minimum amount of data points allowed in a leaf node. This matter, since if you don't have enough of them, outliers or noisy data can affect the overall average values of this leafs since they could be group wrongly and give average values that are not representative of the majority of the data in these groups.

## Model Results

Different values are pass through these parameters, and SearchGridCV extensively tries to find the best possible combination of various values to be used within the model.  Once completed this group of parameters can be used to fit our training data to build our model.

After building the model with the given data and selected parameters, new data can be brought in to see how the model distributes it in comparison the actual prices to what these homes selling price is. An accuracy comparison its done through the use of the Root Mean Square Error which gives a better form to interpret the difference between predicted values and actual values for what the sale prices of the homes' were.

The Random Forest Regressor model gives a Root Mean Square Error of $30,084.33. Meaning that on average the real value of the homes should be within this range give or take.

## Feature Importance

The last and one of the most important parts of this search was trying to find if the weather plays any significant role in the prices of homes. For this, a features' importance table and a graph were built. After looking at this table graph and chart, there is a clear indication that there are other factors that affect the prices of homes with a more significant impact in comparison to weather. Some of these factors included the overall quality of the property, the size of the house living area, total basement size, size of second floors, among others. Average temperatures around the year place as the 24th and 25th in importance when it comes to features determining prices. However, it is interesting to find out how these variables are far more critical than many of the quality aspects of other rooms as individual basis. Also, these show more important than whether a home has a central air unit or a fireplace.

## Conclusion

The model and insights presented in this reports serve as an analysis and predictive model for real estate practitioners from different sides of the industry. The primary goals were to find some of the features that mostly effect homes pricing and discovering what kind of role weather plays in this matter. Also, building a predictive model that can help in better predicting the future price of homes was intended. Further analysis can be done such as correlations between variables or more in-depth analysis of the other two different models provided as a comparison for accurary, but for now, this should serve as a good starting point of analysis of this data and predictive modeling.

---
Alfredo M. ☺️