# Machine Learning Engineer Nanodegree
## Capstone Project
**  Haleh Dolati, February , 2018**

# 1- Introduction
## 1-1- Background (Project Overview)
For most people, buying a home is the most expensive purchase of their life. For some, their residential property is also their retirement savings and investment, an investment that their spend most of their life paying for it. Thus the ups and downs of the real estate market could be a “make or break” situation for many people. Therefore, being able to monitoring the market is incredibly important for them. Prospective buyers, on the other hand, would like to know the approximate price of residential units in different neighborhoods to choose the home that meet their needs and they can afford. For all these reasons, having an estimate of the value of a home can be really useful. Zillow’s Zestimate provides consumers with information about the value of their property today and the possible change of it in near future. <br>
Zillow estimates the home value using millions of statistical and machine-learning models based on many data points for each property. As an urban planner with years of experience, I understand the impact of Zestimate on home values, the real estate market, and people’s life. According to Zillow Research, since their first release of Zestimate 11 years ago, Zillow successfully improved the median margin of error for Zestimate from 14% to 5%. Since Zestimate can affect more than 110 million homes in the United States, it is really important to improve the margin of error even further.<br>
In order to achieve this goal, Zillow recently started a Kaggle competition (https://www.kaggle.com/c/zillow-prize-1). In this competition, Zillow is asking for a model that can predict the log error between Zestimate and actual sale's price. The log error is defined as:<br>
**logerror=log(Zestimate)−log(SalePrice)** <br>
As a part of their Kaggle competition, Zillow provided data of more than 95000 houses. This data comes in two parts: the feature file with 58 features and the target file with the continuous l labels. The “parcelid” is the common id between these two files and will be used to join them.<br> 
## 1-2- Problem Statement
The goal of this project is to build a supervised prediction model that can improve the Zestimate using a dataset with residential units’ related features and a continuous label that shows the log error between the actual price and Zestimate price. Modeling the error is a useful tool to find areas/ feature that need to be explored more in the original prediction model. In this project, finding an accurate model would help to adjust the Zestimate and make it more accurate.<br>
Based on the dataset and the continuous nature of the response variable, supervised learning models such as Linear Regression, Decision Tree Regression, and Gradient Boosting Regression are possible choices for training. After exploring the data, I will choose the model that fits the best. The results then can be evaluated in order to find the best model to optimize. The optimized model can predict the log error and make the Zestimate more accurate. In addition, feature importance analysis will be done in order to find the most important factor(s) that leads to high log error.<br>

## 1-3- Metrics
In Kaggle competition, Zillow suggested the Mean Absolute Error ($MAE$) to measure the performance of the proposed model. This dataset’s response variable is continuous. Therefore, the metrics that I suggested in my proposal were Mean Absolute Error and Root Mean Squared Error:
 
**1-** Mean Absolute Error ($MAE$) measures the magnitude of the errors while ignores the direction of those errors. MAE ranges from 0 to ∞, where the smaller values are more desirable. In Zillow's Kaggle completion, this measure was chosen.  $\frac{\sum_{i=1}^{n}|\hat y_i - y_i|}{n}$

**2-** Root Mean Squared Error ($RMSE$) is another metric that can be used to measure the difference between the actual response variable and predicted response variable. Higher RMSE means the model is not accurate and the lower RMSE shows the model is performing well.
 $\sqrt{\frac{\sum_{i=1}^{n}(\hat y_i - y_i)^2}{n}}$

Both MAE and RMSE use some form of average error, range from 0 to ∞, unresponsive to the direction of errors, and both are negatively oriented scores. However, RMSE penalizes larger error heavily and gives them a larger weight. Because of this characteristic of the RMSE, I decided to use both of them to evaluate the model.

----

# 2- Analysis
## 2-1- Data Exploration
### 2-1-1- Dataset: 
As mentioned before, the Zillow data is consist of two datasets. The first step in this analysis was to combine these tow datasets and create a new one. The first one (properties_2016), has the information of 2985217 homes. The second one (transactions_2016) has the information of 90275 transactions. These two data sets have a common id, ‘parcelid’, that makes joining these two datasets possible. <br>
Before creating a new dataset, I looked into the possibility of duplicates by searching for parcelids that appeared more than once in each of those two datasets. There were no duplicates in properties file, which means each property only has 1 set of variables associated with it. However, in the transaction data there were 249 duplicated parcelids.  This means more than 1 transaction is associated with 1 parcelid and the home was sold more than once in 2016. For example the home with parcel id 13850164 been sold twice in 2016: once on 2016-01-05 and once on 2016-06-29. Joining properties and transactions files, will keep both transactions as two separate records and will assign the features from properties file to both of them. The new dataset (result of join) has 60 variables and 90275 records. 
### 2-1-2- Variables:
There are 60 variables in this dataset: one target variable (label) and 59 possible predictor variables (features). “logerror” is the continuous label of this dataset. In order to familiarize myself with the dataset, I looked into features’ type and identified categorical and numerical variables to see whether I need to convert some of the variables into dummy variables. There are 54 numerical variable (float and integer), 1 date/time variable, and 5 object variables. By looking into the features, I realized in some of the numerical variables are in fact categorical variables and the numbers are actually representing different categories. For example, variable “fips” has three values: 6037, 6059, and 6111. “fips” is the county where the home is located and those three numbers correspond to Los Angeles, Orange, and Ventura county, respectively. In this case, if I want to use “fips” in my analysis I have to convert it to 3 dummy variables. Appendix 1 shows the complete list of variables, their description, data type, and data category.
### 2-1-3- Missing values:
Among 60 variables in this dataset, only 15 of them do not have any missing values. The percent of missing values for variable sin this dataset ranges from 0 to 99.98%. Appendix 2 shows the numbers and percent of missing values, if any, for all 60 variables. I classified variables with missing values into 3 categories and dealt with them differently: <br>
1- Variables with a few numbers of missing values were left alone. For example “yearbuilt”, which represent the year in which the building was built, only has 756 missing values (0.84%).<br>
2- Some variable with large number of missing values were recovered using different methods. For example a couple of variables are providing information such as the size or the type of the pool on each property. While each of them separately had a high number of missing values, the combination of them can be used to create a flag variable that tell us whether there is a pool on the property. Handling the missing values in the this category mostly resulted in creating new variables. <br>
3- The last group simply had too many missing values. For example “buildingclasstypeid”, which represents the building framing type (steel, wood, or concrete/brick frames), has 90259 missing values, which is 99.98% of the data. I had no choice but excluding this variable from the analysis. 

## 2-2- Exploratory visualization
###  2-2-1- Univariate Analysis:
In order to have a better understanding of the data, I looked into the frequency and distribution of some of the variables.
#### Logerror
As mentioned before, logerror is the different between logarithm of the Zestimate and actual price. Logerror in this data ranges from -4.61 to 4.74. A positive logerror means Zestimate is overestimating the sale price while a negative logerror means that Zestimate is underestimating the sale price. The mean, median, and standard deviation of the logerror are the same, which suggests the logerror is normally distributed. Graph 1 and 2 shows the normal distribution of logerror.  Graph 3 shows the distribution of absolute logerror. The absolute values graph shows more clearly that majority of Zestimate predictions are accurate and very close to zero.  <br>
 
!["Picture 1"](Image-report/EDA_Univar_logerrorBar.png) 
!["Picture 2"](Image-report/EDA_Univar_logerrorPoint.png)
!["Picture 3"](Image-report/EDA_Univar_ABSlogerror.png)

#### Transaction Date
The day in which the transaction happened is included in this dataset. The format of the data is in YYYY-MM-DD format. I made two new numerical variables for month and day of transaction to see how the sales changed during the year.  The result shows fewer properties were sold during the last 3 months of the year and there is a pick in number of transactions between May and August. The univariate analysis of the day of the transaction does not show a very distinct change. The only noticeable and interesting point is every 7 days there is a spike in the number of transaction: 7th, 14th, 21st, and 28th of the month are the days with the highest number of sales. 
!["Picture 4"](Image-report/EDA_Univar_Day.png)
!["Picture 5"](Image-report/EDA_Univar_Month.png)

#### Geography
Next, I looked at the geographic distribution of the sold properties. When I compared the graph with map of this area, the darker color aligned with high dentist areas (example: downtown LA). How ever, the small island of sales on the northeast of the map is not a very dense area. <br>
!["Picture 6"](Image-report/EDA_Univar_LocationDens.png)

#### Bedrooms
Number of bedrooms is a very important characteristic of a home. Homebuyers usually look at this number even before the size of the home. Therefore, this number play an important role in listing and sales price. Due to its importance, number of bedrooms can cause error when it comes to predicting the price (Zestimate). The homes in this dataset have between 0 and 16 bedrooms while only 2% of them have more than 5 bedrooms. Picture bellow shows that 4 bedrooms homes are the most common ones in this dataset and majority of homes have more than 3 or 4 bedrooms.   
!["Picture 7"](Image-report/EDA_Univar_bedroom.png)

#### Bathroom
Like bedrooms, the number of bathrooms in a home usually is a key factor that affects the price thus can cause error in predicting it.  In this dataset, the number of bathroom ranges from 0 to 20. Since only slightly over 1% of the properties have more than 5 bathrooms, I decided to look into homes with less than 6 bathrooms. The following picture shows 3 bathrooms in the property is the most common one and majority of the homes have more than 3 bathrooms. 
!["Picture 8"](Image-report/EDA_Univar_Bathroom.png)

###  2-2-2- Bivariate Analysis:
#### Logerror by Month and Day
I wanted to see whether over time, Zestimate were improved. Therefore, I calculated the mean of log error for each month. First Pictures shows for most of the year the average logerror increased.  But then I realized since logerror had both negative and positive values, the mean cannot not capture the magnitude of the error: mean of logerror with many large positive and very small negative values could be a number very close to zero, which would be misleading. Therefore, I changed the mean of log error to mean of absolute logerror. The second picture shows the mean of absolute logerror over time. This picture shows an opposite trend compared to the mean logerror and in fact it shows logerror was improved over the same period of time. <br>
*(The number of sales in November and December is significantly lower than other months. So the mean logerror and mean of absolute logerror might not be a good representation of Zestimate’s performance in those two months)*
!["Picture 9"](Image-report/EDA_Bivar_LogerrorMonth.png)
!["Picture 10"](Image-report/EDA_Bivar_ABSLogerrorMonth.png)

#### Logerror by Year Built
The next relationship to be examined is between building’s age and logerror to show whether age of the home affects logerror significantly. I chose to use absolute logerror instead logerror itself since I wanted to se the magnitude of the error by home’s age. Picture bellow displays a scatter plot that shows how logerror changed based on the age of the building (represented by the year in which the building was built). I found the relationship inconclusive and did not get much insight into the possible relationship.<br>
!["Picture 11"](Image-report/EDA_Bivar_LogerrorYearbuilt.png)

#### Bathroom by Year Built
Number of bathrooms and age of the building are very important factors in my opinion. Therefore, I wanted to see whether these two are somehow related. The picture below shows how the average number of bathrooms in a home based on the year in which the home was built. The graph clearly shows an increase in the number of bathrooms over time and newer homes have more bathrooms. 
!["Picture 11"](Image-report/EDA_Bivar_BathroomYearbuilt.png)


## 2-3- Algorithms and techniques
The answer to which machine-learning algorithm to choose is driven by both the nature of the data and the question one is trying to answer.  In Zillow’s case, based on the dataset and the continuous nature of the response variable, following supervised prediction algorithms and techniques were used.
### 2-3-1- Algorithms:
#### Linear Regression
Linear regression is a simple statistical model that attempts to find the relationship between variables by fitting a linear equation that describe the data. In linear regression there are 1 label variable (Also known as dependent, response, or predicted variable) and one or more features (also known as independent, regressor, or predictor variables). Linear regression attempts to find whether there is relationship between these two types of variables, and if there is one how strong it is. Therefore, the result of a linear regression model is an equation that shows the relationship between independent variables and the dependent variable in the form of coefficients. It is important to note that linear regression explores the co-relation between variables, not the causation. 
#### Lasso and Ridge regression
One of main issues of machine learning techniques is over fitting. When your model over fits, it means the model is perfectly fine on training dataset but performs poorly on test datasets. The reason your model over fits is that the technique used to build the model pays too much attention to details in training portion that are not generalizable in to the test portion. There’re many ways to avoid over fitting but one of the most effective method is regularization. In regularization, you introduce a penalty term for the size of the weights in your model. There are two ways to regularize linear regression models. In the first approach called Ridge regression or L2 regularization, we add a new term of sum of squares of coefficients in to our objective function. Thus, ridge regression objective function is:<br>
*       Objective function = RSS + alpha * (sum of squares of coefficients)*<br>

Where alpha is the parameter, which balances the penalty given to minimizing RSS vs. minimizing sum of square of coefficients. The second approach called Lasso regression is quite similar to Ridge regression. Lasso is the abbreviation of Least Absolute Shrinkage and Selection Operator and is performing L1 regularization. In Lasso, we add a term of sum of absolute value of coefficients to our objective function. Thus, lasso regression objective function is: <br>
*       Objective function = RSS + alpha * (sum of absolute value of coefficients)*<br>

(Alpha is the same for both Lasso and Ridge regressions.)
#### Decision Tree Regression: 
Decision tree is a very robust and easy to interpret machine learning algorithm. Decision tree maps the features to the label. Features and label can be either numerical or categorical. Decision tree regression is used for numerical label and decision tree classification is used for categorical label.  In this project, since the target variable is numerical, a regression model will be fitted to the target variable using the features (AKA independent variables). The decision tree regressor will split the features in several split points and assigns a prediction value to each region. Then Sum of Squared Errors will be calculated. The SSE across all variables will be compared and the one with the lowest SSE will be the root node. This process will be repeated till the results become satisfactory.
#### Gradient Boosting Regression 
Gradient Boosting, like decision tree, is also a machine learning technique that can be used for both classification and regression problems. In Gradient Boosting regression, the idea is to form a structure of vary simple trees. Each tree is built to reduce the error caused by the previous tree.
### 2-3-1- Techniques:
These algorithms  will be evaluated based on their default parameter's values (out of the box). Then those parameters will be tuned to find the best combination of them using grid search. The following techniques were used in this project.
#### Grid Search 
Grid search is used to tune a model in the best and most effective way. Each algorithm can be modified (tuned) based on different values for its parameters (like maximum depth for decision tree). Different values for these parameters could change the behavior of the algorithm hence result in different model. Grid search tried different combination of the specified values for parameters to find out the best combination. This is called tuning. For example, if we want to see which one of the maximum depth candidates would give us the best score, we can use the grid search and try the combination of maximum depth with other parameters. 
#### Cross Validation 
In cross validation with k-fold, the training set will be divided in k pieces. K-1 pieces will be used in training and 1 piece will be used for validation. This process will be repeated K times using different pieces for training and validation. 

## 2-4- Benchmark
For benchmark model, a naïve model that returns the mean of absolute logerror values we chosen. It means instead of predicting the value, the model always returns 0.0696778. The goal is to find a model with a better MAE and MSAE than the benchmark model. The MAE for benchmark model is 0.100695393513 and the RMSE for benchmark model is 0.182605106132.

---


# 3- Methodology


## 3-1 Data Preprocessing
Before using the data to train the regression models, I took three data preprocessing steps to make the data ready for training: Missing value handling, one-hot encoding, and feature scaling. 

### 3-1-1- Missing Values:
As mentioned in section 2-1 Data Exploration, I classified features with missing values into three categories and designed different approaches for each of them. After careful examination of the data, I found out some of the features are simply cannot be used due to their high number of missing values. On the other hand, there was some feature with high number of missing values that with some extra steps could be used in training. A short explanation of how I handled missing values in those features is provided below: <br>
#### Fireplace
There are two variables for fireplace. One flags the existence of a fireplace (0 and1) and the other one shows the number of them. In and ideal situation, for every flag 1 there should be number in fireplace count that shows how many fireplaces exist and if there is a number in fireplace count variable then the fire place flag for that home should be one. Unfortunately for many homes this is not the case. So I decided to make a new variable and populated it with 0 and 1 based on the other two variables. If any of the two original variables have a value (1 for flag or a non-zero number for count) the flag for new variable would 1 meaning there is a fireplace in the house. Otherwise, the new variable would be zero, which mean there is no fireplace in the property.
#### Pool
There are 4 variables that have information about the existence and type of the pool on the property: “poolsizesum”, “pooltypeid10”, “pooltypeid2”, and “pooltypeid7”. Using all of them, I created a 0-1 flag that shows the existence of a pool on property. If any of these has values, then there is a pool.
#### Garage
“garagecarcnt” and “garagetotalsqft” are two variables related to the garage. The first one shows the number of garages in each property and the second one shows the total square foot dedicated to garage. After inspecting these two variables, I concluded if one variable has a value for a home, the other one also had a value for that one. However, there are almost 9000 homes with a garage count value grater than 0 and square foot of 0, which does not make sense. Therefore, I decided to not use the garage square floor. Instead, I filled the Nan values with 0, meaning those homes do not have garage.
#### Air condition
There is one variable that has information about air condition in a property. This variable has several categories regarding the type of the air condition that exist in a property. One of the categories is None, which means there is no air condition available in the unit. I decided to make a new flag variable that only shows the existence of an air condition unit (any type) in the property.

### 3-1-2- One-hot encoding:
In order to use categorical variables in regression, those variables need to be converted into numerical ones. One-hot encoding is a method to transform categorical variables (features) to numerical features.  In one-hot encoding for each category in a non-numerical feature, a dummy (Boolean) feature will be created.   In this dataset, there is one categorical variable that I would like to include in the modeling: 
Feature FIPS represents the county in which the property is located. After one-hot encoding, three new dummy variables (CountyVentura', 'CountyOrange', 'CountyLosAngeles') were created. 

### 3-1-3- Feature scaling:
Feature scaling or normalization of the numerical features involves standardizing the range of numerical features. Feature scaling have several benefits but one of the most important one is it prevents the model from getting stuck in the local optima. Of course not all of the machine learning algorithms need feature scaling but since I’m trying a variety of the algorithms, I decided to use 'MinMaxScaler' from the scikit-learn library. 'MinMaxScaler' scales data to a fixed range between 0 and 1 using the following equation: 
## 3-2- Implementation
The first step in implementing the algorithms was to perform the normalization of features using 'MinMaxScaler'.  Since all of the features in this dataset are numerical, the scikit-learn’s  'MinMaxScaler' was used on the whole data set. Then five algorithms that were mentioned earlier (Linear Regression, Lasso Regression, Ridge Regression, Gradient Boosting Regression, and Decision Tree Regression) with their default parameters were used in training. Then the MAE and RMSE were calculated using the test data. 
## 3-3- Refinement
GridSearchCV was used to fine tune models. Usually, the models that performed better in their default state are the best candidates for tuning. However, for some algorithms like Decision Tree, tunning can make a huge difference. Considering the size of the dataset and the number of features, I decided to tune them all but Linear Regression. I used grid search to fine-tune each algorithm. <br>

In order to choose the best parameters for tuning, I used scikit-learn’s description of the parameters to understand how changing each of them would affect the results. In addition, I took advantage of existing literature about the best tuning practices for each algorithm. Using these two resources, I chose the following parameters and values for each algorithm:
#### Linear Regression
While there are some parameters in scikit-learn for linear regression, none of them actually affects the regression’s results. These parameters can change the CPU usage, calculate the intercept, or normalize the regressors. 
#### Lasso Regression
Alpha and max_iter were two parameters of lasso regression that were used in tuning.
- Alpha: The default value for alpha is 0 which is equivalent to an ordinary least square. Four values for alpha where tried [0, 1, 100, 1000].
- Max_iter: This optional parameter is the maximum number of iterations. Two values were used in grid search for tuning [500, 1000].

#### Ridge Regression

Two parameters, alpha and max_iter, were used in tuning of ridge regression algorithm.
- Alpha : This parameter is a positive number that sets the regularization strength. In order to have a stronger regularization larger values should be used. Four values were used in tuning [0, 1, 100, 1000],
- Max_iter: Two number for the maximum number of iteration were used in tuning [500, 1000].

#### Gradient Boosting Regression

- N_estimators:  This parameter sets the number of boosting stages that the algorithm is going to perform. Larger values usually give a better performance since gradient boosting is rather robust to over-fitting. The three values that were tried in tuning are [100, 500, 1000].
- Max_depth: This parameter sets the maximum depth of each individual regression estimators, which limits the number of nodes in the tree. Three numbers were tries as the maximum depth of the tree [1, 3, 5].
- Loss: This parameter provides different ways to optimize the loss function. ‘ls’ uses  least squares regression, ‘lad’ is the least absolute deviation, ‘huber’ is a combination of ‘ls’ and ‘lad’, and ‘quantile’ uses the quantile regression. I tried all four [‘ls’, ‘lad’, ‘huber’, ‘quantile’].
- Learning_rate’: This parameter controls the contribution of each tree. 2 leaning rate values were tried for tuning. [0.05, 0.1].


#### Decision Tree Regression

- Criterion: This parameter uses 3 functions to measure the quality of a split. “mse” is the mean squared error, “friedman_mse” is mean squared error with Friedman’s improvement score, and “mae” for the mean absolute error. ['mse', 'friedman_mse', 'mae'].

- Splitter: This parameter is used to choose the split at each node. Two accepted values are “best” which chooses the best split and “random” which chooses the best random split ['best', 'random'].
- Max_depth: This parameter sets the maximum number of the nodes. None means the tree will go on until all leafs are pure. I tried two values to find the best depth [2,5]. 
- Min_samples_split: This parameter sets the minimum number of splits that needed in order to make a split in a node [2, 4].
- Random_state: This parameter sets the seed [42].


---



# 4- Results


## 4-1- Model Evaluation and Validation
As mentioned before, I chose to tune all of the algorithms. After tuning, I measured the performance of each tuned algorithm using the test data, which was 20% of the entire data set. Using part of a data set that has not been seen by algorithms as the test set makes the evaluation more reliable. The graphs bellow Show the the results of performance evaluation show a decrease in MAE and RMSE after tuning.

!["Picture 12"](Image-report/Results_beforeTuning.png)
!["Picture 13"](Image-report/Results_AfterTuning.png)


The graph on top shows the MAE and RMSE of 5 models and the benchmark model before tuning and the one on the bottom shows the same info after tuning. To make the comparison easier, the following two tables show the same info in tabular format.
#### Before Tuning Results
|Model	|MAE	|RMSE|
|:|:|:|
|Benchmark|0.100695 |0.182605|
|Linear   |0.0715649|0.172087|
|Lasso    |0.0717224|0.172333|
|Ridge    |0.0715494|0.172083|
|Gradient Boosting|	0.0716858|	0.172597|
|Decision Tree|	0.118212|	0.246391|

#### After Tuning Results



|Model	|MAE	|RMSE|
|:|:|:|
|Benchmark|	0.100695|	0.182605|
|Linear|	0.0715649|	0.172087|
|Lasso|	0.07156|	0.172084|
|Ridge|	0.0715494|	0.172083
|Gradient Boosting|	0.0707912|	0.172049|
|Decision Tree|	0.0711454|	0.172429|


As the before tuning graph and table showed, all algorithms but decision tree regression performed better than the benchmark model. However, the after tuning graph and table showed that while ridge regression and gradient boosting regression yield slightly better results, decision tree’s MAE and RMSE significantly improved after tuning the hyper-parameters. Based on the results, gradient boosting is chosen since it gives the lowest MAE and RMSE scores.  Decision tree regression is the second best algorithm with slightly higher MAE and RMSE. 
It means my initial suspicion that decision tree will perform much better after tuning was valid. 

---

# 5- Conclusion

## 5-1- Feature importance
The gradient boosting regression provided a model that can predict the logerror. This information can be used to improve the accuracy of Zestimate, which makes home sellers, real-estate agents, and homebuyers more confident in Zestimate. This information can be easily obtained using the feature importance function. Chart bellow shows the features importance score based on the tuned gradient boosting regression model. 
The 5 features accounts for more than 78% of the total feature importance. Floor size is the most important feature. Lot size, Lan Tax value, structure tax value, and year built are other features that are highly affecting the logerror. 

!["Picture 14"](Image-report/FeatureImportance.png)


Knowing which features affect the logerror prediction the most are important in two ways:<br>
First it can be used to see whether the model makes sense. For example if the model says having a fire place in a warm climate such as southern California plays an important role in predicting the logerror, we may want to re evaluate the model.  In this case, number of bathrooms and bedrooms indeed plays an important role in predicting housing price and as a result in predicting the error of price prediction. Second, since these features affect the logerror highly, more attention could be paid to gathering, cleaning and processing those features


## 5-2- Reflection
In this project the goal was to find a suitable machine learning approach that can predict the logerror (the difference between logarithm of the real home price and the Zillow’s Zestimate) as accurately as possible. The data set, which was provided by Zillow, contains information about residential properties that were sold in 2016 in southern California. This dataset have several numerical and categorical features as well as the logerror (label). Before using these features in modeling, several steps were taken to clean the data and handle the missing values. Then, categorical variables were one-hot-encoded and the numerical data were normalized using minmaxscaler().  Although the dataset presented a good amount of features, after taking all these steps it became clear that only a dozen of them could be used in the modeling. This was due to either high number of missing values or the nature of the feature. For example home’s basement size data was missing for 99.95% of the transactions. Another example is the parcelid, which is the unique identifier for each parcel and does not provide any information about the characteristics of the home. <br>
After all these steps, 16 features were chose to build the model. Five algorithms were tried and tuned: Linear Regression, Lasso Regression, Ridge Regression, Gradient Boosting Regression, and Decision Tree Regression. In order to find the best hyper-parameters for tuning, gridsearchcv was used which adopts 3-fold cross validation by default. The performance of the created models was measured using MAE and RMSE. While all models did outperform the benchmark model, gradient boosting regression had the lowest MAE and RMSE. Feature importance analysis shows that the most important feature is Floor size and the top 5 features consist about 78% of the total importance. 

## 5-3- Improvement
There different ways to improve this model including doing more feature engineering, using more data, and using different algorithms. 
#### Feature Engineering 
As mentioned before, there are several features in the original data set. Since all these features are somehow related to the characteristic of the home, it might be a good idea to create new features using a couple of existing features. 
#### More data
After finishing the tuning, I found out Zillow released the 2017 data as well. Next step could be using 2016 and 2017 transactions data to generalize the model better. This way, the model can be used to predict future logerror more confidently since the model were exposed to data from different years. In addition, the effect of different seasons can be captured better since there will be two sets of data for each season. 
#### More complex algorithms
last but not least, using different algorithm might improve the results as well. Neural network can be a good candidate especially with the data from both 2016 and 2017. 

---


# 6- References 

1- https://www.kaggle.com/c/zillow-prize-1 <br>
2- https://www.kaggle.com/c/zillow-prize-1/discussion/33899 <br>
3- https://www.zillow.com/research/ <br>Reference:<br>
4- Rascoff, Spencer, and Stan Humphries. Zillow Talk: Rewriting the Rules of Real Estate. Grand Central Publishing, 2015. <br>
5- Sangani, Darshan, Kelby Erickson, and Mohammad Al Hasan. "Predicting Zillow Estimation Error Using Linear Regression and Gradient Boosting." In 2017 IEEE 14th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), pp. 530-534. IEEE, 2017.<br>
6- https://www.census.gov/quickfacts/CA<br>
7- https://www.car.org/marketdata/data<br>
8- https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-lasso-regression/<br>
9- https://www.scikit-learn.org <br>
10- https://onlinecourses.science.psu.edu/stat857/node/137<br>
11- http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/<br>
12- https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/ <br>
13- https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d<br> 
14- http://www.saedsayad.com/decision_tree_reg.htm<br>
15- https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf<br>
16- https://robinsones.github.io/Better-Plotting-in-Python-with-Seaborn/ <br>

---


# 7- Appendices

## Appendix 1
|Variable| data Type|Variable| data Type|
|:|:|:|:|
|airconditioningtypeid|float64|lotsizesquarefeet|float64|
|architecturalstyletypeid|float64|numberofstories|float64|
|assessmentyear|float64|parcelid|int64|
|basementsqft|float64|poolcnt|float64|
|bathroomcnt|float64|poolsizesum|float64|
|bedroomcnt|float64|pooltypeid10|float64|
|buildingclasstypeid|float64|pooltypeid2|float64|
|buildingqualitytypeid|float64|pooltypeid7|float64|
|calculatedbathnbr|float64|propertycountylandusecode|object|
|calculatedfinishedsquarefeet|float64|propertylandusetypeid|float64|
|censustractandblock|float64|propertyzoningdesc|object|
|decktypeid|float64|rawcensustractandblock|float64|
|finishedfloor1squarefeet|float64|regionidcity|float64|
|finishedsquarefeet12|float64|regionidcounty|float64|
|finishedsquarefeet13|float64|regionidneighborhood|float64|
|finishedsquarefeet15|float64|regionidzip|float64|
|finishedsquarefeet50|float64|roomcnt|float64|
|finishedsquarefeet6|float64|storytypeid|float64|
|fips|float64|structuretaxvaluedollarcnt|float64|
|fireplacecnt|float64|taxamount|float64|
|fireplaceflag|object|taxdelinquencyflag|object|
|fullbathcnt|float64|taxdelinquencyyear|float64|
|garagecarcnt|float64|taxvaluedollarcnt|float64|
|garagetotalsqft|float64|threequarterbathnbr|float64|
|hashottuborspa|object|transactiondate|datetime64|
|heatingorsystemtypeid|float64|typeconstructiontypeid|float64|
|landtaxvaluedollarcnt|float64|unitcnt|float64|
|latitude|float64|yardbuildingsqft17|float64|
|logerror|float64|yardbuildingsqft26|float64|
|longitude|float64|yearbuilt|float64|

## Appendix 2

|Feature|Number of Missing|Percent|Feature|Number of Missing|Percent|
|:|:|:||:|:|:|
|buildingclasstypeid|90259|99.98|pooltypeid7|73578|81.5|
|finishedsquarefeet13|90242|99.96|poolcnt|72374|80.17|
|basementsqft|90232|99.95|numberofstories|69705|77.21|
|storytypeid|90232|99.95|airconditioningtypeid|61494|68.12|
|yardbuildingsqft26|90180|99.89|garagecarcnt|60338|66.84|
|fireplaceflag|90053|99.75|garagetotalsqft|60338|66.84|
|architecturalstyletypeid|90014|99.71|regionidneighborhood|54263|60.11|
|typeconstructiontypeid|89976|99.67|heatingorsystemtypeid|34195|37.88|
|finishedsquarefeet6|89854|99.53|buildingqualitytypeid|32911|36.46|
|decktypeid|89617|99.27|propertyzoningdesc|31962|35.41|
|poolsizesum|89306|98.93|unitcnt|31922|35.36|
|pooltypeid10|89114|98.71|lotsizesquarefeet|10150|11.24|
|pooltypeid2|89071|98.67|finishedsquarefeet12|4679|5.18|
|taxdelinquencyflag|88492|98.02|regionidcity|1803|2|
|taxdelinquencyyear|88492|98.02|calculatedbathnbr|1182|1.31|
|hashottuborspa|87910|97.38|fullbathcnt|1182|1.31|
|yardbuildingsqft17|87629|97.07|yearbuilt|756|0.84|
|finishedsquarefeet15|86711|96.05|calculatedfinishedsquarefeet|661|0.73|
|finishedfloor1squarefeet|83419|92.41|censustractandblock|605|0.67|
|finishedsquarefeet50|83419|92.41|structuretaxvaluedollarcnt|380|0.42|
|fireplacecnt|80668|89.36|regionidzip|35|0.04|
|threequarterbathnbr|78266|86.7|taxamount|6|0.01|