### TECHNICAL APPROACH

**Preliminary Data Analysis**

A preliminary analysis of the descriptive statistics of the provided features was completed. During this analysis, it became evident that a large majority (225 of 256) of the provided features had zero variance. In this case, these features would not provide useful information to the modeling process and were eliminated. Three other features were eliminated as they were perfectly linearly correlated with three other features. Finally, three samples in the training set were eliminated due to the fact that the 'gap' value was negative; a result which is theoretically impossible.

In regression problems, it is common practice to transform the response data to fit a normal distribution often by taking the log of the response. In the case of this dataset, the response variable was already normally distributed and required no transformation.

**Ensemble Approach**

Many high profile machine learning competitions, including the Netflix Prize [2], have been won by participants who have utilized a blending or ensemble approach combining many models into a single model. We decided to test this strategy as well. To allow for training of the ensemble model, we separated 10% of the training data (100,000 samples) as a validation set. The ensemble model's features were made up of the predictions on this validation set using gradient boosted tree models, linear regression with ridge regularization, neural networks and MARS models. *(DAVID, DESCRIBE HERE WHICH ALGORITHM WE USED FOR THE ENSEMBLE)*

**Gradient Boosted Trees**

Gradient boosted trees is a non-parametric technique that combines several 'weak' decision tree models trained by iteratively improving on residuals from sequentially trained trees. Gradient boosted trees are a particularly attractive method due to the ability to model non-linear behavior, feature interactions, and skewed features without transformation. Also, as an ensemble approach of many 'weak' learners, gradient boosted trees can be quite proficient in avoiding overfitting. The python package XGBoost [1] was used for the gradient boosted trees model due to its speed and parallelization.

As the data set is quite large, hyperparameter tuning was performed on a subset of 200,000 samples from the training set. The hyperparameters tuned in a 5-fold grid search cross-validation were the following:

- `'max_depth'`: maximum depth of each boosted tree [6,8,10]

- `'colsample_bytree'`: Proportion of features used in each boosted tree [0.5,0.75,1]

- `'subsample'`: Proportion of total samples considered for each boosted tree [0.5,0.8,1]

Increasing the value of each of these hyperparameters leads to a higher variance/lower bias model. The optimal hyperparameters were selected as they provided a balanced model in variance and bias. The optimal parameters for `max_depth`, `'colsample_bytree'`, and `'subsample'` were found to be 10, 0.75 and 0.8 respectively. The RMSE seemed to converge at arbitrarily large number of boosted trees. Training was typically performed on between 1000 and 2000 trees. It became evident from the results of the grid search cross validation that the model seemed to converge to an optimum RMSE value irrespective of the hyperparameters. This is often the case for datasets with a very large sample size and a large number of trees. For this reason, the same optimal hyperparameters were used in the fitting of each of the three models with different degrees of feature engineering.


### RESULTS

**Gradient Boosted Trees**

(TO GO INSIDE TABLE WITH OTHER RESULTS)
- XGboost 28 features: RMSE - Test - 0.27207
- XGBoost 41 features: RMSE - Test - 0.15007
- XGBoost 66 features: RMSE - Test - 0.13691


### DISCUSSION

**Gradient Boosted Trees**

Analysis using XGBoost was broken down into three iterations. The first iteration made use of the relevant provided features. The model performed well as it made a slight improvement on the Random Forest baseline. As the model was trained using a comprehensive grid-search cross validation and trained on 2000 trees, it appeared as if this competition would be won by comprehensive feature engineering rather than powerful modeling. Using the rdMolDescriptors module from RDKit, thirteen new features extracted including exact molecular weight, number of rings, among others. Our intuition was correct as adding these 13 features drastically improved the RMSE value. Finally, twenty-five more features were extracted using the same module from RDKit and the model saw a slight improvement in performance.


### REFERENCES


[1] XGBoost - https://github.com/dmlc/xgboost

[2] The BellKor Solution to the Netflix Grand Prize http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf

### APPENDICES

All code from this practical can be found at the following Google Drive link:

https://drive.google.com/folderview?id=0B0e1_K8CvqynSWJuenRUdFRJN1U&usp=sharing
