## 6. Machine Learning
For this particular classification problem, I decided use multiple machine learning algorithms that utilizes ensemble learning. Below are the following machine learning algorithms that were used to predict whether restaurant will fail or remain open with the given features.

1. **Random Forest** (Bagging)
2. **AdaBoost** (use of increasing the weight of misclassified data points)
3. **Gradient Boosting** (learning previous mistakes with residual error)

Based on the result from those three machine learning algorithms, I'll be determining best machine learning algorithm based on accurarcy and computational time. All algorithms were analyzed by initially using default parameter and making improvements by optimizing model parameters using either RandomizedSearchCV and GridSearchCV.

As standard, features were tested using standard train/test split ratio of 7:3. 70% were fitted to model and 30% were left for testing to evaluate machine learning algorithms. The baseline was having at least 100 decision trees (or stumps for AdaBoost and Gradient Boosting).

### 6.1 Random Forest (Bagging Approach)
My focus is to prevent or minimize overfitting or having high false positives in the result; therefore, random forest is used to check since it minimizes overfitting and handles large dataset with high dimensionality

#### Default approach without hyperparameter tuning
With the initial baseline result with 100 decision trees (n_estimators) it yielded the following result below:

##### Initial Precision and Recall

<img src="img/initial_rf_result_img.png" alt="initial_rf_result" style="width: 40%;"/>

According to initial confusion matrix report, random forest classifier is better at identifying open restaurants compared to close businesses. **42% being false positive** and **8% being false negative**. Since this capstone project is about whether to lend money to the restaurant or whether aspiring restauranteur should open a restaurant. I need to adjust machine learning algorithm to focus on reducing false positives because we don't want to lend or invest in business that will eventually close.

### Selecting the best tuning parameters (aka 'hyperparameters') for Random Forest 
Randomized search cv. Randomized search cv is used for Random Forest due to random forest having many parameters which may take a lot of computational time in finding best parameters without overfitting. Below are the parameters I'll be tuning:

- n_estimators = number of trees in the forest
- max_features = max number of features considered for splitting a node
- max_depth = max number of levels in each decision tree
- min_samples_split = min number of data points placed in a node before the node is split
- min_samples_leaf = min number of data points allowed in a leaf node
- bootstrap = method for sampling data points (with or without replacement)

Randomized Search CV yielded the best parameters:
- n_estimators: 1800
- min_samples_split: 2
- min_samples_leaf: 1
- max_features: 'auto'
- max_depth: 50
- criterion: entropy
- bootstrap: FAlse

#### Test result
<img src="img/rf_result_img.png" alt="rf_result" style="width: 40%;"/>

#### Feature Importance
<img src="img/rf_feat_importance_img.png" alt="rf_feat_importance" style="width: 40%;"/>

#### RandomForest Summary

Randomized Search CV improved its overall accuracy by 1% by improving in identifying true negatives but it did not significantly reduce false positives. Random Forest gave equal weights to all created decision trees which resulted in about 80% accuracy; we will identify whether giving different weights to each decision trees will give better result by using Adaptive Boosting.

Lifespan, pos, and star rating, and sentiment related features had highest signal in determining whether restaurant will fail or not. With strong emphais on lifespan feature. We will later identify whether this is true for other two algorithms we will see in later section.

### 6.2 AdaBoost (Sampling Distribution)
I decided to use AdaBoost algorithm to inspect how well it performs when we focus on what the model misclassifies through **sampling distribution**. In other words, we give more focus on data points that makes a mistake so that it learns from its mistake when creating n-stumps.

#### Default approach without hyperparameter tuning
Inspecting baseline when AdaBoost is used without tuning with the exception n_estimators (how many stumps we like to create).

#### Initial Recall and Precision
<img src="img/initial_ada_result_img.png" alt="initial_ada_result" style="width: 40%;"/>

AdaBoost algorithm took **5.45 s** in creating and evaluating 100 decision trees with **76%** accuracy without hyperparameter tuning. Overall AdaBoost provided less desirable result compared to random forest classifier since AdaBoost had higher false negative at **14%**. However, it did provide lower false positives at **40.5%**, decrease from **42%**.

### Selecting the best tuning parameters for AdaBoost 
Will be using GridSearchCV for optimization by tuning the following parameters below:

- **n_estimators**: maximum number of estimators (stumps at which boosting is terminated
- **learning_rate**: rate at which we are adjusting the weights of our model with respect to the loss gradient

#### Test result
<img src="img/rf_result_img.png" alt="rf_result" style="width: 40%;"/>

#### Feature Importance
<img src="img/ada_feats_img.png" alt="ada_feats" style="width: 40%;"/>

AdaBoost's feature importance is slightly different with last 30 day review count and several other restaurant attributes turned out to be more important in identifying closed and open restaurants. Lifespan, revenue, and sentiment score related features are consistently ranked high in determining our restaurant's business status.

### AdaBoost Summary
Initial AdaBoost algorithm with default setting took **5.45 s** in creating and evaluating 100 decision trees with **76%** accuracy which is lower than Random Forest's initial accuracy. However it did provide lower false positives at **40.5%**, decrease from **42%**. I tuned its hyperparameters using GridSearchCV adjusting **n_estimators at 200** with **learning_rate of 0.5**.

It yielded better overall result at **77%** getting better result at obtaining higher true negatives but at the expense of gaining more false positives.

## 6.3  Gradient Boosting
AdaBoost provided fairly okay result in identifying closed and open restaurants through sampling distribution and giving each stump its own weight in deciding which model worked best. I decided to use gradient boosting to see if it provides similar or better result to Random Forest (which currently holds best accuracy) through through residual error directly instead of giving weights to each data points.

#### Initial Recall and Precision
<img src="img/initial_grad_result_img.png" alt="initial_grad_result" style="width: 40%;"/>

Gradient boosting did better at predicting open and closed restaurants than AdaBoost but did poorer than Random Forest, similar to all algorithms - It did not do very good job at predicting closed restaurants - instead it misclassified closed restaurants as open. It has the highest false positives (**43%**) compared to other algorithms. Overall it has initial accuracy at **78%**.

### Selecting the best tuning parameters for Gradient Boosting
Same as AdaBoost, I will be using GridSearchCV for optimization by tuning the following parameters below:

- **n_estimators**: maximum number of estimators (stumps at which boosting is terminated
- **learning_rate**: rate at which we are adjusting the weights of our model with respect to the loss gradient

#### Test result
<img src="img/grad_result_img.png" alt="grad_result" style="width: 40%;"/>

#### Feature Importance
<img src="img/ada_feats_img.png" alt="ada_feats" style="width: 40%;"/>

### Gradient Boosting Summary
Initial GradientBoosting algorithm with default setting took **9.24 s** in creating and evaluating 100 decision trees with **78%** accuracy which is lower than Random Forest's initial accuracy but higher than AdaBoost's accurracy. However it did provided highest false positives at **43%**. I tuned its hyperparameters using GridSearchCV adjusting **n_estimators at 200** with **learning_rate of 0.2**.

It yielded best overall accuracy result at **79.9%** and getting least false positives at **36.9%**. So far, I would recommend using gradient boosting as it has lower chance of lending or investing on businesses that is going to fail.

## 6.4 ROC and AUC
Computing AUROC and ROC curve values and plotting for visualization purpose.

<img src="img/auroc_result_img.png" alt="auroc_result" style="width: 40%;"/>
<img src="img/roc_plot_img.png" alt="roc_plot" style="width: 40%;"/>

### Summary
Random forest and Gradient Boosting algorithm gave the best outcome in predicting whether restaurants are open or closed based on given features. Each algorithms had differing feature importance but few features repeatedly came into view such as lifespan, sentiment score, star rating, revenue, and review count. Those five features played important role in determining whether restaurant will strive or fail in hospitality industry with precision of **84%** and overall F1-score of **80%**.

## 7. Suggested Improvement
There are several ways to improve performances in identifying restaurant's business status using multiple data sources.

- Using population demographics and income level to gauge if it impacts restaurant's price range or its revenues.
- Comparing nearby similar competitors - whether one similar restaurant's performance affects nearby restaurant's performance.
- Using actual restaurant's revenue data instead of using speculative revenues which were calculated using price and review_count columns. 

## 8. Project Summary
- The model was built for restaurant lending purposes that helps decides whether to invest in a independent restaurant or not based on its given data.
- Yelp dataset was used and analyzed to build classification model that correctly identifies restaurant's business status.
- Four major predictive features were identified which is lifespan, sentiment analysis scores, star ratings, and revenues.
- Used three ML algorithms based on bagging and boosting (Random Forest, AdaBoost, and Gradient Boosting) which yielded highest precision score of **79%** and recall score of **90%** with overall F1-score of **80%**.

None of the restaurant's attributes such as cuisines, service types, and venue types yielded any significant result in identifying restaurant's status. Price (general dining cost) did not matter whether the restaurant will remain open or not as indicated in Machine Learning Algorithm's feature importance. In conclusion, As common as this may sound, having long lifespan and positive sentiment scores were shown to be great predictors in identifying healthy stable restaurants as proven through data analysis and multiple machine learning algorithms.