## Evaluation

### Decision Tree
The Decision Tree model had a **100%** accuracy when predicting with the training dataset with **68%** accuracy score with the testing dataset. When the accuracies have a large difference, this is a sign of overfitting. This is because, decision tree perform a greedy algorithm and will always result the same model.  

**Solution:** Reduce the max depth or use a model that isn't prone to overfitting. In this case, a Random Forest was used for the next model. 

### Random Forest
The Random Forest was set with a max depth and no restriction to the feature to max a similar parameters as the decision tree. Random Forest build upon decison trees made up of an emsamble of trees voting on the results. The Random Forest model had an **83.45%** accuracy on the training dataset and a **77.19%** accuracy score with the testing dataset. Although the training accuracy decreased from the decision tree, the testing accuracy increased instead. This model was not overfitted but the parameters were given to mirror the decision tree default parameters. 

**Solution:** In order to improve the model, GridSearch was used to search through the different hyperparameters in order to find the best performing parameters. A dictory of parameters was searched to find the best combination of parameters for the random forest model to obtain the best accuracy. 

### Tuned Random Forest
The best parameters from the list, was found with GridSearch. There was a small improvement with the testing accuracy which means that the parameters used was better than the parameters given before. 

## Conclusion
The decision tree classifer preform well on the training dataset but not with the testing dataset. To resolve that, random forest is used instead to prevent the model from overfitting the dataset when fitted. However, given the large amount of combinations of parameters, GridSearch was used to find the best performing combination for the Random Forest Model. Although the model only had an accuracy of roughly 80%, that will be good enough to allow medical centers to order a reasonable amount of vaccines when the model predicts a respondant willingness to take the vaccine. 

## Example 

In [21]:
def prediction_df(list_of_clf, X):
    combined_df = pd.DataFrame(columns= ['Tree Pred', 'Forest Pred','Tuned Pred'])
    for num, clf in enumerate(combined_df.columns):
        combined_df[clf] = list_of_clf[num].predict(X)
        
    return combined_df

In [29]:
example_df = pd.concat([y_test.reset_index().drop('index', axis = 1), prediction_df([tree_clf, forest_clf, forest_clf_v2], X_test_scaled)], axis = 1)

In [43]:
number = len(example_df)
true_pred = len(example_df.loc[example_df['seasonal_vaccine'] == 1 ])
for clf in example_df.columns[1:]:
    clf_pred = len(example_df[clf].loc[example_df[clf] == 1])
    
    print(clf)
    print("Number of Patients: ", number)
    print("Number of predicted patients taken vaccines: ", clf_pred)
    print("Number of true patients taken vaccines: ", true_pred)
    print("Estimated vaccines needed ", (round(clf_pred * 1.05)))
    print("Wasted Vaccines: ", (round(clf_pred * 1.05) - true_pred))

Tree Pred
Number of Patients:  5342
Number of predicted patients taken vaccines:  2454
Number of true patients taken vaccines:  2466
Estimated vaccines needed  2577
Wasted Vaccines:  111
Forest Pred
Number of Patients:  5342
Number of predicted patients taken vaccines:  2441
Number of true patients taken vaccines:  2466
Estimated vaccines needed  2563
Wasted Vaccines:  97
Tuned Pred
Number of Patients:  5342
Number of predicted patients taken vaccines:  2416
Number of true patients taken vaccines:  2466
Estimated vaccines needed  2537
Wasted Vaccines:  71


| Number of Patients |Prediction| True Prediction | Supply of Vaccine | Wasted Vaccines |                                                                                     
|:------:|:-------------:|:------------:|:----:|:---:|
| 5342 | 2416| 2466 | 2570 | 71 | 

## Limitations
Unfortunately, there is a large combination of parameters that the classifier can use to make the best model and it is too computationally complex to look through and compare which has the best performance. That is why only a list of parameters was given to save time and narrow down the best combination. There could be an instance where a specific model performs better with the training set but not the testing set or vice versa. If given time, there could be the best-performing model for a given state but would be unrealistic to obtain a prediction. 
1. Majority of the questions are targeted toward the H1N1 vaccines which don't reflect too well if they were to take the seasonal vaccine. 
Since it is given seasonal, a patient may be more likely to opt out of taking it for a year. This is quite different from H1N1 vaccines which last longer and tend to be more dangerous than the seasonal flu. 
2. New generations and events may affect people's willingness to take the vaccines. 
Covid-19 has had an impact on people's perception of vaccines, either good or bad. Along with other factors such as religion, people's opinions will change which can make the model ineffective if not updated for the new population. 

## Recommendations
1. Given time, I would recommend exploring different classification models and different combinations of parameters to improve the model. <br>
Ideally, an 80% would be preferred but not as needed since a medical center should have a range of vaccines supply to take into account situations such as new patients. 
2. Modified or targeted questions should be added to the survey. <br>
The survey was made asking primarily about the H1N1 virus and could be changed to adapt to the current state of the world such as covid-19.
3. Determine an adequate amount of spare vaccines to allocate from one center to another. <br>
Not all medical centers will have the same results and should be treated regionally to maximize the accuracy of the model. 