## Notebook Goals

The goal of this notebook is to find the best models to go forward with.  
Using gridsearchCV for hyper-parameter tuning. 

## Import

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

from sklearn.feature_selection import f_classif
from sklearn.metrics import precision_score, recall_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('../data/amzn_final_dataset.csv',index_col='AMZN')
df.dropna(inplace=True)
df['c_four_percent_high'] = df['c_four_percent_high'].map({'Buy': 1, '0': 0})

## Best estimator models

### Decision Tree

In [3]:
a = df.iloc[-1500: ]

y = a['c_four_percent_high']

x = a[['SMA', 'ROC', 'ATR', 'ADX',
        'High', 'Low', 'Close'
       ]]

>After running the Decision Tree feature importance in the modeling notebook. Only a handful of features are necessary for the same predictions.

This prediction model is very sensitive to change.  
In this for loop, I run the gridsearch several times and put it in a data frame to help limit any grid search parameters that causes fitting issues or bad estimates

In [4]:
dtc = DecisionTreeClassifier()
b = pd.DataFrame(columns = ['train', 'test', 'criterion', 'max_depth', 'min_samples_leaf', 'min_samples_split'])

for i in range(25): 
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

    param_grid = {
                'criterion': ['gini', 'entropy'],
                'max_depth': [2, 3, 4],
                'min_samples_split': [2, 5, 10],
                'min_samples_leaf': [1, 2, 3, 4, 5, 6]}

    grid_search = GridSearchCV(dtc, param_grid, cv=3, return_train_score=True, scoring = 'precision')


    grid_search.fit(x_train, y_train)


    best_model = grid_search.best_estimator_

    train_model = best_model.predict(x_train)

    test_model = best_model.predict(x_test)

    train_score = precision_score(y_train, train_model)
    test_score = precision_score(y_test, test_model)


    b.loc[i] = [train_score,
                test_score,
                grid_search.best_params_['criterion'],
                grid_search.best_params_['max_depth'],
                grid_search.best_params_['min_samples_leaf'],
                grid_search.best_params_['min_samples_split']]
    
b.sort_values(by='test', ascending=False)

Unnamed: 0,train,test,criterion,max_depth,min_samples_leaf,min_samples_split
13,0.884058,0.866667,entropy,4,1,10
22,0.586022,0.714286,gini,2,1,2
21,0.572973,0.714286,gini,2,1,2
10,0.713115,0.69697,entropy,4,6,2
7,0.586592,0.690476,gini,2,1,2
24,0.821053,0.666667,entropy,4,5,2
12,0.582857,0.660377,entropy,2,1,2
11,0.882979,0.653846,entropy,4,1,2
18,0.803419,0.653846,entropy,4,5,5
1,0.586592,0.653061,gini,2,6,5


Due to the sensitivity of the dataset, I played with several random_state integers to find the best model.  
In this model, I am looking for a precision score higher than .69 (good investments vs bad investments) that doesn't over fit, with a recall score higher than .2 (limit how many missed opportunities)

In [5]:
for i in range(790, 800): 
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=i)

    param_grid = {
                'criterion': ['gini', 'entropy'],
                'max_depth': [2, 3, 4],
                'min_samples_split': [2, 5, 10],
                'min_samples_leaf': [1, 2, 3, 4, 5, 6]}

    grid_search = GridSearchCV(dtc, param_grid, cv=3, return_train_score=True, scoring = 'precision', n_jobs=-1)


    grid_search.fit(x_train, y_train)


    best_model = grid_search.best_estimator_

    train_model = best_model.predict(x_train)

    test_model = best_model.predict(x_test)

    train_score = precision_score(y_train, train_model)
    test_score = precision_score(y_test, test_model)
    
    recall_test = recall_score(y_test, test_model)


    diff = abs(train_score - test_score)
    
    if test_score > .69 and diff < .05 and recall_test > .11:
        print('random_state :',i)
        print('train precision score :', round(train_score,2))
        print('test precision score :', round(test_score,2))
        print(grid_search.best_params_)
        print('test recall score :', round(recall_test,2))
        print()

random_state : 790
train precision score : 0.86
test precision score : 0.84
{'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 6, 'min_samples_split': 10}
test recall score : 0.21



Best Choices:
_________

random_state : 790  
train precision score : 0.86  
test precision score : 0.84  
{'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 6, 'min_samples_split': 2}  
test recall score : 0.21  

    ------------
    | 93 | 354 |
    ------------
    | 16 | 1040|  
    ------------
      
    
    
random_state : 711  
train precision score : 0.75  
test precision score : 0.74  
{'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 1, 'min_samples_split': 2}  
test recall score : 0.27   

    -------------
    | 141 | 306 |
    -------------
    | 47 | 1030 |  
    -------------
        
          
          
random_state : 253  
train precision score : 0.8  
test precision score : 0.76  
{'criterion': 'entropy', 'max_depth': 4, 'min_samples_leaf': 2, 'min_samples_split': 10}  
test recall score : 0.24  

    -------------
    | 130 | 317 |
    -------------
    | 33  | 1020|  
    -------------

> Top three parameters with tweaked random states.   
> Each model has something good.  
> I am moving forward with the random state of 790.

## Ensemble Models

Ensemble models will take the base model as a best estimator to help with performance of overall model and to help with the stability of newer data.

> base model - Decision Tree()  
>{'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 6, 'min_samples_split': 10}  
> random_state: 790

In [6]:
y = a['c_four_percent_high']

x = a[['SMA', 'ROC', 'ATR', 'ADX',
        'High', 'Close'
       ]]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=790)

dtc = DecisionTreeClassifier(criterion='gini',
                             max_depth=4,
                             min_samples_leaf=6,
                             min_samples_split=2)


### Random Forest

In [7]:
forest = RandomForestClassifier()

param_grid = {
            'criterion': ['gini'],
            'max_depth': [4],
            'min_samples_leaf': [6],
            'min_samples_split': [2],
            'n_estimators': [800]}

grid_search = GridSearchCV(forest, param_grid, cv=3, return_train_score=True, scoring = 'precision', n_jobs=-1)


grid_search.fit(x_train, y_train)
best_model = grid_search.best_estimator_

#Mean training score
train_model = best_model.predict(x_train)
train_score = precision_score(y_train, train_model)

#Mean test score
test_model = best_model.predict(x_test)
test_score = precision_score(y_test, test_model)

print(f"Precision Training Score: {train_score :.2%}")
print(f"Precision Test Score: {test_score :.2%}")
print("Best Parameter Combination Found During Grid Search:")
print(grid_search.best_params_)

Precision Training Score: 80.34%
Precision Test Score: 63.64%
Best Parameter Combination Found During Grid Search:
{'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 6, 'min_samples_split': 2, 'n_estimators': 800}


### ADA Boost

In [8]:
ada = AdaBoostClassifier(base_estimator=dtc)

param_grid = {
        'n_estimators': [25],
        'learning_rate': [0.001, 0.01, 0.1, 0.2, 0.5]}

grid_search = GridSearchCV(ada, param_grid, cv=3, return_train_score=True, scoring = 'precision', n_jobs=-1)


grid_search.fit(x_train, y_train)
best_model = grid_search.best_estimator_

#Mean training score
train_model = best_model.predict(x_train)
train_score = precision_score(y_train, train_model)

#Mean test score
test_model = best_model.predict(x_test)
test_score = precision_score(y_test, test_model)

print(f"Precision Training Score: {train_score :.2%}")
print(f"Precision Test Score: {test_score :.2%}")
print("Best Parameter Combination Found During Grid Search:")
print(grid_search.best_params_)

Precision Training Score: 85.71%
Precision Test Score: 84.00%
Best Parameter Combination Found During Grid Search:
{'learning_rate': 0.001, 'n_estimators': 25}


### Gradient Boosting

In [9]:
grad_boost = GradientBoostingClassifier(init=dtc)

param_grid = {
    'n_estimators': [30],
    'learning_rate': [0.001, 0.01, 0.1, 0.2],
    'loss': ['deviance', 'exponential'],
    'criterion': ['friedman_mse', 'mse']
}


grid_search = GridSearchCV(grad_boost, param_grid, cv=3, return_train_score=True, scoring = 'precision', n_jobs=-1)



grid_search.fit(x_train, y_train)
best_model = grid_search.best_estimator_

#Mean training score
train_model = best_model.predict(x_train)
train_score = precision_score(y_train, train_model)

#Mean test score
test_model = best_model.predict(x_test)
test_score = precision_score(y_test, test_model)

print(f"Precision Training Score: {train_score :.2%}")
print(f"Precision Test Score: {test_score :.2%}")
print("Best Parameter Combination Found During Grid Search:")
print(grid_search.best_params_)
   

Precision Training Score: 85.71%
Precision Test Score: 84.00%
Best Parameter Combination Found During Grid Search:
{'criterion': 'friedman_mse', 'learning_rate': 0.001, 'loss': 'deviance', 'n_estimators': 30}


## Conclusion

### Base Model

Currently moving forward with the Decision Tree as a base model.  
> K Nearest Neighbors and SVC models both struggled with extreme over fitting.  
> Xgb models seemed to work decently at first and definitely worth looking into for future works.

### Ensemble Model

I am moving forward with the Gradient Boosting ensemble method to help with the performance of the base model.    
Currently with the Decision Tree Classifier as the base, The over fit is less with more option as parameters for future proofing