# CLASSIFICATION MODEL:
Due to nature of this process (creating a predictive model), involving coming back and forth trial and error while cleaning data and testing differet hyperparameters setup. We decided to:

- Move all our helper functions to a separated file, so we can reach them from every file while keeping our files cleaner.
- Have 2 code files to get our final model:
    - File `clf_data_proccessing.ipynb`:<br>
        &emsp;To read and clean the data and save it to sql database in ``./database/models.db`` file.<br>
        &emsp;To test cleanliness of data, we'll use a random forest model.
    - File `clf_model_selection.ipynb` to test and compare different models working with cleaned dataset.
- Once final model version is selected, it will be serialized after trainning and stored in ``./trained_models`` folder.
- Then trained model will be deployed to a website built with flask/jinja to perform predictions for data entered by users.
***

### MODEL CREATION
With our data cleaned, well try differnent classification models to come up with the model to be deployed in the website.<br>
Well test:
- Random forest classifier
- Knn classifier
- Logistic regressor

we use cross validation to select the best version for each model, then we just use score method in the model to select the final model



In [7]:
from myFunc import *  # importing helper functions
# pull cleaned dataset
con = sqlite3.connect('./../database/models.db')
df=pd.read_sql_query('select * from class_clean_data',con)
# separating vector features from target
X=df.drop(['num'],axis=1)
y=df['num']
# pulling out test data, we'll use it after tweeking hyperparameters in different models.
X1,Xtest,y1,ytest=train_test_split(X, y, test_size=0.1, random_state=7)

## Random Forest Classifier

In [8]:

# a code from https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74 ---------
# was used to find best hp for our rf model, results are shown in line below, but results weren't too different from those we got with simpler settings.
# rf1=RandomForestClassifier(n_estimators=1600,min_samples_split=2,min_samples_leaf=4,max_features='sqrt',max_depth=80,bootstrap=True)

rf1=RandomForestClassifier(n_estimators=55, bootstrap=False)
cross_val(rf1,X1,y1,'c')
rf2=RandomForestClassifier(n_estimators=150)# default
cross_val(rf2,X1,y1,'c')
rf3=RandomForestClassifier(n_estimators=200)
cross_val(rf3,X1,y1,'c')



-------------Cross Validation-----------------
Accuracy -val set: 81.29120879120877
Accuracy -test set: 80.3030303030303
-------------Cross Validation-----------------
Accuracy -val set: 81.73076923076923
Accuracy -test set: 80.3030303030303
-------------Cross Validation-----------------
Accuracy -val set: 83.1868131868132
Accuracy -test set: 84.84848484848484


## KNeighbors Classifier

In [9]:
knn1 = KNeighborsClassifier(n_neighbors=3)
cross_val(knn1,X1,y1,'c')
knn2 = KNeighborsClassifier(n_neighbors=35)
cross_val(knn2,X1,y1,'c')
knn3 = KNeighborsClassifier(n_neighbors=17)
cross_val(knn3,X1,y1,'c')

-------------Cross Validation-----------------
Accuracy -val set: 61.37362637362638
Accuracy -test set: 62.121212121212125
-------------Cross Validation-----------------
Accuracy -val set: 62.2252747252747
Accuracy -test set: 68.18181818181817
-------------Cross Validation-----------------
Accuracy -val set: 63.736263736263744
Accuracy -test set: 59.09090909090909


## Logistic Regression

In [10]:
lr1=LogisticRegression(solver='lbfgs',penalty='l2',C=.6)
cross_val(lr1,X1,y1,'c')
lr2=LogisticRegression(solver='newton-cg',penalty='l2',C=.55)
cross_val(lr2,X1,y1,'c')
lr3=LogisticRegression(solver='sag',penalty=None,C=3)
cross_val(lr3,X1,y1,'c')

-------------Cross Validation-----------------
Accuracy -val set: 82.80219780219781
Accuracy -test set: 80.3030303030303
-------------Cross Validation-----------------
Accuracy -val set: 83.8736263736264
Accuracy -test set: 81.81818181818183
-------------Cross Validation-----------------
Accuracy -val set: 70.27472527472527
Accuracy -test set: 66.66666666666666


In [11]:
# print(accuracy_score(ytest,rf3.predict(Xtest)))
# print(accuracy_score(ytest,knn2.predict(Xtest)))
# print(accuracy_score(ytest,lr2.predict(Xtest)))
print(rf3.score(Xtest,ytest))
print(knn2.score(Xtest,ytest))
print(lr2.score(Xtest,ytest))

0.9032258064516129
0.6774193548387096
0.8709677419354839


In [12]:
print(classification_report(ytest,rf3.predict(Xtest)))
print(classification_report(ytest,lr2.predict(Xtest)))

              precision    recall  f1-score   support

           0       0.88      0.94      0.91        16
           1       0.93      0.87      0.90        15

    accuracy                           0.90        31
   macro avg       0.91      0.90      0.90        31
weighted avg       0.90      0.90      0.90        31

              precision    recall  f1-score   support

           0       0.83      0.94      0.88        16
           1       0.92      0.80      0.86        15

    accuracy                           0.87        31
   macro avg       0.88      0.87      0.87        31
weighted avg       0.88      0.87      0.87        31



### RandomForest is the winner, although LogisticRegressor was close!

In [13]:
jl_filedir = Path("./../trained_models")
jl_filedir.mkdir(parents=True,exist_ok=True)

jl_filepath=jl_filedir / 'class_heart.joblib'

joblib.dump(rf3,jl_filepath)

# rf3_jl=joblib.load(jl_filepath)


['..\\trained_models\\class_heart.joblib']