# Notebook Description

## Objectives

*   fit a clf model for Iris


## Inputs

* LoadIrisDataset

## Outputs

* clf
* train/test set used in training/evaluation process

## Additional Comments | Insights | Conclusions


  * Add any relevant comment



---

# Install and Import packages

* You eventually will need to restart runtime when installing packages, please note cell output when installing a package

In [None]:
#! pip install xxxx

---

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [37]:
from getpass import getpass
import os
from IPython.display import clear_output 
print("=== Insert your credentials === \nType in and hit Enter")
UserName = getpass('GitHub User Name: ')
UserEmail = getpass('GitHub User E-mail: ')
RepoName = getpass('GitHub Repository Name: ')
UserPwd = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

* Thanks for inserting your credentials!
* You may now Clone your Repo to this Session, then Connect this Session to your Repo.


---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [48]:
! git clone https://github.com/{UserName}/{RepoName}.git

print("\n")
%cd /content/{RepoName}
print(f"\n\n* Current session directory is:  {os.getcwd()}")
print(f"* You may refresh the session folder to access {RepoName} folder.")

Cloning into 'convert-streamlit-to-django'...
remote: Enumerating objects: 646, done.[K
remote: Counting objects: 100% (646/646), done.[K
remote: Compressing objects: 100% (489/489), done.[K
remote: Total 646 (delta 322), reused 398 (delta 123), pack-reused 0[K
Receiving objects: 100% (646/646), 427.05 KiB | 4.19 MiB/s, done.
Resolving deltas: 100% (322/322), done.


/content/convert-streamlit-to-django


* Current session directory is:  /content/convert-streamlit-to-django
* You may refresh the session folder to access convert-streamlit-to-django folder.


In [49]:
%cd streamlit/

/content/convert-streamlit-to-django/streamlit


In [50]:
pwd

'/content/convert-streamlit-to-django/streamlit'

---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [51]:
!git config --global user.email {UserEmail}
!git config --global user.name {UserName}
!git remote rm origin
!git remote add origin https://{UserName}:{UserPwd}@github.com/{UserName}/{RepoName}.git
print(f"\n\n * The current Colab Session is connected to the following GitHub repo: {UserName}/{RepoName}")
print(" * You can now push new files to the repo.")



 * The current Colab Session is connected to the following GitHub repo: FernandoRocha88/convert-streamlit-to-django
 * You can now push new files to the repo.


---

### **Push** generated/new files from this Session to GitHub repo

* Git commit

In [68]:
CommitMsg = "update"
!git add .
!git commit -m {CommitMsg}

[main e4fa21b] update
 5 files changed, 304 insertions(+)
 create mode 100644 streamlit/outputs/datasets/ClfModel_Xtest.csv
 create mode 100644 streamlit/outputs/datasets/ClfModel_Xtrain.csv
 create mode 100644 streamlit/outputs/datasets/ClfModel_ytest.csv
 create mode 100644 streamlit/outputs/datasets/ClfModel_ytrain.csv
 create mode 100644 streamlit/outputs/trained_models/ClfModel.pkl


* Git Push

In [69]:
!git push origin main

Counting objects: 11, done.
Delta compression using up to 2 threads.
Compressing objects:   9% (1/11)   Compressing objects:  18% (2/11)   Compressing objects:  27% (3/11)   Compressing objects:  36% (4/11)   Compressing objects:  45% (5/11)   Compressing objects:  54% (6/11)   Compressing objects:  63% (7/11)   Compressing objects:  72% (8/11)   Compressing objects:  81% (9/11)   Compressing objects:  90% (10/11)   Compressing objects: 100% (11/11)   Compressing objects: 100% (11/11), done.
Writing objects:   9% (1/11)   Writing objects:  18% (2/11)   Writing objects:  27% (3/11)   Writing objects:  36% (4/11)   Writing objects:  45% (5/11)   Writing objects:  54% (6/11)   Writing objects:  63% (7/11)   Writing objects:  72% (8/11)   Writing objects:  81% (9/11)   Writing objects:  90% (10/11)   Writing objects: 100% (11/11)   Writing objects: 100% (11/11), 3.37 KiB | 3.37 MiB/s, done.
Total 11 (delta 2), reused 5 (delta 0)
remote: Resolving deltas:   0% (0/2)[K

---

### **Delete** Cloned Repo from current Session

In [47]:
%cd /content
!rm -rf {RepoName}
print(f"\n * Please refresh session folder to validate that {RepoName} folder was removed from this session.")

/content

 * Please refresh session folder to validate that convert-streamlit-to-django folder was removed from this session.


---

# Your first notebook section starts from here

In [54]:
from src.processing.data_management import LoadIrisDataset

In [55]:
df = LoadIrisDataset()
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


* Train test split

In [56]:
from config import config
from src.processing.data_management import TrainTestSplit

In [57]:
X_train, X_test,y_train, y_test = TrainTestSplit(df=df,TARGET=config.ClfIrisSpecies_TARGET)

# Grid Search on 1 pipeline

In [58]:
from config import config
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV

ClfIrisSpecies_DT = Pipeline(
    [       
        ("feat_selection",SelectFromModel(DecisionTreeClassifier(random_state=config.RANDOM_STATE))),
        ("feat_scaling",StandardScaler()),
        ("model", DecisionTreeClassifier(random_state=config.RANDOM_STATE))
    ]
)


_parameters = {
    'model__splitter': ["best","random"],
    'model__max_depth': [None,3,5,10],
    'model__criterion': ["gini", "entropy"]
}


_pipe = GridSearchCV(
		estimator = ClfIrisSpecies_DT,
		param_grid = _parameters, 
		cv=5,n_jobs=-2,verbose=2)

In [59]:
_pipe.fit(X_train, y_train)


Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] model__criterion=gini, model__max_depth=None, model__splitter=best 
[CV]  model__criterion=gini, model__max_depth=None, model__splitter=best, total=   0.0s
[CV] model__criterion=gini, model__max_depth=None, model__splitter=best 
[CV]  model__criterion=gini, model__max_depth=None, model__splitter=best, total=   0.0s
[CV] model__criterion=gini, model__max_depth=None, model__splitter=best 
[CV]  model__criterion=gini, model__max_depth=None, model__splitter=best, total=   0.0s
[CV] model__criterion=gini, model__max_depth=None, model__splitter=best 
[CV]  model__criterion=gini, model__max_depth=None, model__splitter=best, total=   0.0s
[CV] model__criterion=gini, model__max_depth=None, model__splitter=best 
[CV]  model__criterion=gini, model__max_depth=None, model__splitter=best, total=   0.0s
[CV] model__criterion=gini, model__max_depth=None, model__splitter=random 
[CV]  model__criterion=gini, model__max_depth=None, model__

[Parallel(n_jobs=-2)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=-2)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV]  model__criterion=gini, model__max_depth=5, model__splitter=best, total=   0.0s
[CV] model__criterion=gini, model__max_depth=5, model__splitter=best .
[CV]  model__criterion=gini, model__max_depth=5, model__splitter=best, total=   0.0s
[CV] model__criterion=gini, model__max_depth=5, model__splitter=best .
[CV]  model__criterion=gini, model__max_depth=5, model__splitter=best, total=   0.0s
[CV] model__criterion=gini, model__max_depth=5, model__splitter=best .
[CV]  model__criterion=gini, model__max_depth=5, model__splitter=best, total=   0.0s
[CV] model__criterion=gini, model__max_depth=5, model__splitter=random 
[CV]  model__criterion=gini, model__max_depth=5, model__splitter=random, total=   0.0s
[CV] model__criterion=gini, model__max_depth=5, model__splitter=random 
[CV]  model__criterion=gini, model__max_depth=5, model__splitter=random, total=   0.0s
[CV] model__criterion=gini, model__max_depth=5, model__splitter=random 
[CV]  model__criterion=gini, model__max_depth=5, model__s

[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed:    0.8s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('feat_selection',
                                        SelectFromModel(estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                                                         class_weight=None,
                                                                                         criterion='gini',
                                                                                         max_depth=None,
                                                                                         max_features=None,
                                                                                         max_leaf_nodes=None,
                                                                                         min_impurity_decrease=0.0,
                                                                                         min_impurity_s

In [60]:
PipelineToDeploy = _pipe.best_estimator_
PipelineToDeploy

Pipeline(memory=None,
         steps=[('feat_selection',
                 SelectFromModel(estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                                  class_weight=None,
                                                                  criterion='gini',
                                                                  max_depth=None,
                                                                  max_features=None,
                                                                  max_leaf_nodes=None,
                                                                  min_impurity_decrease=0.0,
                                                                  min_impurity_split=None,
                                                                  min_samples_leaf=1,
                                                                  min_samples_split=2,
                                                                  min_weight_fract

In [62]:
_pipe.best_params_

{'model__criterion': 'gini',
 'model__max_depth': None,
 'model__splitter': 'best'}

In [63]:
X_train.columns[PipelineToDeploy['feat_selection'].get_support()].to_list()

['petal width (cm)']

* Evaluation on Train and test set

In [64]:
from sklearn.metrics import classification_report
print( classification_report(y_train, PipelineToDeploy.predict(X_train)) )

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        40
           1       0.93      0.97      0.95        40
           2       0.97      0.93      0.95        40

    accuracy                           0.97       120
   macro avg       0.97      0.97      0.97       120
weighted avg       0.97      0.97      0.97       120



In [65]:
print( classification_report(y_test, PipelineToDeploy.predict(X_test)) )

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.83      1.00      0.91        10
           2       1.00      0.80      0.89        10

    accuracy                           0.93        30
   macro avg       0.94      0.93      0.93        30
weighted avg       0.94      0.93      0.93        30



* save model and train/test set

In [66]:
import os
import joblib
model_name = "ClfModel"
save_path = f"/content/convert-streamlit-to-django/streamlit/outputs/trained_models/{model_name}.pkl"
joblib.dump(PipelineToDeploy, save_path)

['/content/convert-streamlit-to-django/streamlit/outputs/trained_models/ClfModel.pkl']

In [67]:
save_path = "/content/convert-streamlit-to-django/streamlit/outputs/datasets/"
model_name = "ClfModel"
import pandas as pd
from config import config

X_train.to_csv(f"{save_path}/{model_name}_Xtrain.csv",index=False)

X_test.to_csv(f"{save_path}/{model_name}_Xtest.csv",index=False)

df_y_train = pd.DataFrame(y_train,columns=[config.ClfIrisSpecies_TARGET])
df_y_train.to_csv(f"{save_path}/{model_name}_ytrain.csv",index=False)

df_y_test = pd.DataFrame(y_test,columns=[config.ClfIrisSpecies_TARGET])
df_y_test.to_csv(f"{save_path}/{model_name}_ytest.csv",index=False)