<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />


#  Worksheet 5.4: Automate it All!
This worksheet covers concepts relating to automating a machine learning model using the techniques we learned.  It should take no more than 20-30 minutes to complete.  Please raise your hand if you get stuck.  

In [1]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
from sklearn.model_selection import train_test_split
import scikitplot as skplt
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import lime
from tpot import TPOTClassifier
%matplotlib inline



## Step One:  Import the Data
In this example, we're going to use the dataset we used in worksheet 5.3.  Run the following code to read in the data, extract the features and target vector.

In [3]:
df = pd.read_csv('../data/dga_features_final_df.csv')
y = df['isDGA']
X = df.drop(['isDGA'], axis=1)

Next, perform the test/train split in the conventional manner.

In [4]:
# Your code here ...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

## Step Two:  Run the Optimizer
In the next step, use TPOT to create a classification pipeline using the DGA data set that we have been using.  The `TPOTClassifier()` has many configuration options and in the interest of time, please set the following variables when you instantiate the classifier.

* `max_eval_time_mins`:  In the interests of time, set this to 15 or 20.
* `verbosity`: Set to 1 or 2 so you can see what TPOT is doing.


**Note:  This step will take some time, so you might want to get some coffee or a snack when it is running.**  While this is running take a look at the other configuration options available here: http://epistasislab.github.io/tpot/api/.  

In [5]:
# Your code here... 
tpot = TPOTClassifier(max_eval_time_mins=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

Optimization Progress:   0%|          | 0/10100 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.9173333333333333

Generation 2 - Current best internal CV score: 0.9173333333333333

Generation 3 - Current best internal CV score: 0.9173333333333333

Generation 4 - Current best internal CV score: 0.9173333333333333

Generation 5 - Current best internal CV score: 0.9226666666666666

Generation 6 - Current best internal CV score: 0.9226666666666666

Generation 7 - Current best internal CV score: 0.9226666666666666

Generation 8 - Current best internal CV score: 0.9226666666666666

Generation 9 - Current best internal CV score: 0.9226666666666666

Generation 10 - Current best internal CV score: 0.9226666666666666

Generation 11 - Current best internal CV score: 0.9226666666666666

Generation 12 - Current best internal CV score: 0.9226666666666666

Generation 13 - Current best internal CV score: 0.9226666666666666

Generation 14 - Current best internal CV score: 0.9226666666666666

Generation 15 - Current best internal CV score: 0.922666

## Step Three:  Evaluate the Performance
Now that you have a trained model, the next step is to evaluate the performance and see how TPOT did in comparison with earlier models we created.  Use the techniques you've learned to evaluate the performance of your model.  Specifically, print out the `classification report` and a confusion matrix. 

Unfortunately, Yellowbrick will not work in this instance, however, you can generate a similar visual confusion matrix with the following code:

```
import scikitplot as skplt
skplt.metrics.confusion_matrix(optimized_preds, target_test)

```

What is the accuracy of your model?  Is it significantly better than what you did in earlier labs?

In [7]:
# Your code here...
pred = tpot.predict(X_test)
# Output the classification report
class_report = classification_report(pred, y_test)
# Render the confusion matrix
skplt_report = skplt.metrics.confusion_matrix(pred, y_test)

print(class_report)
print(skplt_report)

              precision    recall  f1-score   support

           0       0.96      0.86      0.91       269
           1       0.85      0.96      0.90       231

    accuracy                           0.91       500
   macro avg       0.91      0.91      0.91       500
weighted avg       0.91      0.91      0.91       500

[[231  38]
 [  9 222]]


## Step 4:  Export your Pipeline
If you are happy with the results from `TPOT` you can export the pipeline as python code. The final step in this lab is to export the pipeline as a file called `automate_ml.py` and examine it.  What model and preprocessing steps t

In [8]:
# Your code here... 
tpot.export('automate_ml.py')