<a href="https://colab.research.google.com/github/PurpleDin0/698S_BOG/blob/master/MST698S_CNN_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bad Ozone Grasshoppers - CNN Exercise
# MST 698S - Data Science Tools And Techniques 

**BLUF:**  Train an image classifier to identify tanks and then evaluate the usefulness of the model.

**Problem Statement:** Design an image classifier to streamline the task of identifying Denovian military vehciles in social network photos.  Specifically, construct an image classifer to identify military vehicles.
* [X] Describe other ways that image classifiers could be used to streamline data sorting, collection, and analysis process.
* [X] How well does the model perform? would you deploy this model?  Why or Why not?

**Summary:** This notebook installs the required python libraries and operating system (OS) programs to execute a python based image classifier.  Additionally, this notebook then evaluates model performance by creating a Confusion Matrix, calculating both the Matthews Correlation Coefficient and Cohens Kappa, as well as creating a classification report.

**Usage Details:** This Notebook is desigened to be run in the [Google Colab environment](https://colab.research.google.com/). However, it should work in ***most*** linux based Jupyter Notebooks or Jupyter Lab environments, as long as the appropraite versions of TensorFlow, Keras, ad NumPy are installed.  The main purpose of the notebook is to install relevant python libraries, execute the code, and save the output to a cloud repository.  <font color=yellow>CAUTION: If executing this notebook on a Windows based system the user will need to install Git, TensorFlow, Keras, NumPy, and update the default filepaths to match windows formating. </font>

## Initialize the Environment 
1. Clone the github repo [located here](https://github.com/PurpleDin0/CNN-exercise).  
```
!git clone https://github.com/[repo_owner]/[repo].git
```

2. Download the image sets from [here](https://drive.google.com/drive/folders/12H7D5-ipY5hCv2zr-6XQEUz3sXUXS3bE?usp=sharing).  Image set can be manually downloaded by the user or by using the below code.

In [10]:
### 1. CLone the github Repo ###
# Navigate the working directory in colab to "/content" 
%cd /content/
# clone the relevant github repo
!git clone https://github.com/PurpleDin0/CNN-exercise.git
# Navigate to the newly created repo folder
%cd /content/CNN-exercise

### 2. Download the image sets ###
# Import PyDrive and associated libraries.
# This only needs to be done once per notebook.
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Download a file based on its file ID.
#
# A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz
file_id = '1xT3cTVgLBbaSeub2bwoFflPbEBnPM8Uv'
downloaded = drive.CreateFile({'id': file_id})
file_name = 'test_set.zip'
downloaded.GetContentFile(file_name)

# Unzip the file that was just downloaded 
!unzip {file_name}

/content
Cloning into 'CNN-exercise'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 21 (delta 10), reused 3 (delta 2), pack-reused 0[K
Unpacking objects: 100% (21/21), done.
/content/CNN-exercise
Archive:  test_set.zip
   creating: test_set/not_tank/
  inflating: test_set/not_tank/google_images_2019-10-03_13h_34m_38s_00000890.jpg  
  inflating: test_set/not_tank/google_images_2019-10-03_13h_34m_38s_00000891.jpg  
  inflating: test_set/not_tank/google_images_2019-10-03_13h_34m_38s_00000892.jpg  
  inflating: test_set/not_tank/google_images_2019-10-03_13h_34m_38s_00000893.jpg  
  inflating: test_set/not_tank/google_images_2019-10-03_13h_34m_38s_00000894.jpg  
  inflating: test_set/not_tank/google_images_2019-10-03_13h_34m_38s_00000897.jpg  
  inflating: test_set/not_tank/google_images_2019-10-03_13h_34m_38s_00000898.jpg  
  inflating: test_set/not_tank/google_images_2019-10-0

In [0]:
%ls

classifier_function.py                             README.md
CNN_model.h5                                       [0m[01;34mtest_set[0m/
CNN_trainer.py                                     test_set.zip
MST698S_BOG_CNN_exercise-execution_notebook.ipynb  [01;34mtraining_set[0m/


## Train the model
Training will take ~30 minutes per epoch if you train on a CPU (total of 12 hours).


In [0]:
!python3 CNN_trainer.py

Loading...
Using TensorFlow backend.
2020-05-26 19:33:37.604166: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
CNN will build your convolutional neural network!
Accessing image data...
Training model...
2020-05-26 19:33:39.438113: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-26 19:33:39.488288: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 19:33:39.489262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2020-05-26 19:33:39.489303: I tensorflow/stream_executor/platform/default/ds

## Run the model against the test data
1. Load the classifer function and run it against the data that was downloaded earlier located in the test_set folder.

2. Store the results in a pandas dataframe for easier manipulation.

In [15]:
# Navigate to the code from the cloned python repo
%cd /content/CNN-exercise

# Import the classifier function code
import classifier_function
# Import pandas so we can store the results in a dataframe
import pandas as pd

### 1. call the classifier function against the test data using the CNN Model. ###
test_data_path = '/content/CNN-exercise/test_set/'
model_path = '/content/CNN-exercise/CNN_model.h5'
out_list = classifier_function.image_classifier(test_data_path, model_path)

### 2. Create a dataframe that stores the predicted value and actual value for the tank classification  ###
# Initializes two dictionaries (actual and predicted)
out_list_fixed = {}
out_list_actual = {}

# Step through the output list and build the predicted and actual tank classification dictionaries
for i, key in enumerate(out_list):
    out_list_fixed[key] = int(out_list[key][0][0]) #converts the tank predication to an integer
    if '/not_tank/' in key:
        out_list_actual[key] = 0
    elif '/tank/' in key:
        out_list_actual[key] = 1
    else:
      print('SOMETHING BROKE ... WHAT DID YOU DO!')

# Convert the dictionaries that were created into dataframes and then join those two dataframes 
df_predicted = pd.DataFrame.from_dict(out_list_fixed, orient='index', columns=['predicted'])
df_actual = pd.DataFrame.from_dict(out_list_actual, orient='index', columns=['actual'])
df_results = pd.concat([df_predicted, df_actual], axis=1)

/content/CNN-exercise


  "Palette images with Transparency expressed in bytes should be "


## Evaluate the results from the image classifier
Next we will check the accuracy of the image classifier using several different methods.  
* [X] Create a confusion matrix
* [X] Calculate the Matthews Correlation Coefficient
* [X] Calculate the Cohen kappa score
* [X] Create a classification report

Determining all of these is simplified by using the `sklearn.metrics` library and importing the classification_report, confusion_matrix, matthews_corrcoef, cohen_kappa_score functions.

```python
from sklearn.metrics import classification_report, confusion_matrix, matthews_corrcoef, cohen_kappa_score
```


In [0]:
# Import the required functions
from sklearn.metrics import classification_report, confusion_matrix, matthews_corrcoef, cohen_kappa_score

# Build the confusion matrix
txt = "Confusion Matrix"
print("\n", "*"*(26-(len(txt)//2)), txt, "*"*(26-(len(txt)//2)))
matrix = confusion_matrix(df_results.actual.values.tolist(), df_results.predicted.values.tolist())
print("[ TP ",  matrix[0][0], " | FP  ", matrix[0][1], "]\n[ FN  ", matrix[1][0], " | TN ", matrix[1][1], "]")
print("\n  N =", matrix.sum())

# Determine the matthews_corrcoef
txt = "Matthews Correlation Coefficient"
print("\n", "*"*(26-(len(txt)//2)), txt, "*"*(26-(len(txt)//2)))
print(matthews_corrcoef(df_results.actual.values.tolist(), df_results.predicted.values.tolist()))

# Determine the cohen_kappa_score
txt = "Cohens Kappa"
print("\n", "*"*(26-(len(txt)//2)), txt, "*"*(26-(len(txt)//2)))
print(cohen_kappa_score(df_results.actual.values.tolist(), df_results.predicted.values.tolist()))

# Build the classification_report
txt = "Classificaion Report"
print("\n", "*"*(26-(len(txt)//2)), txt, "*"*(26-(len(txt)//2)))
print(classification_report(df_results.actual.values.tolist(), df_results.predicted.values.tolist(), target_names=['Not Tank', 'Tank'], output_dict=False))  # ADDED TARGET NAMES


 ****************** Confusion Matrix ******************
[ TP  162  | FP   46 ]
[ FN   17  | TN  349 ]

  N = 574

 ********** Matthews Correlation Coefficient **********
0.7599663028252964

 ******************** Cohens Kappa ********************
0.7551227704267508

 **************** Classificaion Report ****************
              precision    recall  f1-score   support

    Not Tank       0.91      0.78      0.84       208
        Tank       0.88      0.95      0.92       366

    accuracy                           0.89       574
   macro avg       0.89      0.87      0.88       574
weighted avg       0.89      0.89      0.89       574



## Explain the Evaluation Metrics
- [X] **Confusion Matrix `confusion_matrix(y_true, y_pred, *args)`:**  The Confusion Matrix is a representation of the model's inference grouped by its predictions into the respective box based on if results were answered correctly of not. Box values are used to tabulate addional metrics, like Matthews Correlation Coefficient (MCC).  


- [X] **Matthews Correlation Coefficient `matthews.corrcoef(y_true, y_pred, *args)`:**  Matthews Correlation Coefficient (MCC) measures model prediction performance as a floating point value from -1 -> +1. An MCC of 1 represents a perfect prediction, and -1 MCC would show the model predicting opposite from desired results. MCC is a favorable selection for inbalanced data sets. In this project, the test set had 208 `tank` and 366 `not tank` samples--an imbalanced dataset. 


- [X] **Cohen Kappa Score `cohen_kappa_score(y_true, y_pred, *args)`:**
Cohen Kappa Score is similar to MCC as it measures the performance of a model classifiers of `n` items into mutually exclusive categories, but uses a floating point number from 0 -> 1. The Kappa score statisitc is grouped into categories:  

| Stat        | Agreement              |
| ---         | ---                    |
|      0      | equivalent to chance   |
| 0.10 – 0.20 | slight agreement       |
| 0.21 – 0.40 | fair agreement         |
| 0.41 – 0.60 | moderate agreement     |
| 0.61 – 0.80 | substantial agreement  |
| 0.81 – 0.99 | near perfect agreement |
|     1       | perfect agreement      |


- [X] **Classificaion Report `classification_report(y_true, y_pred, *args)`:**
The Classificaiton Report is a simple digest of several metrics including:
  - precision - Measures the rate of false positives.
  - accuracy - The measure of correct predictions over the complete dataset.
  - recall - The ability of the classifier to find positive samples in the dataset.
  - F-1 score - Harmonic mean of precision and recall, or teh accuracy of the classifier to predict in a particular category.   


## Model Performance and Results Discussion
- **Describe other ways that image classifiers could be used to streamline data sorting, collection, and analysis process:**
  - `REPLACE WITH ANSWER`

- **How well does the model perform?:**
  - `REPLACE WITH ANSWER`

- **Would you deploy this model?:**
  - `REPLACE WITH ANSWER`