<a href="https://colab.research.google.com/github/PurpleDin0/CNN-exercise/blob/master/MST698S_BOG_CNN_exercise-execution_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bad Ozone Grasshoppers - CNN Exercise**
## MST 698S - Data Science Tools And Techniques 

**BLUF:**  Train an image classifier to identify tanks and then evaluate the usefulness of the model.

**Problem Statement:** Design an image classifier to streamline the task of identifying Denovian military vehicles in social network photos.  Specifically, construct an image classifier to identify military vehicles.
* [X] Describe other ways that image classifiers could be used to streamline data sorting, collection, and analysis process.
* [X] How well does the model perform? would you deploy this model?  Why or Why not?

**Summary:** This notebook installs the required python libraries and operating system (OS) programs to execute a python based image classifier.  Additionally, this notebook then evaluates model performance by creating a Confusion Matrix, calculating both the Matthews Correlation Coefficient and Cohens Kappa, as well as creating a classification report.

**Usage Details:** This Notebook is designed to be run in the [Google Colab environment](https://colab.research.google.com/). However, it should work in ***most*** Linux based Jupyter Notebooks or Jupyter Lab environments, as long as the appropriate versions of TensorFlow, Keras, and NumPy are installed.  The main purpose of the notebook is to install relevant python libraries, execute the code, and save the output to a cloud repository.  <font color=yellow>CAUTION: If executing this notebook on a Windows based system the user will need to install Git, TensorFlow, Keras, NumPy, and update the default file paths to match windows formatting. </font>

## Initialize the Environment 
1. Clone the github repo [located here](https://github.com/PurpleDin0/CNN-exercise).  
```
!git clone https://github.com/[repo_owner]/[repo].git
```

2. Download the image sets from [here](https://drive.google.com/drive/folders/12H7D5-ipY5hCv2zr-6XQEUz3sXUXS3bE?usp=sharing).  Image set can be manually downloaded by the user or by using the below code.

In [0]:
# Navigate the working directory in colab to "/content" 
%cd /content/
### 1. clone the relevant github repo
!git clone https://github.com/PurpleDin0/CNN-exercise.git
# Navigate to the newly created repo folder
%cd /content/CNN-exercise

### 2. Download the image sets ###
## Import PyDrive and associated libraries. ##
# This only needs to be done once per notebook.
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

## Download a file based on its file ID. ##
# A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz
# This downloads the test_set.zip file from google drive
file_id = '1xT3cTVgLBbaSeub2bwoFflPbEBnPM8Uv'
downloaded = drive.CreateFile({'id': file_id})
file_name = 'test_set.zip'
downloaded.GetContentFile(file_name)

## Unzip the file that was just downloaded ##
# The below code also prints pretty headers and displays number of unzipped files
# There are easier ways to do this (e.g. using the os library) but this works
# Only line that is actually needed is "!unzip {file_name}"

# Print the unzip start banner
txt = "Unzipping " +  file_name
print("\n", "*"*(26-(len(txt)//2)), txt, "*"*(26-(len(txt)//2)))

# determine the quantity of files in the current working folder/subfolders using the ! find command
file_qty_initial = !find -type f | wc -l

#Unzip the files (this is the only required line)
!unzip -q {file_name}

# same as the ! command above but uses the get_ipython method to execute instead of bash 
# (just a different way of doung the same thing)
file_qty_after = get_ipython().getoutput('find -type f | wc -l')
file_qty_dif = str(int(file_qty_after[0]) - int(file_qty_initial[0]))

# Print the unzip completion banner
txt = "Unzipped " + file_qty_dif + " items(s)" 
print("\n", "*"*(26-(len(txt)//2)), txt, "*"*(26-(len(txt)//2)))

/content
Cloning into 'CNN-exercise'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 27 (delta 14), reused 6 (delta 4), pack-reused 0[K
Unpacking objects: 100% (27/27), done.
/content/CNN-exercise

 *************** Unzipping test_set.zip ***************

 *************** Unzipped 2872 items(s) ***************


## Train the model
Training will take ~30 minutes per epoch if you train on a CPU (total of 12 hours).


In [0]:
!python3 CNN_trainer.py

Loading...
Using TensorFlow backend.
2020-05-26 19:33:37.604166: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
CNN will build your convolutional neural network!
Accessing image data...
Training model...
2020-05-26 19:33:39.438113: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-26 19:33:39.488288: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 19:33:39.489262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2020-05-26 19:33:39.489303: I tensorflow/stream_executor/platform/default/ds

## Run the model against the test data
1. Load the classifier function and run it against the data that was downloaded earlier located in the test_set folder.

2. Store the results in a pandas data frame for easier manipulation.

In [0]:
# Navigate to the code from the cloned python repo
%cd /content/CNN-exercise

# Import the classifier function code
import classifier_function
# Import pandas so we can store the results in a dataframe
import pandas as pd

### 1. call the classifier function against the test data using the CNN Model. ###
test_data_path = '/content/CNN-exercise/test_set/'
model_path = '/content/CNN-exercise/CNN_model.h5'
out_list = classifier_function.image_classifier(test_data_path, model_path)

### 2. Create a dataframe that stores the predicted value and actual value for the tank classification  ###
# Initializes two dictionaries (actual and predicted)
out_list_fixed = {}
out_list_actual = {}

# Step through the output list and build the predicted and actual tank classification dictionaries
for i, key in enumerate(out_list):
    out_list_fixed[key] = int(out_list[key][0][0]) #converts the tank predication to an integer
    if '/not_tank/' in key:
        out_list_actual[key] = 0
    elif '/tank/' in key:
        out_list_actual[key] = 1
    else:
      print('SOMETHING BROKE ... WHAT DID YOU DO!')

# Convert the dictionaries that were created into dataframes and then join those two dataframes 
df_predicted = pd.DataFrame.from_dict(out_list_fixed, orient='index', columns=['predicted'])
df_actual = pd.DataFrame.from_dict(out_list_actual, orient='index', columns=['actual'])
df_results = pd.concat([df_predicted, df_actual], axis=1)

/content/CNN-exercise


  "Palette images with Transparency expressed in bytes should be "


## Evaluate the results from the image classifier
Next we will check the accuracy of the image classifier using several different methods.  
* [X] Create a confusion matrix
* [X] Calculate the Matthews Correlation Coefficient
* [X] Calculate the Cohen kappa score
* [X] Create a classification report

Determining all of these is simplified by using the `sklearn.metrics` library and importing the classification_report, confusion_matrix, matthews_corrcoef, cohen_kappa_score functions.

```python
from sklearn.metrics import classification_report, confusion_matrix, matthews_corrcoef, cohen_kappa_score
```


In [0]:
# Import the required functions
from sklearn.metrics import classification_report, confusion_matrix, matthews_corrcoef, cohen_kappa_score

# Build the confusion matrix
txt = "Confusion Matrix"
print("\n", "*"*(26-(len(txt)//2)), txt, "*"*(26-(len(txt)//2)))
matrix = confusion_matrix(df_results.actual.values.tolist(), df_results.predicted.values.tolist())
print("[ TP ",  matrix[0][0], " | FP  ", matrix[0][1], "]\n[ FN  ", matrix[1][0], " | TN ", matrix[1][1], "]")
print("\n  N =", matrix.sum())

# Determine the matthews_corrcoef
txt = "Matthews Correlation Coefficient"
print("\n", "*"*(26-(len(txt)//2)), txt, "*"*(26-(len(txt)//2)))
print(matthews_corrcoef(df_results.actual.values.tolist(), df_results.predicted.values.tolist()))

# Determine the cohen_kappa_score
txt = "Cohens Kappa"
print("\n", "*"*(26-(len(txt)//2)), txt, "*"*(26-(len(txt)//2)))
print(cohen_kappa_score(df_results.actual.values.tolist(), df_results.predicted.values.tolist()))

# Build the classification_report
txt = "Classification Report"
print("\n", "*"*(26-(len(txt)//2)), txt, "*"*(26-(len(txt)//2)))
print(classification_report(df_results.actual.values.tolist(), df_results.predicted.values.tolist(), target_names=['Not Tank', 'Tank'], output_dict=False))  # ADDED TARGET NAMES


 ****************** Confusion Matrix ******************
[ TP  162  | FP   46 ]
[ FN   17  | TN  349 ]

  N = 574

 ********** Matthews Correlation Coefficient **********
0.7599663028252964

 ******************** Cohens Kappa ********************
0.7551227704267508

 **************** Classificaion Report ****************
              precision    recall  f1-score   support

    Not Tank       0.91      0.78      0.84       208
        Tank       0.88      0.95      0.92       366

    accuracy                           0.89       574
   macro avg       0.89      0.87      0.88       574
weighted avg       0.89      0.89      0.89       574



## Explain the Evaluation Metrics
- [X] **Confusion Matrix `confusion_matrix(y_true, y_pred, *args)`:**  The Confusion Matrix is a representation of the model's inference grouped by its predictions into the respective box based on if results were answered correctly of not. Box values are used to tabulate addional metrics, like Matthews Correlation Coefficient (MCC).  


- [X] **Matthews Correlation Coefficient `matthews.corrcoef(y_true, y_pred, *args)`:**  Matthews Correlation Coefficient (MCC) measures model prediction performance as a floating point value from -1 -> +1. An MCC of 1 represents a perfect prediction, and -1 MCC would show the model predicting opposite from desired results. MCC is a favorable selection for inbalanced data sets. In this project, the test set had 208 `tank` and 366 `not tank` samples--an imbalanced dataset. 


- [X] **Cohen's Kappa Score `cohen_kappa_score(y_true, y_pred, *args)`:**
Cohen's Kappa Score is similar to MCC as it measures the performance of a model classifiers of `n` items into mutually exclusive categories, but uses a floating point number from 0 -> 1. The Kappa score statisitc is grouped into categories:  

| Stat        | Agreement              | |
| ---:        | :---                   |---|
|      0      | equivalent to chance   | |
| 0.10 – 0.20 | slight agreement       | |
| 0.21 – 0.40 | fair agreement         | |
| 0.41 – 0.60 | moderate agreement     | |
| 0.61 – 0.80 | substantial agreement  | |
| 0.81 – 0.99 | near perfect agreement | |
|     1       | perfect agreement      | |
|             |                        | [Source](http://web2.cs.columbia.edu/~julia/courses/CS6998/Interrater_agreement.Kappa_statistic.pdf) |



- [X] **Classification Report `classification_report(y_true, y_pred, *args)`:**
The Classification  Report is a simple digest of several metrics including:
  - precision - Measures the rate of false positives.
  - accuracy - The measure of correct predictions over the complete dataset.
  - recall - The ability of the classifier to find positive samples in the dataset.
  - F-1 score - Harmonic mean of precision and recall, or the accuracy of the classifier to predict in a particular category.   


## Model Performance and Results Discussion
- **Describe other ways that image classifiers could be used to streamline data sorting, collection, and analysis process:**
  - **Sorting:**  A multi-label classifier could be used to label/tag images that may contain content of interest.  Then an image search/retrieval system could be designed that allows the analyst to view the machine applied labels.  Finally, the analyst could then validate or correct the machine applied labels when they access any of the data.  This human labeled/validated data could then be used in future training cycles.  
  - **Collection:**  Collection assets with limited onboard storage/bandwidth could incorporate automated image classification and only record/transmit images that match those they are trained to classify.
  - **Analysis:** Binary classifiers could be trained on multi-spectral imagery of assets that human analysts have a hard time identifying.  For example, overhead imagery of similar aircraft or vehicles.  Then this system could be accessed by an analyst when they are reviewing the image to assist them in making the correct determination of specific aircraft/vehicle type.

- **How well does the model perform?:**
  - The model performs well with moderately high MCC and Cohen's Kappa Scores.  However, the model has a fairly high false positive rate.  This and the model's low False Negative rate make it a good initial image filter to help flag images that contain *potential* tanks.  

- **Would you deploy this model?:**
  - Yes, this would be of assistance to analysts that need to review vast amounts of images for potential tanks.  However, it cannot be used to conduct targeting or automate any process of the kill chain without human involvement.
