<a href="https://colab.research.google.com/github/DrMelissaFranklin/Docker.dsub/blob/main/Mel_Project_6_from_Kaggle_working_but_needs_functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 6: Image Classification with Deep Learning




This project introduces us to deep learning. The deep learning process is a huge leap forward in data science and the field is less than 15 years old. The processing is significantly different from our previous projects so you will basically be provided a walkthrough document delineating the steps - much like Project 1. Deep learning is fascinating and I just want you to go through the process so you can appreciate its power.




The data collection portion of deep learning projects is a rather complex task. In other words, there is no CSV file that we can load to serve as our training data. Constructing a training data set is a rather large undertaking. We have imported all the training images (took hours) and I will show you how you can load the training data through the "pickle" process.




There are three files in this data set:

- [the feature set]( https://ddc-datascience.s3.amazonaws.com/Projects/Project.6-Images/Data/X.pickle ), i.e. images of dogs and cats ( mostly )

- [the target set]( https://ddc-datascience.s3.amazonaws.com/Projects/Project.6-Images/Data/y.pickle ), i.e. the indicator if something is a dog or cat ( mostly )

- [a test image]( https://ddc-datascience.s3.amazonaws.com/Projects/Project.6-Images/Data/dog.jpg )








This project will classify new, unseen images of cats and dogs. This was one of the first big success stories of deep learning and we will go through the process of building a learning algorithm that will do this task. Telling a picture of a cat from a picture of a dog is easy for humans to do, but had been notoriously difficult to get a computer learning to perform well on the task. Deep learning solved that.


The same data set, but you can copy the links by viewing the markdown directly to bypass Google Colab's annoying "You are leaving Colab" link.









https://ddc-datascience.s3.amazonaws.com/Projects/Project.6-Images/Data/X.pickle



https://ddc-datascience.s3.amazonaws.com/Projects/Project.6-Images/Data/y.pickle



https://ddc-datascience.s3.amazonaws.com/Projects/Project.6-Images/Data/dog.jpg


# Images – To Do List




Prior to starting this problem, connect to a T4 or CPU to check code...?

Be sure to enable the GPU runtime processing in your Jupyter notebook.




## Problem Definition




* Write a concise problem definition for the project. Put it in a text field at the top of your Jupyter notebook.



* Load necessary packages.




In [None]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt



import pickle #native to python for saving/loading using serialized data to be packed / unpacked



import random



import os

import cv2

In [None]:
import tensorflow as tf

import tensorflow.keras as keras # for nearest neighbors?



from keras.models import Sequential

from keras.layers import Dense, Flatten, Conv2D, MaxPooling2D

from keras.utils import plot_model



from sklearn.model_selection import train_test_split

from sklearn import datasets # for algorithms.scaling

## Data Collection from AWS S3 bucket.


In [None]:
url_X = "https://ddc-datascience.s3.amazonaws.com/Projects/Project.6-Images/Data/X.pickle"



url_y = "https://ddc-datascience.s3.amazonaws.com/Projects/Project.6-Images/Data/y.pickle"



url_dog = "https://ddc-datascience.s3.amazonaws.com/Projects/Project.6-Images/Data/dog.jpg"

### Confirm connection to X and y pickle at AWS by comparing what get back from curl to url:

In [None]:
!curl -0 https://www.google.com/url?q=https%3A%2F%2Fddc-datascience.s3.amazonahttps://ddc-datascience.s3.amazonaws.com/Projects/Project.6-Images/Data/X.pickle

In [None]:
!curl -0 https://ddc-datascience.s3.amazonaws.com/Projects/Project.6-Images/Data/y.pickle



## Read the pickled data from X.pickle and y.pickle




In [None]:
X = pd.read_pickle(url_X)

y = np.array(pd.read_pickle(url_y))

## Look at X and y data shapes and types

In [None]:
# Confirm data shapes

print(f"X shape: {X.shape}")

print(f"y shape: {y.shape}")

In [None]:
# Confirm data types

print(f"X datatype: {type(X)}")

print(f"y datatype: {type(y)}")

## Data Preprocessing




* Scale the values in X so that they fall between 0 and 1 by dividing by 255.




In [None]:
#Scale X, but not y which is an array

X_scaled = X/255

X_scaled

## Exploratory Data Analysis




* Look at the shape of X and y. Ensure that X is 4 dimensional.




In [None]:
X_scaled.shape

* Plot a few ( >5 ) of the images in X using plt.imshow().

In [None]:
plt.imshow(X_scaled[0])

In [None]:
plt.imshow(X_scaled[5])

In [None]:
plt.imshow(X_scaled[10])

In [None]:
plt.imshow(X_scaled[15])

In [None]:
plt.imshow(X_scaled[20])

In [None]:
#create a function for showing the same X images above

for c in range(0, 20, 5): # reference the images using 'range'

  plt.imshow(X_scaled[c]) # cycle through 'c'

  plt.show() # display



* Look at the response values in y for those images.

In [None]:
y[0:21:5] #dogs 0, cats 1

* Start with a random subset of 10% to get familiar with the process of building a NN before going through the process again with the full set.

In [None]:
# Take a sample

sample_size = int(0.1 * len(X_scaled))

X_sample = X_scaled[:sample_size]

y_sample = y[:sample_size]

In [None]:
# Shuffle the sample

random.shuffle(X_sample)

## Data Processing




* Split X and y into training and testing sets.


In [None]:
# Another way to split the data, without randomization ...

# X = X_scaled[:1000]

# y = y_scaled[:1000]

In [None]:
X = X_sample

y = y_sample

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
# Reshape training data

X_train_reshaped = X_train.reshape(-1, 100, 100, 1)

X_test_reshaped = X_test.reshape(-1, 100, 100, 1)



print("X_train shape:", X_train_reshaped.shape)

print("X_test shape:", X_test_reshaped.shape)


In [None]:
# Visualize some samples

plt.figure(figsize=(10, 5))

for i in range(5):

    plt.subplot(1, 5, i+1)

    plt.imshow(X_train_reshaped[i])

    plt.title(f"Label: {y_train[i]}")

plt.show()



*  Build a convolutional neural network with the following:

  * Sequential layers

  * At least two 2D convolutional layers using the 'relu' activation function and a (3,3) kernel size.

  * A MaxPooling2D layer after each 2D convolutional layer that has a pool size of (2,2).

  * A dense output layer using the 'sigmoid' activation function.

  Note: you can play around with the number of layers and nodes to try to get better performance.



* Compile your model. Use the 'adam' optimizer. Determine which loss function and metric is most appropriate for this problem.



* Fit your model using the training set.



* Evaluate your model using the testing set.



* Plot the distribution of probabilities for the testing set.




In [None]:
model = Sequential([

    Conv2D(filters=32, kernel_size=(3, 3), activation='relu', input_shape=(100, 100, 1)),

    MaxPooling2D(pool_size=(2, 2)),

    Flatten(),

    Dense(64, activation='relu'),

    Dense(1, activation='sigmoid')

])



model.compile(optimizer='adam',

              loss='binary_crossentropy',

              metrics=['accuracy'])



history = model.fit(X_train_reshaped, y_train, epochs=4, validation_data=(X_test_reshaped, y_test))



# Evaluate the model

test_loss, test_accuracy = model.evaluate(X_test_reshaped, y_test)

print(f"Test accuracy: {test_accuracy:.2f}")


In [None]:
# Predict probabilities for the entire test set

predictions = model.predict(X_test_reshaped)



# Plot the distribution of probabilities

plt.figure(figsize=(10, 6))

plt.hist(predictions.flatten(), bins=50, edgecolor='black')

plt.title('Distribution of Probabilities')

plt.xlabel('Probability')

plt.ylabel('Frequency')

plt.axvline(x=0.5, color='r', linestyle='dashed', linewidth=2, label='Decision Boundary')

plt.legend()

plt.show()



# Plot ROC curve

from sklearn.metrics import roc_curve, auc



fpr, tpr, _ = roc_curve(y_test, predictions.flatten())

roc_auc = auc(fpr, tpr)



plt.figure(figsize=(8, 6))

plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('Receiver Operating Characteristic (ROC) Curve')

plt.legend(loc="lower right")

plt.show()



# Print some statistics

print("Mean prediction:", np.mean(predictions))

print("Standard deviation of predictions:", np.std(predictions))


* Define a function that will read in a new image and convert it to a 4 dimensional array of pixels (ask the instructor for help with this). Hint: [numpy.reshape]( https://numpy.org/doc/stable/reference/generated/numpy.reshape.html )



* Use the function defined above to read in the dog.jpg image that is saved in the AWS S3 bucket.



* Use the neural network you created to predict whether the image is a dog or a cat.

## Communication of Results




* Communicate the results of your analysis.




## **BONUS** (optional)




* Upload an image of your (or your friend's or family's) dog or cat and use your model to predict whether the image is a dog or cat.

* Hint: you'll probably need to convert the image from color to grayscale.  OpenCV, pillow, and other libraries are your friend.