<a href="https://colab.research.google.com/github/Lucy-Moctezuma/SFSU-CodeLab-Work-/blob/main/MARC%20Machine%20Learning%20Project/2_Yeast_Cells_with_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**WELCOME to the coding portion for the Introduction to Deep Learning for Image Classification!**

This notebook was created by Lucy Moctezuma Tan, Florentine van Nouhuijs, Lorena Benitez-Rivera (SFSU master students and CoDE lab members) and Pleuni Pennings (SFSU bio professor)

Special Acknowledgement to Dr. Ilmi Yoon (SFSU CS professor) for providing the base code, to Dr. Mark Chan and his student lab members: Gabriela Alvarez-Azanedo and Adilene Rodriguez for sharing the lab images to use for the data analysis. 

#OBJECTIVE OF THIS EXERCISE:

We are going to be working with 104 **yeast cells images** provided by Dr.Chan's lab and use a computer model to distinguish between two different ones: Wild Type (WT) and mutant (MT). 
- 61 images are already labeled and will be used as training images
- 20 images are used for validation during the training process 
- 23 remaining images used for testing our model's performance.

Below is an example for the two different types of yeast cells:

![WT_vs_Mutant.png](https://drive.google.com/uc?export=view&id=1clVtBYMqxCIoE31uGmYnm5k74p3WZxoT)

### Do they look similiar to you? let's see if the computer can determine that!

**The Objective** of this workshop 2 is to use a **Deep Learning** model to predict which cells from the 23 test images are from the WT yeast cells, and which ones are the Mutant yeast cells.

- ***Wild Type Cells:*** Mother cell transmits normal amount of vacuoles (green colored) to daughter cell.
- ***Mutant Cells:*** Mother cell does not transmit normal amount of vacuoles (green colored) to daughter cell.

In order to teach our model how to distinguish these cells, we are using images that had already been labeled by students from Mark's lab to train the model, and a technique called **Transfer Learning**.


# DEEP LEARNING REVIEW:

**Deep Learning**  is a method within Machine Learning (ML) in which the computer learns to accomplish a task through trial and error by analyzing training samples. Another common term for deep learning is **Artificial Neural Networks**.

**Deep Learning** or **Artificial Neural Networks** were initially inspired by how neurons are organized and signal to each other in the brain.  


**Artificial Neural Networks (ANN)** can be pictured as a series of stacked layers, and each layer is composed of different amounts of ***nodes***. All Neural networks are composed of an **input layer**, **hidden layers** and **output layer**. There are many types of ANN but all of them share the following features: 

![neuralnet.png](https://drive.google.com/uc?export=view&id=1UdEGkblSb3X__Y7ez6R1hJOQ4czP9RHC)

**Image sources**: 

**Human Neural Networks**: https://upload.wikimedia.org/wikipedia/commons/5/5b/Cajal_cortex_drawings.png)

**Artificial Neural Networks**: https://upload.wikimedia.org/wikipedia/commons/d/d2/Neural_network_explain.png

### PARTS OF AN ARTIFICIAL NEURAL NETWORK

**1) Input layer:** is the layer that we use to feed our initial data. These can be datatables, text, images, etc.

**2) Hidden layers:** are the ones that will further process the information they receive from the input layer. In the example above we have 2 hidden layers but the amount of layers can vary.

**3) Output layer:** is the final layer where we get our predictions.

**4) Nodes:** are the the components of each layer and it represents a center where computation and mathematical equations determine what information is passed to the next layer. Nodes are connected to the following layer differently.

**5) Weights:** are values that are meant to show the strenght of the relationship between each node. The general idea is that a neural network starts with a random set of weights and then during training, the weights get updated in a trial and error fashion until it finds the best combination of weights that will yield the highest performing model.  
**Notice that in the image above, every black arrow has its own weight**

You can find more information in the link:  [Neural Networks by IBM](https://www.ibm.com/cloud/learn/neural-networks#toc-how-do-neu-vMq6OP-P)





# TRANSFER LEARNING AND CONVOLUTIONAL NEURAL NETWORKS:

This notebook will use a technique called **TRANSFER LEARNING** and a particular type of Artificial Neural Network, the **CONVOLUTIONAL NEURAL NETWORK(CNN)** , 

**- TRANSFER LEARNING** is using a pretrained neural network instead of creating and training your own from scratch. The reason machine learning practionists use this method is because CNNs are notorious for requiring a lot of images to train, and since we are using relatively few data, it is best to use an existing model that has been trained in millions of images already by companies.

**Keras** is a python package that specializes in creating Artificial Neural Networks.  Here is a list of other pre-trained models in Keras : [Keras - Pretrained models](https://keras.io/api/applications/)

**- CONVOLUTIONAL NEURAL NETWORKS (CNN)** , is a common type of ANN model used for image classification. The model looks at different local areas in the image and the hidden layers work together to extract specific image patterns, that will ultimately help our model classify our yeast cell.

The Pre-trained Convolutional NeuralNetwork we will be using is the one called **VGG16.**

## **Step 1) Configuring Virtual Interpreter and Uploading image files from Github.**

**A)** The Colab environment runs code cells on a virtual cloud computer owned by Google. To ensure that the code is executed efficiently, Graphics Processing Unit (GPU) must be enabled:

* Navigate to 'Runtime' > 'Change Runtime Type'
* Set 'Hardware Accelerator' to 'GPU'



In [None]:
#This code shows you the name of the GPU that was assigned to you
!nvidia-smi --list-gpus

**B)** Connect your Colab Notebook to your personal Drive so that there is a place to download your images to.

In [None]:
#Mounting Google Drive into Google Colab
from google.colab import drive
drive.mount('/content/drive')


**C)** Importing a package that allows you to connect to Github and download the images to your own Drive

In [None]:
# This installs a package needed so that images are downloaded from github
!pip install GitPython
from git import Repo

In [None]:
#Copying Github Repo folder structure into your Drive
%cd /content/drive/MyDrive
!git clone https://github.com/MarcMachineLearning/Introduction-to-Deep-Learning


## **Step 2) Importing needed packages to be able to run our neural network and creating a seed**


**A)** Importing all the packages needed to run our code

In [None]:
# imports packages to handle the math and data 
import numpy as np
import random as random
import pandas as pd

In [None]:
# imports functions from tensorflow and keras that allow the CNN to be loaded, run and modified
%tensorflow_version 2.x
import tensorflow as tf
import keras as keras
from tensorflow.keras.models import load_model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import categorical_crossentropy
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import confusion_matrix

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


In [None]:
# imports that will allow us to visualize and evaluate our model
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

In [None]:
#imports that let the notebook navigate the files in the drive
import os

**B)** Creating a function that will allow us to set a seed for our results.
A **seed** is basically a set of predetermined numbers that are supposed to imitate randomness. A seed was set so that the results are reproducible and we can all look at the same results.

**NOTE:** Usually people do not need to set up seeds for their Neural Networks but for educational purposes we will.

In [None]:
def random_seed_GPU(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

#fix seed
random_seed_GPU(45)

## **Step 3) Loading VGG16 Pre-trained CNN**

**VGG16** is a pre-trained CNN that has been already trained with millions of images, so it has learned important ways to distinguish basic image components such as edges, horizontal vs vertical lines or background versus foreground, etc. This information can be used to also recognize our yeast cells.

![VVG16.png](https://drive.google.com/uc?export=view&id=1c0_Oaq3rRTHwl6KvdztsEk7RpH2rf871)

### PARTS OF VGG16

#### **1) Input layer** 
This layer consists of the images generated by our Image Generator, each image will be passed first through the input layer.

#### **2) Hidden layers** 
VGG16 has hidden layers that are composed of Feature learning layers **(A)** and Classification layers **(B)**.

**A) Feature learning layers**
You can think of these layers as learning the features of the images such as borders, vertical vs horizontal lines, etc. As you can see in the image, they are grouped in blocks called **convolutional blocks**. VGG16 has 5 blocks
  - **Convolutional layers**: function basically as different filters for the image, such as detecting edges, darker or lighter spots, etc. The end product of a convolution layer is called a **Feature Map**
  - **Maxpooling layers**: are layers that reduce the dimensions of the filtered images, this is important because every image that goes through our network will go through several filters and this could make an image have too much information. Maxpooling tries to preserve the most important features of the image while also making it easier to process.

![Convo_Max.png](https://drive.google.com/uc?export=view&id=1gYJQBAbLsNya9HYLve0epR4hGydU7QAz)

**Image source**:https://picryl.com/media/the-golden-gate-bridge-650ae8

**B) Classification layers**
- **Flatten layer**: This layer is the one that takes all the filtered images created in the last convolutional block. the images are represented as 2D arrays and turns them into 1D vectors. This step is important since it is necessary to feed it into the Dense layers.

![2d into 1d.png](https://drive.google.com/uc?export=view&id=1SaHoqsniGDI3tpgDO3QlicfsLogbDgwX)

- **Dense layers**: are also called fully connected layers because each neuron is connected to every other neuron from the next layer. VGG16 has 2 that start to perform the classification process. 

#### **4) Output layer**
VGG16 contains a final output layer with the predicted classes. This final dense layer contains 1000 channels because it was designed to classify 1000 classes of images, the output shows the probabilities of belonging to a particular class.

In [None]:
# load a preset model from keras that has been trained by tons of images and check a summary of it
vgg16_model = tf.keras.applications.VGG16()
vgg16_model.summary()

## **Step 4) Loading Training and Validation Data to look at the Images used for Training**

We devided the images into training and validation into different folders, please note that both are labeled images.

  **A) Training:** 61 Images that will be used to train only the last layer of our CNN, that is the output layer.

In [None]:
# Load data folders with images from the Drive
train_path = '/content/drive/MyDrive/Introduction-to-Deep-Learning/Training'
image_size = (224, 224)
classes = os.listdir(train_path)
print("Training:")
train_batches = ImageDataGenerator() \
                .flow_from_directory(train_path,
                                    image_size,
                                    classes=classes,
                                    class_mode = "categorical",
                                    batch_size=5)

 **B) Validation:** 20 Images that are labeled already and that will be used to validate during the training in order to check how the CNN is performing while training. It is performed however many times is specified in the "validation steps" 

In [None]:
valid_path = '/content/drive/MyDrive/Introduction-to-Deep-Learning/Validation'

print("Validation:")
valid_batches = ImageDataGenerator() \
                .flow_from_directory(valid_path,
                                       image_size,
                                       classes=classes,
                                       class_mode = "categorical",
                                       batch_size=20)

**C)** Let's visualize some of the images we will be training

In [None]:
# function that plots images with labels
def plots(ims, figsize=(20,20), rows=1, interp=False, titles=None):
    if type(ims[0]) is np.ndarray:
        ims = np.array(ims).astype(np.uint8)
        if (ims.shape[-1] != 3):
            ims = ims.transpose((0,2,3,1))
    f = plt.figure(figsize=figsize)
    cols = len(ims)//rows if len(ims) % 2 == 0 else len(ims)//rows + 1
    for i in range(len(ims)):
        sp = f.add_subplot(rows, cols, i+1)
        sp.axis('Off')
        if titles is not None:
            title_list = []
            for t in titles:
                for element in t:
                    if element == 1:
                      index = list(t).index(element)
                      if index in list(train_batches.class_indices.values()):
                            label = list(train_batches.class_indices.keys())[index]
                            title_list.append(label)
        sp.set_title(title_list[i], fontsize=16)
        plt.imshow(ims[i], interpolation=None if interp else 'none')

imgs, train_labels = next(train_batches)
plots(imgs[0:5], titles=train_labels)

## **Step 5) Fine Tunning VGG16 for our dataset**

**Fine Tunning (Adapting)** Means that we need to change the pre-trained network VGG16 into one that can work with our data. 

**A)** The code below copies all the layers and weights of vgg16 and freezes them and then it adds the last layer, the only layer that is trainable. The only layer we are interested in changing is the last one. The original allows us to classify about 1000 classes but we only have two classes. 


In [None]:
# make modifications to the last layer only and check a summary of it

model = Sequential()
for layer in vgg16_model.layers[:-1]:  
    model.add(layer)
for layer in model.layers:
    layer.trainable = False
model.add(Dense(len(classes), kernel_initializer=keras.initializers.glorot_uniform(seed=45) ,activation='softmax'))

model.summary()

## **Step 6) Compiling the model and Training our Data** 

**A)** Compiling allows us to set different parameters such as how fast we want it to learn or how we want our model to calculate it's performace.

- **Learning rate**: is a value between 0 and 1, that determines basically how fast our model will try to learn. The higher the learning rate, the less number of epochs (training cycles) is required. This is the most important parameter to tune in a Neural Network. Too large and it will converge too quickly and provide a suboptimal result, too small and it will get stuck. 

- **Loss**: There are many kinds of loss functions and it depends on what you are trying to predict, for example ours is *categorical crossentropy*. It is essentially a way to measure how well your model fits your data. The lower the better.

- **Metrics**: allows us to choose how we want to evaluate our model, for example i chose *accuracy*. This measures the percentage of correct predictions over the total predictions made by our model.


In [None]:
# compile model to determine learning rate, loss function and metrics
model.compile(Adam(learning_rate=.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
print("Done")

We are going to use now the 61 training images and the 20 validation images together for the training. 

**Note:** We have also set a seed during training so that the images are fed to the model in the same order, that way the training process is reproducible.

**B)** In order to do our training we can specify the following parameters: 

- **Epoch:** is one cycle of training, meaning our entire training data goes through all the layers of our neural network forwards and then backwards.

- **Steps per Epoch:** Each epoch is devided into steps that depend on the  number of images per training batch. The first step starts with a set of random weights, once the first batch of training data reaches the end of the network, it then goes backwards upgrading the weights. This phenomenon is called **Backpropagation**. 

> One step = 1 backpropagation.

- **Validation Steps:** This is similar to Steps per epoch, except it is for the validation set of images. Total number of steps (batches of samples) to draw before stopping when performing validation at the end of every epoch.

In [None]:
#fix seed
random_seed_GPU(45)

STEP_SIZE_TRAIN=train_batches.n//train_batches.batch_size
STEP_SIZE_VALID=valid_batches.n//valid_batches.batch_size

model.fit(train_batches,
          steps_per_epoch=STEP_SIZE_TRAIN,
          validation_data=valid_batches,
          validation_steps=STEP_SIZE_VALID,
          epochs=10,
          verbose=1)

## **Step 7) Predicting labels for the Test Data**

**Testing:** 23 Images that our CNN model has never seen will be loaded. Our model will then attempt to predict labels for the testing set. We will then compared our test images with the predictions made by our model to determine how good our model is.

**A)** Loading our Testing Set of images
 

In [None]:
test_path = '/content/drive/MyDrive/Introduction-to-Deep-Learning/Testing'

print("Testing:")
test_batches = ImageDataGenerator() \
                .flow_from_directory(test_path,
                                     image_size,
                                     classes=classes,
                                     class_mode = "categorical",
                                     batch_size=23, 
                                     shuffle=False)


**B)** Making predictions for the Test Data

Each prediction outputs a probability between 0 to 1 for each class. Whatever probability is the highest, is the one that ends up being as the predicted class.




In [None]:
#Makes prediction with our models
pred=model.predict(test_batches,
steps=1,
verbose=2)

#Observe the probablilities predicted for each class 
pred = np.round(pred, 2)
pred

**C)** Lets Convert our predictions into the actual classes predicted by our model to make it easier to read the predictions.

In [None]:
#convert predictions into actual labels
predicted_class_indices=np.argmax(pred,axis=1)
labels = (train_batches.class_indices)
labels = dict((v,k) for k,v in labels.items())
predictions = [labels[k] for k in predicted_class_indices]
#observing the actual labels predicted
predictions

## **Step 8) Observe how many predictions were correct using a Confusion Matrix**

**Confusion Matrix**: is a chart that lets you observe the number of accurate predictions and mistakes made by the model.

**A)** Loading Actual data labels to make our confusion matrix.

**Actual Values** are the ones that are already labeled correctly in our test data by a human.



In [None]:
# This line of code turns our Actual Data into a list of "Mutant" and "WT" list
array = test_batches.classes
actual_data= array.tolist()
Actual = []
for i in actual_data:
  if i == 0:
    i = "Mutant"
    Actual.append(i)
  elif i == 1:
    i = "WT"
    Actual.append(i)

Actual   

**B)** Let's take a look at our confusion matrix!

**Predicted Values:** Are the ones that our model has made using our unlabelled test data.

What the model predicted *correctly*:
- **True WT**: Model predicted "Wild Type" and it was actually "Wild Type"
- **True Mutant**: Model predicted "Mutant" and it was actually "Mutant"

What the model predicted *wrong*:
- **False WT**: Model predicted "Wild Type" but it was actually "Mutant"
- **False Mutant**: Model predicted "Mutant" but it was actually "Wild Type"

In [None]:
#Create Confusion matrix for 2 classes
cm = confusion_matrix(Actual, predictions)
cm_df = pd.DataFrame(cm,
                     index = ['Mutant','WT'], 
                     columns = ['Mutant','WT'])
plt.figure(figsize=(5,4))
sns.heatmap(cm_df, annot=True)
plt.title('Confusion Matrix')
plt.ylabel('Actual Values')
plt.xlabel('Predicted Values')
plt.show()

On the Y axis of the **Confusion Matrix** you see the actual values, on the x axis you see the values that the model predicted. 

The model used 23 images for testing. From these images: 
- The model incorrectly predicted 2 Wild Types When the Actual category was Mutant (upper right corner). 
- The model incorrectly predicted 1 mutant when the Actual category was Wild Type (bottom left corner).
- The model correctly predicted 10 mutants to be mutants (upper left corner).
- The model correctly predicted 10 wildtypes to be Wild Type (bottom right corner).

**C)** Let's take a look at the image it predicted incorrectly as a Wild Type

In [None]:
# This code allows you to see which image was predicted incorrectly
test_batches.reset()
test_imgs, test_labels = next(test_batches)
test_labels = test_labels[:,0]
pred_labels= np.round(pred[:,0])

false_wt = np.where(np.logical_and(test_labels == 1, pred_labels == 0))
imgs, labels = next(test_batches)
plots(imgs[false_wt[0]], titles=labels)

**C)** Evaluating our model will allow us to get concrete metrics of how our model performed. We can see how well we performed by looking at the following metrics:
- **Accuracy**: metric showing us what percentage of the test data was predicted correctly.

In [None]:
#Evaluating the model using test batches
test_acc = model.evaluate(test_batches, 
                          steps=1,
                          verbose=1) 
print(test_acc[1]) 


Our model got a 86.96% Accuracy. Woohooo!! The computer was able to distinguish Wild Type vs Mutant Yeast Cells!


<img src="https://drive.google.com/uc?export=view&id=16cXtuethRuM8mlJd5qXusKjGaRgbpzEs" width=400 lenght=400>

### **We Hope you had Fun Doing this Deep Learning Workshop. As you can see it's not that hard!**



