## This assignment is designed for automated pathology detection for Medical Images in a relalistic setup, i.e. each image may have multiple pathologies/disorders. 
### The goal, for you as an MLE, is to design models and methods to predictively detect pathological images and explain the pathology sites in the image data.

## Data for this assignment is taken from a Kaggle contest: https://www.kaggle.com/c/vietai-advance-course-retinal-disease-detection/overview
Explanation of the data set:
The training data set contains 3435 retinal images that represent multiple pathological disorders. The patholgy classes and corresponding labels are: included in 'train.csv' file and each image can have more than one class category (multiple pathologies).
The labels for each image are

```
-opacity (0), 
-diabetic retinopathy (1), 
-glaucoma (2),
-macular edema (3),
-macular degeneration (4),
-retinal vascular occlusion (5)
-normal (6)
```
The test data set contains 350 unlabelled images.

# For this assignment, you are working with specialists for Diabetic Retinopathy and Glaucoma only, and your client is interested in a predictive learning model along with feature explanability and self-learning for Diabetic Retinopathy and Glaucoma vs. Normal images.
# Design models and methods for the following tasks. Each task should be accompanied by code, plots/images (if applicable), tables (if applicable) and text:
## Task 1: Build a classification model for Diabetic Retinopathy and Glaucoma vs normal images. You may consider multi-class classification vs. all-vs-one classification. Clearly state your choice and share details of your model, paremeters and hyper-paramaterization pprocess. (60 points)
```
a. Perform 70/30 data split and report performance scores on the test data set.
b. You can choose to apply any data augmentation strategy. 
Explain your methods and rationale behind parameter selection.
c. Show Training-validation curves to ensure overfitting and underfitting is avoided.
```
## Task 2: Visualize the heatmap/saliency/features using any method of your choice to demonstrate what regions of interest contribute to Diabetic Retinopathy and Glaucoma, respectively. (25 points)
```
Submit images/folder of images with heatmaps/features aligned on top of the images, or corresponding bounding boxes, and report what regions of interest in your opinion represent the pathological sites.
```

## Task 3: Using the unlabelled data set in the 'test' folder augment the training data (semi-supervised learning) and report the variation in classification performance on test data set.(15 points)
[You may use any method of your choice, one possible way is mentioned below.] 

```
Hint: 
a. Train a model using the 'train' split.
b. Pass the unlabelled images through the trained model and retrieve the dense layer feature prior to classification layer. Using this dense layer as representative of the image, apply label propagation to retrieve labels correspndng to the unbalelled data.
c. Next, concatenate the train data with the unlabelled data (that has now been self labelled) and retrain the network.
d. Report classification performance on test data
Use the unlabelled test data  to improve classification performance by using a semi-supervised label-propagation/self-labelling approach. (20 points)
```

## [Hint: If you are wondering how to use the "dense layer representative of an image" in step 2, see this exercise that extracts a [1,2048] dense representattive from an image using the InceptionV3 pre-trained model.]
https://colab.research.google.com/drive/14-6qRGARgBSj4isZk86zKQtyIT2f9Wu1#scrollTo=_IqraxtP4Ex3


## Good Luck!

In [1]:
# General Imports
import os               # importing data
import sys
import numpy as np      # linear algebra
import pandas as pd     # data processing, CSV file I/O (e.g. pd.read_csv)

# Tensorflow
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
print("TensorFlow Version: ", tf.__version__) # In case there are issues with model fitting, I know what version of tensorflow I have

# Image Stuff
import IPython.display as display
import PIL.Image

TensorFlow Version:  2.6.1


## Task 0: Data Wrangling

In [4]:
# Unzip midterm data (note, if you are running this, add Data.zip ~ renamed to midterm_data.zip to ../hidden_files/)
if not os.path.exists('../hidden_files/midterm_data/train'):
    !unzip ../hidden_files/midterm_data.zip

print(os.listdir("../hidden_files/midterm_data/train"))
# should return ['train', 'train.csv'].
# *NOTE: .DS_Store might also appear in this directory if on mac. This file is ignored in the .gitignore.

['.DS_Store', 'train', 'train.csv']


In [7]:
# Load training data
data = pd.read_csv("../hidden_files/midterm_data/train/train.csv")

# Examine first few rows to better understand data
data.head()

Unnamed: 0,filename,opacity,diabetic retinopathy,glaucoma,macular edema,macular degeneration,retinal vascular occlusion,normal
0,c24a1b14d253.jpg,0,0,0,0,0,1,0
1,9ee905a41651.jpg,0,0,0,0,0,1,0
2,3f58d128caf6.jpg,0,0,1,0,0,0,0
3,4ce6599e7b20.jpg,1,0,0,0,1,0,0
4,0def470360e4.jpg,1,0,0,0,1,0,0


In [8]:
# Take a look a the data types, count
data.info(memory_usage=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3435 entries, 0 to 3434
Data columns (total 8 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   filename                    3435 non-null   object
 1   opacity                     3435 non-null   int64 
 2   diabetic retinopathy        3435 non-null   int64 
 3   glaucoma                    3435 non-null   int64 
 4   macular edema               3435 non-null   int64 
 5   macular degeneration        3435 non-null   int64 
 6   retinal vascular occlusion  3435 non-null   int64 
 7   normal                      3435 non-null   int64 
dtypes: int64(7), object(1)
memory usage: 214.8+ KB


### Visualizng a single image

## Task 1: Classification of Diabetic Retanopathy and Glaucoma vs Normal Images:

### Options: Multi-class vs. all-vs-one model

**Multi-Class:**
1. can use one-hot-encoding approach to generate probabilities for multiple classes
2. easier to add additional classifications schemes

**All-vs-one:**
1. will probably need more than one model for each class trying to identify.
2. Can probabably share some of the eariler NN (transfer learning) for each model.

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os

if not os.path.exists('../hidden_files/midterm_data/train'):
    !unzip ../hidden_files/midterm_data.zip

print(os.listdir("../hidden_files/midterm_data/train"))

['.DS_Store', 'train', 'train.csv']


In [3]:
data = pd.read_csv("../hidden_files/midterm_data/train/train.csv")
data.head()

Unnamed: 0,filename,opacity,diabetic retinopathy,glaucoma,macular edema,macular degeneration,retinal vascular occlusion,normal
0,c24a1b14d253.jpg,0,0,0,0,0,1,0
1,9ee905a41651.jpg,0,0,0,0,0,1,0
2,3f58d128caf6.jpg,0,0,1,0,0,0,0
3,4ce6599e7b20.jpg,1,0,0,0,1,0,0
4,0def470360e4.jpg,1,0,0,0,1,0,0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3435 entries, 0 to 3434
Data columns (total 8 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   filename                    3435 non-null   object
 1   opacity                     3435 non-null   int64 
 2   diabetic retinopathy        3435 non-null   int64 
 3   glaucoma                    3435 non-null   int64 
 4   macular edema               3435 non-null   int64 
 5   macular degeneration        3435 non-null   int64 
 6   retinal vascular occlusion  3435 non-null   int64 
 7   normal                      3435 non-null   int64 
dtypes: int64(7), object(1)
memory usage: 214.8+ KB


In [5]:
for label in data[['diabetic retinopathy', 'glaucoma', 'opacity', 'macular edema', 'macular degeneration', 'retinal vascular occlusion', 'normal']]:
    print("Distribution of", label)
    print(data[label].value_counts())

Distribution of diabetic retinopathy
0    2680
1     755
Name: diabetic retinopathy, dtype: int64
Distribution of glaucoma
0    2838
1     597
Name: glaucoma, dtype: int64
Distribution of opacity
0    1902
1    1533
Name: opacity, dtype: int64
Distribution of macular edema
0    2919
1     516
Name: macular edema, dtype: int64
Distribution of macular degeneration
0    2861
1     574
Name: macular degeneration, dtype: int64
Distribution of retinal vascular occlusion
0    2995
1     440
Name: retinal vascular occlusion, dtype: int64
Distribution of normal
0    2910
1     525
Name: normal, dtype: int64


## Create new labels

We only care about diabetic retinopathy and glaucoma, we will group the rest of the labels together and drop the other columns.

In [9]:
# drop the columns that we dont care about 
data = data[['filename','diabetic retinopathy', 'glaucoma']]

In [10]:
data.head()

Unnamed: 0,filename,diabetic retinopathy,glaucoma
0,c24a1b14d253.jpg,0,0
1,9ee905a41651.jpg,0,0
2,3f58d128caf6.jpg,0,1
3,4ce6599e7b20.jpg,0,0
4,0def470360e4.jpg,0,0


In [12]:
data['other'] = 0
data.head()

Unnamed: 0,filename,diabetic retinopathy,glaucoma,other
0,c24a1b14d253.jpg,0,0,0
1,9ee905a41651.jpg,0,0,0
2,3f58d128caf6.jpg,0,1,0
3,4ce6599e7b20.jpg,0,0,0
4,0def470360e4.jpg,0,0,0


In [21]:
data['other'] = ((data['glaucoma'] == 0) & (data['diabetic retinopathy'] == 0)).astype(int)
data.head()

Unnamed: 0,filename,diabetic retinopathy,glaucoma,other
0,c24a1b14d253.jpg,0,0,1
1,9ee905a41651.jpg,0,0,1
2,3f58d128caf6.jpg,0,1,0
3,4ce6599e7b20.jpg,0,0,1
4,0def470360e4.jpg,0,0,1


In [22]:
data.info()
for label in data.columns[1:]:
    print("Distribution of", label)
    print(data[label].value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3435 entries, 0 to 3434
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   filename              3435 non-null   object
 1   diabetic retinopathy  3435 non-null   int64 
 2   glaucoma              3435 non-null   int64 
 3   other                 3435 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 107.5+ KB
Distribution of diabetic retinopathy
0    2680
1     755
Name: diabetic retinopathy, dtype: int64
Distribution of glaucoma
0    2838
1     597
Name: glaucoma, dtype: int64
Distribution of other
1    2102
0    1333
Name: other, dtype: int64


In [23]:
# Check new lables, and that all of the combinations add up to the total numbe of rows.
LABELS = data.columns[1:]
def build_label(row):
    return ",".join([LABELS[idx] for idx, val in enumerate(row[1:]) if val == 1])
        
data.apply(lambda x: build_label(x), axis=1).value_counts()

other                            2102
diabetic retinopathy              736
glaucoma                          578
diabetic retinopathy,glaucoma      19
dtype: int64

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
train_data, validation_data = train_test_split(data, test_size=0.3, random_state=42)

In [None]:
# lift code from MLE for training generator & data augmentation. use unet-helper functions and lift directly. 

# unet-helper functions -> prints out metrics later when we need it 
# 