<h1 style="text-align: center;">Midterm - Computer Vision option <h1>

This is one of two options for the midterm.

This option involves automated pathology detection for medical images. 

We're asking you to develop a solution, using computer vision techniques, that can detect when images of a patient's eye contain evidence of specific medical pathologies. The setup is somewhat realistic in the sense that each image may have multiple pathologies.

The goal, for you as an MLE, is to design models and methods to detect pathological images and explain the pathology sites in the image data.

Your submission should be a single file (either just a .ipynb file or a .zip file--if you use multiple files for any helper functions). Please stay organized and follow the outline of the different tasks that we provide below.

## The data

Data is taken from a Kaggle contest: https://www.kaggle.com/c/vietai-advance-course-retinal-disease-detection/overview

The training data set contains 3,435 retinal images that represent multiple pathological disorders. The pathology classes and corresponding labels are included in the `train.csv` file, and each image can have more than one class (multiple pathologies).

The labels for each image are:

- opacity (0), 
- diabetic retinopathy (1), 
- glaucoma (2),
- macular edema (3),
- macular degeneration (4),
- retinal vascular occlusion (5)
- normal (6)

The test data set contains 350 unlabeled images.

## The tasks

For this assignment, assume you are working with specialists for Diabetic Retinopathy and Glaucoma only, and your client is interested in a predictive learning model that identifies when patients have Diabetic Retinopathy or Glaucoma. They also care if the patient has any other pathology, but not the specific details of which other pathology. That means they want a prediction with 3 (non-exclusive) options:
1. Diabetic Retinopathy
2. Glaucoma
3. Other

The absence of any of these pathologies would be considered Normal.

Your client is also interested in feature explainability and learning with small amounts of data.

**Design models and methods for the following tasks. Each task should be accompanied by**:
* code
* plots/images (if applicable)
* tables (if applicable)
* text explanations (in markdown cells) of what is being done and why

### Task 1: Build a classification model (50 points)
You should perform a random 70/30 data split for training and validation. Report classification metrics (Accuracy, Precision, Recall, F1) on the validation data set. You can choose to apply any data augmentation strategy, but please explain your methods and rationale behind what types of data augmentation you include. We expect you to fit at least **2 different models**, and pick the one you think works best. Getting a good fit for the models should include some amount of basic hyperparameter tuning. Make sure that when you train the models you display training-validation curves to ensure overfitting and underfitting is avoided.

We expect your solution to have at least the following components (point breakdowns for each component indicated)
```
1. Performs basic Exploratory Data Analysis (5 points)
2. Sets up a data augmentation pipeline (10 points)
3. Model Fitting (25 points)
```


The remaining 10 points for this task will be given for the following:
```
4. Addresses the multi-label nature of this data (5 points)
5. F1 score is in the top quartile among all students (5 points)
```

### Task 2: Visualize regions of interest that contribute to Diabetic Retinopathy and Glaucoma (25 points)
We want you to help the client understand what parts of an image are important for making the pathology classification. In support of that, you should visualize the feature activations and generate some saliency heatmaps using any method of your choice. For the saliency heatmap, one suggestion is to apply the gradCAM method, although this is only a suggestion. The point breakdown for this task is:

```
1. Implements feature activation, heatmap visualization (15 points)
2. Provides a written and graphical analysis of these visualizations (10 points)
```
The written analysis of your visualizations should be more than 50 words, but less than 500 words. Describe what you found from doing the feature activation and heatmap visualization on some images with different pathologies. 

### Task 3: Using the unlabeled data set in the `test` folder, augment the training data (semi-supervised learning) and report the change in classification performance on your (labeled) validation data set (15 points)

Here's how we would like to you do to this:
1. Pass the unlabelled images through the trained model and retrieve the dense layer features prior to the classification layer. Using this dense layer as representative of the image, apply label propagation using these features in order to retrieve labels corresponding to the unlabeled data.
2. Next, concatenate the train data with the unlabeled data (that has now been self-labeled) and retrain the network, either from scratch or just fine-tuning, up to you.
3. Report classification performance on the labeled validation data you generated in Task 1

The points breakdown for this task is:
```
1. Correctly grabs the dense layer activations for unlabeled data (2 points)
2. Applies label propagation using the dense layer activations (8 points)
3. Retrains the model using self-labeled data. Reports changes in performance (5 points)
```


### Style and clarity (10 points)
Please attempt to write clear, well-commented code and explanations in markdown cells, where appropriate. The remaining 10 points will be allocated based on how well you do this.



## That's it. Good Luck!

<br>
<br> 
<br>

----