# Chem 277B - Fall 2024 - Homework 6
## CNN - Image Processing Using an Encoder-Decoder Structure
*Submit this notebook to bCourses to receive a credit for this assignment.*
<br>
due: **Nov 18th 2024** 
<br>
**Please upload both, the .ipynb file and the corresponding .pdf**<br>
<br>

## 120 Points Total

**Problem**

Your task is building an anomaly detection model to identify parasitized cells among images of infected and uninfected cells using a deep learning autoencoder structure (see **tutorial 7** at *bCourses*). The goal is to develop a system that detects anomalies based on reconstruction errors and density estimations of encoded features.<br>
The learning goal of this homework assignment is to work on a real-life scenario with a realistic workflow. This task also serves a preparation for the Capstone Project concerning complexity and difficulty. 

## Note: Optimize for Computational Efficiency
Autoencoders require significant computational resources, so consider the following strategies to reduce computation time while maintaining accuracy:

**Use Smaller Image Sizes**: Start with 64, 32, or even 16 pixels. Smaller images can drastically cut computational cost.
Reduce Filters and Layers: Experiment with fewer filters in each layer and avoid building a very dense network. Aim for a trade-off between model complexity and compute efficiency.<br>
<br>

**Skip the Final Conv2D Layer**: Instead of reconstructing at full resolution, consider using a smaller output size, which reduces the computation load without sacrificing much accuracy.<br>


**Consider a Shallow Network**: A simpler encoder-decoder structure with fewer layers may still work effectively for anomaly detection.<br>

**Preprocess Images:** Apply preprocessing techniques such as thresholding to simplify images before passing them through the autoencoder. Smarter preprocessing may reduce the need for a dense network. You may also make this a single channel image with smart pre processing.


These adjustments are intended to balance computational efficiency with detection accuracy. Use these hints to guide your network design and experiment with different configurations for the best trade-off.<br>


Please avoid building a network that takes in **128x128** images without preprocessing. Training such a model will be infeasible with limited compute resources. Although smaller image sizes or simpler networks may not yield the highest possible accuracy, experimenting with these configurations is essential.

Grading will be based on the workflow and application of concepts—not purely on accuracy. Focus on understanding the process and applying techniques for effective anomaly detection with practical constraints. (Hint:- Opencv & KNN)

**Dataset**

The dataset contains images of cell samples, both infected (parasitized) and uninfected (healthy), which can be downloaded from the National Library of Medicine: [Malaria Dataset](https://lhncbc.nlm.nih.gov/LHC-downloads/downloads.html#malaria-datasets)

**Guided Workflow and Hints:**

**1) Preprocessing and Data Loading** <br>
> **Objective**: Load and preprocess images of healthy and infected cells.<br> 
> **Hint**: Apply preprocessing techniques such as thresholding to simplify images before passing them through the autoencoder. Smarter preprocessing may reduce the need for a dense network.
You may also make this a single channel image with smart pre processing.

**2) Autoencoder Model Setup**<br>
> **Objective**: Build an autoencoder with a bottleneck layer (see **tutorial 7**) to capture compressed representations of cell images.<br>
> **Hint**: Use *Conv2D*, *MaxPooling2D*, and *UpSampling2D* layers to build an encoder-decoder structure. Keep the bottleneck layer relatively small to capture key features in the latent space.<br>
> **Question**: Why is it important to use a small bottleneck layer in anomaly detection with autoencoders?<br>

**3) Model Training**<br>
> **Objective:** Train the autoencoder on healthy images to learn the general features of normal cell images.<br>
>**Hint:** Compile the model with *mean_squared_error* loss and train it with only the healthy cell data (no parasitized cells at this stage).<br>
>**Question:** Why should we avoid training on parasitized cells for this model?<br>

**4) Evaluation Using Reconstruction Error**<br>
>**Objective:** Evaluate the reconstruction error on both healthy and parasitized cells.<br>
>**Hint:** Calculate the reconstruction error for each type of image. High errors for parasitized cells should indicate anomalies.<br>
>**Question:** Why might parasitized cells have a higher reconstruction error than healthy cells?<br>

**5) Latent Space Representation and Density Calculation**<br>
>**Objective**: Extract the compressed (latent) representations of the healthy cells from the bottleneck layer and calculate a density threshold for anomaly detection.<br>
>**Hint**: Use *KernelDensity* from *sklearn.neighbors* to fit a density model on the compressed representations of healthy cells, then set a threshold for density values.<br>
>**Question**: What role does density estimation play in enhancing anomaly detection beyond reconstruction error alone?<br>

**6) Threshold Setting and Anomaly Detection**<br>
>**Objective**: Define thresholds for density and reconstruction error to classify new images as healthy or parasitized.<br>
>**Hint**: Test different threshold values based on mean and standard deviation of errors in both healthy and parasitized cells.<br>
>**Question**: How would changing the density threshold or reconstruction error threshold affect the model’s performance?<br>

**7) Testing on New Images**<br>
>**Objective**: Test the model’s performance on a mix of new healthy and parasitized images.<br>
>**Hint**: Use the *check_anomaly* function to predict whether new images are anomalies. Adjust threshold values if necessary.<br>
>**Question**: What can you conclude if a healthy cell image is incorrectly classified as parasitized?<br>

![image.png](attachment:49c99e04-9039-4873-8d8d-563cabde28ff.png)