# COGS 118B - Final Project

# Skin Lesions and Clustering Models

## Group members

- Tom Hocquet
- Jesse Sanchez Villegas
- Kian Ekhlassi
- Jiawei Li

# Abstract 
<!-- This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents 
- the solution/what you did
- major results you came up with (mention how results are measured) 

__NB:__ this final project form is much more report-like than the proposal and the checkpoint. Think in terms of writing a paper with bits of code in the middle to make the plots/tables -->

Our goal is to be able to predict what type of skin lesion the patient has. The data we used consist of images that have Melanoma ,Melanocytic nevus,Basal cell carcinoma,Actinic keratosis,Benign keratosis (solar lentigo / seborrheic keratosis / lichen planus-like keratosis), Dermatofibroma , Vascular lesion, Squamous cell carcinoma, or none of the above. We will be conducting photo segmentation to highlight and cluster similar features.Specifically our data consist of images which we will be converting into vectors, then use several clustering models like Kmeans to group the images with respect to each other to see if it can find a pattern within the vectors. From here we will have our model predict what lesion the patient has. Performance will be measured through its accuracy score, recall and F1 scores.

# Background

<!-- Fill in the background and discuss the kind of prior work that has gone on in this research area here. **Use inline citation** to specify which references support which statements.  You can do that through HTML footnotes (demonstrated here). I used to reccommend Markdown footnotes (google is your friend) because they are simpler but recently I have had some problems with them working for me whereas HTML ones always work so far. So use the method that works for you, but do use inline citations.

Here is an example of inline citation. After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). Use a minimum of 2 or 3 citations, but we prefer more <a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). You need enough citations to fully explain and back up important facts. 

Remeber you are trying to explain why someone would want to answer your question or why your hypothesis is in the form that you've stated.  -->

Skin lesions are a critical visible symptom of a potentially harmful disease but study shows that a significant amount of patients are unaware of carrying these diseases[1]. This leads to the conclusion that a vast majority of the population are uneducated despite the life altering consequences that can arise if not treated soon. This undiagnosed issue is explained by the increase in cost it takes to fully diagnose and treat skin lesions which discourages the general public in pursuing. Along with this, many diagnosed patients have spoken about their experiences and worry that they are oftentimes inspected by untrained physicians [2]. 
Overall it is apparent that the uneducated populace over a fairly common disease especially to elderly people, needs to have some changes. Our research is intended to create a model so that people can send pictures and ideally get an accurate prediction of what type of skin lesion they may have. Note this is not to replace the role of a trained professional, since the issue of misdiagnosing a possibly life threatening disease is not our goal. Instead our goal is to help the general public get a name of the possible disease they may have to be able to consult to the trained professional to get treatment in a timely manner. 

# Problem Statement

<!-- Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once). -->

The problem that we are solving is can we create a model to accurately predict what type of skin lesion a patient has (Melanoma, Melanocytic nevus, Basal cell carcinoma, Actinic keratosis, Benign keratosis (solar lentigo / seborrheic keratosis / lichen planus-like keratosis), Dermatofibroma, Vascular lesion, or Squamous cell carcinoma) based on an image of the patient’s affected skin. We will be conducting image segmentation to classify skin lesions from image through some kind of clustering algorithm such as k-means, DBSCAN, Hierarchical clustering, and Gaussian Mixed Models. The problem is quantifiable because visual features such as color, shape and size of the skin lesion can be quantified. Specifically, pixel intensities and values that represent the color scale can be used to quantify image of skin lesion. Similarly, we can measure and evaluate the performance of our constructed classification model through metrics like accuracy, recall and F1 score. Moreover, this process of classification is vastly replicable because skin lesions are common symptoms among many patients and diagnosis of type of skin lesions occur frequently in clinical practices. By creating this classification model, we want to provide the general public with an easily accessible method of skin lesion diagnosis. 

# Data

<!-- Detail how/where you obtained the data and cleaned it (if necessary)

If the data cleaning process is very long (e.g., elaborate text processing) consider describing it briefly here in text, and moving the actual clearning process to another notebook in your repo (include a link here!).  The idea behind this approach: this is a report, and if you blow up the flow of the report to include a lot of code it makes it hard to read.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!
 -->
 
Link to the dataset: https://www.kaggle.com/datasets/andrewmvd/isic-2019?resource=download
- Dataset Name: ISIC_2019_Training_Metadata
- \# of variables: 5; \# of observations: 25331
- Observation consists of: image (filename), age_approx(approximated age), anatom_site_general(anatomical site of image), lesion_Id (id of lesion), sex(sex of the patient)
- The most critical variable is the image, which is represented by its corresponding filename and stored in JPEG. The anatom_site_general is also a critical variable that stores the label of the anatomical site of the image. Some of the labels include anterior torso, lower extremity, higher extremity and palms/soles.
- For the images of the patients affected skin, we plan to encode these JPEG images and conduct pixel normalization in binary format during our preprocessing process.

# Proposed Solution

<!-- In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared.  -->

Our goal is to properly classify skin lesions from images. To solve this problem, we will use a mix of techniques. First we will do data preprocessing in order to make sure our data is in a form that we can use. Then we will use an unsupervised clustering algorithm to see if we can cluster different images easily. We will likely try k-means, DBSCAN, Hierarchical clustering, and Gaussian Mixed Models. We might need to do dimensionality reduction (PCA, maybe convolution) in order for these methods to work due to the curse of dimensionality. If we find that a specific clustering algorithm works particularly well, we will likely use it. We will also need to use a CNN to finalize our predictions with the help of the insights gained from clustering. 
To test our solution we will use a cross-validation technique with a testing sample separated at the beginning on which we can test on at the very end. We will use the evaluation metrics below to measure our “success.” We will also use a benchmark model using KMeans (as it is the simplest model) for our solution to be compared against. 
Libraries that we will use will likely include but is not limited to:
Pandas, numpy, sklearn.cluster.(KMeans, DBSCAN …), sklearn.metrics, pytorch, tensorflow, matplotlib.

# Evaluation Metrics

Since we are dealing with a classification problem, doing a metric that measures our successes and failures accurately will be best. Maximizing the True Positive Rate will be our goal (recall, TP/P). We will also likely use the Positive Predictive Value (precision, TP/PP) and the F-score (2 TP /( 2 TP + FP + FN)) as it can balance if some of the size of classes is significantly different from the other. These three metrics will make sure to guide our solution to the right direction. Using these three metrics we should be able to evaluate our model accurately and we will also be able to use these tests for our validation and test set which will tell us if our model is generalizing well or not.

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?



### Set Up

In [None]:
#importing the commands
import os
import cv2
import numpy as np
import pandas as pd

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

%matplotlib inline
import matplotlib.pyplot as plt
from pathlib import Path
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization, Input
import tensorflow as tf

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import EarlyStopping

import warnings

# to get rid of warnings during models
warnings.filterwarnings("ignore", category=UserWarning, message=".*threadpoolctl.*")

In [None]:
import zipfile

with zipfile.ZipFile('grayscale_img.csv.zip','r') as zip_ref:
    zip_ref.extractall()

Here we are defining functions that we will be using on the data in order to reduce the size of the orinigal images so that we can reduce computational power needed. Since we are reducing the size we want to make up for it my calculating the average of the images and recreating the images using the averages.

In [None]:
# Function to load images from a directory
def load_images(directory):
    images = []
    for filename in os.listdir(directory):
        #images are in jpg format
        if filename.endswith('.jpg'):
            image_path = os.path.join(directory, filename)
            image = cv2.imread(image_path)
            images.append(image)
    return images

# Function to calculate average pixel values
def calculate_average_pixels(images):
    return np.mean(images, axis=(1, 2))

# Function to resize images using average pixel values
def resize_images(images, new_size=(30, 30)):
    resized_images = []
    for image in images:
        #built in the average pixel function within the resizing
        resized_image = np.full((new_size[0], new_size[1], 3), calculate_average_pixels([image]))
        resized_images.append(resized_image)
    return resized_images

In [None]:
# Load images from dataset
dataset_directory = os.path.join('archive','ISIC_2019_Training_Input','ISIC_2019_Training_Input')
images = load_images(dataset_directory)

# Calculate average pixel values
average_pixels = calculate_average_pixels(images)

# Resize images using average pixel values
resized_images = resize_images(images)

The following code is to show us the comparison of the first 3 images between resized(avg) image to the original. Also will be a good indication for us to see if the avg pixel is a good basis to base the kmeans and make conclusions on or if its too drastically different to acknowledge that and consider it for our final conclusions.

In [None]:
# Show the original and resized image
for i in range(3):
    cv2.imshow('Original', images[i])
    cv2.imshow('Resized', resized_images[i])
    cv2.waitKey(0)
    cv2.destroyAllWindows()

In [None]:
grayscale = pd.read_csv(os.path.join('grayscale_img.csv', 'grayscale_img.csv'), index_col = 0)

grayscale

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
X_gray = grayscale.drop(columns = ['label'])
y_gray = grayscale['label']
clf = LinearDiscriminantAnalysis()
clf.fit_transform(X_gray, y_gray)

In [None]:
color = pd.read_csv(os.path.join('resized_images_color.csv', 'resized_images_color.csv'))

color

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
X_color = color.drop(columns = ['label'])
y_color = color['label']
clf = LinearDiscriminantAnalysis()
clf.fit_transform(X_color, y_color)

We will now be using the data to conduct kmeans. Note the data set we will be testing on is a grayscaled, reduced sized (originally 512x512 into 30x30), and grayscaled for computational reasons. We want to point this out as it will affect results, given we are reducing and getting rid of possible features that may be important for better predictions. We used the average of pixels before reducing the size of the images to make up for the reduction 

In [None]:
# read the csv that has all the changes applied already
df = pd.read_csv("grayscale_img.csv")

In [None]:
def vis_clust(df, n_clusters=8):
    # Convert DataFrame to numpy array
    data = df.values
    
    # Flatten the data
    flattened_data = data.reshape(data.shape[0], -1)
    
    # Standardize the data
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(flattened_data)
    
    # Apply PCA for dimensionality reduction
    pca = PCA(n_components=2)
    data_reduced = pca.fit_transform(data_scaled)
    
    # Perform KMeans clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(data_reduced)
    
    # Visualize the clusters
    plt.figure(figsize=(8, 6))
    for cluster in range(n_clusters):
        plt.scatter(data_reduced[cluster_labels == cluster, 0], 
                    data_reduced[cluster_labels == cluster, 1], 
                    label=f'Cluster {cluster + 1}')
    plt.title('Clusters Visualization')
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend()
    plt.show()

# Filter numeric columns bc some vectors contain string
numeric_df = df.select_dtypes(include=np.number)

# Call vis_clust only num values, because cant run with strings
vis_clust(numeric_df)

### K-Means

Now that we have the images preprocessed, we know there should be 8 different clusters since the images contain 8 different skin conditions, but we want to see how it will cluster them using k means. Also since we are doing a unsupervised portion we will be using a elbow method to decide the number of clusers, and compare to the actual amount we know is true.

The kmeans at 8 clusters does a good job at clustering. Our only concern would be the top right section between cluster 4 and 5 but that is a given due to outliers.

In [None]:
def elbow_method(df, max_clusters=10):
    distortions = []
    for n_clusters in range(1, max_clusters + 1):
        kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
        kmeans.fit(df)
        # Sum of squared distances to closest centroid
        distortions.append(kmeans.inertia_)  

    # Plotting the elbow method
    plt.figure(figsize=(10, 5))
    plt.plot(range(1, max_clusters + 1), distortions, marker='o')
    plt.xlabel('Number of clusters')
    plt.ylabel('Distortion')
    plt.title('Elbow Method')
    plt.grid(True)
    plt.show()

# Select only numeric columns from the DataFrame
numeric_columns = df.select_dtypes(include=[np.number]).columns
numeric_data = df[numeric_columns]

# Prepare data
flattened_data = np.array([image.flatten() for image in numeric_data.values])
scaler = StandardScaler()
data_scaled = scaler.fit_transform(flattened_data)

# Run elbow method
elbow_method(data_scaled, max_clusters=10)

We see that based off the elbow graph between 2-4 is ideal for the number of clusters for our kmeans. It makes sense that it would be less than 8 because that would be overfitting, but another reason for this can be that skin lesions may minor differences, that the kmeans overlooked. We will compare how a cluster of 3 looks with respect to cluster of 8.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 6))

# Plot with 8 clusters
axs[0].set_title('8 clusters')
vis_clust(numeric_df, n_clusters=8)

# Plot with 3 clusters
axs[1].set_title('3 clusters')
vis_clust(numeric_df, n_clusters=3)

plt.tight_layout()
plt.show()

Our df's last column contains the truths of each image, we will now be using this column to see how accurate the kmeans is with respect to what we know is true. We will be testing this for clusters=3, and 9. 

In [None]:
# write out all the unique strings in the last column
unique_strings = df.iloc[:, -1].unique()
print(unique_strings)

In [None]:
from sklearn.metrics import accuracy_score
#Extract truths from df last column
true_labels = df.iloc[:, -1]
# kmeans 3 clusters
kmeans_3 = KMeans(n_clusters=3, random_state=42)
kmeans_3.fit(data_scaled)
labels_3 = kmeans_3.labels_

#kmeans 8 clusters
kmeans_8 = KMeans(n_clusters=8, random_state=42)
kmeans_8.fit(data_scaled)
labels_8 = kmeans_8.labels_

# Mapping clusters to true labels
cluster_mapping = {
    0: 'NV',
    1: 'MEL',
    2: 'BKL',
    3: 'DF',
    4: 'SCC',
    5: 'BCC',
    6: 'VASC',
    7: 'AK'
}

# Map cluster labels to true labels for 3 clusters
predicted_labels_3 = [cluster_mapping[label] for label in labels_3]

# Map cluster labels to true labels for 8 clusters
predicted_labels_8 = [cluster_mapping[label] for label in labels_8]

# Compare the accuracy of the clustering results
accuracy_3 = accuracy_score(true_labels, predicted_labels_3)
accuracy_8 = accuracy_score(true_labels, predicted_labels_8)
# Print the first few predicted labels for each clustering
print("\nPredicted Labels for 3 clusters:")
print(predicted_labels_3[:10])  # Print the first 10 predicted labels for 3 clusters
print("\nPredicted Labels for 8 clusters:")
print(predicted_labels_8[:10])  # Print the first 10 predicted labels for 8 clusters
print("Accuracy with 3 clusters:", accuracy_3)
print("Accuracy with 8 clusters:", accuracy_8)

Due to our low accuracy score in our image predicting models. We looked to the metadata of the images in an attempt to create a predictive model using those statistics of the patients and see if that is better at prediciting the skin lesion. 

Things to know from the following data it consists of 5 columns (image, age_approx, site, lesion id, sex). We needed to clean the data a bit specifically remove any rows without a lesion id. We then got rid of portion of the string in the lesion id, because the data included the lesion id with respect to the image. The reason for this is to be able to use this column as our ground truths for validation. Lastly we convert the sex column from string to numerical values in our case female=0, and male=1.

In [None]:
# Load the metadata from the CSV file
metadata_df = pd.read_csv("ISIC_2019_Training_Metadata.csv")

# Drop rows where 'lesion_id' is blank or missing
metadata_df.dropna(subset=['lesion_id'], inplace=True)

# Remove characters after the underscore in 'lesion_id' column
metadata_df['lesion_id'] = metadata_df['lesion_id'].apply(lambda x: x.split('_')[0])

# Convert strings in the 'sex' column to numerical values
metadata_df['sex'] = metadata_df['sex'].map({'female': 0, 'male': 1})

### Hierarchical Clustering

We will now be creating a hierarchical clustering dendogram 

In [None]:
from sklearn.preprocessing import LabelEncoder
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt

# Encode categorical variables
label_encoder = LabelEncoder()
metadata_df['anatom_site_general'] = label_encoder.fit_transform(metadata_df['anatom_site_general'])

# Select features for clustering
features = metadata_df[['age_approx', 'anatom_site_general','sex']]
# Impute NaN values with mean
features = features.fillna(features.mean())
# Perform hierarchical clustering
Z = linkage(features, method='ward')
# Plot dendrogram
plt.figure(figsize=(12, 8))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample index')
plt.ylabel('Distance')
dendrogram(Z, truncate_mode='lastp', p=30, leaf_rotation=90., leaf_font_size=8.)
plt.show()

We adjusted the x axis, to show the range of sample index's otherwise it would not be legible given the number of samples we used.

In [None]:
from sklearn.metrics import adjusted_rand_score
from scipy.cluster.hierarchy import fcluster

#Remove missing or blank values from metadata
metadata_df = metadata_df.dropna(subset=['lesion_id'])

#Ground Truth for validation
ground_truth_labels = metadata_df['lesion_id'].apply(lambda x: x.split('_')[0])

#Define a range of threshold distances
thresholds = range(10, 500, 5)

best_ari = -1
best_t = None

for t in thresholds:
    # Obtain cluster labels
    cluster_labels = fcluster(Z, t, criterion='distance')
    
    # Calculate ARI
    ari = adjusted_rand_score(ground_truth_labels, cluster_labels)
    
    # Update best ARI and threshold if ARI improves
    if ari > best_ari:
        best_ari = ari
        best_t = t

print("Best threshold distance (t):", best_t)
print("Best Adjusted Rand Index (ARI):", best_ari)

The for loop was used to find the best t value to get the highest ari. Unfortunately we still have a low score of  less than .1 which means our clustering did a very poor job at correctly prediciting with respect to the ground truths

The following we will be doing a second Hierarchical Clustering but this time we will imnplement weights to the features to see if this will improve ARI score.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

# Assuming you have metadata features stored in 'X' and lesion types stored in 'y'

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)

# Train a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_scaled, ground_truth_labels)

# Extract feature importances
feature_importances = rf_classifier.feature_importances_

# Normalize feature importances to create weights
weights = feature_importances / feature_importances.sum()

# Print feature importance weights
print("Feature Importance Weights:")
for feature, weight in zip(metadata_df.columns, weights):
    print(f"{feature}: {weight}")

In [None]:
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
import numpy as np

# Normalize metadata features using weights
weighted_features = X_scaled * weights

# Compute the weighted Euclidean distance matrix
weighted_distance_matrix = pdist(weighted_features)

# Perform hierarchical clustering with weighted distance matrix
Z = linkage(weighted_distance_matrix, method='ward')

# Plot dendrogram
plt.figure(figsize=(12, 8))
plt.title('Hierarchical Clustering Dendrogram with Weights')
plt.xlabel('Sample index')
plt.ylabel('Distance')
dendrogram(Z, truncate_mode='lastp', p=30, leaf_rotation=90., leaf_font_size=8.)
plt.show()

In [None]:
from sklearn.metrics import adjusted_rand_score
from scipy.cluster.hierarchy import fcluster

# Remove missing or blank values from metadata
metadata_df = metadata_df.dropna(subset=['lesion_id'])

# Ground Truth for validation
ground_truth_labels = metadata_df['lesion_id'].apply(lambda x: x.split('_')[0])

# Define a range of threshold distances
thresholds = range(10, 500, 5)

best_ari = -1
best_t = None

for t in thresholds:
    # Obtain cluster labels with weighted hierarchical clustering
    cluster_labels = fcluster(Z, t, criterion='distance')
    
    # Calculate ARI
    ari = adjusted_rand_score(ground_truth_labels, cluster_labels)
    
    # Update best ARI and threshold if ARI improves
    if ari > best_ari:
        best_ari = ari
        best_t = t

print("Best threshold distance (t):", best_t)
print("Best Adjusted Rand Index (ARI):", best_ari)

We see that our ARI has increased significantly when we incorporated weights before it was roughly 0.02 and now the ARI is roughly 0.16. Meaning our model improved in predicting the ground truth, but it is still low when considering the ARI score where ARI=1 perfect prediction. With this said since it is >0 meaning it performs better than random preiction.

### Gaussian Mixture Model

In [None]:
def vis_clust_gmm(data, n_clusters=9):
    flattened_data = np.array([image.flatten() for image in data])
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(flattened_data)


    pca = PCA(n_components=2)
    data_reduced = pca.fit_transform(data_scaled)

    #Gaussian Mixture Model
    gmm = GaussianMixture(n_components=n_clusters, random_state=42)
    gmm.fit(data_reduced)
    cluster_labels = gmm.predict(data_reduced)


    plt.figure(figsize=(8, 6))
    for cluster in np.unique(cluster_labels):
        plt.scatter(data_reduced[cluster_labels == cluster, 0], 
                    data_reduced[cluster_labels == cluster, 1], 
                    label=f'Cluster {cluster + 1}')
    plt.title('GMM Clusters Visualization')
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend()
    plt.show()
vis_clust_gmm(resized_images, n_clusters=9) 
from sklearn.metrics import silhouette_score

def evaluate_gmm_silhouette(data, n_clusters=9):
    # Calculate silhouette score
    silhouette_avg = silhouette_score(data_reduced, cluster_labels)
    print(f'Silhouette Score for {n_clusters} clusters: {silhouette_avg:.3f}')


    plt.figure(figsize=(8, 6))
    for cluster in np.unique(cluster_labels):
        plt.scatter(data_reduced[cluster_labels == cluster, 0], 
                    data_reduced[cluster_labels == cluster, 1], 
                    label=f'Cluster {cluster 

### Neural Networks

In [None]:
data = pd.read_csv("/Users/demo/Desktop/COGS118B_WI_24/Project/archive/ISIC_2019_Training_GroundTruth.csv", header = 0)
lesion_type_dict = {
    'NV': 'Melanocytic nevi',
    'MEL': 'Melanoma',
    'BKL': 'Benign keratosis ',
    'BCC': 'Basal cell carcinoma',
    'AK': 'Actinic keratoses',
    'VASC': 'Vascular lesions',
    'DF': 'Dermatofibroma',
    'SCC' : 'Squamous cell carcinoma'
}
data['truth'] = data.drop(columns='image').idxmax(axis=1)
data.head(20)

In [None]:
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
for device in gpu_devices:
    tf.config.experimental.set_memory_growth(device, True)

In [None]:
def load_data(path: str):
    dir = Path(path)
    filepaths = list(dir.glob(r'**/*.jpg'))
    labels = data['truth']
    filepaths = pd.Series(filepaths, name='FilePaths').astype(str)
    labels = pd.Series(labels, name='Labels').astype(str)
    df = pd.merge(filepaths, labels, right_index=True, left_index=True)
    return df.sample(frac=1).reset_index(drop=True)
df = load_data('/Users/demo/Desktop/COGS118B_WI_24/Project/archive/ISIC_2019_Training_Input/ISIC_2019_Training_Input')

In [None]:
df.head(20)

In [None]:
labels_count = df['Labels'].value_counts(ascending=True)

plt.figure(figsize=(15, 6))
plt.bar(labels_count.index, labels_count.values)
plt.show()

In [None]:
df['Labels'] = df['Labels'].apply(lambda x: x if x == 'NV' else 'OTH')
df_binary = df['Labels']
binary = np.array([-1 if x == 'OTH' else 1 for x in df_binary])
binary.shape

We can see that our data is heavily biased towards NV, and has low numbers of other diseases

In [None]:
files = pd.read_csv('archive/ISIC_2019_Training_GroundTruth.csv')
files

In [None]:
filesWithLabels = pd.DataFrame()
filesWithLabels['file'] = files['image']+'.jpg'
filesWithLabels['label'] = ""

In [None]:
filesWithLabels['label'] = data['truth']
filesWithLabels['file'] = 'archive/ISIC_2019_Training_Input/ISIC_2019_Training_Input/'+filesWithLabels['file']

In [None]:
X_train, X_test = train_test_split(filesWithLabels, test_size=0.2, random_state=42)

In [None]:
X_train

In [None]:
epochs = 5
input_shape = (128, 128, 3)
num_classes = 1
def get_model():   

    model = Sequential([
        Input(shape=(128, 128, 3)),
        Conv2D(16, kernel_size=(3, 3), input_shape=input_shape, activation="relu", padding="same"),
        MaxPooling2D(pool_size=(2, 2)),
        BatchNormalization(),
        Dropout(0.2),
        Conv2D(32, kernel_size=(3, 3), activation="relu"),
        MaxPooling2D(pool_size=(2, 2)),
        BatchNormalization(),
        Dropout(0.2),
        Conv2D(64, kernel_size=(3, 3), activation="relu"),
        MaxPooling2D(pool_size=(2, 2)),
        BatchNormalization(),
        Dropout(0.2),
        Conv2D(128, kernel_size=(3, 3), activation="relu"),
        MaxPooling2D(pool_size=(2, 2)),
        BatchNormalization(),
        Dropout(0.2),
        Conv2D(256, kernel_size=(3, 3), activation="relu"),
        MaxPooling2D(pool_size=(2, 2)),
        BatchNormalization(),
        Dropout(0.2),
        Flatten(),
        Dense(1, activation="sigmoid")
    ])
    return model

In [None]:
datagen = ImageDataGenerator(
    #rescale=1./255,
    validation_split=0.2,
    rotation_range=20,  
    width_shift_range=0.2, 
    height_shift_range=0.2,
    horizontal_flip=True, 
    vertical_flip=True,
    fill_mode="nearest",
)
def plot_history(history, title):
    
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.title(title)
    plt.show()

In [None]:
def conver_models(model,name):
    dest_folder = '/kaggle/working/'
    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    tflite_model = converter.convert()
    with open(dest_folder  + name +".tflite", 'wb') as f:
        f.write(tflite_model)
def create_model(table, name, epoch):
    early_stop = EarlyStopping(monitor='val_loss', patience=5)
    train_generator = datagen.flow_from_dataframe(
        dataframe=table,
        directory=None,
        x_col='file',
        y_col='label',
        subset="training",
        batch_size=64,
        seed=42,
        shuffle=True,
        class_mode="binary",
        target_size=(128, 128))
    validation_generator = datagen.flow_from_dataframe(
        dataframe=table,
        directory=None,
        x_col='file',
        y_col='label',
        subset="validation",
        batch_size=64,
        seed=42,
        shuffle=True,
        class_mode="binary",
        target_size=(128, 128))
    # Create a function that yields samples
    model = get_model()
    model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
    history = model.fit(train_generator, epochs=epoch, validation_data=validation_generator, callbacks=[early_stop])
    return history, model     

In [None]:
NVTable = pd.concat([X_train.loc[X_train['label'] != 'NV'].sample(n=8000), X_train.loc[X_train['label'] == 'NV']])
NVTable['label'] = NVTable['label'].apply(lambda x: 'OTH' if x != 'NV' else x)

n = X_train.loc[X_train['label'] == 'MEL'].shape[0]
MELTable = pd.concat([X_train.loc[X_train['label'] != 'MEL'].sample(n=n), X_train.loc[X_train['label'] == 'MEL']])
MELTable['label'] = MELTable['label'].apply(lambda x: 'OTH' if x != 'MEL' else x)

n = X_train.loc[X_train['label'] == 'BKL'].shape[0]
BKLTable = pd.concat([X_train.loc[X_train['label'] != 'BKL'].sample(n=n), X_train.loc[X_train['label'] == 'BKL']])
BKLTable['label'] = BKLTable['label'].apply(lambda x: 'OTH' if x != 'BKL' else x)

n = X_train.loc[X_train['label'] == 'DF'].shape[0]
DFTable = pd.concat([X_train.loc[X_train['label'] != 'DF'].sample(n=n), X_train.loc[X_train['label'] == 'DF']])
DFTable['label'] = DFTable['label'].apply(lambda x: 'OTH' if x != 'DF' else x)

n = X_train.loc[X_train['label'] == 'SCC'].shape[0]
SCCTable = pd.concat([X_train.loc[X_train['label'] != 'SCC'].sample(n=n), X_train.loc[X_train['label'] == 'SCC']])
SCCTable['label'] = SCCTable['label'].apply(lambda x: 'OTH' if x != 'SCC' else x)

n = X_train.loc[X_train['label'] == 'BCC'].shape[0]
BCCTable = pd.concat([X_train.loc[X_train['label'] != 'BCC'].sample(n=n), X_train.loc[X_train['label'] == 'BCC']])
BCCTable['label'] = BCCTable['label'].apply(lambda x: 'OTH' if x != 'BCC' else x)

n = X_train.loc[X_train['label'] == 'VASC'].shape[0]
VASCTable = pd.concat([X_train.loc[filesWithLabels['label'] != 'VASC'].sample(n=n), X_train.loc[X_train['label'] == 'VASC']])
VASCTable['label'] = VASCTable['label'].apply(lambda x: 'OTH' if x != 'VASC' else x)

n = X_train.loc[X_train['label'] == 'AK'].shape[0]
AKTable = pd.concat([X_train.loc[X_train['label'] != 'AK'].sample(n=n), X_train.loc[X_train['label'] == 'AK']])
AKTable['label'] = AKTable['label'].apply(lambda x: 'OTH' if x != 'AK' else x)

tables = {
    "AK": AKTable,
    "NV": NVTable,
    "MEL": MELTable,
    "BKL": BKLTable,
    "DF": DFTable,
    "SCC": SCCTable,
    "BCC": BCCTable,
    "VASC": VASCTable,

}

In [None]:
for i in tables.keys():
    hist_1, mdl_1 = create_model(tables[i], str(i) + " model, validation", 7)
    test_generator = datagen.flow_from_dataframe(
        dataframe=tables[i],  # Your test DataFrame
        directory=None,  # Adjust if your file paths are relative
        x_col='file',  # Column in X_test that contains the file paths
        y_col='label',  # Column in X_test that contains the labels
        batch_size=64,  # Can adjust based on your preference
        seed=42,
        shuffle=False,  # Keep it False to maintain order, important for evaluation
        class_mode="binary",  # or "categorical" based on your model
        target_size=(128, 128)
        )
    eval_result = mdl_1.evaluate(test_generator)
    print(f"Test Loss: {eval_result[0]}, Test Accuracy: {eval_result[1]}")
    plot_history(hist_1, str(i) + " model, validation loss")

# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Computational power was a big limitation for us as ideally we would perform our models on the original 512x512 pixels. Some problems that were pointed out in Kaggle were that some of the data had repeated samples. This affects the score. Another issue was that our images had a lot of noise, speccifcally the skin surrounding the skin lesion. This could explain our low accuracy as not everyone has the same skin, and the images were not zoomed in properly. Finally our images were proccesed to being smaller and grayscaled which we can predict also limited our clustering models.   

### Ethics & Privacy

In terms of ethical concerns the main issue is where the images are sourced from, and whether or not the people in the photos have given their consent for researchers to study them. In our case photos do not contain any faces, they are high resolution photos of the skin disease site only. It seems some sort of microscope was used to take these pictures. The data was obtained from a hospital in Barcelona, where the patient's consent was given. We sourced the dataset from Kaggle which is a reliable source for ethical data. Another ethical concern is what this data will be used for. In our case the model we create will only benefit humanity, because people can detect skin melanoma or other cancerous diseases in early stages and seek medical care as soon as possible.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
