Welcome to assignment 1.                                                       

We are using pathology images for our first assignment please download data from this link https://drive.google.com/drive/folders/10dUOzcPR-PQwfFYcHk5gsLjIjSorQ32Q?usp=sharing



Task 1: Feature Generation (15%)
Use and run the following code (a deep network) to generate features from a set of training images. For this assignment, you do not need to know how the deep network is working here to extract features.
This code extracts the features of image T4.tif (in the T folder of dataset). Modify the code so that it iterates over all images of the dataset and extracts their features.
Allocate 10% of the data for validation.

Insert your code here for Task 1





In [54]:
# import the necessary packages

import torch
import numpy as np
import torchvision.transforms as transforms
from torchvision.models import densenet121
from torch.autograd import Variable
from PIL import Image
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import os

# imports for task 2
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.model_selection import cross_val_score
import numpy as np
from collections import Counter

In [58]:
# Load the DenseNet model pre-trained on ImageNet
model = densenet121(weights=True)
# Modify the model to remove the last fully connected layer
model = torch.nn.Sequential(*list(model.children())[:-1])
# Add a global average pooling layer to the model
model.add_module("global_avg_pool", torch.nn.AdaptiveAvgPool2d(1))
# Set the model to evaluation mode
model.eval()

# Define a series of transformations for preprocessing the images
preprocess = transforms.Compose(
    [
        transforms.Resize(256),  # Resize the input images to 256x256
        transforms.CenterCrop(224),  # Crop the images to 224x224
        transforms.ToTensor(),  # Convert the images to PyTorch tensors
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
        ),  # Normalize the images
    ]
)

# Specify the directory containing the dataset
dataset_dir = "train"

# Initialize lists to hold image paths and their corresponding labels
image_paths = []
labels = []

# Iterate through each folder in the dataset directory
for folder_name in os.listdir(dataset_dir):
    folder_path = os.path.join(dataset_dir, folder_name)
    # Check if the path is a directory
    if os.path.isdir(folder_path):
        # Iterate through each file in the folder
        for file_name in os.listdir(folder_path):
            # Check if the file is a TIFF image
            if file_name.endswith(".tif"):
                # Append the image path and label to their respective lists
                image_paths.append(os.path.join(folder_path, file_name))
                labels.append(folder_name)

# Convert categorical labels into numeric labels
unique_labels = sorted(set(labels))
label_to_numeric = {label: idx for idx, label in enumerate(unique_labels)}
labels_numeric = [label_to_numeric[label] for label in labels]

# Combine image paths and numeric labels into tuples for easy processing
combined = list(zip(image_paths, labels_numeric))

# Split the combined dataset into training/validation and testing sets
train_val_combined, test_combined = train_test_split(
    combined, test_size=0.1, random_state=42
)

# Further split the training/validation set into separate training and validation sets
train_combined, val_combined = train_test_split(
    train_val_combined, test_size=0.1, random_state=42
)


def extract_features_and_labels(combined_data):
    """
    Extract features and labels from the given dataset.

    Parameters:
    - combined_data: A list of tuples, each containing the path to an image and its numeric label.

    Returns:
    - A tuple containing two numpy arrays: one for the extracted features and one for the corresponding labels.
    """
    features = []
    labels = []
    for path, label in combined_data:
        # Load the image from the specified path
        image = Image.open(path)
        # Preprocess the image
        input_tensor = preprocess(image)
        input_batch = input_tensor.unsqueeze(0)
        # Extract features using the model
        with torch.no_grad():
            output = model(input_batch)
        features.append(output.squeeze().detach().numpy())
        labels.append(label)
    return np.array(features), np.array(labels)


# Extract features and labels for training, testing, and validation sets
train_features, train_labels = extract_features_and_labels(train_combined)
test_features, test_labels = extract_features_and_labels(test_combined)
val_features, val_labels = extract_features_and_labels(val_combined)

# Save the extracted features and labels to disk
np.save("train_features.npy", train_features)
np.save("test_features.npy", test_features)
np.save("val_features.npy", val_features)
np.save("train_labels.npy", train_labels)
np.save("test_labels.npy", test_labels)
np.save("val_labels.npy", val_labels)

print("Features and labels for training, testing, and validation sets have been saved.")

# Note on fixing potential warning with updated model loading approach:
# Uncomment and use the following code to address deprecation warnings related to loading pretrained models:
# from torchvision.models import densenet121, DenseNet121_Weights
# model_weights = DenseNet121_Weights.IMAGENET1K_V1  # Alternatively, use DenseNet121_Weights.DEFAULT for the latest weights
# model = densenet121(weights=model_weights)
# Modify the model similarly as above to prepare for feature extraction
# model.eval()

Features and labels for training, testing, and validation sets have been saved.


 Task 2: High Bias Classification Method (5%)
 Choose a classification method and let is have a high bias.
 Train it on the generated features and discuss why it is underfitting.

 Insert your code here for Task 2




In [59]:
# Use a multi-class logistic regression method to classify data
# initialize a logistic regression model
lr_model = LogisticRegression(max_iter=1000, multi_class="ovr")

# perform k-fold cross validation
k = 10
lr_scores = cross_val_score(lr_model, train_features, train_labels, cv=k)

# Print cross-validation scores for logistic regression
for i, score in enumerate(lr_scores):
    print(f"Logistic Regression Fold {i+1} Accuracy: {score}")

# Print mean scores for logistic regression
mean_lr_accuracy = np.mean(lr_scores)
print(f"Mean Logistic Regression Accuracy: {mean_lr_accuracy}")

# Use a multi-class SVM method to classify data
hb_svm_model = svm.SVC(kernel="linear", C=0.00001, gamma=10000)
hb_svm_model.fit(train_features, train_labels)
hb_svm_train_score = hb_svm_model.score(
    train_features, train_labels, sample_weight=None
)
hb_svm_val_score = hb_svm_model.score(val_features, val_labels, sample_weight=None)

print(hb_svm_train_score, hb_svm_val_score)

Logistic Regression Fold 1 Accuracy: 1.0
Logistic Regression Fold 2 Accuracy: 0.9523809523809523
Logistic Regression Fold 3 Accuracy: 0.9841269841269841
Logistic Regression Fold 4 Accuracy: 0.9523809523809523
Logistic Regression Fold 5 Accuracy: 0.9841269841269841
Logistic Regression Fold 6 Accuracy: 0.9841269841269841
Logistic Regression Fold 7 Accuracy: 0.9682539682539683
Logistic Regression Fold 8 Accuracy: 0.9841269841269841
Logistic Regression Fold 9 Accuracy: 0.9523809523809523
Logistic Regression Fold 10 Accuracy: 1.0
Mean Logistic Regression Accuracy: 0.976190476190476
0.16164817749603805 0.09859154929577464


 Task 3: High Variance Classification Method (5%)
 Use the chosen classification method and let it have a high variance.
 Train it on the generated features and discuss why it is overfitting.

 Insert your code here for Task 3




In [60]:
# Use a multi-class SVM method to classify data
hv_svm_model = svm.SVC(kernel="sigmoid")
hv_svm_scores = cross_val_score(hv_svm_model, train_features, train_labels, cv=k)

# Print the cross-validation scores
for i, hv_score in enumerate(hv_svm_scores):
    print(f"SVM Fold {i+1} Accuracy: {hv_score}")

# Calculate and print the mean accuracy across all folds
hv_mean_accuracy = np.mean(hv_svm_scores)
print(f"Mean SVM Accuracy: {hv_mean_accuracy}")

# Calculate and print the accuracy variance across all folds
hv_accuracy_var = np.var(hv_svm_scores)
print(f"SVM Accuracy Variance: {hv_accuracy_var}")

SVM Fold 1 Accuracy: 0.890625
SVM Fold 2 Accuracy: 0.8888888888888888
SVM Fold 3 Accuracy: 0.9047619047619048
SVM Fold 4 Accuracy: 0.8412698412698413
SVM Fold 5 Accuracy: 0.8888888888888888
SVM Fold 6 Accuracy: 0.9206349206349206
SVM Fold 7 Accuracy: 0.9047619047619048
SVM Fold 8 Accuracy: 0.8888888888888888
SVM Fold 9 Accuracy: 0.873015873015873
SVM Fold 10 Accuracy: 0.9047619047619048
Mean SVM Accuracy: 0.8906498015873016
SVM Accuracy Variance: 0.0004255200705861044


 Task 4: Balanced Classification Method (15%)
 Use the chosen classification method and let it balance the bias and variance.
 Train it on the generated features, possibly adjusting parameters.
 Discuss insights into achieving balance.

 Insert your code here for Task 4




In [61]:
balanced_svm_model = svm.SVC(kernel="linear")
balanced_svm_scores = cross_val_score(
    balanced_svm_model, train_features, train_labels, cv=k
)

# Print the cross-validation scores
for i, balanced_score in enumerate(balanced_svm_scores):
    print(f"SVM Fold {i+1} Accuracy: {balanced_score}")

# Calculate and print the mean accuracy across all folds
balanced_mean_accuracy = np.mean(balanced_svm_scores)
print(f"Balanced SVM Mean Accuracy: {balanced_mean_accuracy}")

# Calculate and print the accuracy variance across all folds
balanced_accuracy_var = np.var(balanced_svm_scores)
print(f"Balanced SVM Accuracy Variance: {balanced_accuracy_var}")

SVM Fold 1 Accuracy: 0.96875
SVM Fold 2 Accuracy: 0.9682539682539683
SVM Fold 3 Accuracy: 0.9841269841269841
SVM Fold 4 Accuracy: 0.9206349206349206
SVM Fold 5 Accuracy: 0.9841269841269841
SVM Fold 6 Accuracy: 0.9523809523809523
SVM Fold 7 Accuracy: 0.9682539682539683
SVM Fold 8 Accuracy: 0.9682539682539683
SVM Fold 9 Accuracy: 0.9682539682539683
SVM Fold 10 Accuracy: 0.9682539682539683
Balanced SVM Mean Accuracy: 0.9651289682539682
Balanced SVM Accuracy Variance: 0.00029260213923532394


 Task 5: K-Means Clustering (20%)
 Apply K-Means clustering on the generated features.
 Test with available labels and report accuracy.
 Experiment with automated K and compare with manually set 20 clusters.

 Insert your code here for Task 5




In [67]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, accuracy_score
import numpy as np
from collections import Counter

# Load features and labels from Numpy files.
train_features = np.load("train_features.npy")
train_labels = np.load("train_labels.npy")
val_features = np.load("val_features.npy")
val_labels = np.load("val_labels.npy")
test_features = np.load("test_features.npy")
test_labels = np.load("test_labels.npy")

# Combine features and labels from training, validation, and test sets for clustering analysis.
features = np.concatenate((train_features, val_features, test_features), axis=0)
labels = np.concatenate((train_labels, val_labels, test_labels), axis=0)


def calculate_custom_accuracy(clusters, true_labels):
    """
    Calculate the custom accuracy for clustering by assigning the most frequent true label
    to each cluster and then comparing these assigned labels to the true labels.

    Parameters:
    - clusters (numpy.ndarray): Cluster assignments for each data point.
    - true_labels (numpy.ndarray): True labels for each data point.

    Returns:
    - float: The accuracy of the clustering based on how well the assigned cluster labels
             match the true labels.
    """
    label_mapping = {}
    for cluster_id in set(clusters):
        # Identify all data points assigned to the current cluster.
        cluster_indices = np.where(clusters == cluster_id)[0]
        # Get the true labels of these data points.
        cluster_labels = true_labels[cluster_indices]
        # Determine the most common label in the cluster.
        most_common_label = Counter(cluster_labels).most_common(1)[0][0]
        # Assign this label to the cluster.
        label_mapping[cluster_id] = most_common_label
    # Create an array of predicted labels based on the cluster assignments.
    predicted_labels = np.array([label_mapping[cluster_id] for cluster_id in clusters])
    # Calculate and return the accuracy.
    return accuracy_score(true_labels, predicted_labels)


# Initialize variables to track the best number of clusters (k) and the highest accuracy found.
best_k = 2
best_accuracy = 0

# Iterate over a range of k values to find the best one based on custom accuracy.
for k in range(2, 20):
    # Perform KMeans clustering with the current value of k.
    kmeans = KMeans(n_clusters=k, random_state=42, n_init="auto").fit(features)
    # Get the cluster assignments for each data point.
    clusters = kmeans.labels_
    # Calculate the custom accuracy for these cluster assignments.
    accuracy = calculate_custom_accuracy(clusters, labels)
    print(f"k = {k}: Custom Accuracy = {accuracy}")
    # Update the best k and accuracy if the current accuracy is higher.
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_k = k

# Report the best custom accuracy found and the corresponding value of k.
print(f"Best custom accuracy: {best_accuracy} found for k = {best_k}")

# Perform clustering again with k set to 20 for comparison.
kmeans_20 = KMeans(n_clusters=31, random_state=42, n_init="auto").fit(features)
clusters_20 = kmeans_20.labels_
# Calculate and print the custom accuracy for k=20.
accuracy_20 = calculate_custom_accuracy(clusters_20, labels)
print(f"Custom Accuracy with k = 20: {accuracy_20}")


# Assuming 'features' contains your data
features = np.concatenate((train_features, val_features, test_features), axis=0)

# Define the range of K to try
k_values = range(2, 31)  # Example: trying values of K from 2 to 30

best_k = None
best_silhouette = -1
print(
    "============================================================================================"
)
# Experiment with different K values
print("Experimenting with different K values using silhouette_score to find accuracy:")
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init="auto").fit(features)
    cluster_labels = kmeans.labels_
    silhouette_avg = silhouette_score(features, cluster_labels)

    print(f"For K={k}, the average silhouette_score is: {silhouette_avg}")

    # Update the best K if the current silhouette score is better
    if silhouette_avg > best_silhouette:
        best_k = k
        best_silhouette = silhouette_avg

# After finding the best_k based on silhouette scores
print(f"The best K is {best_k} with an average silhouette score of {best_silhouette}.")

# Perform KMeans clustering again using the best_k found from silhouette scores
kmeans_best_silhouette = KMeans(n_clusters=best_k, random_state=42, n_init="auto").fit(
    features
)
clusters_best_silhouette = kmeans_best_silhouette.labels_

# Calculate and print the custom accuracy for the clustering with best silhouette score
accuracy_best_silhouette = calculate_custom_accuracy(clusters_best_silhouette, labels)
print(
    f"Custom Accuracy with best silhouette score (k = {best_k}): {accuracy_best_silhouette}"
)

k = 2: Custom Accuracy = 0.1
k = 3: Custom Accuracy = 0.15
k = 4: Custom Accuracy = 0.2
k = 5: Custom Accuracy = 0.24871794871794872
k = 6: Custom Accuracy = 0.2987179487179487
k = 7: Custom Accuracy = 0.3487179487179487
k = 8: Custom Accuracy = 0.3974358974358974
k = 9: Custom Accuracy = 0.4461538461538462
k = 10: Custom Accuracy = 0.4948717948717949
k = 11: Custom Accuracy = 0.5333333333333333
k = 12: Custom Accuracy = 0.5820512820512821
k = 13: Custom Accuracy = 0.6256410256410256
k = 14: Custom Accuracy = 0.6666666666666666
k = 15: Custom Accuracy = 0.7012820512820512
k = 16: Custom Accuracy = 0.717948717948718
k = 17: Custom Accuracy = 0.7474358974358974
k = 18: Custom Accuracy = 0.7884615384615384
k = 19: Custom Accuracy = 0.7987179487179488
Best custom accuracy: 0.7987179487179488 found for k = 19
Custom Accuracy with k = 20: 0.8615384615384616
Experimenting with different K values using silhouette_score to find accuracy:
For K=2, the average silhouette_score is: 0.1650917083024

 Task 6: Additional Clustering Algorithm (10%)
 Choose another clustering algorithm and apply it on the features.
 Test accuracy with available labels.

 Insert your code here for Task 6




In [68]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, accuracy_score
import numpy as np
from collections import Counter

# Load features and labels from Numpy files.
train_features = np.load("train_features.npy")
train_labels = np.load("train_labels.npy")
val_features = np.load("val_features.npy")
val_labels = np.load("val_labels.npy")
test_features = np.load("test_features.npy")
test_labels = np.load("test_labels.npy")

# Combine features and labels from training, validation, and test sets for clustering analysis.
features = np.concatenate((train_features, val_features, test_features), axis=0)
labels = np.concatenate((train_labels, val_labels, test_labels), axis=0)

# Your existing custom accuracy function should work fine here as well.


# Explore a range of values for eps and min_samples to find the best DBSCAN configuration.
eps_values = [
    3,
    5,
    7,
    10,
]  # Example eps values. You should adjust this based on your data.
min_samples_values = [5, 10, 15]  # Example min_samples values.

best_eps = None
best_min_samples = None
best_accuracy = 0


def calculate_custom_accuracy(clusters, true_labels):
    # Filter out noise (-1 labels) from the clusters and corresponding true labels
    valid_indices = clusters != -1
    if not np.any(
        valid_indices
    ):  # If all points are noise, return 0 accuracy or an appropriate value
        return 0
    filtered_clusters = clusters[valid_indices]
    filtered_labels = true_labels[valid_indices]

    label_mapping = {}
    for cluster_id in set(filtered_clusters):
        cluster_indices = np.where(filtered_clusters == cluster_id)[0]
        cluster_labels = filtered_labels[cluster_indices]
        if (
            cluster_labels.size == 0
        ):  # Skip if the cluster is empty (should not happen after filtering noise)
            continue
        most_common_label = Counter(cluster_labels).most_common(1)[0][0]
        label_mapping[cluster_id] = most_common_label

    predicted_labels = np.array(
        [label_mapping.get(cluster_id, -1) for cluster_id in filtered_clusters]
    )
    return accuracy_score(filtered_labels, predicted_labels)


for eps in eps_values:
    for min_samples in min_samples_values:
        # Perform DBSCAN clustering with the current value of eps and min_samples.
        dbscan = DBSCAN(eps=eps, min_samples=min_samples).fit(features)
        # DBSCAN labels_ attribute to get the cluster assignments.
        clusters = dbscan.labels_

        # Ignore noise points (cluster label = -1) in accuracy calculation if necessary
        if np.any(clusters == -1):
            # Optional: Filter out noise points for accuracy calculation
            valid_indices = clusters != -1
            filtered_clusters = clusters[valid_indices]
            filtered_labels = labels[valid_indices]
            accuracy = calculate_custom_accuracy(filtered_clusters, filtered_labels)
        else:
            accuracy = calculate_custom_accuracy(clusters, labels)

        print(f"eps = {eps}, min_samples = {min_samples}: Custom Accuracy = {accuracy}")

        # Update the best eps, min_samples, and accuracy if the current accuracy is higher.
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_eps = eps
            best_min_samples = min_samples

# Report the best custom accuracy found and the corresponding values of eps and min_samples.
print(
    f"Best custom accuracy: {best_accuracy} found for eps = {best_eps} and min_samples = {best_min_samples}"
)

eps = 3, min_samples = 5: Custom Accuracy = 0
eps = 3, min_samples = 10: Custom Accuracy = 0
eps = 3, min_samples = 15: Custom Accuracy = 0
eps = 5, min_samples = 5: Custom Accuracy = 0
eps = 5, min_samples = 10: Custom Accuracy = 0
eps = 5, min_samples = 15: Custom Accuracy = 0
eps = 7, min_samples = 5: Custom Accuracy = 0
eps = 7, min_samples = 10: Custom Accuracy = 0
eps = 7, min_samples = 15: Custom Accuracy = 0
eps = 10, min_samples = 5: Custom Accuracy = 0.6174242424242424
eps = 10, min_samples = 10: Custom Accuracy = 0.7251184834123223
eps = 10, min_samples = 15: Custom Accuracy = 0.7151162790697675
Best custom accuracy: 0.7251184834123223 found for eps = 10 and min_samples = 10


 Task 7: PCA for Classification Improvement (20%)
 Apply PCA on the features and then feed them to the best classification method in the above tasks.
 Assess if PCA improves outcomes and discuss the results.

 Insert your code here for Task 7




In [69]:
Data_To_PCA = [[placeholder]]

# project from X to K dimensions using PCA
k = 2

# using sklearn's implementation
pca = PCA(k)
projected_sklearn = pca.fit_transform(Data_To_PCA.data)
print('reduced dim (sklearn):', projected_sklearn.shape)
plt.figure(1,figsize = (12,4))

plt.scatter(projected_sklearn[:, 0], projected_sklearn[:, 1],
            c=Data_To_PCA.target, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('tab20', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.title('Dim reduction using PCA (sklearn)')
plt.colorbar()


[[RUN BEST OTHER MODEL]]

SyntaxError: invalid syntax (3008200728.py, line 21)

 Task 8: Visualization and Analysis (10%)
 Plot the features in a lower dimension using dimentinality reduction techniques.
 Analyze the visual representation, identifying patterns or insights.

Insert your code here for Task 8

In [None]:
import umap
Data_To_Reduce = [[placeholder]]
Labels_For_Reduction = [[placeholder]]
mapper = umap.UMAP().fit(Data_To_Reduce)
umap.plot.points(mapper,labels=Labels_For_Reduction,theme='fire')


umap.plot.output_notebook()
hover_data = pd.DataFrame({'index':np.arange(30000),
                           'label':Labels_For_Reduction})
hover_data['item'] = hover_data.label.map(
    {
        '0':'Name1',
        '1':'Name1',
        '2':'Name1',
        '3':'Name1',
        '4':'Name1',
        '5':'Name1',
        '6':'Name1',
        '7':'Name1',
        '8':'Name1',
        '9':'Name1',
    }
)
p = umap.plot.interactive(mapper, labels=Labels_For_Reduction, hover_data=hover_data, point_size=2)
umap.plot.show(p)