Welcome to assignment 1.                                                       

We are using pathology images for our first assignment please download data from this link https://drive.google.com/drive/folders/10dUOzcPR-PQwfFYcHk5gsLjIjSorQ32Q?usp=sharing



Task 1: Feature Generation (15%)
Use and run the following code (a deep network) to generate features from a set of training images. For this assignment, you do not need to know how the deep network is working here to extract features.
This code extracts the features of image T4.tif (in the T folder of dataset). Modify the code so that it iterates over all images of the dataset and extracts their features.
Allocate 10% of the data for validation.

Insert your code here for Task 1





In [30]:
# import the necessary packages

import torch
import numpy as np
import torchvision.transforms as transforms
from torchvision.models import densenet121
from torch.autograd import Variable
from PIL import Image
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import os

#imports for task 2
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.model_selection import cross_val_score

In [39]:
# Load the DenseNet model pre-trained on ImageNet
model = densenet121(pretrained=True)
# Modify the model to remove the last fully connected layer
model = torch.nn.Sequential(*list(model.children())[:-1])
# Add a global average pooling layer to the model
model.add_module("global_avg_pool", torch.nn.AdaptiveAvgPool2d(1))
# Set the model to evaluation mode
model.eval()

# Define a series of transformations for preprocessing the images
preprocess = transforms.Compose(
    [
        transforms.Resize(256),  # Resize the input images to 256x256
        transforms.CenterCrop(224),  # Crop the images to 224x224
        transforms.ToTensor(),  # Convert the images to PyTorch tensors
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
        ),  # Normalize the images
    ]
)

# Specify the directory containing the dataset
dataset_dir = "train"

# Initialize lists to hold image paths and their corresponding labels
image_paths = []
labels = []

# Iterate through each folder in the dataset directory
for folder_name in os.listdir(dataset_dir):
    folder_path = os.path.join(dataset_dir, folder_name)
    # Check if the path is a directory
    if os.path.isdir(folder_path):
        # Iterate through each file in the folder
        for file_name in os.listdir(folder_path):
            # Check if the file is a TIFF image
            if file_name.endswith(".tif"):
                # Append the image path and label to their respective lists
                image_paths.append(os.path.join(folder_path, file_name))
                labels.append(folder_name)

# Convert categorical labels into numeric labels
unique_labels = sorted(set(labels))
label_to_numeric = {label: idx for idx, label in enumerate(unique_labels)}
labels_numeric = [label_to_numeric[label] for label in labels]

# Combine image paths and numeric labels into tuples for easy processing
combined = list(zip(image_paths, labels_numeric))

# Split the combined dataset into training/validation and testing sets
train_val_combined, test_combined = train_test_split(
    combined, test_size=0.1, random_state=42
)

# Further split the training/validation set into separate training and validation sets
train_combined, val_combined = train_test_split(
    train_val_combined, test_size=0.1, random_state=42
)


def extract_features_and_labels(combined_data):
    """
    Extract features and labels from the given dataset.

    Parameters:
    - combined_data: A list of tuples, each containing the path to an image and its numeric label.

    Returns:
    - A tuple containing two numpy arrays: one for the extracted features and one for the corresponding labels.
    """
    features = []
    labels = []
    for path, label in combined_data:
        # Load the image from the specified path
        image = Image.open(path)
        # Preprocess the image
        input_tensor = preprocess(image)
        input_batch = input_tensor.unsqueeze(0)
        # Extract features using the model
        with torch.no_grad():
            output = model(input_batch)
        features.append(output.squeeze().detach().numpy())
        labels.append(label)
    return np.array(features), np.array(labels)


# Extract features and labels for training, testing, and validation sets
train_features, train_labels = extract_features_and_labels(train_combined)
test_features, test_labels = extract_features_and_labels(test_combined)
val_features, val_labels = extract_features_and_labels(val_combined)

# Save the extracted features and labels to disk
np.save("train_features.npy", train_features)
np.save("test_features.npy", test_features)
np.save("val_features.npy", val_features)
np.save("train_labels.npy", train_labels)
np.save("test_labels.npy", test_labels)
np.save("val_labels.npy", val_labels)

print("Features and labels for training, testing, and validation sets have been saved.")

# Note on fixing potential warning with updated model loading approach:
# Uncomment and use the following code to address deprecation warnings related to loading pretrained models:
# from torchvision.models import densenet121, DenseNet121_Weights
# model_weights = DenseNet121_Weights.IMAGENET1K_V1  # Alternatively, use DenseNet121_Weights.DEFAULT for the latest weights
# model = densenet121(weights=model_weights)
# Modify the model similarly as above to prepare for feature extraction
# model.eval()



Features and labels for training, testing, and validation sets have been saved.


 Task 2: High Bias Classification Method (5%)
 Choose a classification method and let is have a high bias.
 Train it on the generated features and discuss why it is underfitting.

 Insert your code here for Task 2




In [40]:
# Use a multi-class logistic regression method to classify data
# initialize a logistic regression model
lr_model = LogisticRegression(max_iter=1000, multi_class="ovr")

# perform k-fold cross validation
k = 10
lr_scores = cross_val_score(lr_model, train_features, train_labels, cv=k)

# Print cross-validation scores for logistic regression
for i, score in enumerate(lr_scores):
    print(f"Logistic Regression Fold {i+1} Accuracy: {score}")

# Print mean scores for logistic regression
mean_lr_accuracy = np.mean(lr_scores)
print(f"Mean Logistic Regression Accuracy: {mean_lr_accuracy}")

# Use a multi-class SVM method to classify data
hb_svm_model = svm.SVC(kernel="linear", C=0.01, gamma=1000)
hb_svm_scores = cross_val_score(hb_svm_model, train_features, train_labels, cv=k)

# Print the cross-validation scores
for i, hb_score in enumerate(hb_svm_scores):
    print(f"SVM Fold {i+1} Accuracy: {hb_score}")

# Calculate and print the mean accuracy across all folds
hb_mean_accuracy = np.mean(hb_svm_scores)
print(f"Mean SVM Accuracy: {hb_mean_accuracy}")

# Calculate and print the accuracy variance
hb_accuracy_var = np.var(hb_svm_scores)  # unterminated string literal
print(f"SVM Accuracy Variance: {hb_accuracy_var}")

Logistic Regression Fold 1 Accuracy: 1.0
Logistic Regression Fold 2 Accuracy: 0.9523809523809523
Logistic Regression Fold 3 Accuracy: 0.9841269841269841
Logistic Regression Fold 4 Accuracy: 0.9523809523809523
Logistic Regression Fold 5 Accuracy: 0.9841269841269841
Logistic Regression Fold 6 Accuracy: 0.9841269841269841
Logistic Regression Fold 7 Accuracy: 0.9682539682539683
Logistic Regression Fold 8 Accuracy: 0.9841269841269841
Logistic Regression Fold 9 Accuracy: 0.9523809523809523
Logistic Regression Fold 10 Accuracy: 1.0
Mean Logistic Regression Accuracy: 0.976190476190476
SVM Fold 1 Accuracy: 0.96875
SVM Fold 2 Accuracy: 0.9682539682539683
SVM Fold 3 Accuracy: 0.9841269841269841
SVM Fold 4 Accuracy: 0.9206349206349206
SVM Fold 5 Accuracy: 0.9841269841269841
SVM Fold 6 Accuracy: 0.9682539682539683
SVM Fold 7 Accuracy: 0.9682539682539683
SVM Fold 8 Accuracy: 0.9682539682539683
SVM Fold 9 Accuracy: 0.9682539682539683
SVM Fold 10 Accuracy: 0.9682539682539683
Mean SVM Accuracy: 0.96671

 Task 3: High Variance Classification Method (5%)
 Use the chosen classification method and let it have a high variance.
 Train it on the generated features and discuss why it is overfitting.

 Insert your code here for Task 3




In [None]:
# Use a multi-class SVM method to classify data
hv_svm_model = svm.SVC(kernel='sigmoid')
hv_svm_scores = cross_val_score(hv_svm_model, train_features, train_labels, cv=k)

# Print the cross-validation scores
for i, hv_score in enumerate(hv_svm_scores):
    print(f"SVM Fold {i+1} Accuracy: {hv_score}")

# Calculate and print the mean accuracy across all folds
hv_mean_accuracy = np.mean(hv_svm_scores)
print(f"Mean SVM Accuracy: {hv_mean_accuracy}")

# Calculate and print the accuracy variance across all folds
hv_accuracy_var = np.var(hv_svm_scores)
print(f"SVM Accuracy Variance: {hv_accuracy_var}")

 Task 4: Balanced Classification Method (15%)
 Use the chosen classification method and let it balance the bias and variance.
 Train it on the generated features, possibly adjusting parameters.
 Discuss insights into achieving balance.

 Insert your code here for Task 4




In [None]:
balanced_svm_model = svm.SVC(kernel='linear')
balanced_svm_scores = cross_val_score(balanced_svm_model, train_features, 
                                      train_labels, cv=k)

# Print the cross-validation scores
for i, balanced_score in enumerate(balanced_svm_scores):
    print(f"SVM Fold {i+1} Accuracy: {balanced_score}")

# Calculate and print the mean accuracy across all folds
balanced_mean_accuracy = np.mean(balanced_svm_scores)
print(f"Balanced SVM Mean Accuracy: {balanced_mean_accuracy}")

# Calculate and print the accuracy variance across all folds
balanced_accuracy_var = np.var(balanced_svm_scores)
print(f"Balanced SVM Accuracy Variance: {balanced_accuracy_var}")

 Task 5: K-Means Clustering (20%)
 Apply K-Means clustering on the generated features.
 Test with available labels and report accuracy.
 Experiment with automated K and compare with manually set 20 clusters.

 Insert your code here for Task 5




In [18]:

# Assuming 'train_features' are the features extracted from the images

# Apply K-Means with manually set 20 clusters
kmeans_20 = KMeans(n_clusters=20, random_state=42)
clusters_20 = kmeans_20.fit_predict(train_features)

# Since KMeans does not inherently provide labels matching to original labels,
# a mapping function or strategy is needed to evaluate clustering accuracy.

# Apply K-Means with automated K (e.g., using the Elbow method or other heuristic)
# This step would involve determining the optimal number of clusters K,
# which can be done using methods like the Elbow Method or the Silhouette Score.

# Example for Elbow Method (commented out because it requires plotting)


# wcss = []
# for i in range(1, 21):
#     kmeans = KMeans(n_clusters=i, random_state=42)
#     kmeans.fit(train_features)
#     wcss.append(kmeans.inertia_)
# plt.plot(range(1, 21), wcss)
# plt.title('Elbow Method')
# plt.xlabel('Number of clusters')
# plt.ylabel('WCSS')  # Within cluster sum of squares
# plt.show()

# Assuming optimal K is found, e.g., k_optimal
# kmeans_opt = KMeans(n_clusters=k_optimal, random_state=42)
# clusters_opt = kmeans_opt.fit_predict(train_features)

# Accuracy reporting would require a strategy to match cluster labels to true labels, which is non-trivial
# because clustering is unsupervised and does not directly produce class labels that match with the ground truth.

NameError: name 'train_features' is not defined

 Task 6: Additional Clustering Algorithm (10%)
 Choose another clustering algorithm and apply it on the features.
 Test accuracy with available labels.

 Insert your code here for Task 6




In [None]:


# Assuming 'train_features' are the features extracted from the images

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_clusters = dbscan.fit_predict(train_features)

# The silhouette score can be used to evaluate the quality of the clustering
silhouette_avg = silhouette_score(train_features, dbscan_clusters)
print(f"Silhouette Score: {silhouette_avg}")

# DBSCAN labels outliers as -1, so you might want to handle them in your accuracy assessment


 Task 7: PCA for Classification Improvement (20%)
 Apply PCA on the features and then feed them to the best classification method in the above tasks.
 Assess if PCA improves outcomes and discuss the results.

 Insert your code here for Task 7




 Task 8: Visualization and Analysis (10%)
 Plot the features in a lower dimension using dimentinality reduction techniques.
 Analyze the visual representation, identifying patterns or insights.

Insert your code here for Task 8