## CS203 Lab 1

### Team Number: 18
* Name: Paras Prashant Shirvale
* Roll No: 23110232 
---
* Name: Akshat Shah
* Roll No: 23110293
---

### Task 1: Setup the Dataset

1. Load the MNIST dataset using the Hugging Face datasets library.

In [133]:
import numpy as np
from datasets import load_dataset

In [134]:
# Load the MNIST dataset
mnist = load_dataset("mnist")

# Check the structure of the dataset
print(mnist)

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 60000
    })
    test: Dataset({
        features: ['image', 'label'],
        num_rows: 10000
    })
})


2. Convert the image data into Numpy arrays and normalize pixel values to the range [0,1].

In [135]:
# Get the train and test datasets
train_data = mnist['train']
test_data = mnist['test']

In [136]:
# Convert the images to numpy arrays and normalize them
train_images = np.array(train_data['image'])
test_images = np.array(test_data['image'])

In [137]:

# Normalize the pixel values to the range [0, 1]
train_images = train_images.astype(np.float32) / 255.0
test_images = test_images.astype(np.float32) / 255.0

In [138]:
# Check the shape and max, min value of the data to verify
print(f"Train images shape: {train_images.shape}")
print(f"Test images shape: {test_images.shape}")
print(max(train_images[0].flatten()), min(train_images[0].flatten()))

Train images shape: (60000, 28, 28)
Test images shape: (10000, 28, 28)
1.0 0.0


3. Flatten each image into a vector of 784 features.

In [139]:
# Flatten each image into a vector of 784 features (28x28 = 784)
train_images_flattened = train_images.reshape(train_images.shape[0], -1)  # shape: (num_samples, 784)
test_images_flattened = test_images.reshape(test_images.shape[0], -1)  # shape: (num_samples, 784)

In [140]:
# Check the shape of the flattened data
print(f"Train images shape after flattening: {train_images_flattened.shape}")
print(f"Test images shape after flattening: {test_images_flattened.shape}")

Train images shape after flattening: (60000, 784)
Test images shape after flattening: (10000, 784)


4. Split the dataset into training and testing sets.

In [141]:
# Get the labels (digits)
train_labels = np.array(train_data['label'])
test_labels = np.array(test_data['label'])

In [142]:
from sklearn.model_selection import train_test_split

# Split the training data into training and validation sets (80% training, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(
    train_images_flattened, train_labels, test_size=0.2, random_state=42
)

In [143]:
# Check the shape of the new splits
print(f"Training data shape: {X_train.shape}")
print(f"Validation data shape: {X_val.shape}")
print(f"Test data shape: {test_images_flattened.shape}")

Training data shape: (48000, 784)
Validation data shape: (12000, 784)
Test data shape: (10000, 784)


5. Randomly select an initially labeled dataset of 200 samples from training samples.

In [144]:
# Randomly select 200 samples from the training set
num_samples = 200
random_indices = np.random.choice(train_images_flattened.shape[0], num_samples, replace=False)

In [145]:
# Create the initially labeled dataset
initial_images = train_images_flattened[random_indices]
initial_labels = train_labels[random_indices]


In [146]:
# Check the shape of the selected dataset
print(f"Initial labeled dataset shape (images): {initial_images.shape}")
print(f"Initial labeled dataset shape (labels): {initial_labels.shape}")

Initial labeled dataset shape (images): (200, 784)
Initial labeled dataset shape (labels): (200,)


6. Generate an "Unlabeled Pool," the Initial Dataset excluding 200 samples.

In [147]:
# Create the Unlabeled Pool by excluding the 200 selected samples
unlabeled_images = np.delete(train_images_flattened, random_indices, axis=0)
unlabeled_labels = np.delete(train_labels, random_indices, axis=0)

In [148]:
# Check the shape of the Unlabeled Pool
print(f"Unlabeled Pool shape (images): {unlabeled_images.shape}")
print(f"Unlabeled Pool shape (labels): {unlabeled_labels.shape}")

Unlabeled Pool shape (images): (59800, 784)
Unlabeled Pool shape (labels): (59800,)


### Task 2: Implement Random Sampling for Active Learning

1. Train a Random Forest Classifier (you can use “from sklearn.ensemble import RandomForestClassifier”)  on the initial dataset of 200 samples.

In [149]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [150]:
# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

In [151]:
initial_images.shape, initial_labels.shape

((200, 784), (200,))

In [152]:
# Train the model on the initial labeled dataset (200 samples)
rf_classifier.fit(initial_images, initial_labels)

In [153]:
# Evaluate the model on the Test Pool (10,000 samples)
# We will use the accuracy score to see how well the model performs
predicted_labels = rf_classifier.predict(test_images_flattened)

# Optionally, calculate accuracy (here we're using the true labels to compute accuracy)
accuracy = accuracy_score(test_labels, predicted_labels)

print(f"Accuracy on Test Pool: {accuracy * 100:.2f}%")

Accuracy on Test Pool: 79.85%


In [154]:
# Evaluate the model on the Unlabeled Pool (remaining 58,800 samples)
# We will use the accuracy score to see how well the model performs
predicted_labels = rf_classifier.predict(unlabeled_images)

# Optionally, calculate accuracy (here we're using the true labels to compute accuracy)
accuracy = accuracy_score(unlabeled_labels, predicted_labels)

print(f"Accuracy on Unlabeled Pool: {accuracy * 100:.2f}%")

Accuracy on Unlabeled Pool: 78.35%


2. Implement an active learning loop for 20 iterations:
* Randomly select a sample from the unlabeled pool.
* Get the selected sample and its true label.
* Add the sample and label to the labeled dataset.
* Remove the selected sample and label from the pool.
* Retrain the model on the updated dataset.
* Check the model's accuracy on the test set.
* Print accuracy after every iteration.


In [155]:
# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
random_sampling_accuracies = []

# Active learning loop for 20 iterations
num_iterations = 20
for iteration in range(num_iterations):
    # Step 1: Randomly select a sample from the unlabeled pool
    random_index = np.random.choice(unlabeled_images.shape[0], 1, replace=False)
    selected_image = unlabeled_images[random_index]
    selected_label = unlabeled_labels[random_index]

    # Step 2: Add the selected sample and its label to the labeled dataset
    initial_images = np.vstack([initial_images, selected_image])
    initial_labels = np.append(initial_labels, selected_label)

    # Step 3: Remove the selected sample and label from the unlabeled pool
    unlabeled_images = np.delete(unlabeled_images, random_index, axis=0)
    unlabeled_labels = np.delete(unlabeled_labels, random_index, axis=0)

    # Step 4: Retrain the model on the updated dataset
    rf_classifier.fit(initial_images, initial_labels)

    # Step 5: Evaluate the model on the test set
    predicted_labels_test = rf_classifier.predict(test_images_flattened)
    accuracy_test = accuracy_score(test_labels, predicted_labels_test)
    random_sampling_accuracies.append(accuracy_test)

    # Step 6: Print accuracy after every iteration
    print(f"Iteration {iteration+1}/{num_iterations} - Test Accuracy: {accuracy_test * 100:.2f}%")


Iteration 1/20 - Test Accuracy: 79.25%
Iteration 2/20 - Test Accuracy: 78.29%
Iteration 3/20 - Test Accuracy: 78.90%
Iteration 4/20 - Test Accuracy: 79.10%
Iteration 5/20 - Test Accuracy: 78.04%
Iteration 6/20 - Test Accuracy: 77.95%
Iteration 7/20 - Test Accuracy: 78.67%
Iteration 8/20 - Test Accuracy: 79.07%
Iteration 9/20 - Test Accuracy: 78.89%
Iteration 10/20 - Test Accuracy: 79.36%
Iteration 11/20 - Test Accuracy: 78.95%
Iteration 12/20 - Test Accuracy: 79.28%
Iteration 13/20 - Test Accuracy: 78.90%
Iteration 14/20 - Test Accuracy: 79.28%
Iteration 15/20 - Test Accuracy: 78.13%
Iteration 16/20 - Test Accuracy: 79.11%
Iteration 17/20 - Test Accuracy: 79.02%
Iteration 18/20 - Test Accuracy: 79.70%
Iteration 19/20 - Test Accuracy: 79.15%
Iteration 20/20 - Test Accuracy: 79.67%


### Task 3: Implement Uncertainty Sampling for Active Learning.

1. Train a Random Forest Classifier (you can use “from sklearn.ensemble import RandomForestClassifier”)  on the initial dataset of 200 samples.

In [156]:
# Randomly select 200 samples from the training set
num_samples = 200
random_indices = np.random.choice(train_images_flattened.shape[0], num_samples, replace=False)

# Create the initially labeled dataset
initial_images = train_images_flattened[random_indices]
initial_labels = train_labels[random_indices]

In [157]:
# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model on the initial labeled dataset (200 samples)
rf_classifier.fit(initial_images, initial_labels)

# Predict the labels on the test dataset
predicted_labels_test = rf_classifier.predict(test_images_flattened)

# Calculate the accuracy on the test dataset
accuracy_test = accuracy_score(test_labels, predicted_labels_test)

# Print the accuracy
print(f"Accuracy on Test Set after training on 200 samples: {accuracy_test * 100:.2f}%")

Accuracy on Test Set after training on 200 samples: 80.38%


2. Implement an active learning loop for 20 iterations:
* Compute uncertainty (Label Entropy) for each sample in the unlabeled pool using entropy.
* Select the sample with the highest uncertainty and query its true label.
* Add the queried sample to the labelled dataset and remove it from the unlabelled pool.
* Retrain the model and check the model's accuracy on the test set.
* Print accuracy after every iteration.


In [158]:
# Function to compute Label Entropy
def label_entropy(labels):
    unique_labels, counts = np.unique(labels, return_counts=True)
    probabilities = counts / len(labels)
    entropy = -np.sum(probabilities * np.log2(probabilities))
    return entropy

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
uncertainty_sampling_accuracies = []

# Active learning loop for 20 iterations
num_iterations = 20
for iteration in range(num_iterations):
    
    # Step 1: Compute Label Entropy for each sample in the unlabeled pool
    label_entropies = np.array([label_entropy(unlabeled_images[i].flatten()) for i in range(unlabeled_images.shape[0])])
    
    # Step 2: Select the sample with the highest entropy (highest uncertainty)
    uncertain_sample_index = np.argmax(label_entropies) 
    selected_image = unlabeled_images[uncertain_sample_index]
    selected_label = unlabeled_labels[uncertain_sample_index]
    
    # Step 3: Add the selected sample and label to the labeled dataset
    initial_images = np.vstack([initial_images, selected_image])
    initial_labels = np.append(initial_labels, selected_label)
    
    # Step 4: Remove the selected sample and label from the unlabeled pool
    unlabeled_images = np.delete(unlabeled_images, uncertain_sample_index, axis=0)
    unlabeled_labels = np.delete(unlabeled_labels, uncertain_sample_index, axis=0)
    
    # Step 5: Retrain the model on the updated labeled dataset
    rf_classifier.fit(initial_images, initial_labels)
    
    # Step 6: Evaluate the model on the test set
    predicted_labels_test = rf_classifier.predict(test_images_flattened)
    accuracy_test = accuracy_score(test_labels, predicted_labels_test)
    uncertainty_sampling_accuracies.append(accuracy_test)
    
    # Step 7: Print the accuracy after every iteration
    print(f"Test Accuracy after iteration {iteration + 1}: {accuracy_test * 100:.2f}%")


Test Accuracy after iteration 1: 80.23%
Test Accuracy after iteration 2: 80.19%
Test Accuracy after iteration 3: 80.75%
Test Accuracy after iteration 4: 81.03%
Test Accuracy after iteration 5: 80.65%
Test Accuracy after iteration 6: 81.11%
Test Accuracy after iteration 7: 80.96%
Test Accuracy after iteration 8: 80.84%
Test Accuracy after iteration 9: 80.98%
Test Accuracy after iteration 10: 80.51%
Test Accuracy after iteration 11: 80.82%
Test Accuracy after iteration 12: 80.72%
Test Accuracy after iteration 13: 81.10%
Test Accuracy after iteration 14: 80.82%
Test Accuracy after iteration 15: 80.51%
Test Accuracy after iteration 16: 80.80%
Test Accuracy after iteration 17: 80.75%
Test Accuracy after iteration 18: 80.67%
Test Accuracy after iteration 19: 80.79%
Test Accuracy after iteration 20: 80.98%


### Task 4: Implement Query-by-Committee for Active Learning

1. Initialize a committee of 5 Random Forest models, each trained on the initial dataset.


In [159]:
# Number of models in the committee
num_models = 5

# Initialize an empty list to hold the models (the committee)
committee = []

# Initialize and train 5 Random Forest models on the initial dataset
for _ in range(num_models):
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_classifier.fit(initial_images, initial_labels)
    committee.append(rf_classifier)  # Add the trained model to the committee

# Print out the number of models in the committee
print(f"Committee of {len(committee)} Random Forest models initialized and trained.")


Committee of 5 Random Forest models initialized and trained.


2. In the active learning loop for 20 iterations:
* For each unlabeled sample, compute predictions from all committee members.
* Measure disagreement by calculating Vote Entropy.
* Select the sample with the highest disagreement and query its true label.
* Update the labelled dataset and retrain all models.
* Check the model's accuracy on the test set.
* Print accuracy after every iteration


In [160]:
# Function to compute Vote Entropy for a given set of predictions
def vote_entropy(predictions):
    total_entropy = 0
    for sample_predictions in predictions:
        unique_votes, counts = np.unique(sample_predictions, return_counts=True)
        probabilities = counts / len(sample_predictions)
        sample_entropy = -np.sum(probabilities * np.log2(probabilities))
        total_entropy += sample_entropy
    return total_entropy / len(predictions)

In [161]:
# Active learning loop for 20 iterations
num_iterations = 20
committee_sampling_accuracies = []

for iteration in range(num_iterations):
    
    # Step 1: Compute predictions from all committee members for each sample in the unlabeled pool
    committee_predictions = np.array([model.predict(unlabeled_images) for model in committee]).T  # Shape: (num_samples, num_models)
    
    # Step 2: Calculate Vote Entropy for each sample in the unlabeled pool
    vote_entropies = np.array([vote_entropy(committee_predictions[i:i+1]) for i in range(committee_predictions.shape[0])])
    
    # Step 3: Select the sample with the highest Vote Entropy (highest disagreement)
    uncertain_sample_index = np.argmax(vote_entropies)
    selected_image = unlabeled_images[uncertain_sample_index]
    selected_label = unlabeled_labels[uncertain_sample_index]
    
    # Step 4: Add the selected sample and its true label to the labeled dataset
    initial_images = np.vstack([initial_images, selected_image])
    initial_labels = np.append(initial_labels, selected_label)
    
    # Step 5: Remove the selected sample and label from the unlabeled pool
    unlabeled_images = np.delete(unlabeled_images, uncertain_sample_index, axis=0)
    unlabeled_labels = np.delete(unlabeled_labels, uncertain_sample_index, axis=0)
    
    # Step 6: Retrain all models in the committee on the updated labeled dataset
    for model in committee:
        model.fit(initial_images, initial_labels)
    
    # Step 7: Evaluate the model on the test set
    test_predictions = np.array([model.predict(test_images_flattened) for model in committee]).T
    test_accuracy = accuracy_score(test_labels, test_predictions.mean(axis=1).round())  # Majority vote from the committee
    committee_sampling_accuracies.append(test_accuracy)
    
    # Step 8: Print accuracy after every iteration
    print(f"Test Accuracy after iteration {iteration + 1}: {test_accuracy * 100:.2f}%")

Test Accuracy after iteration 1: 80.72%
Test Accuracy after iteration 2: 80.58%
Test Accuracy after iteration 3: 80.86%
Test Accuracy after iteration 4: 80.43%
Test Accuracy after iteration 5: 80.36%
Test Accuracy after iteration 6: 80.19%
Test Accuracy after iteration 7: 80.16%
Test Accuracy after iteration 8: 80.59%
Test Accuracy after iteration 9: 80.73%
Test Accuracy after iteration 10: 79.88%
Test Accuracy after iteration 11: 79.88%
Test Accuracy after iteration 12: 79.96%
Test Accuracy after iteration 13: 80.68%
Test Accuracy after iteration 14: 80.66%
Test Accuracy after iteration 15: 80.78%
Test Accuracy after iteration 16: 80.60%
Test Accuracy after iteration 17: 80.21%
Test Accuracy after iteration 18: 80.36%
Test Accuracy after iteration 19: 80.19%
Test Accuracy after iteration 20: 80.75%


### Task 5 : Evaluation & Report

* Compare the final model accuracy across all three strategies.
* Plots the graph of accuracies for all three methods for 20 iterations.
* Analyze which method leads to the most cost-effective improvement in accuracy.
* Discuss findings and limitations in a brief report.

In [1]:
import matplotlib.pyplot as plt