# Assignment 4

### <span style="color:chocolate"> Submission requirements </span>

Your work will not be graded if your notebook doesn't include output. In other words, <span style="color:red"> make sure to rerun your notebook before submitting to Gradescope </span> (Note: if you are using Google Colab: go to Edit > Notebook Settings  and uncheck Omit code cell output when saving this notebook, otherwise the output is not printed).

Additional points may be deducted if these requirements are not met:
    
* Comment your code;
* Each graph should have a title, labels for each axis, and (if needed) a legend. Each graph should be understandable on its own;
* Try and minimize the use of the global namespace (meaning, keep things inside functions).
---

### Import libraries

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns  # for nicer plots
sns.set(style="darkgrid")  # default style

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

import tensorflow as tf
from tensorflow import keras
from keras import metrics
from keras.datasets import fashion_mnist

tf.get_logger().setLevel('INFO')

---
### Step 1: Data ingestion

You'll train a binary classifier using the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset. This consists of 70,000 grayscale images (28x28). Each image is associated with 1 of 10 classes. The dataset was split by the creators; there are 60,000 training images and 10,000 test images. Note also that Tensorflow includes a growing [library of datasets](https://www.tensorflow.org/datasets/catalog/overview) and makes it easy to load them in numpy arrays.

In [None]:
# Load the Fashion MNIST dataset.
(X_train, Y_train), (X_test, Y_test) = fashion_mnist.load_data()

---
### Step 2: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) and Data Preprocessing are often iterative processes that involve going back and forth to refine and improve the quality of data analysis and preparation. However, the specific order can vary depending on the project's requirements. In some cases, starting with EDA, as you see in this assignment, could be more useful, but there is no rigid rule dictating the sequence in all situations.

### <span style="color:chocolate">Exercise 1:</span> Getting to know your data (5 points)

Complete the following tasks:

1. Print the shapes and types of (X_train, Y_train) and (X_test, Y_test). Interpret the shapes (i.e., what do the numbers represent?). Hint: For types use the <span style="color:chocolate">type()</span> function.
2. Define a list of strings of class names corresponding to each class in (Y_train, Y_test). Call this list label_names. Hint: Refer to the Fashion MNIST documentation.

In [None]:
# YOUR CODE HERE
print(f"X_train shape: {X_train.shape}")
print(f"X_train type: {type(X_train)}\n")

print(f"Y_train shape: {Y_train.shape}")
print(f"Y_train type: {type(Y_train)}\n")

print(f"X_test shape: {X_test.shape}")
print(f"X_test type: {type(X_test)}\n")

print(f"Y_test shape: {Y_test.shape}")
print(f"Y_test type: {type(Y_test)}\n")

#define list of strings
label_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

X_train is a NumPy array with 60,000 samples.Each sample is represented as a 28x28 matrix corresponding to each image pixel's

Y_train is a NumPy array with 60,000 target values.

X_test is a NumPy array with 10,000 samples.Each sample is represented as a 28x28 matrix corresponding to each image pixel's

Y_test is a NumPy array with 10,000 target values.

### <span style="color:chocolate">Exercise 2:</span> Getting to know your data - cont'd (5 points)

Fashion MNIST images have one of 10 possible labels (shown above).

Complete the following tasks:

1. Display the first 5 images in X_train for each class in Y_train, arranged in a 10x5 grid. Use the label_names list defined above;
2. Determine the minimum and maximum pixel values for images in the X_train dataset.

In [None]:
# Using list comprehension to make a dictionary of empty lists corresponding to each clothing label
indice_dict = {label: [] for label in label_names}

for i, label in enumerate(Y_train):
  # Get number that corresponds to label, then pass it into our label names list to "translate" it into words then set it equal to our variable
  article_of_clothing = label_names[label]

  # If the list of the article of clothing is bigger than 5 skip
  if len(indice_dict[article_of_clothing]) < 5:

    #add the index to the article of clothing's list
    indice_dict[article_of_clothing].append(i)

#making a plot that has 10 rows (1 for each label) 5 columns (1 for each index)
fig, axes = plt.subplots(nrows=10, ncols=5, figsize=(15, 30))

#plotting
#loop through label_names list and store the name of the clothing in class_name
for i, class_name in enumerate(label_names):
  #loop through indice_dict[class_name] list and store the index of the type of clothing in index
  for j, index in enumerate(indice_dict[class_name]):
    #set ax to the current grid we are on
    ax = axes[i, j]
    #in the ax display the image that is stored in X_train by using the index we stored
    ax.imshow(X_train[index])
    #turn off the gridlines and numbers to make it look neater
    ax.axis('off')
    #put a tiltle for each plot
    ax.set_title(f"{class_name} {j + 1}")

plt.tight_layout()
plt.show()

---
### Step 3: Data preprocessing

This step is essential for preparing this image data in a format that is suitable for ML algorithms.

### <span style="color:chocolate">Exercise 3:</span> Feature preprocessing (5 points)

In the previous lab, the input data had just a few features. Here, we treat **every pixel value as a separate feature**, so each input example has 28x28 (784) features!

In this exercise, you'll perform the following tasks:

1. Normalize the pixel values in both X_train and X_test data so they range between 0 and 1;
2. For each image in X_train and X_test, flatten the 2-D 28x28 pixel array to a 1-D array of size 784. Hint: use the <span style="color:chocolate">reshape()</span> method available in NumPy. Note that by doing so you will overwrite the original arrays;
3. Pint the shape of X_train and X_test arrays.

In [None]:
# To normalize both X_train and X_test so that they range between 0 and 1
X_train_norm = X_train / 255.0
X_test_norm = X_test / 255.0

# We use the reshape function in numpy to flatten the our df, we pass in our normalized df, and we specify that we want to keep the 6000 rows but we multiply 28*28 because those are the dementions of our images
X_train_norm_flat = X_train_norm.reshape(X_train.shape[0], -1)
X_test_norm_flat = X_test_norm.reshape(X_test.shape[0], -1)

# Print the shapes of the flattened arrays
print(f"Shape of X_train_norm_flat: {X_train_norm_flat.shape}")
print(f"Shape of X_test_norm_flat: {X_test_norm_flat.shape}")

### <span style="color:chocolate">Exercise 4:</span> Label preprocessing (5 points)

This assignment involves binary classification. Specifically, the objective is to predict whether an image belongs to the sneaker class (class 7) or not.

Therefore, write code so that for each example in (Y_train, Y_test), the outcome variable is represented as follows:
* $y=1$, for sneaker class (positive examples), and
* $y=0$, for non-sneaker class (negative examples).

Note: To avoid "ValueError: assignment destination is read-only", first create a copy of the (Y_train, Y_test) data and call the resulting arrays (Y_train, Y_test). Then overwrite the (Y_train, Y_test) arrays to create binary outcomes.

In [None]:
# Make copies of the original dataset for binary classification task.
Y_train_copy = np.copy(Y_train)
Y_test_copy = np.copy(Y_test)

# create a new array of the same shape as Y_train_copy filled with zeros
Y_train_binary = np.zeros_like(Y_train_copy)
Y_test_binary = np.zeros_like(Y_test_copy)

Y_train_binary = (Y_train == 7).astype(int)
Y_test_binary = (Y_test == 7).astype(int)

# Printing first 10 values to check work
print(f"Y_train_binary: {Y_train_binary[:10]}")
print(f"Y_test_binary: {Y_test_binary[:10]}")

### <span style="color:chocolate">Exercise 5:</span> Data splits (10 points)

Using the <span style="color:chocolate">train_test_split()</span> method available in scikit-learn:
1. Retain 20% from the training data for validation purposes. Set random state to 1234. Name the resulting dataframes as follows: X_train_mini, X_val, Y_train_mini, Y_val.
2. Print the shape of each array.

In [None]:
# Using train_test_split we passed in X_train, Y_train df's and set the test size to .20
X_train_mini, X_val, Y_train_mini, Y_val = train_test_split(X_train_norm_flat, Y_train_binary, test_size = 0.2, random_state = 1234)

# Print the shapes of the flattened arrays
print(f"Shape of X_train_norm_flat: {X_train_mini.shape}")
print(f"Shape of Y_train_mini: {Y_train_mini.shape}")
print(f"Shape of X_val: {X_val.shape}")
print(f"Shape of Y_val: {Y_val.shape}")

### <span style="color:chocolate">Exercise 6:</span> Data shuffling (10 points)

Since you'll be using Batch Gradient Descent (BGD) for training, it is important that **each batch is a random sample of the data** so that the gradient computed is representative.

1. Use [integer array indexing](https://numpy.org/doc/stable/reference/arrays.indexing.html#integer-array-indexing) to re-order (X_train_mini, Y_train_mini) using a list of shuffled indices. In doing so, you will overwrite the arrays.

In [None]:
np.random.seed(0)

# generate a list of indices from 0 to the length of X_train_mini - 1.
index_list = np.arange(len(X_train_mini))

#shuffle this list of indices
np.random.shuffle(index_list)
X_train_mini = X_train_mini[index_list]
Y_train_mini = Y_train_mini[index_list]



---
### Step 4: Exploratory Data Analysis (EDA) - cont'd

Before delving into model training, let's further explore the raw feature values by comparing sneaker and non-sneaker training images.

### <span style="color:chocolate">Exercise 7:</span> Pixel distributions (10 points)

1. Identify all sneaker images in X_train_mini and calculate the mean pixel value for each sneaker image. Visualize these pixel values using a histogram. Print the mean pixel value across all sneaker images.
2. Identify all non-sneaker images in X_train_mini and calculate the mean pixel value for each non-sneaker image. Visualize these pixel values using a histogram. Print the mean pixel value across all non-sneaker images.
3. Based on the histogram results, assess whether there is any evidence suggesting that pixel values can be utilized to distinguish between sneaker and non-sneaker images. Justify your response.

Notes: Make sure to provide a descriptive title and axis labels for each histogran. Make sure you utilize Y_train_mini to locate the sneaker and non-sneaker class.

In [None]:
# Use boolean indexing to create separate arrays for sneaker and non-sneaker images
sneaker_indices = Y_train_mini == 1
non_sneaker_indices = Y_train_mini == 0

# Flatten the images to ensure each image is a 1D array
X_train_mini_flat = X_train_mini.reshape(X_train_mini.shape[0], -1)

# Extract images
sneaker_images = X_train_mini_flat[sneaker_indices]
non_sneaker_images = X_train_mini_flat[non_sneaker_indices]

# Calculate mean pixel values for each image (ensure the arrays are 2D)
sneaker_images_mean = sneaker_images.mean(axis=1)
non_sneaker_images_mean = non_sneaker_images.mean(axis=1)

#visualize
plt.figure(figsize=(12,6))

plt.subplot(1,2,1)

plt.hist(sneaker_images_mean, bins = 30, alpha = 0.70, label = 'Sneaker')
plt.title("Histogram of Mean Pixel Values - Sneaker")
plt.xlabel('Mean Pixel Value')
plt.ylabel('Frequency')
plt.legend()

plt.subplot(1,2,2)

plt.hist(sneaker_images_mean, bins = 30, alpha = 0.70, label = 'Non-Sneaker')
plt.title("Histogram of Mean Pixel Values - Non-Sneaker")
plt.xlabel('Mean Pixel Value')
plt.ylabel('Frequency')
plt.legend()

plt.tight_layout()
plt.show()

mean_sneaker_pixel_value = sneaker_images_mean.mean()
mean_non_sneaker_pixel_value = non_sneaker_images_mean.mean()

print(f'Mean pixel value for sneaker images: {mean_sneaker_pixel_value}')
print(f'Mean pixel value for non-sneaker images: {mean_non_sneaker_pixel_value}')

#Based on the histogram results, assess whether there is any evidence suggesting that pixel values can be utilized to distinguish between sneaker and non-sneaker images. Justify your response.
# No, based on the results the pixel (approximately 72.93 for sneakers and 73.47 for non-sneakers) values between sneaker and non-sneaker are too close to call with certanty. The overlaping of histograms suggests that there isn't a clear distinction



---
### Step 4: Modeling

### <span style="color:chocolate">Exercise 8:</span> Baseline model (10 points)

When dealing with classification problems, a simple baseline is to select the *majority* class (the most common label in the training set) and use it as the prediction for all inputs.

With this information in mind:

1. What is the number of sneaker images in Y_train_mini?
2. What is the number of non-sneaker images in Y_train_mini?
3. What is the majority class in Y_train_mini?
4. What is the accuracy of a majority class classifier for Y_train_mini?
5. Implement a function that computes the Log Loss (binary cross-entropy) metric and use it to evaluate this baseline on both the mini train (Y_train_mini) and validation (Y_val) data. Use 0.1 as the predicted probability for your baseline (reflecting what we know about the original distribution of classes in the mini training data). Hint: for additional help, see the file ``04 Logistic Regression with Tensorflow_helpers.ipynb``.

In [None]:
# Assuming sneaker_images and non_sneaker_images are correctly defined
sneaker_num = len(sneaker_images)
non_sneaker_num = len(non_sneaker_images)

print(f"Number of sneaker images in Y_train_mini: {sneaker_num}")
print(f"Number of non-sneaker images in Y_train_mini: {non_sneaker_num}")

# Determine the majority class
if sneaker_num > non_sneaker_num:
    majority_class = 1
    print(f"Majority class is sneakers")
else:
    majority_class = 0
    print(f"Majority class is non-sneakers")

# Calculate the accuracy of a majority class classifier
accuracy = max(sneaker_num, non_sneaker_num) / (sneaker_num + non_sneaker_num)
print(f"The accuracy is: {accuracy}")

# Log loss function: measures the performance of a classification model whose output is a probability value between 0 and 1
# It provides a measure of how well the predicted probabilities align with the actual outcomes.
# Penalization: It heavily penalizes wrong predictions that are confident (i.e., predicting a high probability for the wrong class).
def log_loss(labels, predicted_probabilities):
  """Build a Log Loss function model.

  Args:
    labels: Actual labels (binary).
    predicted_probabilities: Predicted probabilities.

  Returns:
    model: A tf.keras model (graph).
  """
  total_log_loss = 0

  for i in labels:
    #Check if the label is a sneaker (represented by 1).
    if i == 1:
      total_log_loss += -np.log(predicted_probabilities)
    else:
      total_log_loss += -np.log(1 - predicted_probabilities)

  average_log_loss = total_log_loss / len(labels)

  return average_log_loss

# Calculate log loss for Y_train_mini and Y_val
train_log_loss = log_loss(Y_train_mini, 0.1)
val_log_loss = log_loss(Y_val, 0.1)

print(f"Log Loss on training data: {train_log_loss}")
print(f"Log Loss on validation data: {val_log_loss}")

### <span style="color:chocolate">Exercise 9:</span> Improvement over Baseline with TensorFlow (10 points)

Let's use TensorFlow to train a binary logistic regression model much like you did in the previous assignment. The goal here is to build a ML model to improve over the baseline classifier.

1. Fill in the <span style="color:green">NotImplemented</span> parts of the build_model() function below by following the instructions provided as comments. Hint: the activation function, the loss, and the evaluation metric are different compared to the linear regression model;
2. Build and compile a model using the build_model() function and the (X_train_mini, Y_train_mini) data. Set learning_rate = 0.0001. Call the resulting object *model_tf*.
3. Train *model_tf* using the (X_train_mini, Y_train_mini) data. Set num_epochs = 5 and batch_size=32. Pass the (X_val, Y_val) data for validation. Hint: see the documentation behind the [tf.keras.Model.fit()](https://bcourses.berkeley.edu/courses/1534588/files/88733489?module_item_id=17073646) method.
3. Generate a plot (for the mini training and validation data) with the loss values on the y-axis and the epoch number on the x-axis for visualization. Make sure to include axes name and title. Hint: check what the [tf.keras.Model.fit()](https://bcourses.berkeley.edu/courses/1534588/files/88733489?module_item_id=17073646) method returns.

In [None]:
def build_model(num_features, learning_rate):
  """Build a TF linear regression model using Keras.

  Args:
    num_features: The number of input features.
    learning_rate: The desired learning rate for SGD.

  Returns:
    model: A tf.keras model (graph).
  """
  # This is not strictly necessary, but each time you build a model, TF adds
  # new nodes (rather than overwriting), so the colab session can end up
  # storing lots of copies of the graph when you only care about the most
  # recent. Also, as there is some randomness built into training with SGD,
  # setting a random seed ensures that results are the same on each identical
  # training run.
  tf.keras.backend.clear_session()
  tf.random.set_seed(0)

  # Build a model using keras.Sequential. While this is intended for neural
  # networks (which may have multiple layers), we want just a single layer for
  # binary logistic regression.

  model = tf.keras.Sequential()
  model.add(tf.keras.layers.Dense(
      units=10,        # output dim
      input_shape=(num_features,),  # input dim
      use_bias=True,               # use a bias (intercept) param
      activation='softmax',
      kernel_initializer=tf.keras.initializers.Ones(),  # initialize params to 1
      bias_initializer=tf.keras.initializers.Ones()    # initialize bias to 1
  ))


  # We need to choose an optimizer. We'll use SGD, which is actually mini-batch SGD
  optimizer = tf.keras.optimizers.SGD(learning_rate = learning_rate)

  # Finally, compile the model. Select the accuracy metric. This finalizes the graph for training.
  # Telling our model how it should learn and evaluate its performance
  # The optimizer tells the model how to adjust its weigths during training
  # The loss measures how well the model's predictions match the actual labels
  # The metrics monitor the performance of the model during and after training
  model.compile(optimizer = optimizer, loss = 'categorical_crossentropy', metrics = ['accuracy'])

  return model

In [None]:
tf.random.set_seed(0)

# One-hot encode the labels using TensorFlow's built-in functionality
Y_train_mini_one_hot = tf.keras.utils.to_categorical(Y_train_mini, num_classes=10)
Y_val_one_hot = tf.keras.utils.to_categorical(Y_val, num_classes=10)

# Flatten the input data
X_train_mini_flat = X_train_mini.reshape(X_train_mini.shape[0], -1)
X_val_flat = X_val.reshape(X_val.shape[0], -1)
num_features = X_train_mini_flat.shape[1]

# Build and compile the model
model_tf = build_model(num_features=num_features, learning_rate=0.001)  # Adjusted learning rate

# Fit the model
history = model_tf.fit(X_train_mini_flat, Y_train_mini_one_hot, epochs=5, batch_size=32, validation_data=(X_val_flat, Y_val_one_hot))

# Evaluate the model
train_loss, train_accuracy = model_tf.evaluate(X_train_mini_flat, Y_train_mini_one_hot, verbose=1)
print(f"Aggregate accuracy on the mini train dataset: {train_accuracy:.4f}")

test_loss, test_accuracy = model_tf.evaluate(X_val_flat, Y_val_one_hot, verbose=1)
print(f"Aggregate accuracy on the test dataset: {test_accuracy:.4f}")

In [None]:
# Extract the loss history
train_loss = history.history['loss']
val_loss = history.history['val_loss']

# Plotting the loss values
plt.figure(figsize=(10, 6))
plt.plot(range(1, 6), train_loss, label='Training Loss')
plt.plot(range(1, 6), val_loss, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss Over Epochs')
plt.legend()
plt.show()

print(f"The training loss is: {train_loss[-1]}")
print(f"The validation loss is: {val_loss[-1]}")


---
### Step 5: Hyperparameter tuning

Hyperparameter tuning is a crucial step in optimizing ML models. It involves systematically adjusting hyperparameters such as learning rate, number of epochs, and optimizer to find the model configuration that leads to the best generalization performance.

This tuning process is typically conducted by monitoring the model's performance on the validation vs. training set. It's important to note that using the test set for hyperparameter tuning can compromise the integrity of the evaluation process by violating the assumption of "blindness" of the test data.

### <span style="color:chocolate">Exercise 10:</span> Hyperparameter tuning (10 points)

1. Fine-tune the hyperparameters of *model_tf* to determine the setup that yields the most optimal performance. Feel free to explore various values for the hyperparameters. Hint: ask your instructors and TAs for help if in doubt.

After identifying your preferred model configuration, print the following information:

2. The first five learned parameters of the model (this should include the bias term);
3. The loss at the final epoch on both the mini training and validation datasets;
4. The percentage difference between the losses observed on the mini training and validation datasets.
5. Compare the training/validation loss of the TensorFlow model (model_tf) with the baseline model's loss. Does the TensorFlow model demonstrate an improvement over the baseline model?


Please note that we will consider 'optimal model configuration' any last-epoch loss that is below 0.08.

In [None]:
tf.random.set_seed(0)
# 2. Build and compile model Get the number of features
# Flatten the input data

model_tf = build_model(num_features=num_features, learning_rate = 0.007)

# Fit the model
history = model_tf.fit(X_train_mini_flat, Y_train_mini_one_hot, epochs=50, batch_size=12, validation_data=(X_val_flat, Y_val_one_hot))


In [None]:
# Extract the loss history
train_loss2 = history.history['loss']
val_loss2 = history.history['val_loss']

# Plotting the loss values
plt.figure(figsize=(10, 6))
plt.plot(range(1, 51), train_loss2, label='Training Loss')
plt.plot(range(1, 51), val_loss2, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss Over Epochs')
plt.legend()
plt.show()


In [None]:
# Extract the weights and biases
weight_array = model_tf.get_weights()

weights = weight_array[0]  # Weights between input and first layer
bias = weight_array[1]     # Biases of the first layer

# Print the first five weights and the bias term
print("First five learned parameters of the model (including the bias term):")

# Print the first 5 weights of the first 5 input features for each class
for i in range(5):  # Loop through the first 5 input features
    print(f"Weights for input feature {i+1}:")
    for class_index in range(10):  # Loop through all 10 classes
        print(f"  Class {class_index}: {weights[i, class_index]}")

print("Biases for each class:")
print(bias)

3. The loss at the final epoch on both the mini training and validation datasets;
4. The percentage difference between the losses observed on the mini training and validation datasets.
5. Compare the training/validation loss of the TensorFlow model (model_tf) with the baseline model's loss. Does the TensorFlow model demonstrate an improvement over the baseline model?

In [None]:
# 3. Output the loss at the final epoch on both the mini training and validation datasets;
print(f"The loss at the final epoch on the mini training dataset: {train_loss2[-1]}")
print(f"The loss at the final epoch on the validation training dataset: {val_loss2[-1]}")

In [None]:
# Output the percentage difference between the losses observed on the mini training and validation datasets.
# Calculate the absolute difference
final_train_loss = train_loss2[-1]
final_val_loss = val_loss2[-1]

absolute_difference = abs(final_val_loss - final_train_loss)

# Calculate the average loss
average_loss = (final_val_loss + final_train_loss) / 2

# Calculate the percentage difference
percentage_difference = (absolute_difference / average_loss) * 100

# Print the percentage difference
print(f"The percentage difference between the losses observed on the mini training and validation datasets is: {percentage_difference:.2f}%")

In [None]:
# Baseline model losses
baseline_train_loss = 1.3868722915649414
baseline_val_loss = 1.268518090248108

# TensorFlow model losses
final_train_loss = train_loss[-1]
final_val_loss = val_loss[-1]

# Output the losses
print(f"The training loss of the baseline is: {baseline_train_loss}")
print(f"The validation loss of the baseline is: {baseline_val_loss}")

# Calculate the absolute difference for the baseline model
baseline_absolute_difference = abs(baseline_val_loss - baseline_train_loss)

# Calculate the average loss for the baseline model
baseline_average_loss = (baseline_val_loss + baseline_train_loss) / 2

# Calculate the percentage difference for the baseline model
baseline_percentage_difference = (baseline_absolute_difference / baseline_average_loss) * 100

# Print the percentage difference for the baseline model
print(f"The percentage difference between the losses observed on the baseline model is: {baseline_percentage_difference:.2f}%")

# Compare the training/validation loss of the TensorFlow model with the baseline model's loss
print(f"The training loss of the TensorFlow model is: {final_train_loss}")
print(f"The validation loss of the TensorFlow model is: {final_val_loss}")

# Check if the TensorFlow model demonstrates an improvement over the baseline model
if final_val_loss < baseline_val_loss:
    print("The TensorFlow model demonstrates an improvement over the baseline model.")
else:
    print("The TensorFlow model does not demonstrate an improvement over the baseline model.")


---
### Step 6: Evaluation and Generalization


Now that you've determined the optimal set of hyperparameters, it's time to evaluate your optimized model on the test data to gauge its performance in real-world scenarios, commonly known as inference.

### <span style="color:chocolate">Exercise 11:</span> Computing accuracy (10 points)

1. Calculate aggregate accuracy on both mini train and test datasets using a probability threshold of 0.5. Hint: You can utilize the <span style="color:chocolate">model.evaluate()</span> method provided by tf.keras. Note: Aggregate accuracy measures the overall correctness of the model across all classes in the dataset;

2. Does the model demonstrate strong aggregate generalization capabilities? Provide an explanation based on your accuracy observations.

In [None]:
from tensorflow.keras.utils import to_categorical

# One-hot encode the test labels
Y_test_one_hot = to_categorical(Y_test, num_classes=10)

# Flatten the test data to match the shape expected by the model
X_test_flat = X_test.reshape(X_test.shape[0], -1)

# Evaluate the model using model.evaluate()
train_loss, train_accuracy = model_tf.evaluate(X_train_mini_flat, Y_train_mini_one_hot, verbose=1)
print(f"Aggregate accuracy on the mini train dataset (from evaluate): {train_accuracy:.4f}")

test_loss, test_accuracy = model_tf.evaluate(X_test_flat, Y_test_one_hot, verbose=1)
print(f"Aggregate accuracy on the test dataset (from evaluate): {test_accuracy:.4f}")


Does the model demonstrate strong aggregate generalization capabilities? Provide an explanation based on your accuracy observations.

No unfortunetly it does really poorly on the test, so much so that I thought I did the whole thing wrong. It has poor generalization and is overfitted.



### <span style="color:chocolate">Exercise 12:</span> Fairness evaluation (10 points)

1. Generate and visualize the confusion matrix on the test dataset using a probability threshold of 0.5. Additionally, print the True Positives (TP), False Negatives (FN), False Positives (FP), and True Negatives (TN). Hint: you can utilize the <span style="color:chocolate">model.predict()</span> method available in tf.keras, and then the <span style="color:chocolate">confusion_matrix()</span>, <span style="color:chocolate">ConfusionMatrixDisplay()</span> methods available in sklearn.metrics;

2. Compute subgroup accuracy, separately for the sneaker and non-sneaker classes, on the test dataset using a probability threshold of 0.5. Reflect on any observed accuracy differences (potential lack of fairness) between the two classes.

3. Does the model demonstrate strong subgroup generalization capabilities? Provide an explanation based on your accuracy observations.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score

# Assuming the model and data are already defined and trained
# X_test_flat, Y_test, and model_tf should be available in the context

# Predict probabilities
test_probs = model_tf.predict(X_test_flat)

# Convert probabilities to class labels using a threshold of 0.5
test_preds = np.argmax(test_probs, axis=1)

# Generate the confusion matrix
cm = confusion_matrix(Y_test, test_preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate TP, FN, FP, TN for each class
TP = cm.diagonal()
FN = cm.sum(axis=1) - TP
FP = cm.sum(axis=0) - TP
TN = cm.sum() - (TP + FN + FP)

# Print TP, FN, FP, TN for each class
for i in range(10):
    print(f"Class {i}:")
    print(f"  TP: {TP[i]}")
    print(f"  FN: {FN[i]}")
    print(f"  FP: {FP[i]}")
    print(f"  TN: {TN[i]}")

# Compute subgroup accuracy for the sneaker (class 7) and non-sneaker classes
sneaker_class = 7
non_sneaker_classes = [i for i in range(10) if i != sneaker_class]

# Sneaker accuracy
sneaker_indices = (Y_test == sneaker_class)
sneaker_accuracy = accuracy_score(Y_test[sneaker_indices], test_preds[sneaker_indices])
print(f"Sneaker class accuracy: {sneaker_accuracy:.4f}")

# Non-sneaker accuracy
non_sneaker_indices = (Y_test != sneaker_class)
non_sneaker_accuracy = accuracy_score(Y_test[non_sneaker_indices], test_preds[non_sneaker_indices])
print(f"Non-sneaker class accuracy: {non_sneaker_accuracy:.4f}")

# Reflect on observed accuracy differences
accuracy_difference = sneaker_accuracy - non_sneaker_accuracy
print(f"Accuracy difference between sneaker and non-sneaker classes: {accuracy_difference:.4f}")

# Does the model demonstrate strong subgroup generalization capabilities?
if abs(accuracy_difference) < 0.05:
    print("The model demonstrates strong subgroup generalization capabilities.")
else:
    print("The model does not demonstrate strong subgroup generalization capabilities.")


----
### <span style="color:chocolate">Bonus question</span> (20 points)

Is it possible to enhance the prediction accuracy for the sneaker class by performing the following steps?

1. Implement data balancing techniques, such as oversampling or undersampling, to equalize the representation of both classes.
2. After balancing the data, retrain the model on the balanced dataset.
3. Evaluate the model's performance, particularly focusing on the accuracy achieved for the sneaker class.

Note: provide a separate notebook for the Bonus exercise. Name it ``04 Logistic Regression with Tensorflow_bonus``.

In [None]:
from imblearn.over_sampling import RandomOverSampler

# Retrain the Model on the Balanced Dataset

# Assuming X_train and Y_train are already loaded
# Reshape and normalize data
X_train_flat = X_train.reshape(X_train.shape[0], -1) / 255.0

# Balance the dataset using RandomOverSampler
ros = RandomOverSampler(sampling_strategy='auto')
X_train_balanced, Y_train_balanced = ros.fit_resample(X_train_flat, Y_train)

# One-hot encode the balanced labels
Y_train_balanced_one_hot = to_categorical(Y_train_balanced, num_classes=10)

# Print the new class distribution
print("Class distribution after balancing:")
print(np.bincount(Y_train_balanced))

In [None]:
# Evaluate the model's performance,
# Define the logistic regression model
def build_model(num_features, learning_rate):
    tf.keras.backend.clear_session()
    tf.random.set_seed(0)

    model = tf.keras.Sequential([
        tf.keras.layers.Dense(10, activation='softmax', input_shape=(num_features,))
    ])

    model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate),
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    return model

# Build and compile the model
num_features = X_train_balanced.shape[1]
model_tf_balanced = build_model(num_features, learning_rate=0.01)

# Train the model
history_balanced = model_tf_balanced.fit(X_train_balanced, Y_train_balanced_one_hot, epochs=50, batch_size=32, validation_split=0.2, verbose=1)

from sklearn.metrics import confusion_matrix, accuracy_score, ConfusionMatrixDisplay

# Predict on the test set
X_test_flat = X_test.reshape(X_test.shape[0], -1) / 255.0
Y_test_one_hot = to_categorical(Y_test, num_classes=10)
test_probs_balanced = model_tf_balanced.predict(X_test_flat)
test_preds_balanced = np.argmax(test_probs_balanced, axis=1)

# Generate the confusion matrix
cm_balanced = confusion_matrix(Y_test, test_preds_balanced)
disp_balanced = ConfusionMatrixDisplay(confusion_matrix=cm_balanced)
disp_balanced.plot()
plt.show()

# Calculate accuracy for sneaker and non-sneaker classes
sneaker_accuracy_balanced = accuracy_score(Y_test[Y_test == 7], test_preds_balanced[Y_test == 7])
non_sneaker_accuracy_balanced = accuracy_score(Y_test[Y_test != 7], test_preds_balanced[Y_test != 7])

print(f"Sneaker class accuracy: {sneaker_accuracy_balanced:.4f}")
print(f"Non-sneaker class accuracy: {non_sneaker_accuracy_balanced:.4f}")

# Print the overall accuracy
test_loss_balanced, test_accuracy_balanced = model_tf_balanced.evaluate(X_test_flat, Y_test_one_hot, verbose=1)
print(f"Overall test accuracy: {test_accuracy_balanced:.4f}")