<a href="https://colab.research.google.com/github/Carllang448/CS_670/blob/main/Part1_MachineLearning_FinalProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data Selection
Teams will select an image classification dataset for their project from PyTorch’s or Tensorflow’s set of builtin datasets. Students may not choose an MNIST dataset, since they used it for Assignment 3.

Data Analysis & Visualization:
Students should include a description of the dataset they chose in their presentation, including
source, size, and the type of data it contains and conduct a statistical analysis of the dataset to
understand its characteristics.

Pre-Processing & Model Selection:
Identify data preprocessing tasks that can be done on dataset and design at least 3 different
machine learning models (or combinations of models) which can be used. Justify the choice of
preprocessing steps and models based on the data analysis.
The models used should be explained in the phase 1 presentation.

Model Training:
Groups should train each model on the dataset, and collect results on the models’ performance.
These results should be included in the phase 1 presentation. A validation set should be used to
prevent overfitting. The models should be exported, as was done in assignment 3. Each model is
recommended be in a separate .ipynb file, for ease of collaboration.

Presentation:
On March 10, groups will present the results of their work from phase 1. There will be an announcement on blackboard for groups to schedule their presentations.
The group leader will submit the slides, code and exported models to Blackboard

In [None]:
#Carl look at getting email address/drive to link and working on portioning the data for use on personal computer
#loading data
#framework for code by wednesday for everything


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from sklearn.model_selection import train_test_split
import pickle
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import *
import h5py

# Set the seed for reproducibility
np.random.seed(66)
tf.random.set_seed(66)

base_path = '/content/drive/MyDrive/CS638_Project/'
train_x_path = os.path.join(base_path, 'camelyonpatch_level_2_split_train_x.h5')
train_y_path = os.path.join(base_path, 'camelyonpatch_level_2_split_train_y.h5')
valid_x_path = os.path.join(base_path, 'camelyonpatch_level_2_split_valid_x.h5')
valid_y_path = os.path.join(base_path, 'camelyonpatch_level_2_split_valid_y.h5')



In [None]:
from tensorflow.keras.applications.resnet50 import preprocess_input

def normalize_images(images):
    """Preprocess image pixel values for ResNet50."""
    # Convert to float32 for precision
    images = images.astype('float32')
    # Use the preprocess_input function specific to the model
    images = preprocess_input(images)
    return images



In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Define the data augmentation
data_augmentation = ImageDataGenerator(
    preprocessing_function=normalize_images,  # Assuming you want to keep using your normalize_images function
    width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
    height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
    horizontal_flip=True,  # randomly flip images
    vertical_flip=True  # randomly flip images
)


In [None]:
def hdf5_image_feature_generator(x_path, y_path, feature_array, batch_size, shuffle=False, augmentor=None):
    while True:
        with h5py.File(x_path, 'r') as x_h5, h5py.File(y_path, 'r') as y_h5:
            x = x_h5['x']
            y = y_h5['y']
            num_samples = x.shape[0]

            indices = np.arange(num_samples)
            if shuffle:
                np.random.shuffle(indices)

            for start_idx in range(0, num_samples, batch_size):
                end_idx = min(start_idx + batch_size, num_samples)
                batch_indices = indices[start_idx:end_idx]

                batch_x = x[batch_indices]

                if augmentor is not None:
                    # Apply augmentation here
                    for i in range(batch_x.shape[0]):
                        batch_x[i] = augmentor.random_transform(batch_x[i])

                batch_y = np.array(y[batch_indices]).astype('float32')
                batch_features = feature_array[batch_indices]

                # Ensure that batch_y is 2D and batch_features has the correct shape
                batch_y = np.squeeze(batch_y)
                assert len(batch_y.shape) == 1, "Labels should be 1D"
                assert batch_features.shape[0] == batch_x.shape[0], "Features batch size does not match images batch size"

                yield [batch_x, batch_features], batch_y



In [None]:
def extract_features(images):
    # Calculate the mean and standard deviation of pixel values
    pixel_mean = images.mean(axis=(1, 2, 3))
    pixel_std = images.std(axis=(1, 2, 3))
    pixel_max = images.max(axis=(1, 2, 3))
    pixel_min = images.min(axis=(1, 2, 3))

    # More features can be added here, such as skewness, kurtosis, etc.

    # Combine features into a single array
    features = np.stack([pixel_mean, pixel_std, pixel_max, pixel_min], axis=1)
    return features

In [None]:
def extract_features_in_chunks(hdf5_path, batch_size=1000):
    # Initialize an empty list to store the extracted features
    all_features = []

    with h5py.File(hdf5_path, 'r') as hdf5_file:
        # Determine the number of images
        num_images = hdf5_file['x'].shape[0]

        # Process the dataset in batches
        for start_idx in range(0, num_images, batch_size):
            end_idx = min(start_idx + batch_size, num_images)
            images_batch = hdf5_file['x'][start_idx:end_idx]

            # Extract features for the current batch
            batch_features = extract_features(images_batch)
            all_features.append(batch_features)

    # Concatenate all batch features together
    all_features = np.concatenate(all_features, axis=0)
    return all_features

# Call the function to extract features for the entire dataset
# Assuming 'train_x_path' is the path to your HDF5 file containing the images
# Extract features for the training dataset
train_features = extract_features_in_chunks(train_x_path)

# Extract features for the validation dataset
valid_features = extract_features_in_chunks(valid_x_path)

# Calculate mean and std from training features
mean = train_features.mean(axis=0)
std = train_features.std(axis=0)

# Apply standardization using training data statistics
train_features = (train_features - mean) / std
valid_features = (valid_features - mean) / std

# Verify the sizes
print(f"Training features shape: {train_features.shape}")
print(f"Validation features shape: {valid_features.shape}")


# Now 'extracted_features' should have as many rows as there are images in your dataset



FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = '/content/drive/MyDrive/CS638_Project/camelyonpatch_level_2_split_train_x.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

In [None]:
# Set batch size
batch_size = 32

# Create the augmented training data generator
train_augmented_generator = hdf5_image_feature_generator(
    train_x_path, train_y_path, train_features, batch_size, shuffle=False, augmentor=data_augmentation)

# Create a non-augmented validation data generator
validation_generator = hdf5_image_feature_generator(
    valid_x_path, valid_y_path, valid_features, batch_size, shuffle=False)



In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Dense, Dropout, Flatten, Multiply, Reshape, GlobalAveragePooling2D
from tensorflow.keras.models import Model


def attention_block(inputs):
    # Perform global average pooling on the inputs to get a vector
    gap = GlobalAveragePooling2D()(inputs)  # Shape: (batch_size, channels)

    # Learn a dense layer to predict attention scores from the GAP vector
    attention_scores = Dense(inputs.shape[-1], activation='softmax', use_bias=False)(gap)  # Shape: (batch_size, channels)

    # Apply attention scores to the original inputs
    attention_scores = Reshape((1, 1, inputs.shape[-1]))(attention_scores)  # Shape: (batch_size, 1, 1, channels)
    attention_output = Multiply()([inputs, attention_scores])

    return attention_output


# Modified model building function that includes an additional input for the extracted features
def build_model(num_features):
    # Input for the image data
    image_input = Input(shape=(96, 96, 3))

    # Convolutional layers
    conv1 = Conv2D(32, (3, 3), activation='relu', padding='same')(image_input)
    conv1 = MaxPooling2D(pool_size=(2, 2))(conv1)
    conv2 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv1)
    conv2 = MaxPooling2D(pool_size=(2, 2))(conv2)
    conv3 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv2)
    conv3 = MaxPooling2D(pool_size=(2, 2))(conv3)

    # Apply the attention mechanism
    attention = attention_block(conv3)
    flatten_images = Flatten()(attention)

    # Input for the extracted features
    features_input = Input(shape=(num_features,))

    # Concatenate the image features and the extracted features
    concatenated_features = Concatenate()([flatten_images, features_input])

    # Dense layers
    dense1 = Dense(64, activation='relu')(concatenated_features)
    dropout = Dropout(0.5)(dense1)
    output = Dense(1, activation='sigmoid')(dropout)

    # Build the model
    model = Model(inputs=[image_input, features_input], outputs=output)

    # Compile the model
    opt = tf.keras.optimizers.Adam(learning_rate=0.001)
    model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy', tf.keras.metrics.AUC()])

    return model


In [None]:
num_features = train_features.shape[1]  # This should be the number of engineered features per sample
model = build_model(num_features)


# Estimate steps per epoch for training and validation
with h5py.File(train_x_path, 'r') as f:
    train_steps_per_epoch = f['x'].shape[0] // batch_size

with h5py.File(valid_x_path, 'r') as f:
    validation_steps = f['x'].shape[0] // batch_size

In [None]:
import h5py
import numpy as np

# Paths to the HDF5 files in Google Drive
base_path = '/content/drive/MyDrive/CS_638_Term_project_data/'
train_x_path = os.path.join(base_path, 'camelyonpatch_level_2_split_train_x.h5')
train_y_path = os.path.join(base_path, 'camelyonpatch_level_2_split_train_y.h5')
valid_x_path = os.path.join(base_path, 'camelyonpatch_level_2_split_valid_x.h5')
valid_y_path = os.path.join(base_path, 'camelyonpatch_level_2_split_valid_y.h5')
# Function to load data size from HDF5 file
def get_hdf5_size(file_path, key='x'):
    with h5py.File(file_path, 'r') as hf:
        return hf[key].shape[0]  # Returns the size (number of samples) of the dataset

# Get sizes of the HDF5 datasets
train_x_size = get_hdf5_size(train_x_path, key='x')
train_y_size = get_hdf5_size(train_y_path, key='y')
valid_x_size = get_hdf5_size(valid_x_path, key='x')
valid_y_size = get_hdf5_size(valid_y_path, key='y')
# Assuming extracted_features is your precomputed feature array
# Make sure you've already computed this before running the code
train_features_size = train_features.shape[0]
valid_features_size = valid_features.shape[0]

# Print the sizes
print(f"Size of train_x dataset in HDF5: {train_x_size}")
print(f"Size of train_y dataset in HDF5: {train_y_size}")
print(f"Size of train_x dataset in HDF5: {valid_x_size}")
print(f"Size of train_y dataset in HDF5: {valid_y_size}")
print(f"Size of the train features array: {train_features_size}")
print(f"Size of the valid features array: {valid_features_size}")

# Additionally, if you want to compare the sizes to ensure they match
#if train_x_size == extracted_features_size:
  #  print("The size of the HDF5 image dataset matches the size of the feature array.")
#else:
 #   print("Mismatch! The size of the HDF5 image dataset does not match the size of the feature array.")


Size of train_x dataset in HDF5: 262144
Size of train_y dataset in HDF5: 262144
Size of train_x dataset in HDF5: 32768
Size of train_y dataset in HDF5: 32768
Size of the train features array: 262144
Size of the valid features array: 32768


In [None]:
from tensorflow.keras.callbacks import ReduceLROnPlateau
# Define the callback to reduce the learning rate when a plateau is reached
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=0.0001)

# Include this callback in the fit function
history = model.fit(
    train_augmented_generator,
    steps_per_epoch=train_steps_per_epoch,
    epochs=5,
    validation_data=validation_generator,
    validation_steps=validation_steps,
    callbacks=[reduce_lr]
)

Epoch 1/5

KeyboardInterrupt: 

In [None]:
#data analysis and visualization
#Emilia

#do we want to do NMF, PCA, etc?

In [None]:
#Preprocessing: getting rid of bad data, normalization, data augmentation, splitting into train and test data

In [None]:
#Model 1

In [None]:
#Model 1 Training

In [None]:
#Model 2

In [None]:
#Model 2 Training

In [None]:
#Model 3

In [None]:
#Model 3 Training