For this week's mini-project, you will participate in this Kaggle competition: 
Histopathologic Cancer Detection

This Kaggle competition is a binary image classification problem where you will identify metastatic cancer in small image patches taken from larger digital pathology scans.

You will submit three deliverables: 

Deliverable 1 — A Jupyter notebook with a description of the problem/data, exploratory data analysis (EDA) procedure, analysis (model building and training), result, and discussion/conclusion. 

Suppose your work becomes so large that it doesn’t fit into one notebook (or you think it will be less readable by having one large notebook). In that case, you can make several notebooks or scripts in a GitHub repository (as deliverable 3) and submit a report-style notebook or pdf instead. 

If your project doesn’t fit into Jupyter notebook format (E.g., you built an app that uses ML), write your approach as a report and submit it in a pdf form. 

Deliverable 2 — A public project GitHub repository with your work (please also include the GitHub repo URL in your notebook/report).

Deliverable 3 — A screenshot of your position on the Kaggle competition leaderboard for your top-performing model.

## Project Description
We first start by testing a simple CNN using only numpy matrices. Since memory is limited on my system and I was at first having challenges getting CUDA working for linux, I used a small subset of the training data and validated it against some reserved examples. A screenshot is available for the validation accuracy, but this was not used to generate the final submission.

The next part of the project I used tensor flow with adam optimizer and binary cross entropy loss function. The model is a simple CNN with 3 convolutional layers and 3 max pooling layers. The model is trained for 10 epochs. I achieved an accuracy ~70% across multiple runs. After saving the model, I used it to generate predictions on the test data and submitted the results to Kaggle.

**This is a report style notebook so no data is provided and this is not intended to run**

## Data Imports

In [None]:
# Set paths
train_data_path = 'data/train'
test_data_path = 'data/test'
labels_path = 'data/train_labels.csv'

# Load training labels
labels_df = pd.read_csv(labels_path)

def load_images(data_path):
    images = []
    for filename in os.listdir(data_path):
        if filename.endswith('.tif'):
            img_path = os.path.join(data_path, filename)
            img = Image.open(img_path)
            img_array = np.array(img)
            images.append(img_array)
    return np.array(images)

def process_in_batches(images, batch_size=32):
    for i in range(0, len(images), batch_size):
        yield images[i:i+batch_size]

# Load training data
train_images = load_images(train_data_path)
train_images = train_images[:round(len(train_images) *.20)]
train_labels = labels_df['label'].values

# Load test data
test_images = load_images(test_data_path)

train_images = train_images.astype(np.float32)
test_images = test_images.astype(np.float32)

print("Data Characteristics: ")
print(f"Train images shape: {train_images.shape}  |  Number of Samples: {len(train_images)}")
print(f"Train labels shape: {train_labels.shape}")
print(f"Test images shape: {test_images.shape}  |  Number of Samples: {len(test_images)}")

## Simple CNN
I wanted to start by handrolling a CNN using only numpy

In [1]:
class CNN:
    """
        Implements a Convolutional Neural Network (CNN) with basic layers such as 2D convolution, ReLU activation, and max pooling.
        
        The `CNN` class provides the following methods:
        
        - `__init__()`: Initializes the layers, weights, and biases of the CNN.
        - `conv2d(input, kernel, bias)`: Performs 2D convolution on the input using the provided kernel and bias.
        - `relu(input)`: Applies the ReLU activation function to the input.
        - `max_pool(input)`: Implements max pooling on the input.
        - `forward(input)`: Performs the forward pass of the CNN on the input.
    """

    def __init__(self):
        # Initialize layers, weights, and biases
        self.conv1_weights = np.random.randn(3, 3, 3, 16).astype(np.float32) * 0.01
        self.conv1_bias = np.zeros((16, 1), dtype=np.float32)
        self.conv2_weights = np.random.randn(3, 3, 16, 32).astype(np.float32) * 0.01
        self.conv2_bias = np.zeros((32, 1), dtype=np.float32)
        self.fc_weights = np.random.randn(32 * 22 * 22, 1).astype(np.float32) * 0.01
        self.fc_bias = np.zeros((1, 1), dtype=np.float32)

    def conv2d(self, input, kernel, bias):
        h_out = input.shape[1] - kernel.shape[0] + 1
        w_out = input.shape[2] - kernel.shape[1] + 1
        output = np.zeros((input.shape[0], h_out, w_out, kernel.shape[3]))
        
        for i in range(h_out):
            for j in range(w_out):
                output[:, i, j, :] = np.sum(input[:, i:i+kernel.shape[0], j:j+kernel.shape[1], :, np.newaxis] * 
                                            kernel[np.newaxis, :, :, :], axis=(1, 2, 3)) + bias.T
        return output

    def relu(self, input):
        # Implement ReLU activation
        return np.maximum(0, input)

    def max_pool(self, input):
        h_out, w_out = input.shape[1] // 2, input.shape[2] // 2
        output = np.zeros((input.shape[0], h_out, w_out, input.shape[3]))
        
        for i in range(h_out):
            for j in range(w_out):
                output[:, i, j, :] = np.max(input[:, 2*i:2*i+2, 2*j:2*j+2, :], axis=(1, 2))
        return output

    def forward(self, input):
        # First convolutional layer
        conv1 = self.conv2d(input, self.conv1_weights, self.conv1_bias)
        relu1 = self.relu(conv1)
        pool1 = self.max_pool(relu1)
        
        # Second convolutional layer
        conv2 = self.conv2d(pool1, self.conv2_weights, self.conv2_bias)
        relu2 = self.relu(conv2)
        pool2 = self.max_pool(relu2)
        
        # Flatten and fully connected layer
        flattened = pool2.reshape(pool2.shape[0], -1)
        output = np.dot(flattened, self.fc_weights) + self.fc_bias.T
        
        return output

Since system memory was an issue on my workstation, I had to use batch processing on a subset of the image data, and then validated on a reserved subset of the training data

In [2]:
def process_in_batches(images, batch_size=32):
    for i in range(0, len(images), batch_size):
        yield images[i:i+batch_size]


In [3]:
# Shuffle and split the data
indices = np.arange(len(train_images))
np.random.shuffle(indices)
split = int(0.75 * len(train_images))

train_indices = indices[:split]
val_indices = indices[split:]

X_train = train_images[train_indices]
y_train = train_labels[train_indices]
X_val = train_images[val_indices]
y_val = train_labels[val_indices]

# Create CNN instance
import simple_cnn
cnn = simple_cnn.CNN()

batch_predictions = []
i = 0
for batch in process_in_batches(train_images):
    print(f"Batch {i} of {len(train_images) // 32}")
    batch_pred = cnn.forward(batch)
    batch_predictions.append(batch_pred)
    i += 1

predictions = np.concatenate(batch_predictions, axis=0)

# Evaluate on validation set
val_predictions = cnn.forward(X_val)
val_predictions = (val_predictions > 0.5).astype(int)  # Assuming binary classification

accuracy = np.mean(val_predictions == y_val)
print(f"Validation accuracy: {accuracy * 100:.2f}%")

NameError: name 'np' is not defined

## Tensorflow Model

In [4]:
import tensorflow as tf

class CNN(tf.keras.Model):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = tf.keras.layers.Conv2D(16, 3, activation='relu')
        self.pool1 = tf.keras.layers.MaxPooling2D()
        self.conv2 = tf.keras.layers.Conv2D(32, 3, activation='relu')
        self.pool2 = tf.keras.layers.MaxPooling2D()
        self.flatten = tf.keras.layers.Flatten()
        self.fc = tf.keras.layers.Dense(1, activation='sigmoid')

    def call(self, inputs):
        x = self.conv1(inputs)
        x = self.pool1(x)
        x = self.conv2(x)
        x = self.pool2(x)
        x = self.flatten(x)
        return self.fc(x)

2024-09-28 14:58:23.247292: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-28 14:58:23.265238: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-28 14:58:23.270384: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-28 14:58:23.283950: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
# Load training data
train_images = load_images(train_data_path)
train_images = train_images[:round(len(train_images) *.20)]
train_labels = labels_df['label'].values

# Load test data
test_images = load_images(test_data_path)

train_images = train_images.astype(np.float32)
test_images = test_images.astype(np.float32)

print("Data Characteristics: ")
print(f"Train images shape: {train_images.shape}  |  Number of Samples: {len(train_images)}")
print(f"Train labels shape: {train_labels.shape}")
print(f"Test images shape: {test_images.shape}  |  Number of Samples: {len(test_images)}")

# Create and compile the model
model = tf_cnn.CNN()
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=10, validation_split=0.25, batch_size=32)

# After training the model
model.save('cancer_detection_model')


In [None]:
## Load the model and generate predictions on test data

# Generate predictions
predictions = loaded_model.predict(test_images)
predictions = (predictions > 0.5).astype(int).flatten()

# Create a DataFrame with the predictions
test_filenames = [f.split('.')[0] for f in os.listdir(test_data_path) if f.endswith('.tif')]
results_df = pd.DataFrame({
    'id': test_filenames,
    'label': predictions
})

# Save the results to a CSV file
results_df.to_csv('predictions.csv', index=False)

print("Predictions saved to predictions.csv")

## Results and Refinements
Since I had a memory issue on my machine and likely need to straighten out GPU programming, I could only train on a subset of data, yet the model still achieved  ~70% accuracy on the test data. The simple CNN not making use of hardware acceleration highlights the importance of parallelization when training models. Each batch took about ~5 seconds to run, and there were hundreds of batches even only using a subset of the data. When running with TF, I could train on a much larger set of the data and achieve better accuracy.