# Tumor Detection Project Notebook

## Dependencies
This project requires TensorFlow for building and training the neural network, OpenCV for image processing, matplotlib and seaborn for visualizations, and scikit-learn for evaluation metrics.

In [None]:
# Uncomment and run the following line if any of these libraries are not installed
!pip install tensorflow opencv-python matplotlib seaborn scikit-learn

## Imports and Configuration
The code begins by importing necessary libraries and setting fixed seeds for reproducibility. All relevant modules from the src folder are imported to handle data operations, model construction, training callbacks, and evaluation utilities. The directories for raw and processed data are defined along with the image size, batch size, and split ratios.

In [None]:
import os
import random
import numpy as np
import matplotlib.pyplot as plt
import cv2
from src.data import split_data, load_datasets
from src.model import build_model, compute_class_weights, get_callbacks
from src.evaluate import plot_metrics, evaluate_model
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing import image

random.seed(42)
np.random.seed(42)

RAW_DIR = 'data_set'
PROC_DIR = 'data_set'
CATEGORIES = ['yes', 'no']
IMG_SIZE = (128, 128)
BATCH_SIZE = 16
TRAIN_RATIO = 0.8
VAL_RATIO = 0.15

## Data Inspection and Visualization
The first step is to inspect the raw images to understand their dimensions and to display representative samples. This step ensures that resizing is necessary and that there are no corrupted files in the dataset.

In [None]:
import pandas as pd
image_dims = {}
for cat in CATEGORIES:
    folder = os.path.join(RAW_DIR, cat)
    for fname in os.listdir(folder):
        img = plt.imread(os.path.join(folder, fname))
        image_dims[f"{cat}/{fname}"] = img.shape
df_dims = pd.DataFrame.from_dict(
    image_dims,
    orient='index',
    columns=['height', 'width', 'channels']
)
display(df_dims.describe())
print(f"Total images: {len(df_dims)}")

The summary statistics show the variation in image height and width and confirm that all images have three color channels. The sample images displayed below illustrate typical examples from each category.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
for ax, cat in zip(axes, CATEGORIES):
    path = os.path.join(RAW_DIR, cat, os.listdir(os.path.join(RAW_DIR, cat))[0])
    img = plt.imread(path)
    ax.imshow(img)
    ax.set_title(cat)
    ax.axis('off')
plt.show()

## Data Splitting
The dataset is split into training, validation, and test sets using an eighty, fifteen, and five percent ratio respectively. This split allows the model to learn from the training set, tune parameters on the validation set, and finally be assessed on unseen test data.

In [None]:
split_data(
    source_dir=RAW_DIR,
    dest_dir=PROC_DIR,
    categories=CATEGORIES,
    train_pct=TRAIN_RATIO,
    val_pct=VAL_RATIO,
    seed=42
)

## Loading Datasets
The TensorFlow data pipeline is then used to load images from the processed directories. Training and validation datasets are shuffled to improve model robustness, and the batch size parameter controls the number of samples processed before updating the model weights.

In [None]:
train_ds, val_ds, test_ds, class_names = load_datasets(
    PROC_DIR,
    image_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    seed=42
)
print(f"Classes detected: {class_names}")

## Model Building and Training
The model architecture uses a MobileNetV2 backbone pre-trained on ImageNet to extract features. Data augmentation layers randomly flip, rotate, zoom, translate, and adjust contrast to reduce overfitting. Dropout and L2 regularization are included for further generalization benefits. Training uses the Adam optimizer with binary crossentropy and applies class weights to address any imbalance.

In [None]:
model = build_model(
    input_shape=(*IMG_SIZE, 3),
    dropout_rate=0.6,
    l2_rate=1e-4
)
cw = compute_class_weights(train_ds)
print(f"Class weights: {cw}")
callbacks = get_callbacks(patience=12)
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=20,
    callbacks=callbacks,
    class_weight=cw
)

## Training Metrics Interpretation
The accuracy curves for both training and validation data show how well the model learns over epochs. A steady increase in both curves with minimal divergence indicates good generalization. The loss curves complement the accuracy information by showing how the prediction error evolves during training.

In [None]:
plot_metrics(history)

## Detailed Evaluation
The evaluation phase generates classification reports and confusion matrices for the train, validation, and test sets. Precision and recall metrics for each class reveal whether the model tends to miss tumors or produce false alarms. The confusion matrices visually display the counts of correct and incorrect predictions.

In [None]:
evaluate_model(model, train_ds, class_names, title='Train')
evaluate_model(model, val_ds, class_names, title='Validation')
evaluate_model(model, test_ds, class_names, title='Test')

## Save and Inference
The final model is saved in HDF5 format for deployment or further use. A helper function loads and preprocesses a new image and the model then returns a probability score. The score above or below 0.5 is interpreted as presence or absence of a tumor, and these probabilities can be adjusted for specific use cases.

In [None]:
os.makedirs('models', exist_ok=True)
MODEL_PATH = 'models/best_model.keras'
model.save(MODEL_PATH)
print(f"Model saved at {MODEL_PATH}")

def load_and_preprocess(img_path, img_size):
    img = image.load_img(img_path, target_size=img_size)
    arr = image.img_to_array(img) / 255.0
    return np.expand_dims(arr, axis=0)

model = load_model(MODEL_PATH)
img_path = 'images.jpg'
x = load_and_preprocess(img_path, IMG_SIZE)
pred = float(model.predict(x)[0][0])
label = 'Tumor' if pred > 0.5 else 'No Tumor'
print(f"Prediction score: {pred:.4f} leads to label {label}")

## Conclusion
This notebook has walked through each step of the tumor detection pipeline with clear explanations and interpretations. The design choices, training behavior, and evaluation results are all described in narrative form to highlight understanding and reproducibility.