# Midterm Project

In this project, you will develop a deep learning model to classify lung nodules as benign or malignant from 3D CT scans, utilizing the LUNA16 dataset. This task involves data preprocessing, model design, training, and evaluation, offering hands-on experience with medical image analysis and deep learning in PyTorch.

In [2]:
import pandas as pd
import os
import glob
import copy
import time
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from collections import namedtuple
import SimpleITK as sitk
import gzip
from cassandra.cqltypes import BytesType
from diskcache import FanoutCache, Disk, core
from diskcache.core import io, MODE_BINARY
from io import BytesIO
import matplotlib.pyplot as plt



In [3]:
# pip install SimpleITK
# pip install cassandra-driver
# pip install diskcache

In [4]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1. Load Annotation Data

1.1 Download the numerical dataset from Kaggle. Follow the steps in Week07_02 notebook.

Firstly, you would need to download kaggle.json from your Account Seting by Going in API Token. Then Run this code and Upload.

In [5]:
import os

# Upload kaggle.json to Colab environment
from google.colab import files
files.upload()

# Move the kaggle.json file to the required directory
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# Change the permissions of the file
!chmod 600 ~/.kaggle/kaggle.json


Saving kaggle.json to kaggle (2).json


In [6]:
# !kaggle datasets download -d avc0706/luna16 -p "/content/drive/My Drive"


In [7]:
# !unzip -q "/content/drive/My Drive/luna16.zip" -d "/content/drive/My Drive/luna16"


 Missing Candidates *Data*

In [8]:
# !kaggle datasets download -d mashruravi/luna16missingcandidates -p "/content/drive/My Drive"


In [9]:
# !unzip -q "/content/drive/My Drive/luna16missingcandidates.zip" -d "/content/drive/My Drive/luna16"


Annotations Data

In [10]:
csv_file_path = "/content/drive/MyDrive/luna16/annotations.csv"

annotations = pd.read_csv(csv_file_path)

annotations.shape


(1186, 5)

1.2 Load the `candidates_V2.csv` file as a data frame. Display the first 5 rows.

In [11]:

csv_file_path = "/content/drive/MyDrive/luna16/candidates_V2/candidates_V2.csv"

df_candidates_V2 = pd.read_csv(csv_file_path)

df_candidates_V2.shape


(754975, 5)

In [12]:
df_candidates_V2.head(5)


Unnamed: 0,seriesuid,coordX,coordY,coordZ,class
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,68.42,-74.48,-288.7,0
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-95.209361,-91.809406,-377.42635,0
2,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-24.766755,-120.379294,-273.361539,0
3,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-63.08,-65.74,-344.24,0
4,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,52.946688,-92.688873,-241.067872,0


1.3 Display the number of class 0 records and the number of class 1 records.





In [13]:
class_counts = df_candidates_V2['class'].value_counts()

print("Number of class 0 records:", class_counts[0])
print("Number of class 1 records:", class_counts[1])

Number of class 0 records: 753418
Number of class 1 records: 1557


## 2. Extract Nodule Data

2.1 Load the `candidates_processed.csv` file created from Week07_02 notebook as a data frame. Display its first 5 rows.

In [14]:

csv_file_path = "/content/drive/MyDrive/luna16/candidates.csv"

df_candidates = pd.read_csv(csv_file_path)

df_candidates.shape

(551065, 5)

In [15]:
df_candidates.head(5)


Unnamed: 0,seriesuid,coordX,coordY,coordZ,class
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-56.08,-67.85,-311.92,0
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,53.21,-244.41,-245.17,0
2,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,103.66,-121.8,-286.62,0
3,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-33.66,-72.75,-308.41,0
4,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-32.25,-85.36,-362.51,0


2.2 Create a list of distinct seriesuid's in the candidates_processed data frame. Display the lengh of the list and the first 5 seriesuid's.

In [16]:
distinct_seriesuid=df_candidates['seriesuid'].unique().tolist()
print("Distinct Seriesuid's Length: ",len(distinct_seriesuid))
distinct_seriesuid[:5]

Distinct Seriesuid's Length:  888


['1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860',
 '1.3.6.1.4.1.14519.5.2.1.6279.6001.100332161840553388986847034053',
 '1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793540579077826395208',
 '1.3.6.1.4.1.14519.5.2.1.6279.6001.100530488926682752765845212286',
 '1.3.6.1.4.1.14519.5.2.1.6279.6001.100620385482151095585000946543']

2.3 Load the `subset0.zip` from Google Drive using the file ID '1OFa8UhDvCrcTj1VkFLa7RjifEqMD4TAa'. Extract the zip file to reveal the .mhd and .raw files.

In [17]:
import os

folder_path = '/content/drive/MyDrive/luna16/subset0/subset0'

raw_files = []
mhd_files = []

for file_name in os.listdir(folder_path):
    if file_name.endswith('.raw'):
        raw_files.append(file_name)
    elif file_name.endswith('.mhd'):
        mhd_files.append(file_name)

# Sort the lists
raw_files.sort()
mhd_files.sort()

# Display raw and mhd files side by side
for raw_file, mhd_file in zip(raw_files, mhd_files):
    print(f"Raw File: {raw_file} | MHD File: {mhd_file}")


Raw File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.105756658031515062000744821260.raw | MHD File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.105756658031515062000744821260.mhd
Raw File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.108197895896446896160048741492.raw | MHD File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.108197895896446896160048741492.mhd
Raw File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.109002525524522225658609808059.raw | MHD File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.109002525524522225658609808059.mhd
Raw File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.111172165674661221381920536987.raw | MHD File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.111172165674661221381920536987.mhd
Raw File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.122763913896761494371822656720.raw | MHD File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.122763913896761494371822656720.mhd
Raw File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.124154461048929153767743874565.raw | MHD File: 1.3.6.1.4.1.14519.5.2.1.6279.6001.124154461048929153767743874565.mhd
Raw File: 1.3.6.1.4.1.14519.5.2.1.6279.6

2.3 Write a double for-loop to extract the CT scan data for **the first 15,000** nodules:
- The outer for loop goes through all the distince seriesuid's.
- For each iteration of the outer loop, load the corresponding CT-scan file and create a torch tensor to represent the scan.
- Create an inner-loop that goes through the nodules corresponding to the seriesuid:
    - Load the (index, row, col) tuple of this nodule from the data frame.
    - Extract a 32x48x48 chunk centered at the (index, row, col). If the nodule is near the edge of the image and there is not enough indices to extract, please pad with zeros to keep the overall shape unchanged.
    - Use a 4D tensor to contain all the 32x48x48 chunks. The first dimension of the 4D tensor is the index of nodule.

You may modify the above procedure as you like. Make sure that you are able to obtain a 4D tensor that contains all nodule data. **Display the shape of the 4D tensor.** The shape of the tensor should be (15000, 32, 48, 48).

**Remark** Due to the memory limit, it is impossible to load all nodule images into simultanously. Therefore, the number of nodules required in this section is reduced to 15,000.

In [18]:
with open('/content/drive/MyDrive/luna16/missing.txt', 'r') as f:
    missing_ids = {uid.split('\n')[0] for uid in f}

diameters = {}

# Loop through every annotation
for _, row in annotations.iterrows():

    center_xyz = (row.coordX, row.coordY, row.coordZ)

    diameters.setdefault(row.seriesuid, []).append(
        (center_xyz, row.diameter_mm)
    )

In [19]:
%%time
CandidateInfoTuple = namedtuple(
    'CandidateInfoTuple',
    ['is_nodule', 'diameter_mm', 'series_uid', 'center_xyz']
)

candidates = []
for _, row in df_candidates.iterrows():
    candidate_center_xyz = (row.coordX, row.coordY, row.coordZ)
    candidate_diameter = 0.0
    for annotation in diameters.get(row.seriesuid, []):

        annotation_center_xyz, annotation_diameter = annotation
        for i in range(3):
            delta = abs(candidate_center_xyz[i] - annotation_center_xyz[i])
            if delta > annotation_diameter / 4:
                    break
        else:
            candidate_diameter = annotation_diameter
            break
    candidates.append(CandidateInfoTuple(
        bool(row['class']),
        candidate_diameter,
        row.seriesuid,
        candidate_center_xyz
    ))
candidates.sort(reverse=True)


CPU times: user 1min 4s, sys: 553 ms, total: 1min 5s
Wall time: 1min 6s


In [20]:
filtered_candidates = list(filter(lambda x: x.series_uid not in missing_ids, candidates))

print(f'All candidates in dataset: {len(candidates)}')
print(f'Candidates with CT scan  : {len(filtered_candidates)}')

All candidates in dataset: 551065
Candidates with CT scan  : 275358


In [21]:
class LunaDataset(Dataset):
    def __init__(self, is_validation_set=False, validation_stride=0, is_testing_set=False):
        '''Create a PyTorch dataset for the CT scans

        If `is_validation_set` is `True`, then every `validation_stride` item is kept.
        Otherwise, every `validation_stride` item is deleted.
        If `is_testing_set` is `True`, all items are kept regardless of `is_validation_set`.
        '''
        self.candidates = copy.copy(filtered_candidates[::350])

        # If this is the validation set, keep every `validation_stride` item
        if is_validation_set:
            self.candidates = self.candidates[::validation_stride]

        # If this is the training set, delete every `validation_stride` item
        elif validation_stride != 0:
            del self.candidates[::validation_stride]

        # If this is the testing set, keep all items
        if is_testing_set:
            pass

    def __len__(self):
        '''Returns the number of items in the dataset'''
        return len(self.candidates)

    def __getitem__(self, i):
        '''Get the `i`th item in the dataset'''

        # Get the `i`th candidate
        candidate = self.candidates[i]

        # We want to resize each CT scan to the following dimensions
        dims_irc = (10, 18, 18)

        # Use the utility function to fetch the CT scan
        ct_scan_np = getCtScanChunk(candidate.series_uid, candidate.center_xyz, dims_irc)

        # Convert the CT scan to a tensor
        ct_scan_tensor = torch.from_numpy(ct_scan_np).to(torch.float32).unsqueeze(0)

        # Convert the target to a tensor
        label_tensor = torch.tensor([
            not candidate.is_nodule,
            candidate.is_nodule
        ], dtype=torch.long)

        return ct_scan_tensor, label_tensor


In [22]:

validation_stride = 10
batch_size = 32

# Create training dataset and dataloader
train_ds = LunaDataset(is_validation_set=False, validation_stride=validation_stride)
train_dl = DataLoader(train_ds, batch_size=batch_size, num_workers=0)

# Create validation dataset and dataloader
val_ds = LunaDataset(is_validation_set=True, validation_stride=validation_stride)
val_dl = DataLoader(val_ds, batch_size=batch_size, num_workers=0)

# Create testing dataset and dataloader
test_ds = LunaDataset(is_validation_set=False, is_testing_set=True)
test_dl = DataLoader(test_ds, batch_size=batch_size, num_workers=0)


In [23]:
class GzipDisk(Disk):
    def store(self, value, read, key=None):
        if type(value) is BytesType:
            if read:
                value = value.read()
                read = False

            str_io = BytesIO()
            gz_file = gzip.GzipFile(mode='wb', compresslevel=1, fileobj=str_io)

            for offset in range(0, len(value), 2**30):
                gz_file.write(value[offset:offset+2**30])
            gz_file.close()

            value = str_io.getvalue()

        return super(GzipDisk, self).store(value, read)


    def fetch(self, mode, filename, value, read):
        value = super(GzipDisk, self).fetch(mode, filename, value, read)

        if mode == MODE_BINARY:
            str_io = BytesIO(value)
            gz_file = gzip.GzipFile(mode='rb', fileobj=str_io)
            read_csio = BytesIO()

            while True:
                uncompressed_data = gz_file.read(2**30)
                if uncompressed_data:
                    read_csio.write(uncompressed_data)
                else:
                    break

            value = read_csio.getvalue()

        return value

def getCache(scope_str):
    return FanoutCache('data-unversioned/cache/' + scope_str,
                       disk=GzipDisk,
                       shards=64,
                       timeout=1,
                       size_limit=3e11,
                       )

raw_cache = getCache('ct_scan_raw')

@raw_cache.memoize(typed=True)
def getCtScanChunk(series_uid, center_xyz, dims_irc):

        filepaths = glob.glob(f'/content/drive/MyDrive/luna16/subset*/*/{series_uid}.mhd')
        assert len(filepaths) != 0, f'CT scan with seriesuid {series_uid} not found!'
        mhd_file_path = filepaths[0]

        mhd_file = sitk.ReadImage(mhd_file_path)
        ct_scan = np.array(sitk.GetArrayFromImage(mhd_file), dtype=np.float32)
        ct_scan.clip(-1000, 1000, ct_scan)

        origin_xyz = mhd_file.GetOrigin()
        voxel_size_xyz = mhd_file.GetSpacing()
        direction_matrix = np.array(mhd_file.GetDirection()).reshape(3, 3)

        origin_xyz_np = np.array(origin_xyz)
        voxel_size_xyz_np = np.array(voxel_size_xyz)

        cri = ((center_xyz - origin_xyz_np) @ np.linalg.inv(direction_matrix)) / voxel_size_xyz_np
        cri = np.round(cri)
        irc = (int(cri[2]), int(cri[1]), int(cri[0]))

        slice_list = []
        for axis, center_val in enumerate(irc):

            start_index = int(round(center_val - dims_irc[axis]/2))
            end_index = int(start_index + dims_irc[axis])

            if start_index < 0:
                start_index = 0
                end_index = int(dims_irc[axis])

            if end_index > ct_scan.shape[axis]:
                end_index = ct_scan.shape[axis]
                start_index = int(ct_scan.shape[axis] - dims_irc[axis])

            slice_list.append(slice(start_index, end_index))

        ct_scan_chunk = ct_scan[tuple(slice_list)]

        return ct_scan_chunk

## 3. Model Design and Implementation

3.1 Design a neural network model with only a flatten layer and several dense layers for classifying lung nodules.

In [24]:
class LunaModel(nn.Module):

    def __init__(self):

        super().__init__()

        self.conv1 = nn.Conv3d(1, 32, kernel_size=3, padding=1, bias=True)
        self.relu1 = nn.ReLU()
        self.maxpool1 = nn.MaxPool3d(2)

        self.conv2 = nn.Conv3d(32, 64, kernel_size=3, padding=1, bias=True)
        self.relu2 = nn.ReLU()
        self.maxpool2 = nn.MaxPool3d(2)

        self.flatten = nn.Flatten()

        self.fc1 = nn.Linear(2048, 1024)
        self.relu3 = nn.ReLU()

        self.dropout = nn.Dropout(0.2)

        self.fc2 = nn.Linear(1024, 2)

    def forward(self, X):
        X = self.maxpool1(self.relu1(self.conv1(X)))
        X = self.maxpool2(self.relu2(self.conv2(X)))

        X = self.flatten(X)

        X = self.relu3(self.fc1(X))
        X = self.dropout(X)

        return self.fc2(X)


In [25]:
model = LunaModel()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)

3.2 Create an object to represent the loss function.

In [26]:
criterion = nn.CrossEntropyLoss()

3.3 Create an object to represent the optimizer.

In [27]:
optimizer = optim.AdamW(model.parameters(), weight_decay=0.1)

3.4 Create a function to represent the training loop.


In [28]:
def train_loop(model, dataloader, criterion, optimizer, ds_size):
    '''Train the model for one epoch'''

    model.train()

    running_loss = 0.0
    running_corrects = 0
    running_pos = 0
    running_pos_correct = 0
    running_neg = 0
    running_neg_correct = 0

    for inputs, labels in tqdm(dataloader):

        inputs = inputs.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()

        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)

        loss = criterion(outputs, labels[:,1])
        loss.backward()

        optimizer.step()
        running_loss += loss.item() * inputs.size(0)
        running_corrects += torch.sum(preds == labels.data[:,1])
        running_pos += labels.data[:,1].sum()
        running_pos_correct += ((preds == labels.data[:,1]) & (labels.data[:,1] == 1)).sum()

        running_neg += labels.data[:,0].sum()
        running_neg_correct += ((preds == labels.data[:,1]) & (labels.data[:,1] == 0)).sum()

    epoch_loss = running_loss / ds_size
    epoch_acc = running_corrects.double() / ds_size

    return epoch_loss, epoch_acc, (running_pos_correct, running_pos), (running_neg_correct, running_neg)


In [29]:
def eval_loop(model, dataloader, criterion, ds_size):
    '''Evaluate the model performance for one epoch'''
    model.eval()
    running_loss = 0.0
    running_corrects = 0
    running_pos = 0
    running_pos_correct = 0
    running_neg = 0
    running_neg_correct = 0

    with torch.no_grad():

        for inputs, labels in tqdm(dataloader):
            inputs = inputs.to(device)
            labels = labels.to(device)

            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)
            loss = criterion(outputs, labels[:,1])

            running_loss += loss.item() * inputs.size(0)
            running_corrects += torch.sum(preds == labels.data[:,1])

            running_pos += labels.data[:,1].sum()
            running_pos_correct += ((preds == labels.data[:,1]) & (labels.data[:,1] == 1)).sum()

            running_neg += labels.data[:,0].sum()
            running_neg_correct += ((preds == labels.data[:,1]) & (labels.data[:,1] == 0)).sum()

    epoch_loss = running_loss / ds_size
    epoch_acc = running_corrects.double() / ds_size

    return epoch_loss, epoch_acc, (running_pos_correct, running_pos), (running_neg_correct, running_neg)

3.5 Execute the training loop. Display the change of training loss during the training process.

In [30]:
from tqdm import tqdm

epochs = 5

for epoch in range(epochs):
    epoch_start = time.time()

    train_loss, train_acc, train_pos, train_neg = train_loop(
        model, train_dl, criterion, optimizer, len(train_ds)
    )
    val_loss, val_acc, val_pos, val_neg = eval_loop(
        model, val_dl, criterion, len(val_ds)
    )

    time_elapsed = time.time() - epoch_start

    tqdm.write(f"Epoch: {epoch+1:02} | Epoch Time: {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s")
    tqdm.write(f"\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%")
    tqdm.write(f"\tTrain - correct pos: {train_pos[0]}/{train_pos[1]} | correct neg: {train_neg[0]}/{train_neg[1]}")
    tqdm.write(f"\tVal. Loss: {val_loss:.3f} | Val. Acc: {val_acc*100:.2f}%")
    tqdm.write(f"\tVal. - correct pos: {val_pos[0]}/{val_pos[1]} | correct neg: {val_neg[0]}/{val_neg[1]}")


100%|██████████| 23/23 [06:12<00:00, 16.21s/it]
100%|██████████| 3/3 [03:39<00:00, 73.03s/it]


Epoch: 01 | Epoch Time: 9m 52s
	Train Loss: 0.992 | Train Acc: 96.19%
	Train - correct pos: 1/2 | correct neg: 680/706
	Val. Loss: 55.305 | Val. Acc: 98.73%
	Val. - correct pos: 0/1 | correct neg: 78/78


100%|██████████| 23/23 [00:05<00:00,  4.41it/s]
100%|██████████| 3/3 [00:00<00:00, 13.59it/s]


Epoch: 02 | Epoch Time: 0m 5s
	Train Loss: 121.893 | Train Acc: 99.72%
	Train - correct pos: 0/2 | correct neg: 706/706
	Val. Loss: 1.824 | Val. Acc: 98.73%
	Val. - correct pos: 0/1 | correct neg: 78/78


100%|██████████| 23/23 [00:04<00:00,  4.79it/s]
100%|██████████| 3/3 [00:00<00:00, 13.45it/s]


Epoch: 03 | Epoch Time: 0m 5s
	Train Loss: 4.602 | Train Acc: 99.58%
	Train - correct pos: 0/2 | correct neg: 705/706
	Val. Loss: 2.914 | Val. Acc: 98.73%
	Val. - correct pos: 0/1 | correct neg: 78/78


100%|██████████| 23/23 [00:06<00:00,  3.58it/s]
100%|██████████| 3/3 [00:00<00:00, 15.41it/s]


Epoch: 04 | Epoch Time: 0m 7s
	Train Loss: 0.676 | Train Acc: 99.29%
	Train - correct pos: 0/2 | correct neg: 703/706
	Val. Loss: 1.440 | Val. Acc: 98.73%
	Val. - correct pos: 0/1 | correct neg: 78/78


100%|██████████| 23/23 [00:05<00:00,  4.24it/s]
100%|██████████| 3/3 [00:00<00:00, 14.74it/s]

Epoch: 05 | Epoch Time: 0m 6s
	Train Loss: 0.268 | Train Acc: 99.44%
	Train - correct pos: 0/2 | correct neg: 704/706
	Val. Loss: 1.204 | Val. Acc: 98.73%
	Val. - correct pos: 0/1 | correct neg: 78/78





## 4. Model Evaluation and Analysis

4.1 Obtain the model's prediction on the test set.

In [33]:
test_loss, test_acc, test_pos, test_neg = eval_loop(
    model, test_dl, criterion, len(test_ds)
)

test_acc_percentage = test_acc.item() * 100
test_correct_pos = test_pos[0].item()
test_total_pos = test_pos[1].item()
test_correct_neg = test_neg[0].item()
test_total_neg = test_neg[1].item()

precision = test_correct_pos / (test_correct_pos + (test_total_neg - test_correct_neg))
recall = test_correct_pos / test_total_pos


100%|██████████| 25/25 [00:03<00:00,  7.57it/s]


In [32]:
model_path = '/content/drive/My Drive/model_luna16.pth'
torch.save(model.state_dict(), model_path)
print('Model saved to', model_path)

Model saved to /content/drive/My Drive/model_luna16.pth


4.2 Calculate the report the following metrics:
- accuracy
- precision
- recall

In [34]:
# Print precision, recall, and F1 score
print("Test Loss:", test_loss)
print("Test Accuracy:", f"{test_acc_percentage:.2f}%")

print("Precision:", f"{precision:.2f}")
print("Recall:", f"{recall:.2f}")

print("Test Correct Positive:", f"{test_correct_pos}/{test_total_pos} (Correct/Total)")
print("Test Correct Negative:", f"{test_correct_neg}/{test_total_neg} (Correct/Total)")

Test Loss: 0.2104146295763183
Test Accuracy: 99.75%
Precision: 1.00
Recall: 0.33
Test Correct Positive: 1/3 (Correct/Total)
Test Correct Negative: 784/784 (Correct/Total)


4.3: Discuss the model's performance.




While achieving an impressive accuracy of over 99% on the validation set may initially seem cause for celebration, a deeper analysis of the results reveals a less rosy picture. Despite correctly identifying all negative cases, the model struggled to accurately predict any positive instances, rendering its performance ineffective. This discrepancy highlights a critical issue: the dataset's imbalance, with a dearth of positive samples for the model to learn from. Consequently, relying solely on accuracy as a performance metric proves inadequate in capturing the model's true effectiveness.