# Modeling

Here we'll build a computer vision model to detect if a lung disease is present or not.

## Data Loading

PyTorch works better with classes, so first let's define a class to load in our data during training. Also, we'll just use the data from one folder for now to make things simple as the focus is more so on building a proof of concept user facing app at the moment, rather than the best model.

In [2]:
import os
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, models
from PIL import Image
from sklearn.model_selection import train_test_split

# Define the dataset class
class LungDiseaseDataset(Dataset):
    def __init__(self, dataframe, img_dir, transform=None):
        self.dataframe = dataframe
        self.img_dir = img_dir
        self.transform = transform

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        img_name = os.path.join(self.img_dir, self.dataframe.iloc[idx]['image_index'])
        image = Image.open(img_name).convert('RGB')
        label = self.dataframe.iloc[idx]['finding_labels']

        if self.transform:
            image = self.transform(image)

        return image, torch.tensor(label, dtype=torch.float32)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load and preprocess data
df = pd.read_csv("../data/lung_disease_labels.csv")
img_dir = "D:\BigData\images_005\images"

Now our dataframe has the file names and labels from all the images folders, but since we are just using one for now, we need to access only those - i.e. get the relevant subset.

In [3]:
def get_image_files(index_lst: list):
    image_files = []

    for i in index_lst:
        image_dir = f"D:\BigData\images_00{i}\images"
        
        image_files_for_one_folder = os.listdir(image_dir)
        
        image_files += image_files_for_one_folder
        
    return image_files

image_files = get_image_files([5])
len(image_files)

10000

You can double check it worked here by inspecting the first few elements.

In [4]:
image_files[0:5]

['00009232_004.png',
 '00009232_005.png',
 '00009232_006.png',
 '00009232_007.png',
 '00009233_000.png']

Now get the subset.

In [5]:
subset_df = df[df["image_index"].isin(image_files)]
subset_df.shape

(10000, 6)

Remember from the previous notebook we want a ratio of around 1.16 in the binary target variable, to reflect the total dataset distribution.

In [8]:
subset_df["finding_labels"].value_counts()[0] / subset_df["finding_labels"].value_counts()[1]

1.2070183182520415

## Data Processing

Split the data into training and testing sets. The testing set here is just the validation set. Since we have plenty of other images in the other folders, we can use them for testing at inference time.

In [9]:
train_df, val_df = train_test_split(subset_df, test_size=0.2, random_state=42)

Now we need our transformations. The main thing to do here is resize the images so it's more manageble, but we can come back here and play with transformations later on too.

In [10]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Create datasets
train_dataset = LungDiseaseDataset(train_df, img_dir, transform=transform)
val_dataset = LungDiseaseDataset(val_df, img_dir, transform=transform)

Note the number of workers must be zero to work depending on hardware resources - otherwise you'll get stuck in an endless runtime.

In [11]:
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=0)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=0)

This is just to make sure the DataLoader works as intended.

In [12]:
import time

start_time = time.time()

# Test loading one batch from train_loader
data_iter = iter(train_loader)
images, labels = next(data_iter)

end_time = time.time()
print(f"Time to load one batch: {end_time - start_time:.2f} seconds")


Time to load one batch: 0.58 seconds


Now we get into model training. We'll use the resnet pretrained model.

In [13]:
model = models.resnet50(pretrained=True)
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 1)  # Binary classification
model = model.to(device)

# Define loss function and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 2

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    epoch_start_time = time.time()

    for i, (images, labels) in enumerate(train_loader):
        batch_start_time = time.time()
        images = images.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs.squeeze(), labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
        if (i + 1) % 10 == 0:  # Print every 10 batches
            print(f"Epoch [{epoch+1}/{num_epochs}], "
                  f"Batch [{i+1}/{len(train_loader)}], "
                  f"Loss: {loss.item():.4f}, "
                  f"Batch Time: {time.time() - batch_start_time:.2f}s")
            
    epoch_loss = running_loss / len(train_loader)
    epoch_time = time.time() - epoch_start_time
    
    print(f"Epoch [{epoch+1}/{num_epochs}] completed, "
        f"Average Loss: {epoch_loss:.4f}, "
        f"Epoch Time: {epoch_time:.2f}s")

    # Validation
    model.eval()
    val_loss = 0
    correct = 0
    total = 0
    with torch.no_grad():
        for i, (images, labels) in enumerate(val_loader):
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            val_loss += criterion(outputs.squeeze(), labels).item()
            predicted = torch.round(torch.sigmoid(outputs.squeeze()))
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            
            if (i + 1) % 10 == 0:  # Print every 10 batches
                print(f"Validation Batch [{i+1}/{len(val_loader)}] processed")
            

    print(f"Epoch [{epoch+1}/{num_epochs}], "
          f"Train Loss: {loss.item():.4f}, "
          f"Val Loss: {val_loss/len(val_loader):.4f}, "
          f"Val Accuracy: {100 * correct / total:.2f}%")




Epoch [1/2], Batch [10/250], Loss: 0.5564, Batch Time: 0.19s
Epoch [1/2], Batch [20/250], Loss: 0.7486, Batch Time: 0.19s
Epoch [1/2], Batch [30/250], Loss: 0.6566, Batch Time: 0.18s
Epoch [1/2], Batch [40/250], Loss: 0.7193, Batch Time: 0.18s
Epoch [1/2], Batch [50/250], Loss: 0.6469, Batch Time: 0.18s
Epoch [1/2], Batch [60/250], Loss: 0.6081, Batch Time: 0.18s
Epoch [1/2], Batch [70/250], Loss: 0.6939, Batch Time: 0.18s
Epoch [1/2], Batch [80/250], Loss: 0.6677, Batch Time: 0.19s
Epoch [1/2], Batch [90/250], Loss: 0.7337, Batch Time: 0.19s
Epoch [1/2], Batch [100/250], Loss: 0.7138, Batch Time: 0.20s
Epoch [1/2], Batch [110/250], Loss: 0.6079, Batch Time: 0.19s
Epoch [1/2], Batch [120/250], Loss: 0.5996, Batch Time: 0.18s
Epoch [1/2], Batch [130/250], Loss: 0.6196, Batch Time: 0.18s
Epoch [1/2], Batch [140/250], Loss: 0.6720, Batch Time: 0.18s
Epoch [1/2], Batch [150/250], Loss: 0.7051, Batch Time: 0.19s
Epoch [1/2], Batch [160/250], Loss: 0.6093, Batch Time: 0.18s
Epoch [1/2], Batc