<a href="https://colab.research.google.com/github/HannahShaw21/Applied-Data-Science-Capstone/blob/main/ST_563_HW_5_Hannah_Shaw.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For our project, my group looked at a cervical cancer dataset from the UCI Machine Learning Repository.

We have formalized our research question as a matter of *classification* - if, given a set of factors, we could create an algorithm that could determine if a patient would or would not develop cervical cancer.

In [1]:
#Installing UCIMLrepo in order to access the data: https://archive.ics.uci.edu/dataset/383/cervical+cancer+risk+factors
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [2]:
#Importing necessary packages
from ucimlrepo import fetch_ucirepo
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

Prior to analyzing the data, we needed to fetch, explore, and clean our data. We looked at the summary statistics for numeric variables and we created new response variables based on our understanding of the target variables in the dataset. We created "yPosCount" to transform our binary response variables into a numeric variable that represents the number of tests and diagnoses a patient had. We created "yPosTest" to indicate whether a patient tested positive on any of the tests.

In [3]:
# fetch dataset
risk_factors = fetch_ucirepo(id=383)
patients = pd.concat([risk_factors.data.features,risk_factors.data.targets], axis=1).fillna(0)
patients.columns = ['Age','PartnerCount','FirstTime','PregCount','Smoker','YrsSmoking',
                  'PacksPerYr','BirthControl','BcYrs','IUD','IudYrs','HasSTDs','StdCount',
                  'StdCondy','StdCervixCondy','StdVaginalCondy','StdVulvaCondy',
                  'StdSyphilis','StdPelvis','StdHerpes','StdMollusc',
                  'StdAIDs','StdHIV','StdHepB', 'StdHPV','StdDxCount',
                  'TimeFirstStdDx','TimeLastStdDx','DxCancer','DxCIN','DxHPV','Dx',
                  'Hinselmann','Schiller','Citology','Biopsy']
feature_info = risk_factors.variables[['name']].copy()
feature_info['abrv'] = patients.columns.copy()
print(feature_info, '\n\n')

# numeric variable info
print(patients[['Age','PartnerCount','FirstTime','PregCount']].describe().drop(['25%','50%','75%'], axis='index'), '\n')
print(patients[['YrsSmoking','PacksPerYr','BcYrs','IudYrs']].describe().drop(['25%','50%','75%'], axis='index'), '\n')
print(patients[['StdCount','StdDxCount','TimeFirstStdDx','TimeLastStdDx']].describe().drop(['25%','50%','75%'], axis='index'), '\n')

#categorical response variable info
print(patients.groupby(['DxCancer']).size(), '\n\n')
print(patients.groupby(['DxCIN']).size(), '\n\n')
print(patients.groupby(['Dx']).size(), '\n\n')

# gives us a column that summarizes the others
patients['yPosCount'] = patients[['DxCancer','DxCIN','Hinselmann','Schiller','Citology','Biopsy']].sum(axis=1)
patients['yPosAny'] = patients[['DxCancer','DxCIN','Hinselmann','Schiller','Citology','Biopsy']].any(axis=1).astype(int)
patients['yPosTest'] = patients[['Hinselmann','Schiller','Citology','Biopsy']].any(axis=1).astype(int)

#Specifying response variables to drop from the x dataset, we are including
#DxHPV, since this is already reflected by StdHPV but is not a full cancer diagnosis)
y_cols = ['yPosCount','yPosAny','yPosTest','DxCancer','DxCIN','DxHPV','Dx','Hinselmann','Schiller','Citology','Biopsy']

#displaying new response variables
print(patients.groupby(['yPosAny']).size(), '\n\n')
print(patients.groupby(['yPosAny','DxCancer','DxCIN']).size(), '\n\n')
print(patients.groupby(['yPosTest']).size(), '\n\n')
print(patients.groupby(['yPosCount']).size(), '\n\n')

# in X also drop all y variables
X = patients.copy().drop(y_cols, axis='columns')
y_class = patients['yPosTest']

# to give the linear regression methods a chance, we will use the aggregated column as y
y_lin = patients['yPosCount']
print('data shape (X, y_class, y_lin):', X.shape, y_class.shape, y_lin.shape)

                                  name             abrv
0                                  Age              Age
1            Number of sexual partners     PartnerCount
2             First sexual intercourse        FirstTime
3                   Num of pregnancies        PregCount
4                               Smokes           Smoker
5                       Smokes (years)       YrsSmoking
6                  Smokes (packs/year)       PacksPerYr
7              Hormonal Contraceptives     BirthControl
8      Hormonal Contraceptives (years)            BcYrs
9                                  IUD              IUD
10                         IUD (years)           IudYrs
11                                STDs          HasSTDs
12                       STDs (number)         StdCount
13                 STDs:condylomatosis         StdCondy
14        STDs:cervical condylomatosis   StdCervixCondy
15         STDs:vaginal condylomatosis  StdVaginalCondy
16  STDs:vulvo-perineal condylomatosis    StdVul

In order to perform deep learning for classification, we need to import the following packages, as well as set batch size, number of epoches, and learning rate.

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# ----- Config -----
batch_size = 128
epochs = 5
lr = 1e-3

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(0)

<torch._C.Generator at 0x7d55a500bcf0>

Our dataset has 858 instances (n = 858), which hopefully should be large enough for deep learning.

Now we upload our x-values and our y-values for classifcation into the train_loader.

In [10]:
X

Unnamed: 0,Age,PartnerCount,FirstTime,PregCount,Smoker,YrsSmoking,PacksPerYr,BirthControl,BcYrs,IUD,...,StdPelvis,StdHerpes,StdMollusc,StdAIDs,StdHIV,StdHepB,StdHPV,StdDxCount,TimeFirstStdDx,TimeLastStdDx
0,18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0
1,15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0
2,34,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0
3,52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0
4,46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
853,34,3.0,18.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0
854,32,2.0,19.0,1.0,0.0,0.0,0.0,1.0,8.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0
855,25,2.0,17.0,0.0,0.0,0.0,0.0,1.0,0.08,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0
856,33,2.0,24.0,2.0,0.0,0.0,0.0,1.0,0.08,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0


In [11]:
y_class

Unnamed: 0,yPosTest
0,0
1,0
2,0
3,0
4,0
...,...
853,0
854,0
855,1
856,0


In [12]:
train_loader = DataLoader(risk_factors(X, y_class, batch_size=batch_size, shuffle=True))

TypeError: 'dotdict' object is not callable

Due to the above error, I wasn't able to get train_loader working properly. Despite my efforts, I couldn't figure out a viable workaround other than possibly testing each risk factor (each column in X) with y-value by itself, but I wasn't able to implement that before the due date of the homework. Here is what the rest of the code would have looked like if I had managed to get it working correctly:

In [13]:
# ----- Model (exactly three Linear layers) -----
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(28*28, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)  # logits

    def forward(self, x):
        x = x.view(x.size(0), -1)  # flatten to [B, 784]
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)            # no softmax; CrossEntropyLoss expects logits
        return x

In [14]:
model = MLP().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

In [15]:
# ----- Train / Eval loops -----
def train_one_epoch(loader):
    model.train()
    total, correct, total_loss = 0, 0, 0.0
    for x, y in loader:
        x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
        logits = model(x)
        loss = criterion(logits, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * x.size(0)
        preds = logits.argmax(dim=1)
        correct += (preds == y).sum().item()
        total += x.size(0)
    return total_loss/total, correct/total


In [16]:
@torch.no_grad()
def evaluate(loader):
    model.eval()
    total, correct, total_loss = 0, 0, 0.0
    for x, y in loader:
        x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
        logits = model(x)
        loss = criterion(logits, y)
        total_loss += loss.item() * x.size(0)
        correct += (logits.argmax(dim=1) == y).sum().item()
        total += x.size(0)
    return total_loss/total, correct/total

In [17]:
import pandas as pd
import matplotlib.pyplot as plt

history = []

for epoch in range(1, epochs + 1):
    tr_loss, tr_acc = train_one_epoch(train_loader)
    te_loss, te_acc = evaluate(test_loader)

    history.append({
        "epoch": epoch,
        "train_loss": tr_loss,
        "test_loss": te_loss,
        "train_acc": tr_acc,
        "test_acc": te_acc,
    })

    df = pd.DataFrame(history)

    # Print a concise progress line
    print(f"Epoch {epoch:02d} | train loss {tr_loss:.4f} acc {tr_acc:.4f} | "
          f"test loss {te_loss:.4f} acc {te_acc:.4f}")

    # Plot: (1) train/test loss, (2) train/test accuracy
    plotter(df)

NameError: name 'train_loader' is not defined

Again, I wasn't able to get the code working before the due date. But I imagine that if it WAS working, and I repeated the process using n, n/2, n/4, n/8, and n/16 of the data, the performance of the model would get increasingly worse as the amount of training data decreased - deep learning models require a lot of data in order to be accurate, and with less data to train on, the deep learning classification model would become less accurate.