---
title: "Cervical Cancer Risk Prediction"
subtitle: "Proposal"
author: 
  - name: 'Team Okhawere   Team Member: Kennedy'
    affiliations:
      - name: "College of Information Science, University of Arizona"
description: "Info 523 Final Project"
format:
  html:
    code-tools: true
    code-overflow: wrap
    code-line-numbers: true
    embed-resources: true
bibliography: References/references.bib
editor: visual
code-annotations: hover
execute:
  warning: false
jupyter: python3
---

In [None]:
#| label: load-pkgs
#| echo: false
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Research Goal

To build a predictive model that identifies women at risk of cervical cancer using demographic, social and behavioral history as well as clinical factors from the UCI Cervical Cancer Risk Factors dataset.

## Goals and Motivation

Cervical cancer remains a major global health issue, especially in low-resource settings where regular screenings are less accessible. Early prediction of high-risk individuals can enable timely intervention and reduce mortality. @gopalkrishnan2025cervical This project aims to develop a data-driven, machine learning–based predictive model to assess the likelihood of cervical cancer based on known risk factors such as sexual history, contraceptive use, smoking, and STD history.

## Questions

1.  Can we accurately predict cervical cancer biopsy outcomes using only non-invasive risk factors and lifestyle history?
2.  Which factor contribute the most to cervical cancer risk?

The rationale for choosing this question is that, despite the availability of HPV vaccination for prevention, there is still a need for non-invasive approaches for early detection of cervical cancer through screening other than biopsy, even in patients that have been vaccinated, due to its still rising occurrence.

## Dataset

In [None]:
#| label: load-dataset

def load_data(path):
  """
  Load data and print a data information

  Parameter:
     Path:Str
     Path to the CSV file to be loaded
  
  return:
     DataFrame 
  """
  #load data and store in df
  df = pd.read_csv(path, na_values='?') #consider '?' to be missing
  
  #return dataframe
  return df

data = load_data('data/risk_factors_cervical_cancer.csv')

For this study, we will be utilizing data from the [UCI Cervical Cancer Risk Factors dataset](https://archive.ics.uci.edu/dataset/383/cervical+cancer+risk+factors) from the UCI Machine Learning Repository. The data is best suited for this analysis because it was collected at a limited resource setting and in specifically, Hospital Universitario de Caracas' in Caracas, Venezuela. It contains both demographic and behavioral data for women, along with results from four different cervical cancer screening tests: Hinselmann, Schiller, Cytology, Biopsy. Our goal is to assess patterns that lead to positive biopsy results, which is the most definitive screening measure for Cervical CA.

This dataset was selected because it reflects real-world medical data with a variety of relevant features, including lifestyle factors (e.g., smoking, contraception), prior medical diagnoses, and social determinants of health. Overall, the total number of attributes in the dataset is `{python} data.shape[1]`

## Overview of the data:

### Target Variable:

The target variable is the result of the biopsy test, which is the most definitive indicator for cervical cancer in this dataset. It is a binary variable: 1 indicates cancer detected, 0 indicates no cancer. Below is the information about the target variable:

In [None]:
#| label: target-info
#| echo: false

# Information on target
target_col = data.columns[-1]

target_summary = pd.DataFrame({
    'Column Name': [target_col],
    'Non-Null Count': [data[target_col].notnull().sum()],
    'Data Type': [data[target_col].dtype]
})

display(target_summary)

### Covariates:

Possible covarites to be included are Age, number of pregnancies, age at first intercourse, Smoking history, contraception use, STD history. Below are the information about these variables:

In [None]:
#| label: covariate-info
#| echo: false

# Information on covariates
covariates = pd.DataFrame({
    'Column Name': data.iloc[:, 0:32].columns,
    'Non-Null Count': data.iloc[:, 0:32].notnull().sum().values,
    'Data Type': data.iloc[:, 0:32].dtypes.values
})

display(covariates)

## Study Population

The population for this study consists of `{python} data.shape[0]` female patients from the Hospital Universitario de Caracas in Caracas, Venezuela. The majority of the patients (approximately `{python} int(round((data['Biopsy'] == 0).mean() * 100))`%) have a negative biopsy result, suggestive of a potential imbalance in the distribution of outcomes [Figure @fig-figure-1]. The patient ages range from `{python} data['Age'].min()` to `{python} data['Age'].max()`, with a notable right-skewed distribution, indicating a larger representation of younger individuals in the sample [Figure @fig-figure-2].

### Distribution of the target variable

In [None]:
#| label: fig-figure-1
#| fig-cap: Distribution of Biopsy Results in the Cohort
#| echo: false

plt.figure(figsize=(8, 6))
sns.countplot(x='Biopsy', data=data)
plt.title('Distribution of the Biopsy in the Cohort')
plt.xlabel('Biopsy Result (0: No, 1: Yes)')
plt.ylabel('Count')
plt.show()

### Distribution of the Age in the Cohort

In [None]:
#| label: fig-figure-2
#| fig-cap: Distribution of Age in the Cohort
#| echo: false

plt.figure(figsize=(8, 6))
sns.violinplot(x='Age', data=data)
plt.title('Distribution of the Age in the Cohort')
plt.ylabel('Age')
plt.show()

## Analysis Plan

Missing values will be evaluated and imputed using multiple strategies like the mean/mode imputation, KNN imputation, multiple imputation or completed removal depending on missingness patterns. Variables will transformed, normalized and encode appropriately. We will apply different classification algorithms and compare them using cross-validation and performance metrics like ROC-AUC, F1-Score, Precision, Recall to determine the best model. Proposed algorithms include Logistic Regression, Random Forest, XGBoost, Support Vector Machine.

For Feature importance and interpretation, we will use SHAP values for interpretability and visualize the top contributing risk factors.

## Proposed Timeline

Since this is a single author project, I will be responsible for all aspects of the study, including data acquisition, preprocessing, analysis, model development, evaluation, interpretation, and reporting. The proposed timeline is as follows:

| Time | Overall Goal | Specific Tasks |
|------------------|---------------------|----------------------------------|
| **Week 1:** | Data acquisition, literature review, and initial exploration | Load the dataset and perform a thorough exploratory data analysis, with a special focus on the distribution and patterns of missing values. |
| **Week 2:** | Data cleaning, preprocessing, and feature engineering | Create new variables if necessary - Implement and compare several imputation strategies. - Split the data into training and testing sets |
| **Week 3:** | Model selection, training, hyperpararmeter tuning | Train several classification models (e.g., Logistic Regression, Random Forest, Gradient Boosting) - Address class imbalance by applying techniques like SMOTE or class weighting during model training. - Use cross-validation to fine-tune model hyperparameters and select the best model. |
| **Week 4:** | Model evaluation | Evaluate the final models on the test set and compare their performance using AUC-ROC, F1-score, precision, and recall. - Perform a SHAP analyis on the best model to determine the most predictive covariates |
| **Week 5:** | Interpretation of results, visualization, and report writing | Interprer, write and finalize the project report |

## Repo Organization

| Path/File | Purpose and Description |
|-----------------|-------------------------------------------------------|
| .github/ | Contains GitHub-specific configurations, including workflows, actions, and issue templates that automate and streamline repository management. |
| \_extra/ | Serves as a flexible storage space for miscellaneous or supplementary files that do not fit into other predefined project categories. |
| \_freeze/ | Stores frozen environment snapshots, capturing the exact package versions and setup used during specific stages of the project for reproducibility. |
| \_analysis/ | Hosts Jupyter notebooks outlining the project's analytical framework, including exploratory data analysis, modeling strategies, and evaluation plans. |
| data/ | Central repository for all raw and processed data files essential to the project, including datasets, input files, and metadata. |
| images/ | Contains visual assets such as diagrams, charts, and screenshots used throughout the project for documentation, presentations, and analysis. |
| .gitignore | Specifies files and directories to exclude from Git tracking, helping maintain a clean and efficient version control history. |
| README.md. | Provides a comprehensive overview of the project, including setup instructions, usage guidelines, objectives, and scope. Serves as the project's landing document. |
| \_quarto.yml | Configuration file for Quarto, defining global settings for document rendering, output formats, and styling across all .qmd files. |
| about.qmd | Supplementary Quarto document offering background on the project’s purpose, team member bios, and contextual information. |
| index.qmd | Main Quarto document that will serve as the project's homepage, integrating code, visualizations, narrative, and final results. |
| presentation.qmd | Quarto file designed to generate the final project presentation in slideshow format, summarizing key findings and insights. |
| proposal.qmd | Initial project planning document detailing the dataset, metadata, research questions, and a week-by-week roadmap. Updated regularly to reflect progress. |
| References | Contains cited references |
| requirements.txt | Lists all Python dependencies and their versions required to run the project, ensuring consistent environment setup across collaborators. |

## References