## **Cervical Cancer Analysis**  

I am conducting an analysis on cervical cancer, driven by both personal experience and a strong interest in understanding the factors that contribute to its diagnosis and progression.  

As a cervical cancer patient, my goal is to explore data-driven insights that could aid in early detection and raise awareness about the disease. By analyzing relevant variables, I hope to identify patterns and risk factors that may help improve prevention strategies and timely diagnosis.  

### **Dataset Information**  
- **Source**: The dataset was collected at *Hospital Universitario de Caracas* in Caracas, Venezuela.  
- **Size**: It contains information on **858 patients**.  
- **Features**: The dataset includes **demographic details, lifestyle habits, and medical history records**.  
- **Missing Data**: Some patients chose not to answer certain questions due to privacy concerns, leading to missing values in the dataset.  

This analysis aims to provide meaningful insights that contribute to a better understanding of cervical cancer and its associated risk factors.  


## **General Variables**  
- **`Age`**: *The person's age.*  
- **`Number of sexual partners`**: *The number of sexual partners the person has had.*  
- **`First sexual intercourse (age)`**: *Age at first sexual intercourse.*  
- **`Num of pregnancies`**: *Number of pregnancies the person has had.*  

## **Risk Habits**  
- **`Smokes`**: *Whether the person smokes (Yes/No).*  
- **`Smokes (years)`**: *Number of years the person has been smoking.* **(This should be a numerical variable, but it is currently labeled as boolean.)*  
- **`Smokes (packs/year)`**: *Number of cigarette packs smoked per year.* **(Again, this should be numeric, not boolean.)**  

## **Use of Contraceptives**  
- **`Hormonal Contraceptives`**: *Whether the person uses hormonal contraceptives (Yes/No).*  
- **`Hormonal Contraceptives (years)`**: *Number of years the person has been using hormonal contraceptives.*  
- **`IUD`**: *Whether the person has used an intrauterine device (IUD) (Yes/No).*  
- **`IUD (years)`**: *Number of years with an IUD.*  

## **Sexually Transmitted Diseases (STDs)**  
- **`STDs`**: *Whether the person has had any sexually transmitted disease (Yes/No).*  
- **`STDs (number)`**: *Number of STDs the person has had.*  
- **`STDs:condylomatosis`**, **`STDs:cervical condylomatosis`**, **`STDs:vaginal condylomatosis`**, **`STDs:vulvo-perineal condylomatosis`**: *Different types of condylomas (genital warts caused by HPV).*  
- **`STDs:syphilis`**, **`STDs:pelvic inflammatory disease`**, **`STDs:genital herpes`**, **`STDs:molluscum contagiosum`**, **`STDs:AIDS`**, **`STDs:HIV`**, **`STDs:Hepatitis B`**, **`STDs:HPV`**: *Indicators of whether the person has had these specific diseases.*  

## **STD Diagnosis History**  
- **`STDs: Number of diagnosis`**: *Total number of STD diagnoses received.*  
- **`STDs: Time since first diagnosis`**: *Time in years since the first STD diagnosis.*  
- **`STDs: Time since last diagnosis`**: *Time in years since the last STD diagnosis.*  

## **Medical Diagnoses**  
- **`Dx:Cancer`**: *Whether the person has been diagnosed with cancer.*  
- **`Dx:CIN`**: *Whether the person has been diagnosed with Cervical Intraepithelial Neoplasia (CIN), a precancerous cervical lesion.*  
- **`Dx:HPV`**: *Whether the person has been diagnosed with human papillomavirus (HPV).*  
- **`Dx`**: *General positive diagnosis of a related medical condition.* **(It is unclear what specific conditions this includes.)**  

## **Target Variables**  
These variables are likely used to predict cervical cancer:  
- **`Hinselmann: target variable`**: *Result of the Hinselmann test (colposcopy with acetic acid).*  
- **`Schiller: target variable`**: *Result of the Schiller test (colposcopy with Lugol's iodine).*  
- **`Cytology: target variable`**: *Result of the cytology test (Pap smear).*  
- **`Biopsy: target variable`**: *Result of the biopsy test (definitive confirmation of cancer or precancerous lesions).*  

## **Unclear Variables**  
- **`Smokes (years)` and `Smokes (packs/year)`**: *These are boolean, but they should be numerical. There may be an issue with data encoding.*  
- **`Dx`**: *It is unclear—what type of diagnosis does it refer to?*  
- **The variables `Hinselmann`, `Schiller`, `Cytology`, and `Biopsy`**: *These seem to be potential target variables. If predicting cancer, `Biopsy` might be the most reliable outcome.*  


In [14]:
#imports 
import zipfile
import os
import pandas as pd

In [4]:
# extract the data
zip_path = "../data/raw/cervical+cancer+risk+factors.zip"

extract_path = "../data/raw/"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

In [9]:
#load the dataset
csv_path = "../data/raw/risk_factors_cervical_cancer.csv" 
df = pd.read_csv(csv_path)

In [13]:
pd.options.display.max_columns = None
df.head()

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,IUD (years),STDs,STDs (number),STDs:condylomatosis,STDs:cervical condylomatosis,STDs:vaginal condylomatosis,STDs:vulvo-perineal condylomatosis,STDs:syphilis,STDs:pelvic inflammatory disease,STDs:genital herpes,STDs:molluscum contagiosum,STDs:AIDS,STDs:HIV,STDs:Hepatitis B,STDs:HPV,STDs: Number of diagnosis,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,?,?,0,0,0,0,0,0,0,0
1,15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,?,?,0,0,0,0,0,0,0,0
2,34,1.0,?,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,?,?,0,0,0,0,0,0,0,0
3,52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,?,?,1,0,1,0,0,0,0,0
4,46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,?,?,0,0,0,0,0,0,0,0
