# CERVICAL CANCER RISK PREDICTION IN WOMEN

Every year, just in India alone, around 122,000 women are diagnosed with cervical cancer, out of which around 67,000 die succumb to the disease. One of the primary reasons why the disease is so deadly is because cervical cancer is hard to detect, and the patient is usually diagnosed only in the later stages of the cancer.

This aim of this project is to create a binary classifier that is able to tell whether a woman is at the risk of cervical cancer or not. So let's begin by first importing all the necessary project dependencies.

## Importing Project Dependencies

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

Now, let us import the dataset.

In [2]:
data = pd.read_csv('risk_factors_cervical_cancer.csv', na_values='?')
data.head()

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
1,15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
2,34,1.0,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
3,52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.0,0.0,...,,,1,0,1,0,0,0,0,0
4,46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.0,0.0,...,,,0,0,0,0,0,0,0,0


Now that we have imported the dataset, the next step is to wrangle (preprocess) the dataset.

## Data Wrangling

Let us begin by checking for null values within our dataset.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 858 entries, 0 to 857
Data columns (total 36 columns):
Age                                   858 non-null int64
Number of sexual partners             832 non-null float64
First sexual intercourse              851 non-null float64
Num of pregnancies                    802 non-null float64
Smokes                                845 non-null float64
Smokes (years)                        845 non-null float64
Smokes (packs/year)                   845 non-null float64
Hormonal Contraceptives               750 non-null float64
Hormonal Contraceptives (years)       750 non-null float64
IUD                                   741 non-null float64
IUD (years)                           741 non-null float64
STDs                                  753 non-null float64
STDs (number)                         753 non-null float64
STDs:condylomatosis                   753 non-null float64
STDs:cervical condylomatosis          753 non-null float64
STDs:vaginal

As per resources like MayoClinic and WebMD, the STD HPV plays a key role in bringing about the DNA mutation that might end up turning the cervical tissues cancerous. Hence, we will drop all the rows for which the STDs:HPV value is a null value (NaN).   

In [4]:
# drop rows where STDs:HPV == NaN
data_cleaned = data[data['STDs:HPV'].notna()]

Now, let us drop the columns '__STDs: Time since first diagnosis__' and '__STDs: Time since last diagnosis__' since most values in these columns are null values.

In [5]:
data_cleaned.drop(columns=['STDs: Time since first diagnosis', 'STDs: Time since last diagnosis'], axis=1, inplace=True)
data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 753 entries, 0 to 857
Data columns (total 34 columns):
Age                                   753 non-null int64
Number of sexual partners             739 non-null float64
First sexual intercourse              747 non-null float64
Num of pregnancies                    706 non-null float64
Smokes                                743 non-null float64
Smokes (years)                        743 non-null float64
Smokes (packs/year)                   743 non-null float64
Hormonal Contraceptives               740 non-null float64
Hormonal Contraceptives (years)       740 non-null float64
IUD                                   737 non-null float64
IUD (years)                           737 non-null float64
STDs                                  753 non-null float64
STDs (number)                         753 non-null float64
STDs:condylomatosis                   753 non-null float64
STDs:cervical condylomatosis          753 non-null float64
STDs:vaginal

For the rest of the null values, we will simply replace them with the mode (highest frequent observation) of the respective columns.

In [16]:
values = {
    'Number of sexual partners': data_cleaned['Number of sexual partners'].mode()[0],
    'First sexual intercourse': data_cleaned['First sexual intercourse'].mode()[0],
    'Num of pregnancies': data_cleaned['Num of pregnancies'].mode()[0],
    'Smokes': data_cleaned['Smokes'].mode()[0],
    'Smokes (years)': data_cleaned['Smokes (years)'].mode()[0],
    'Smokes (packs/year)': data_cleaned['Smokes (packs/year)'].mode()[0],
    'Hormonal Contraceptives': data_cleaned['Hormonal Contraceptives'].mode()[0],
    'Hormonal Contraceptives (years)': data_cleaned['Hormonal Contraceptives (years)'].mode()[0],
    'IUD': data_cleaned['IUD'].mode()[0],
    'IUD (years)': data_cleaned['IUD (years)'].mode()[0],
}

data_cleaned.fillna(value=values, inplace=True)
data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 753 entries, 0 to 857
Data columns (total 34 columns):
Age                                   753 non-null int64
Number of sexual partners             753 non-null float64
First sexual intercourse              753 non-null float64
Num of pregnancies                    753 non-null float64
Smokes                                753 non-null float64
Smokes (years)                        753 non-null float64
Smokes (packs/year)                   753 non-null float64
Hormonal Contraceptives               753 non-null float64
Hormonal Contraceptives (years)       753 non-null float64
IUD                                   753 non-null float64
IUD (years)                           753 non-null float64
STDs                                  753 non-null float64
STDs (number)                         753 non-null float64
STDs:condylomatosis                   753 non-null float64
STDs:cervical condylomatosis          753 non-null float64
STDs:vaginal

With this, we have dealt with all the missing/null values within our dataset. Now, let us have a look at the statistical analysis of the data. Since there are 34 columns, we will break the statistical analysis into 2 parts.

In [23]:
data_cleaned.iloc[:, :17].describe()

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,IUD (years),STDs,STDs (number),STDs:condylomatosis,STDs:cervical condylomatosis,STDs:vaginal condylomatosis,STDs:vulvo-perineal condylomatosis
count,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0
mean,27.22842,2.519256,17.073041,2.23506,0.142098,1.210974,0.458142,0.64409,2.173831,0.110226,0.5066,0.104914,0.176627,0.058433,0.0,0.005312,0.057105
std,8.68086,1.670286,2.838513,1.459285,0.349383,4.115163,2.286894,0.479106,3.614502,0.313379,1.928602,0.306646,0.561993,0.234716,0.0,0.072739,0.232197
min,13.0,1.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,21.0,2.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,26.0,2.0,17.0,2.0,0.0,0.0,0.0,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,33.0,3.0,18.0,3.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,84.0,28.0,32.0,11.0,1.0,37.0,37.0,1.0,22.0,1.0,19.0,1.0,4.0,1.0,0.0,1.0,1.0


In [24]:
data_cleaned.iloc[:, 17:].describe()

Unnamed: 0,STDs:syphilis,STDs:pelvic inflammatory disease,STDs:genital herpes,STDs:molluscum contagiosum,STDs:AIDS,STDs:HIV,STDs:Hepatitis B,STDs:HPV,STDs: Number of diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
count,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0
mean,0.023904,0.001328,0.001328,0.001328,0.0,0.023904,0.001328,0.002656,0.099602,0.023904,0.010624,0.023904,0.030544,0.046481,0.096946,0.054449,0.070385
std,0.152853,0.036442,0.036442,0.036442,0.0,0.152853,0.036442,0.051503,0.321089,0.152853,0.102593,0.152853,0.172194,0.210664,0.29608,0.227052,0.255965
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
