MAI643 - Artificial Intelligence in Medicine

Project Assignment 1 - Spring Semester 2024

Student Name:    
Christina Ioanna Saroglaki   
Jianlin Ye 

UCY Email:     
saroglaki.christina-ioanna@ucy.ac.cy    
jye00001@ucy.ac.cy 

### Import Libararies

In [90]:
import pandas as pd 
import numpy as np

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

np.set_printoptions(formatter={'float':"{:6.5g}".format})

# Overview

As per the authors, the chosen dataset focuses on indicators associated with the diagnosis of cervical cancer, encompassing various features such as demographic information, habits, and medical records​. In more detail, the data was gathered at "Hospital Universitario de Caracas" in Venezuela from a total of 858 patients​.

C. J. Fernandes Kelwin and J. Fernandes, “Cervical cancer (Risk Factors),” UCI Machine 
Learning Repository. 2017.

In [91]:
risk_factor_df = pd.read_csv("risk_factors_cervical_cancer.csv", 
            na_values=["?"])

print("----------------------------------- Information -----------------------------------")
risk_factor_df.info()

----------------------------------- Information -----------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 858 entries, 0 to 857
Data columns (total 36 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Age                                 858 non-null    int64  
 1   Number of sexual partners           832 non-null    float64
 2   First sexual intercourse            851 non-null    float64
 3   Num of pregnancies                  802 non-null    float64
 4   Smokes                              845 non-null    float64
 5   Smokes (years)                      845 non-null    float64
 6   Smokes (packs/year)                 845 non-null    float64
 7   Hormonal Contraceptives             750 non-null    float64
 8   Hormonal Contraceptives (years)     750 non-null    float64
 9   IUD                                 741 non-null    float64
 10  IUD (years)               

## Preliminary analysis of the dataset

To gain a better understanding of the dataset, we conducted a preliminary analysis.
### Missing Values

First, we needed to find the volume of missing values contained in the dataset as well as the features that contained the largest amount.

In [92]:
print("----------------------------------- Missing Values -----------------------------------")
missing_info = risk_factor_df.isnull().sum()
total_nan = missing_info.sum()
total_entries = risk_factor_df.size

# Print total NaN values
if (total_nan == 0):
    print("\nNo NaN values in the dataset.")
else:
    print("\nNaN values found in the dataset.")

    print("\nTotal NaN values in dataset: {}/{}".format(total_nan, total_entries))

    # Sort columns by the number of missing values
    nan_columns = missing_info.sort_values(ascending=False)

    print("\nTop 15 columns with missing values:\n")
    for i, (col, count) in enumerate(nan_columns.head(15).items(), 1):
        print("{:2}. {:35} : {:}".format(i, col, count))

----------------------------------- Missing Values -----------------------------------

NaN values found in the dataset.

Total NaN values in dataset: 3622/30888

Top 15 columns with missing values:

 1. STDs: Time since last diagnosis     : 787
 2. STDs: Time since first diagnosis    : 787
 3. IUD                                 : 117
 4. IUD (years)                         : 117
 5. Hormonal Contraceptives             : 108
 6. Hormonal Contraceptives (years)     : 108
 7. STDs:pelvic inflammatory disease    : 105
 8. STDs:vulvo-perineal condylomatosis  : 105
 9. STDs:HPV                            : 105
10. STDs:Hepatitis B                    : 105
11. STDs:HIV                            : 105
12. STDs:AIDS                           : 105
13. STDs:molluscum contagiosum          : 105
14. STDs:genital herpes                 : 105
15. STDs:syphilis                       : 105


In [93]:
# Plot
total_figure = px.pie(values=[total_nan, total_entries-total_nan], names=["NaN values", "Valid Values"],
        color_discrete_sequence=px.colors.sequential.Aggrnyl,
        title="Total NaN Values Distribution",
        width=550, height= 350)

total_figure.update_layout(
    margin=dict(l=50, r=50, t=50, b=50),
    title_x=0.5    
)

total_figure.show()

In [94]:
# Rows containing NaN values
total_rows = len(risk_factor_df)
nan_rows = risk_factor_df.isna().any(axis=1).tolist().count(True)
print("\nTotal Rows containing NaN values in dataset: {}/{}".format(nan_rows, total_rows))

rows_fig=go.Figure(data=[go.Pie(labels=["Has NaN Values","Is Filled"],
    values=[nan_rows, total_rows],
    marker_colors=[px.colors.sequential.Agsunset[0], px.colors.sequential.Agsunset[1]])])

rows_fig.update_layout(
    title="NaN Containing Rows Distribution",
    margin=dict(l=50, r=50, t=50, b=50),
    title_x=0.5,
    width=550, height= 350    
)

rows_fig.show()


Total Rows containing NaN values in dataset: 799/858


We identified that the features “STDs: Time since first diagnosis” and “STDs: Time since last diagnosis” were filled with NaN values of about 92%. Because of the high percentage, it was impractical to either eliminate the affected observations or fill the missing values with the mean of columns. Consequently, these features were excluded from the dataset.

In [95]:
risk_factor_df.drop(columns=["STDs: Time since first diagnosis", "STDs: Time since last diagnosis"], inplace=True)

To ensure the optimal performance of future models, we also set a missing value threshold of 10 per row. Any rows that exceeded this threshold were eliminated from the dataset because we determined they were missing significant information.

In [96]:
# Rows containing NaN values
nan_rows = risk_factor_df.isna().any(axis=1).tolist().count(True)
print("\nTotal Rows containing NaN values in dataset: {}/{}".format(nan_rows, total_rows))

# Find rows that contain more than 10 NaN values
rows_to_del = risk_factor_df[risk_factor_df.isna().sum(axis=1) > 10].index

print("\nRows containing >10 NaN values: {}/{}".format(len(rows_to_del), total_rows))

# Remove rows
risk_factor_df.drop(rows_to_del, inplace=True)
risk_factor_df.reset_index(drop=True, inplace=True)


Total Rows containing NaN values in dataset: 190/858

Rows containing >10 NaN values: 105/858


In [97]:
#Plot
color_1 = [px.colors.sequential.Agsunset[0], px.colors.sequential.Agsunset[1]]
color_2 = [px.colors.sequential.Agsunset[2], px.colors.sequential.Agsunset[3]]


row_figure = make_subplots(1, 2, specs=[[{"type":"domain"}, {"type":"domain"}]],
    subplot_titles=["Contain NaN Values", "Contain >10 NaN Values"])

row_figure.add_trace(go.Pie(labels=["Has NaN Values","Is Filled"],
    values=[nan_rows, total_rows - nan_rows],
    marker_colors=color_1,
    pull=[0.1, 0]), 1, 1)

row_figure.add_trace(go.Pie(labels=[">10 NaN", "<10 NaN"],
    values=[len(rows_to_del), nan_rows - len(rows_to_del)],
    marker_colors=color_2), 1, 2)

row_figure.update_layout(title_text="Rows Containing NaN Values",
    width=650, height= 400,
    title_x=0.5)

row_figure.show()

For the remaining columns, we managed the missing values depending on the column. In more detail, if the column contained binary values (0,1) then the row containing the missing value was deleted. Otherwise, the missing value was replaced with the mean of the column.

In [98]:
print("--------------------------- Handling Missing Values ---------------------------")
print("----------------------------------- BEFORE -----------------------------------")
print("Number of rows before filling missing values: ", len(risk_factor_df))

# Display the number of missing values before filling
print("\nNumber of missing values per column before filling:")
print(risk_factor_df.isnull().sum())

# Fill missing values depending on the column
for col in risk_factor_df.columns:
    # If the column has more than 3 unique values, fill with mean of the column
    if risk_factor_df[col].nunique() > 3:
        risk_factor_df[col].fillna(risk_factor_df[col].median(), inplace=True)
    
# Drop rest NaN containing rows
risk_factor_df=risk_factor_df.dropna()
risk_factor_df.reset_index(drop=True, inplace=True)

--------------------------- Handling Missing Values ---------------------------
----------------------------------- BEFORE -----------------------------------
Number of rows before filling missing values:  753

Number of missing values per column before filling:
Age                                    0
Number of sexual partners             14
First sexual intercourse               6
Num of pregnancies                    47
Smokes                                10
Smokes (years)                        10
Smokes (packs/year)                   10
Hormonal Contraceptives               13
Hormonal Contraceptives (years)       13
IUD                                   16
IUD (years)                           16
STDs                                   0
STDs (number)                          0
STDs:condylomatosis                    0
STDs:cervical condylomatosis           0
STDs:vaginal condylomatosis            0
STDs:vulvo-perineal condylomatosis     0
STDs:syphilis                          0

In [99]:
print("\n----------------------------------- AFTER -----------------------------------")
print("Number of rows after filling missing values: ", len(risk_factor_df))

# Display the number of missing values after filling
print("\nNumber of missing values per column after filling:")
print(risk_factor_df.isnull().sum())


----------------------------------- AFTER -----------------------------------
Number of rows after filling missing values:  726

Number of missing values per column after filling:
Age                                   0
Number of sexual partners             0
First sexual intercourse              0
Num of pregnancies                    0
Smokes                                0
Smokes (years)                        0
Smokes (packs/year)                   0
Hormonal Contraceptives               0
Hormonal Contraceptives (years)       0
IUD                                   0
IUD (years)                           0
STDs                                  0
STDs (number)                         0
STDs:condylomatosis                   0
STDs:cervical condylomatosis          0
STDs:vaginal condylomatosis           0
STDs:vulvo-perineal condylomatosis    0
STDs:syphilis                         0
STDs:pelvic inflammatory disease      0
STDs:genital herpes                   0
STDs:molluscum cont

### Duplicate Rows

Following the missing value analysis, we examined if the dataset contained any duplicate rows and removed them from the dataset.

In [100]:
print("----------------------------------- Duplicate Rows -----------------------------------")
# Check for duplicate rows
duplicate_rows = risk_factor_df.duplicated()

# Count the number of duplicate rows
num_duplicates = duplicate_rows.sum()

if num_duplicates == 0:
    print("No duplicate rows found in the dataset.")
else:
    print(f"Found {num_duplicates} duplicate rows in the dataset.\n")

    # Display the duplicate rows indexes (if any)
    print("Duplicate rows indexes: {}\n".format(risk_factor_df[duplicate_rows].index.values))

    # Removing duplicate rows
    print("----------------------------- Removing Duplicates ----------------------------")
    print("----------------------------------- BEFORE -----------------------------------")
    print("Number of rows before removing duplicates: ", len(risk_factor_df))

    risk_factor_df.drop_duplicates(inplace=True)
    risk_factor_df.reset_index(drop=True, inplace=True)

    print("\n----------------------------------- AFTER -----------------------------------")
    print("Number of rows after removing duplicates: ", len(risk_factor_df))


----------------------------------- Duplicate Rows -----------------------------------
Found 18 duplicate rows in the dataset.

Duplicate rows indexes: [ 63 222 296 332 340 360 364 368 370 377 387 405 441 447 480 484 536 607]

----------------------------- Removing Duplicates ----------------------------
----------------------------------- BEFORE -----------------------------------
Number of rows before removing duplicates:  726

----------------------------------- AFTER -----------------------------------
Number of rows after removing duplicates:  708


This concluded the first phase of the preliminary analysis. After managing all the missing values and duplicate rows, the dataset had 34 features and 708 observations.

In [101]:
print("\nFinal dataset size: {} cols, {} rows".format(risk_factor_df.shape[1], risk_factor_df.shape[0]))


Final dataset size: 34 cols, 708 rows


## Understanding features

Once the first part of the analysis was completed, we moved on to exploring the features and some statistical properties of the dataset. This would allow us to identify possible connections between the features as well as possible imbalances.

#### Unique Features

In [102]:
# Function finding the unique values of each column in the dataframe
def find_unique_values_df(feat: pd.DataFrame):
    return {col: feat[col].unique() for col in feat}

print("----------------------------------- Unique Values -----------------------------------")    
# Unique Values
unique_vals = find_unique_values_df(risk_factor_df)

# Print unique values for each column
for col, col_unique_vals in unique_vals.items():
    print(f"{col}:")
    print(col_unique_vals)
    print(risk_factor_df[col].dtypes)
    print()


----------------------------------- Unique Values -----------------------------------
Age:
[18 15 34 52 46 42 51 26 45 44 27 43 40 41 39 37 38 36 35 33 31 32 30 23
 28 29 25 21 24 22 20 48 19 17 16 14 59 79 84 47 13 70 50 49]
int64

Number of sexual partners:
[     4      1      5      3      2      6      8      7     28      9]
float64

First sexual intercourse:
[    15     14     17     16     21     23     26     20     25     18
     27     19     24     32     29     11     13     22     28     10
     12]
float64

Num of pregnancies:
[     1      4      2      6      3      5      8      7      0     11
     10]
float64

Smokes:
[     0      1]
float64

Smokes (years):
[     0     37     34      3  1.267     12     18      7     19     21
     13     16     15      8      4     10     22     14    0.5     11
      2      6      5      1     32      9     24     28     20   0.16]
float64

Smokes (packs/year):
[     0     37    3.4   0.04 0.5132    2.4      6      9    1.6     19


### Target Values Distribution

First, we analyzed the dataset's balance. As shown in the graph, the dataset has a large imbalance across all four target variables. This imbalance complicates model training and evaluation, and it should be handled during the preprocessing step.

In [103]:
def getCount(col, value):
    return risk_factor_df[col].value_counts()[value]

# Plot occurrences of each class in the dataset
classes_df = pd.DataFrame(
    [["Hinselmann", getCount("Hinselmann", 0), getCount("Hinselmann", 1)],
        ["Schiller", getCount("Schiller", 0), getCount("Schiller", 1)],
        ["Citology", getCount("Citology", 0), getCount("Citology", 1)],
        ["Biopsy", getCount("Biopsy", 0), getCount("Biopsy", 1)]],
    columns =["Exam", "Healthy", "Cervical Cancer"])


balance_fig = px.histogram(classes_df, x="Exam", y=["Healthy", "Cervical Cancer"],
    title="Class Distribution",
    labels={
        "value":"Occurrences",
        "variable": "Result"
    },
    barmode="group",
    text_auto=True,
    color_discrete_sequence=px.colors.qualitative.Bold,
    width=600)

balance_fig.update_layout(
    title_x=0.5    
)

balance_fig.show()

### Statistical Properties

Moving on to the statistical properties of the dataset, we calculated the mean and standard deviation for each column. Columns with a standard deviation of 0 were omitted from the dataset because they did not add significant variability to the data since they contained the same value for all observations.

In [104]:
mean_df = risk_factor_df.mean()
std_df = risk_factor_df.std()

# Print columns that have a standard deviation 0 (contain only one value)
print("Columns containing 1 value: {}\n".format(std_df[std_df==0].index.values))


Columns containing 1 value: ['STDs:cervical condylomatosis' 'STDs:AIDS']



In [105]:
risk_factor_df.drop(columns=["STDs:cervical condylomatosis", "STDs:AIDS"], inplace=True)

In [106]:
# Plot
mean_df = risk_factor_df.mean()
std_df = risk_factor_df.std()

statistic_fig = go.Figure(data=[go.Table(
        header=dict(values=["Feature", "Mean", "Standard Deviation"]),
        cells=dict(values=[list(risk_factor_df.columns), mean_df.values, std_df.values],
                    align=['left', 'center'],
                    format=["",".2"])
    )
])

statistic_fig.show()

#### Remove Outliers

IQR (Inter Quartile Range) Inter Quartile Range approach to finding the outliers is the most commonly used and most trusted approach used in the research field. We utilised IQR to identify and remove outliers.

In [107]:
def find_outliers(col, indices):
    obs = risk_factor_df[col].iloc[indices]
    unique_items, counts = np.unique(obs, return_counts=True)
    unique_items, counts = unique_items[::-1], counts[::-1]

    values_to_delete = unique_items[counts < 2 ]
    return values_to_delete

def delete_outliers(col, to_delete):
    if (to_delete.size != 0):
        rows_to_del = risk_factor_df.loc[risk_factor_df[col].isin(to_delete)].index.values.tolist()

        # Remove rows
        risk_factor_df.drop(rows_to_del, inplace=True)
        risk_factor_df.reset_index(drop=True, inplace=True)

# Identify non-binary columns
non_binary_cols = [col for col, vals in unique_vals.items() if len(vals) > 2]

for col in non_binary_cols:

    # IQR cannot be applied to columns with median 0
    if (risk_factor_df[col].median() != 0):

        # Plot values distribution
        out_dist = px.histogram(risk_factor_df, x=col,
            marginal="box",
            color_discrete_sequence= px.colors.sequential.thermal)
        out_dist.update_layout(bargap=0.2,
            width=701)
        out_dist.show()

        Q3, Q1 = np.percentile(risk_factor_df[col], [75 ,25])
        IQR = Q3-Q1

        upper = Q3+(1.5*IQR)
        lower = Q1-(1.5*IQR)

        print(col)
        print("median: {}, upper fence: {}, lower fence: {}".format(risk_factor_df[col].median(), upper, lower))

        #Delete one occurrence observations outside the upper fence as outliers
        upper_to_delete = find_outliers(col, np.where(risk_factor_df[col] > upper)[0])
        delete_outliers(col, upper_to_delete)

        
        #Delete one occurrence observations outside the lower fence as outliers
        lower_to_delete = find_outliers(col, np.where(risk_factor_df[col] < lower)[0])
        delete_outliers(col, lower_to_delete)


Age
median: 26.0, upper fence: 51.0, lower fence: 3.0


Number of sexual partners
median: 2.0, upper fence: 4.5, lower fence: 0.5


First sexual intercourse
median: 17.0, upper fence: 22.5, lower fence: 10.5


Num of pregnancies
median: 2.0, upper fence: 6.0, lower fence: -2.0


Hormonal Contraceptives (years)
median: 0.5, upper fence: 7.5, lower fence: -4.5


In [108]:
print("\nFinal dataset size: {} cols, {} rows".format(risk_factor_df.shape[1], risk_factor_df.shape[0]))


Final dataset size: 32 cols, 697 rows


#### Correlation with label

Lastly we found the correlation between each of the features and each of the target variables.

In [109]:
def find_corr(target, col):
    return risk_factor_df[target].corr(risk_factor_df[col])

# Create dictionaries
target_variables = ["Hinselmann", "Schiller", "Citology", "Biopsy"]
correlations = {target: {} for target in target_variables}

# Calculate correlations
for target in target_variables:
    target_corr = risk_factor_df.iloc[:, :-4].corrwith(risk_factor_df[target])
    correlations[target] = dict(target_corr.abs().sort_values())
    
# Plot graphs
for target in correlations:
    target_df = pd.DataFrame.from_dict(correlations[target], orient="index", columns=["Correlation"])

    target_fig = px.bar(target_df, x="Correlation",
        orientation='h',
        title="Features & {} Correlations".format(target),
        labels={
            "index": "Features"
        },
        width=900, height=700)
    
    target_fig.update_layout(
        title_x=0.5    
    )
    
    target_fig.show()

# Data pre-processing steps
---
Up to this point the dataset has 32 cols, 697 rows.

- Re-encoded missing values from "?" to "NaN"
- Imputed missing data in columns with median
- Dropped near zero variance and redundant feature

- Assigned target to its own dataframe and split data into train/test set

#### Data Clearning

Sections related above:
- Droped $1^st$ and $2^nd$ of Top 15 columns contain missing values. **36 columns -> 34 columns**
- Droped columns "STDs: Time since first diagnosis", "STDs: Time since last diagnosis". more than 92% are NaN values. **34 columns -> 32 columns**
- Droped rows has more than 10 values are missing. **858 rows -> 753 rows**
- If the column contained binary values then the row containing the missing value was deleted. Otherwise, the missing value was replaced with the mean of the column. **753 rows -> 726 rows**
- 18 Duplicated Rows are removed. **726 rows -> 708 rows**
- Droped "STDs:cervical condylomatosis", "STDs:AIDS" because have a standard deviation of 0 (Near zero variance). **32 columns -> 30 columns**
- Remove outliers. **708 rows -> 697 rows**

#### Split dataset

In [110]:
import plotly.express as px
import pandas as pd
from sklearn.model_selection import train_test_split

In [111]:
risk_factor_df[target].value_counts()

Biopsy
0    648
1     49
Name: count, dtype: int64

In [112]:
# Show the distribution of the Target Class
fig = px.histogram(risk_factor_df, x=target)

fig.show()

In [113]:
X = risk_factor_df.drop(columns=target)
y = risk_factor_df[target]

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.33, 
                                                    random_state=42,
                                                    stratify = y, 
                                                    shuffle=True)

y_test.value_counts()

# Create a count data frame for the test set target variable
y_test_counts = y_test.value_counts().reset_index()
y_test_counts.columns = ['Class', 'Count']
y_test_counts['Class'] = y_test_counts['Class'].map({0: 'Healthy', 1: 'Cervical Cancer'})

# Use Pie chart
fig_pie = px.pie(y_test_counts, names='Class', values='Count', title='Test Set Target Variable Distribution', width=700)

# Bar Chart
fig_bar = px.bar(y_test_counts, x='Class', y='Count', title='Test Set Target Variable Distribution', width=700)

# Show Images
fig_pie.show()
fig_bar.show()

#### Dimensionality Reduction (PCA - Principal Component Analysis)

In [114]:
import plotly.express as px
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = risk_factor_df.drop(columns=target)
y = risk_factor_df[target]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.33, 
                                                    random_state=42,
                                                    stratify=y, 
                                                    shuffle=True)

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Use PCA for dimensionality reduction
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)

# Combine the reduced-dimensional data and target variable into a DataFrame
df_pca_no_smote = pd.DataFrame(data={'PCA1': X_train_pca[:, 0], 'PCA2': X_train_pca[:, 1], 'Class': y_train})

# Create a scatter plot using Plotly Express
fig_no_smote = px.scatter(df_pca_no_smote, x='PCA1', y='PCA2', color='Class', title='PCA Visualization of Original Training Set (No SMOTE)')

# Show the plot
fig_no_smote.show()


#### Handling Imbalanced Data -- SMOTE

In [115]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()

x_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

# Create a data frame of the counts of the target variables of the training set after SMOTE oversampling
y_train_sm_counts = y_train_sm.value_counts().reset_index()
y_train_sm_counts.columns = ['Class', 'Count']
y_train_sm_counts['Class'] = y_train_sm_counts['Class'].map({0: 'Healthy', 1: 'Cervical Cancer'})

# Pie Chart
fig_pie_sm = px.pie(y_train_sm_counts, names='Class', values='Count', title='SMOTE Resampled Training Set Target Variable Distribution', width=700)

# Bar Chart
fig_bar_sm = px.bar(y_train_sm_counts, x='Class', y='Count', title='SMOTE Resampled Training Set Target Variable Distribution', width=700)

# Show Images
fig_pie_sm.show()
fig_bar_sm.show()

Use PCA to visualize the data again

In [116]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Perform SMOTE oversampling
smote = SMOTE()
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

# Standardized data
scaler = StandardScaler()
X_train_sm_scaled = scaler.fit_transform(X_train_sm)

# Dimensionality reduction using PCA
pca = PCA(n_components=2)
X_train_sm_pca = pca.fit_transform(X_train_sm_scaled)

# Combine the downscaled data and the target variable into a data frame
df_pca = pd.DataFrame(data={'PCA1': X_train_sm_pca[:, 0], 'PCA2': X_train_sm_pca[:, 1], 'Class': y_train_sm})

# Create scatter graph
fig = px.scatter(df_pca, x='PCA1', y='PCA2', color='Class', title='PCA Visualization of SMOTE oversampling Training Set')

# show image
fig.show()