MAI643 - Artificial Intelligence in Medicine

Project Assignment 1 - Spring Semester 2024

Student Name:    
Christina Ioanna Saroglaki   
Jianlin Ye 

UCY Email:     
saroglaki.christina-ioanna@ucy.ac.cy    
jye00001@ucy.ac.cy 

### Import Libararies

In [None]:
import pandas as pd 
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Overview

As per the authors, the chosen dataset focuses on indicators associated with the diagnosis of cervical cancer, encompassing various features such as demographic information, habits, and medical records​. In more detail, the data was gathered at 'Hospital Universitario de Caracas' in Venezuela from a total of 858 patients​.

C. J. Fernandes Kelwin and J. Fernandes, “Cervical cancer (Risk Factors),” UCI Machine 
Learning Repository. 2017.

In [None]:
risk_factor_df = pd.read_csv("risk_factors_cervical_cancer.csv", 
            na_values=['?'])

# Only keep the "Biopsy" column as the target variable
risk_factor_df.drop(columns=["Hinselmann","Schiller","Citology"], inplace=True)

print("----------------------------------- Information -----------------------------------")
risk_factor_df.info()

## Preliminary analysis of the dataset

### Missing Values

To gain a better understanding of the dataset we conducted a preliminary analysis. First, we needed to find the volume of missing values contained in the dataset as well as the features that contained the largest amount.

In [None]:
print("----------------------------------- Missing Values -----------------------------------")
nan_columns = {}
total_nan = 0
total_entries = len(risk_factor_df.axes[0]) * len(risk_factor_df.axes[1])

#Fidn columns containing NaN values
for col in risk_factor_df.columns:
    if risk_factor_df[col].isnull().any():
        nan_in_column = risk_factor_df[col].isna().sum()
        nan_columns[col] = nan_in_column
        total_nan += nan_in_column
    else:
        nan_columns[col] = 0

# Print total NaN values
if (total_nan == 0):
    print("\nNo NaN values in the dataset.")
else:
    print("\nNaN values found in the dataset.")
    nan_columns = sorted(nan_columns.items(), key=lambda item: item[1], reverse=True)

    print("\nTotal NaN values in dataset: {}/{}".format(total_nan, total_entries))

    print("\nTop 15 columns with missing values:\n")
    for i in range(15): print("{:2}. {:35} : {:}".format(i+1, nan_columns[i][0], nan_columns[i][1]))

In [None]:
# Plot
total_figure = px.pie(values=[total_nan, total_entries-total_nan], names=["NaN values", "Valid Values"],
        color_discrete_sequence=px.colors.sequential.Aggrnyl,
        title="Total NaN Values Distribution",
        width=550, height= 350)
total_figure.update_layout(
    margin=dict(l=50, r=50, t=50, b=50),
    title_x=0.5    
)
total_figure.show()

We identified that the features “STDs: Time since first diagnosis” and “STDs: Time since last diagnosis” were filled with NaN values of about 92%. Because of the high percentage, it was impractical to either eliminate the affected observations or fill the missing values with the mean of columns. Consequently, these features were excluded from the dataset.

In [None]:
risk_factor_df.drop(columns=["STDs: Time since first diagnosis", "STDs: Time since last diagnosis"], inplace=True)

To ensure the optimal performance of future models, we set a missing value threshold of 10 per row. Any rows that exceeded this threshold were eliminated from the dataset because we determined they lacked meaningful information.

In [None]:
# Rows containing NaN values
total_rows = len(risk_factor_df)
nan_rows = risk_factor_df.isna().any(axis=1).tolist().count(True)

print("\nTotal Rows containing NaN values in dataset: {}/{}".format(nan_rows, total_rows))

# Remove rows that contain more than 10 NaN values
nan_per_row = risk_factor_df.isna().sum(axis=1)
rows_to_del = []
for indx in range(len(nan_per_row)-1):
    val = nan_per_row[indx]
    if (val > 10):
        rows_to_del.append(indx)

print("\nRows containing >10 NaN values: {}/{}".format(len(rows_to_del), total_rows))

risk_factor_df.drop(rows_to_del, inplace=True)
risk_factor_df.reset_index(drop=True, inplace=True)

In [None]:
#Plot
color_1 = [px.colors.sequential.Agsunset[0], px.colors.sequential.Agsunset[1]]
color_2 = [px.colors.sequential.Agsunset[2], px.colors.sequential.Agsunset[3]]


row_figure = make_subplots(1, 2, specs=[[{'type':'domain'}, {'type':'domain'}]],
    subplot_titles=['Contain NaN Values', 'Contain >10 NaN Values'])

row_figure.add_trace(go.Pie(labels=["Has NaN Values","Is Filled"],
    values=[nan_rows, total_rows - nan_rows],
    marker_colors=color_1,
    pull=[0.1, 0]), 1, 1)

row_figure.add_trace(go.Pie(labels=["<10 NaN", ">10 NaN"],
    values=[len(rows_to_del), nan_rows - len(rows_to_del)],
    marker_colors=color_2), 1, 2)

row_figure.update_layout(title_text='Rows Containing NaN Values',
    width=650, height= 400,
    title_x=0.5)

row_figure.show()

For the remaining columns, we replaced the missing values with the mean of the existing data.

In [None]:
# Fill missing values with mean
print("--------------------------- Filling Missing Values ---------------------------")
print("----------------------------------- BEFORE -----------------------------------")
print("Number of rows before filling missing values: ", len(risk_factor_df))

# Display the number of missing values before filling
print("\nNumber of missing values per column before filling:")
print(risk_factor_df.isnull().sum())

# Fill missing values with mean
risk_factor_df.fillna(risk_factor_df.mean(), inplace=True)

In [None]:
print("\n----------------------------------- AFTER -----------------------------------")
print("Number of rows after filling missing values: ", len(risk_factor_df))

# Display the number of missing values after filling
print("\nNumber of missing values per column after filling:")
print(risk_factor_df.isnull().sum())

### Duplicate Rows

Following the missing value analysis, we examined if the dataset contained any duplicate rows and removed them from the dataset.

In [None]:
print("----------------------------------- Duplicate Rows -----------------------------------")
# Check for duplicate rows
duplicate_rows = risk_factor_df.duplicated()

# Count the number of duplicate rows
num_duplicates = duplicate_rows.sum()

if num_duplicates == 0:
    print("No duplicate rows found in the dataset.")
else:
    print(f"Found {num_duplicates} duplicate rows in the dataset.\n")

    # Display the duplicate rows indexes (if any)
    print("Duplicate rows indexes: {}\n".format(risk_factor_df[duplicate_rows].index.values))

    # Removing duplicate rows
    print("----------------------------- Removing Duplicates ----------------------------")
    print("----------------------------------- BEFORE -----------------------------------")
    print("Number of rows before removing duplicates: ", len(risk_factor_df))

    # Drop duplicate rows
    risk_factor_df.drop_duplicates(inplace=True)

    print("\n----------------------------------- AFTER -----------------------------------")
    print("Number of rows after removing duplicates: ", len(risk_factor_df))


That concluded the first part of the analysis.

In [None]:
print("\nFinal dataset size: {} cols, {} rows".format(risk_factor_df.shape[1], risk_factor_df.shape[0]))

## Understanding features

Once the first part of the analysis was completed, we moved on to exploring some statistical properties of the dataset. This would allow us to identify possible connections between the features as well as possible imbalances.

### Target value imbalance

In [None]:
count_0 = risk_factor_df["Biopsy"].value_counts()[0]
count_1 = risk_factor_df["Biopsy"].value_counts()[1]

classes_df = pd.DataFrame(
    [["Healthy",count_0], ["Cervical Cancer",count_1]],
    columns =['type', 'count'])

balance_fig = px.bar(classes_df, x="type", y="count",
    title="Class Distribution",
    color="type",
    text_auto=True,
    color_discrete_sequence=px.colors.qualitative.Bold,
    width=500, height=400,
    labels={
        "type": "Class",
        "count": "Occurrences"
    })

balance_fig.update_layout(
    title_x=0.5    
)

balance_fig.show()

In [None]:
# Function finding the unique values of each column in the dataframe
def find_unique_values_df(feat: pd.DataFrame):
    column_unique  = {}

    for col in list(feat):
        column_unique[str(col)] = feat[col].unique()

    return column_unique

#### General characteristics

In [None]:
mean_gc = risk_factor_df.iloc[:,:11].mean()

In [None]:
std_gc = risk_factor_df.iloc[:,:11].std()

In [None]:
print("----------------------------------- Unique Values -----------------------------------")    
# Unique Values
unique_vals = find_unique_values_df(risk_factor_df)

for col in unique_vals:
    print("{} : {}".format(col, list(unique_vals[col])))
