MAI643 - Artificial Intelligence in Medicine

Project Assignment 1 - Spring Semester 2024

Student Name:    
Christina Ioanna Saroglaki   
Jianlin Ye 

UCY Email:     
saroglaki.christina-ioanna@ucy.ac.cy    
jye00001@ucy.ac.cy 

### Import Libararies

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# Overview

As per the authors, the chosen dataset focuses on indicators associated with the diagnosis of cervical cancer, encompassing various features such as demographic information, habits, and medical records​. In more detail, the data was gathered at 'Hospital Universitario de Caracas' in Venezuela from a total of 858 patients​.

C. J. Fernandes Kelwin and J. Fernandes, “Cervical cancer (Risk Factors),” UCI Machine 
Learning Repository. 2017.

In [None]:
risk_factor_df = pd.read_csv("risk_factors_cervical_cancer.csv")

print("----------------------------------- Information -----------------------------------")
risk_factor_df.info()

Split dataset to features and target variables.

In [None]:
feature_df = risk_factor_df.iloc[:,:-4]
dep_df = risk_factor_df.iloc[:,-4:]

## Preliminary analysis of the dataset

To gain a better understanding of the dataset we conducted a preliminary analysis. Fistly we tranformed all the numeric values into  the correct numeric type.

In [None]:
risk_factor_df = risk_factor_df.apply(pd.to_numeric, errors = "coerce")

### Missing Values

Next we needed to find the volume of missing values contained in the dataset as well as the features that contained the largest amount.

In [None]:
print("----------------------------------- Missing Values -----------------------------------")
nan_columns = {}
total_nan = 0
total_entries = len(risk_factor_df.axes[0]) * len(risk_factor_df.axes[1])

#Fidn columns containing NaN values
for col in risk_factor_df.columns:
    if risk_factor_df[col].isnull().any():
        nan_in_column = risk_factor_df[col].isna().sum()
        nan_columns[col] = nan_in_column
        total_nan += nan_in_column
    else:
        nan_columns[col] = 0

# Print total NaN values
if (total_nan == 0):
    print("\nNo NaN values in the dataset.")
else:
    print("\nNaN values found in the dataset.")
    nan_columns = sorted(nan_columns.items(), key=lambda item: item[1], reverse=True)

print("\nTotal NaN values in dataset: {}/{}".format(total_nan, total_entries))

print("\nAmount of NaN values per column:")
for sort_col in nan_columns:
        print("{} : {}".format(sort_col[0], sort_col[1]))

# Rows containing NaN values
nan_rows = risk_factor_df.iloc[:,:-4].isna().any(axis=1).tolist().count(True)

print("\nTotal Rows containing NaN values in dataset: {}/{}".format(nan_rows, len(risk_factor_df)))

# Plots
total_labels = ["NaN values", "Valid Values"]
total_size = [total_nan, total_entries-total_nan]

row_labels = ["Contain NaN", "FIlled rows"]
row_size = [nan_rows, len(risk_factor_df)]

fig_1, ax_1 = plt.subplots(figsize=(10, 3), subplot_kw=dict(aspect="equal"))
ax_1.pie(total_size, labels=total_labels, autopct='%1.1f%%', textprops=dict(color="w"))
ax_1.legend(loc= "center left", bbox_to_anchor=(1, 0, 0.5, 1))
ax_1.set_title("Total Missing Values")

fig_2, ax_2 = plt.subplots(figsize=(10, 3), subplot_kw=dict(aspect="equal"))
ax_2.pie(row_size, labels=row_labels, autopct='%1.1f%%', textprops=dict(color="w"))
ax_2.legend(loc= "center left", bbox_to_anchor=(1, 0, 0.5, 1))
ax_2.set_title("Total Rows Containing Missing Values")

plt.show()

We identified that the features “STDs: Time since first diagnosis” and “STDs: Time since last diagnosis” were filled with NaN values to about 92%. Because of the high percentage of missing values, it impractical to either eliminate those observations or fill the missing data with the mean of the existing data. Consequently, these features were excluded from the dataset for the development of the models.

For the remaining columns, we can replace the missing values with the mean of the existing data during the pre-processing step.

In [None]:
risk_factor_df.drop(columns=["STDs: Time since first diagnosis", "STDs: Time since last diagnosis"])

### Duplicate rows

Following the missing value analysis, we examined if the dataset contained any duplicate rows.

In [None]:
# Check for duplicate rows
duplicate_rows = risk_factor_df.duplicated()

# Count the number of duplicate rows
num_duplicates = duplicate_rows.sum()

if num_duplicates == 0:
    print("No duplicate rows found in the dataset.")
else:
    print(f"Found {num_duplicates} duplicate rows in the dataset.")

# Display the duplicate rows (if any)
if num_duplicates > 0:
    duplicate_rows_df = risk_factor_df[duplicate_rows]
    print("\nDuplicate rows:")
    print(duplicate_rows_df)


### Understanding features

In [None]:
# Function finding the unique values of each column in the dataframe
def find_unique_values_df(feat: pd.DataFrame):
    column_unique  = {}

    for col in list(feat):
        column_unique[str(col)] = feat[col].unique()

    return column_unique

In [None]:
print("----------------------------------- Unique Values -----------------------------------")    
# Unique Values
unique_vals = find_unique_values_df(risk_factor_df)

for col in unique_vals:
    print("\n{} : {}".format(col, unique_vals[col]))

    #Convert all columns to contain numerical values
    risk_factor_df[col] = risk_factor_df[col].apply(pd.to_numeric, errors="coerce")

In [None]:
# Removing duplicate rows
print("----------------------------- Removing Duplicates ----------------------------")
print("----------------------------------- BEFORE -----------------------------------")
print("Number of rows before removing duplicates: ", len(risk_factor_df))

# Check for duplicate rows
duplicate_rows = risk_factor_df.duplicated()

# Count the number of duplicate rows
num_duplicates = duplicate_rows.sum()

if num_duplicates == 0:
    print("No duplicate rows found in the dataset.")
else:
    print(f"Found {num_duplicates} duplicate rows in the dataset.")

# Display the duplicate rows (if any)
if num_duplicates > 0:
    duplicate_rows_df = risk_factor_df[duplicate_rows]
    print("\nDuplicate rows:")
    print(duplicate_rows_df)

# Drop duplicate rows
risk_factor_df.drop_duplicates(inplace=True)

print("----------------------------------- AFTER -----------------------------------")
print("Number of rows after removing duplicates: ", len(risk_factor_df))
