<table style="background-color:#F5F5F5;" width="100%">
<tr><td style="background-color:#F5F5F5;"><img src="../images/logo.png" width="150" align='right'/></td></tr>     <tr><td>
            <h2><center>Aprendizagem Automática em Engenharia Biomédica</center></h2>
            <h3><center>1st Semester - 2024/2025</center></h3>
            <h4><center>Universidade Nova de Lisboa - Faculdade de Ciências e Tecnologia</center></h4>
</td></tr>
    <tr><td><h2><b><center>Project</center></b></h2>
    <h4><i><b><center>Predicting Cervical Cancer: A Machine Learning Approach Using Risk Factor Analysis 
</center></b></i></h4></td></tr>
</table>


 <h3>Requesitos nos  modelos </h3>
 <li> Comparison of at least 3 models</li>
 <li>Cross-Validation</li>
 <li> Grid Search </li>
 <li> Results evaluation and discussion </li>


<h3>Avaliação </h3>

 <li> Quality of coding (temos de tipo comentar e explicar tudo)</li>
 <li>Creativity</li>
 <li> Comparison with the state of the Art </li>



In [150]:
#!pip install ucimlrepo
import pandas as pd  
import numpy as np 
import matplotlib.pyplot as plt 
from typing import Tuple
from sklearn.model_selection import train_test_split
pd.set_option("display.float_format", "{:.4f}".format)



<h3>1. Introdução </h3>

Cervical cancer is the fourth most common cancer worldwide and a significant cause of mortality, particularly in low- and middle-income countries, where 94% of all deaths occur [1]. Although medicine has evolved, implementing cytology-based and other types of screening remains challenging in these countries due to the lack of healthcare infrastructure and trained professionals [2]. 
With that in mind, machine learning can be used as a helpful system for interpreting complex datasets and support clinical decision-making due to its strong data analysis capabilities [3]. 

The initial approach was to utilize these caracteristics combined with the risk factors, in order to predict the presence of the disease. However, upon analyzing the data, it occured to us we could rewire the project to predict the results of the main four tests used to detect cervical cancer. This work could be relevant in the management of clinical resources. For example, a person that shows some risk factors, can be called in to make only two out of the four exams, saving not only financial resources but also material ones.

 <h3>2. Data Preparation </h3>

<h4> 2.1 Data import </h4>

In [155]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
cervical_cancer_risk_factors = fetch_ucirepo(id=383) 
  
# data (as pandas dataframes) 
df_cervical_cancer = cervical_cancer_risk_factors.data.features 

#Getting dataset dimensions
n_rows = df_cervical_cancer.shape[0]
n_features = df_cervical_cancer.shape[1]
print('The dataset has {} samples and {} features. \n'.format(n_rows, n_features))

#Showing the first 10 rows of the dataset
print('The first 10 rows are displayed below. \n\n')
df_cervical_cancer.head()



The dataset has 858 samples and 36 features. 

The first 10 rows are displayed below. 




Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
1,15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
2,34,1.0,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
3,52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.0,0.0,...,,,1,0,1,0,0,0,0,0
4,46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.0,0.0,...,,,0,0,0,0,0,0,0,0


<p style="font-size:15px;"> The data is divided in integer values, to evaluate, age number of sexual partners, etc, and a binary classification (0 as false and 1 as true) to evaluate the results of the tests, presence of STD's etc. </p>

In [157]:
print(df_cervical_cancer.dtypes)


Age                                     int64
Number of sexual partners             float64
First sexual intercourse              float64
Num of pregnancies                    float64
Smokes                                float64
Smokes (years)                        float64
Smokes (packs/year)                   float64
Hormonal Contraceptives               float64
Hormonal Contraceptives (years)       float64
IUD                                   float64
IUD (years)                           float64
STDs                                  float64
STDs (number)                         float64
STDs:condylomatosis                   float64
STDs:cervical condylomatosis          float64
STDs:vaginal condylomatosis           float64
STDs:vulvo-perineal condylomatosis    float64
STDs:syphilis                         float64
STDs:pelvic inflammatory disease      float64
STDs:genital herpes                   float64
STDs:molluscum contagiosum            float64
STDs:AIDS                         

In [158]:
#Get all the columns with a binary classification 
binary_columns = df_cervical_cancer.loc[:, (df_cervical_cancer.isin([0, 1]) | df_cervical_cancer.isna()).all()]
binary_columns.describe().iloc[[0]]

Unnamed: 0,Smokes,Hormonal Contraceptives,IUD,STDs,STDs:condylomatosis,STDs:cervical condylomatosis,STDs:vaginal condylomatosis,STDs:vulvo-perineal condylomatosis,STDs:syphilis,STDs:pelvic inflammatory disease,...,STDs:Hepatitis B,STDs:HPV,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
count,845.0,750.0,741.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,...,753.0,753.0,858.0,858.0,858.0,858.0,858.0,858.0,858.0,858.0


In [159]:
#Get the columns with continuos values
continuous_columns = df_cervical_cancer.drop(binary_columns.columns, axis=1)
continuous_columns.describe().iloc[[0,1,2,3,7]]


Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes (years),Smokes (packs/year),Hormonal Contraceptives (years),IUD (years),STDs (number),STDs: Number of diagnosis,STDs: Time since first diagnosis,STDs: Time since last diagnosis
count,858.0,832.0,851.0,802.0,845.0,845.0,750.0,741.0,753.0,858.0,71.0,71.0
mean,26.8205,2.5276,16.9953,2.2756,1.2197,0.4531,2.2564,0.5148,0.1766,0.0874,6.1408,5.8169
std,8.4979,1.6678,2.8034,1.4474,4.089,2.2266,3.7643,1.9431,0.562,0.3025,5.895,5.7553
min,13.0,1.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
max,84.0,28.0,32.0,11.0,37.0,37.0,30.0,19.0,4.0,3.0,22.0,22.0


In [160]:
df_cervical_cancer.isnull().sum()

Age                                     0
Number of sexual partners              26
First sexual intercourse                7
Num of pregnancies                     56
Smokes                                 13
Smokes (years)                         13
Smokes (packs/year)                    13
Hormonal Contraceptives               108
Hormonal Contraceptives (years)       108
IUD                                   117
IUD (years)                           117
STDs                                  105
STDs (number)                         105
STDs:condylomatosis                   105
STDs:cervical condylomatosis          105
STDs:vaginal condylomatosis           105
STDs:vulvo-perineal condylomatosis    105
STDs:syphilis                         105
STDs:pelvic inflammatory disease      105
STDs:genital herpes                   105
STDs:molluscum contagiosum            105
STDs:AIDS                             105
STDs:HIV                              105
STDs:Hepatitis B                  

<p style="font-size:15px;"> As colunas  'STDs: Time since first diagnosis' e 'STDs: Time since last diagnosis' have a lot of missing values  então vamos droppar (n me apetece escrever em ingles)</p>

In [162]:
df_cervical_cancer= df_cervical_cancer.drop(['STDs: Time since first diagnosis','STDs: Time since last diagnosis'], axis=1)
df_cervical_cancer.head()


Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs:HPV,STDs: Number of diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
1,15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
2,34,1.0,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
3,52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.0,0.0,...,0.0,0,1,0,1,0,0,0,0,0
4,46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0


In [163]:
df_cervical_cancer = df_cervical_cancer.dropna()
print(df_cervical_cancer.shape[0])

668


In [164]:
zero_per_columns = (df_cervical_cancer == 0).sum()
print(zero_per_columns)

Age                                     0
Number of sexual partners               0
First sexual intercourse                0
Num of pregnancies                     14
Smokes                                572
Smokes (years)                        572
Smokes (packs/year)                   572
Hormonal Contraceptives               238
Hormonal Contraceptives (years)       238
IUD                                   593
IUD (years)                           593
STDs                                  603
STDs (number)                         603
STDs:condylomatosis                   631
STDs:cervical condylomatosis          668
STDs:vaginal condylomatosis           664
STDs:vulvo-perineal condylomatosis    632
STDs:syphilis                         653
STDs:pelvic inflammatory disease      667
STDs:genital herpes                   667
STDs:molluscum contagiosum            667
STDs:AIDS                             668
STDs:HIV                              655
STDs:Hepatitis B                  

<p style="font-size:15px;">STDs:AIDS e STDs:cervical condylomatosis  têm todos os valores = 0 pelo q vamos dropar  </p>

In [166]:
df_cervical_cancer= df_cervical_cancer.drop(['STDs:AIDS','STDs:cervical condylomatosis'], axis=1)
df_cervical_cancer.head()

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs:HPV,STDs: Number of diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
1,15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
3,52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.0,0.0,...,0.0,0,1,0,1,0,0,0,0,0
4,46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
5,42,3.0,23.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0


 <h3>3. Model Training </h3>

In [168]:
from sklearn.model_selection import cross_val_score
from typing import Tuple
import numpy as np

def train_and_evaluate_multi_target(
    model: Tuple[str, any],
    X_train: np.ndarray,
    Y_train: np.ndarray,
    X_test: np.ndarray,
    Y_test: np.ndarray
) -> dict[str, list]:
    """
    Train and evaluate a machine learning model for multiple target variables.

    This function trains and evaluates the model for each target variable in Y_train
    and Y_test using cross-validation.

    :param model: A tuple containing the model name as a string and the model instance.
    :param X_train: Training features as a numpy array.
    :param Y_train: Training labels as a numpy array (can have multiple columns).
    :param X_test: Test features as a numpy array.
    :param Y_test: Test labels as a numpy array (can have multiple columns).
    :return: A dictionary containing performance metrics for each target variable.
    """
    model_name, model_instance = model
    metrics = {"Model Name": model_name}
    metrics_list = []

    # Iterate over each target variable (column in Y_train and Y_test)
    for i in range(Y_train.shape[1]):
        y_train = Y_train.iloc[:, i]  # Seleciona a coluna i do DataFrame
        y_test = Y_test.iloc[:, i]    # Seleciona a coluna i do DataFrame
        target_name = Y_train.columns[i]
        
        # Train the model
        model_instance.fit(X_train, y_train)

        # Calculate the accuracy on training and test sets
        train_acc = model_instance.score(X_train, y_train)
        test_acc = model_instance.score(X_test, y_test)
        

        # Perform cross-validation
        acc_cv = cross_val_score(estimator=model_instance, X=X_train, y=y_train, cv=10)

        # Store metrics for this target variable
        metrics_list.append({
            "Model Name": model_name,
            "Target": target_name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "CV Acc Mean": np.mean(acc_cv),
            "CV Acc Std": np.std(acc_cv)
        })

    metrics = pd.DataFrame(metrics_list)
    
    return metrics

In [169]:
#Preparar os Dados - meti aqui por ser uma coisa geral a todos os modelos
Y = df_cervical_cancer.loc[:,['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]
X = df_cervical_cancer.drop(['Hinselmann', 'Schiller', 'Citology', 'Biopsy'],axis=1)


3.1 - Random Forest

In [171]:
#Dividir os dados em conjunto de treino e conjunto de teste
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

#Treinar os modelos
from sklearn.ensemble import RandomForestClassifier
models = []
for target in ['Hinselmann', 'Schiller', 'Citology', 'Biopsy']:
    model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
    model.fit(X_train, Y_train)
    models.append(model)

#Previsões
predictions = {}
y_pred_test = model.predict(X_test)
y_pred_train = model.predict(X_train)
for i, target in enumerate(['Hinselmann', 'Schiller', 'Citology', 'Biopsy']):
    predictions[target] = models[i].predict(X_test)

#Avaliação
from sklearn.metrics import accuracy_score, roc_auc_score
acc_train = accuracy_score(Y_train, y_pred_train)
acc_test = accuracy_score(Y_test, y_pred_test)
for target in ['Hinselmann', 'Schiller', 'Citology', 'Biopsy']:
    print(f"Resultados para {target}:")
    print('Accuracy test set:', acc_test)
    print('Accuracy training set:', acc_train)

Resultados para Hinselmann:
Accuracy test set: 0.8432835820895522
Accuracy training set: 0.9906367041198502
Resultados para Schiller:
Accuracy test set: 0.8432835820895522
Accuracy training set: 0.9906367041198502
Resultados para Citology:
Accuracy test set: 0.8432835820895522
Accuracy training set: 0.9906367041198502
Resultados para Biopsy:
Accuracy test set: 0.8432835820895522
Accuracy training set: 0.9906367041198502


In [172]:
# Definir o grid de parâmetros com valores válidos
param_grid = {
    'n_estimators': [100, 300],          # Número de árvores
    'max_depth': [None, 10, 20],         # Profundidade máxima
    'min_samples_split': [2, 5, 10],    # Divisão mínima válida
    'min_samples_leaf': [1, 2, 4],      # Tamanho mínimo da folha
}

# Ajustar hiperparâmetros para cada target
best_models = {}
for target in ['Hinselmann', 'Schiller', 'Citology', 'Biopsy']:
    print(f"Ajustando modelo para: {target}")
    
    # Configurar o GridSearchCV
    grid_search = GridSearchCV(
        estimator=RandomForestClassifier(random_state=42, class_weight='balanced'),
        param_grid=param_grid,
        scoring='roc_auc',
        cv=5,  # Validação cruzada com 5 folds
    )
    
    # Treinar o GridSearchCV
    grid_search.fit(X_train, Y_train[target])
    
    # Melhor modelo para o target atual
    best_models[target] = grid_search.best_estimator_
    print(f"Melhores hiperparâmetros para {target}:", grid_search.best_params_)

# Avaliação dos melhores modelos
from sklearn.metrics import accuracy_score, roc_auc_score

for target in ['Hinselmann', 'Schiller', 'Citology', 'Biopsy']:
    # Previsões do modelo otimizado
    y_pred = best_models[target].predict(X_test)
    y_proba = best_models[target].predict_proba(X_test)[:, 1]
    
    # Avaliar a performance
    auc_roc = roc_auc_score(Y_test[target], y_proba)
    accuracy = accuracy_score(Y_test[target], y_pred)
    print(f"\nResultados para {target}:")
    print(f"AUC-ROC: {auc_roc:.4f}")
    print(f"Accuracy: {accuracy:.4f}")

Ajustando modelo para: Hinselmann


NameError: name 'GridSearchCV' is not defined

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Gerar e visualizar matrizes de confusão para os melhores modelos
for target in ['Hinselmann', 'Schiller', 'Citology', 'Biopsy']:
    # Previsões no conjunto de teste
    y_true = Y_test[target]  # Valores reais da variável de alvo
    y_pred = best_models[target].predict(X_test)  # Predições do modelo otimizado
    
    # Gerar a matriz de confusão
    cm = confusion_matrix(y_true, y_pred)
    
    # Visualizar a matriz de confusão
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=best_models[target].classes_)
    disp.plot(cmap=plt.cm.Blues)
    plt.title(f'Matriz de Confusão para {target}')
    plt.show()


In [None]:
#Correlação entre as target variables
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df_cervical_cancer[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']].corr(), annot=True)
plt.show()

3.2 - Support Vector Machine (SVM)


In [173]:
from sklearn.svm import LinearSVC, SVC
from tabulate import tabulate

model = ('SVC', SVC())

# Dividir os dados
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Treinar e avaliar
model_metrics = train_and_evaluate_multi_target(model, X_train, Y_train, X_test, Y_test)

# Mostrar o DataFrame com formatação tabular
print(tabulate(model_metrics, headers='keys', tablefmt='fancy_grid'))


╒════╤══════════════╤════════════╤══════════════════╤═════════════════╤═══════════════╤══════════════╕
│    │ Model Name   │ Target     │   Train Accuracy │   Test Accuracy │   CV Acc Mean │   CV Acc Std │
╞════╪══════════════╪════════════╪══════════════════╪═════════════════╪═══════════════╪══════════════╡
│  0 │ SVC          │ Hinselmann │         0.956929 │        0.947761 │      0.956988 │  0.00821431  │
├────┼──────────────┼────────────┼──────────────────┼─────────────────┼───────────────┼──────────────┤
│  1 │ SVC          │ Schiller   │         0.906367 │        0.902985 │      0.906359 │  0.000855866 │
├────┼──────────────┼────────────┼──────────────────┼─────────────────┼───────────────┼──────────────┤
│  2 │ SVC          │ Citology   │         0.947566 │        0.91791  │      0.947589 │  0.00735249  │
├────┼──────────────┼────────────┼──────────────────┼─────────────────┼───────────────┼──────────────┤
│  3 │ SVC          │ Biopsy     │         0.93633  │        0.91791  │  

3.3

3.4 - Ensemble

3.5 - Diagnóstico Dx

 <h3>4. Result evaluation</h3>

 <h3>5. Discussion</h3>

 <h3>6. Model test (onde fazemos o questionário para mostrar a aplicação do nosso modelo</h3>