## EDA FOR THE CLEANED DATASET

This brief notebook will perform some of the various EDA techniques used on the raw dataset in order to understand not only if the cleaned dataset is complete and suitable for further preprocessing-modeling steps, but also to see if the shape of the data changes importantly, or not. 

In [1]:
file_path = r'C:\unibo-dtm-ml-2526-cervical-cancer-predictor\data\cleaned_data.csv'
with open(file_path, 'r') as f:
    lines = f.readlines()
print('Read {} lines'.format(len(lines)))

Read 836 lines


Importing the necessary libraries. 

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import IsolationForest


Outlining the renewed profile of the dataset. 

In [3]:
# repeat the data profiling pipeline for the newly cleaned data
df = pd.read_csv(file_path)

print("\nDataset Info: \n")
print(df.info())

#check whether everything went smoothly at the data cleaning stage
print("\nMissing Values: \n")
print(df.isnull().sum()) 

print("\nDuplicate Values: \n")
print(df.duplicated().sum())

print("\nDescriptive Statistics:")
print(df.describe(include='all'))



Dataset Info: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 835 entries, 0 to 834
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Age                              835 non-null    int64  
 1   Number of sexual partners        810 non-null    float64
 2   First sexual intercourse         828 non-null    float64
 3   Num of pregnancies               779 non-null    float64
 4   Smokes (years)                   822 non-null    float64
 5   Smokes (packs/year)              822 non-null    float64
 6   Hormonal Contraceptives (years)  732 non-null    float64
 7   IUD (years)                      723 non-null    float64
 8   STDs (number)                    735 non-null    float64
 9   STDs: Viral group                835 non-null    int64  
 10  STDs: Bacterial group            835 non-null    int64  
 11  STDs:condylomatosis              735 non-null    float64
 12  STDs:

## ADVANCED OUTLIER INVESTIGATION VIA ISOLATION FOREST



My intention is to now carry out further outlier investigations via Isolation Forest for both the two processed datasets (median and KNN imputed). The aim is to spot multivariate anomalies that simple boxplots cannot identify. One thing to take into consideration: most of the early-spotted combinations of variable values will likely involve the positively tested cases, which are crucial to our training needs, therefore they won't absolutely be dropped. The goal is to spot strange combinations of variables that look normal on their own but are impossible together. 

### ISOLATION FOREST ON MEDIAN-IMPUTED DATASET

In [23]:
file_path = r'C:\unibo-dtm-ml-2526-cervical-cancer-predictor\data\data_after_imputation\median_and_freq_imputed.csv'
df = pd.read_csv(file_path)

targets = ["Hinselmann", "Schiller", "Citology", "Biopsy"]
features = [col for col in df.columns if col not in targets]

scaler = MinMaxScaler() 
scaled_data = scaler.fit_transform(df[features])
scaled_df = df.copy()
scaled_df[features] = scaled_data

x = scaled_df[features]      

isolation_forest = IsolationForest(n_estimators=100, contamination='auto')
model = isolation_forest.fit(x)
df['anomaly_score'] = model.predict(x)
print("Number of anomalies detected: ", df[df['anomaly_score'] == -1].shape[0])

#show the first 5 outliers detected
print("\nSample of detected outliers:")
display(df[df['anomaly_score'] == -1].head())

display(df)

Number of anomalies detected:  81

Sample of detected outliers:


Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes (years),Smokes (packs/year),Hormonal Contraceptives (years),IUD (years),STDs (number),STDs: Viral group,...,STDs:syphilis,STDs:HIV,Dx:Cancer,Dx:CIN,Dx,Hinselmann,Schiller,Citology,Biopsy,anomaly_score
3,52,1.791759,16.0,4.0,3.637586,3.637586,1.386294,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0,0,0,0,-1
6,51,1.386294,17.0,6.0,3.555348,1.481605,0.0,2.079442,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1,1,0,1,-1
8,45,0.693147,20.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0,0,0,0,-1
18,42,1.098612,20.0,2.0,0.0,0.0,2.079442,1.94591,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-1
21,41,1.386294,17.0,4.0,0.0,0.0,2.397895,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0,0,0,0,-1


Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes (years),Smokes (packs/year),Hormonal Contraceptives (years),IUD (years),STDs (number),STDs: Viral group,...,STDs:syphilis,STDs:HIV,Dx:Cancer,Dx:CIN,Dx,Hinselmann,Schiller,Citology,Biopsy,anomaly_score
0,18,1.609438,15.0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1
1,15,0.693147,14.0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1
2,34,0.693147,17.0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1
3,52,1.791759,16.0,4.0,3.637586,3.637586,1.386294,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0,0,0,0,-1
4,46,1.386294,21.0,4.0,0.000000,0.000000,2.772589,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
830,34,1.386294,18.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1
831,32,1.098612,19.0,1.0,0.000000,0.000000,2.197225,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1
832,25,1.098612,17.0,0.0,0.000000,0.000000,0.076961,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,1,0,1
833,33,1.098612,24.0,2.0,0.000000,0.000000,0.076961,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1
