## DEFINITIVE EXPLORATORY DATA ANALYSIS 

In this notebook I will carry out further EDA scripts in order to extract more meaningful information from the cleaned version of the original dataset. 

In [None]:
file_path = r'C:\unibo-dtm-ml-2526-cervical-cancer-predictor\data\processed.csv'
with open(file_path, 'r') as f:
    lines = f.readlines()
print('Read {} lines'.format(len(lines)))

Read 831 lines


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
# repeat the data profiling pipeline for the newly cleaned data
df = pd.read_csv(file_path)

print("\nDataset Info: \n")
print(df.info())

#check whether everything went smoothly at the data cleaning stage
print("\nMissing Values: \n")
print(df.isnull().sum()) 

print("\nDuplicate Values: \n")
print(df.duplicated().sum())

print("\nDescriptive Statistics:")
print(df.describe(include='all'))



Dataset Info: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 34 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Age                                 830 non-null    int64  
 1   Number of sexual partners           830 non-null    float64
 2   First sexual intercourse            830 non-null    float64
 3   Num of pregnancies                  830 non-null    float64
 4   Smokes                              830 non-null    float64
 5   Smokes (years)                      830 non-null    float64
 6   Smokes (packs/year)                 830 non-null    float64
 7   Hormonal Contraceptives             830 non-null    float64
 8   Hormonal Contraceptives (years)     830 non-null    float64
 9   IUD                                 830 non-null    float64
 10  IUD (years)                         830 non-null    float64
 11  STDs                        

### IDENTIFY CORRELATIONS 
Repeating this step, this time with the cleaned version of the previous dataset.

In [None]:
#Compute the correlation matrix
corr = df.select_dtypes(include=['number']).corr()

#generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr,dtype=bool))

#set up the matplotlib figure
f,ax = plt.subplots(figsize=(11,9))

#generate a custom diverging colormap
cmap = sns.diverging_palette(230,20,as_cmap=True)

#draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0,square=True, linewidths=.5, cbar_kws={"shrink": .5})


plt.title("Feature Correlation Heatmap")

#### CLASS IMBALANCE VISUALIZATION
The goal of the project is to predict the four outcomes of the diagnostic tests. In this sense it is useful to first visualize how is the positive/negative ratio of the dataset for each test. 

In [None]:
tests = ['Hinselmann', 'Schiller', 'Citology', 'Biopsy']
for target in tests: 
    plt.figure(figsize=(3,2))
    ax = sns.countplot(data=df, x=target)
    plt.title(f"{target} Test Positivity (0 = No, 1 = Yes)")

plt.show()