## **Breast Cancer Dataset EDA Notebook**

### **Importing Libraries and Loading Dataset**

In [25]:
#importing libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer

#Load dataset
X, y = load_breast_cancer(as_frame=True, return_X_y=True)
df_full = pd.concat([X, y], axis=1)
df_full.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


The features in this Breast Cancer Dataset can be categorized into three main groups based on their characteristics:
1) Mean Features: they represent the average values of various measurements taken from the breast cancer cells.
2) Error Features: they represent the standard error (variation) of the corresponding mean features.
3) Worst Features: they represent the most extreme (largest) values of the corresponding mean feature, likely indicating the most severe cases in the sample. 

Further categorization within these groups:
 1) Size-related features
 2) Texture-related features
 3) Shape-related features
 4) Complexity-related feature

### **Dataset Overview**

In [26]:
df_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [27]:
# target distribution
df_full.target.value_counts()

target
1    357
0    212
Name: count, dtype: int64

### **Data Cleaning**

In [28]:
# Check for nulls
df_full.isna().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

In [29]:
# Check for duplicates
df_full.duplicated().sum()

np.int64(0)

We start our analysis by focusing on mean-related features (size, shape, etc.) to get an insight about our data and trends. That's because less variations occure in mean values. Next, we move to worst cases.

Hence, we filter our dataframe to work only with mean-related data.

In [30]:
df = df_full.filter(regex= 'mean|target')

### **Data Visualization**

First, we plot a heatmap to check the correlation between various features.

In [31]:
sns.heatmap(
    df.corr(),
    annot=True,
    cmap='coolwarm',
    fmt='.2f',

)
plt.xticks(rotation=45, ha="right")
plt.title('Correlation Matrix of Breast Cancer Dataset (Mean Features)')
plt.savefig('images/Corr_Matrix_Mean_Features.png', dpi=200, bbox_inches='tight')
plt.close()
print("Saved images/Corr_Matrix_Mean_Features.png")

Saved images/Corr_Matrix_Mean_Features.png


We observe that several features have strong correlation with the target (e.g. mean radius, mean perimeter, mean concave points, etc.). 
Besides, we intiutively know that mean radius has a strong correlation with corresponding area and perimeter. We check the heatmap of the full df for error and worst features, too, and observe the same corr between radius with area and perimeter. So, we might choose only radius to continue our analysis.

Next, we examine the distribution of mean radius by target. 

In [32]:
sns.displot(
    data=df,
    x = 'mean radius',
    hue='target'
)
plt.title('Mean Radius Distribution')
plt.savefig('images/mean_radius_distribution.png', dpi=200, bbox_inches='tight')
plt.close()
print("Saved images/mean_radius_distribution.png")

Saved images/mean_radius_distribution.png


It is observed that there is a strong corr between the mean radius and the target.

We check this further by regplot.

In [33]:
sns.regplot(
    data=df,
    x='mean radius',
    y='target',
    logistic=True,
)
plt.title('Mean Radius vs. Target Regression Plot')
plt.savefig('images/mean_radius_regplot.png', dpi=200, bbox_inches='tight')
plt.close()
print("Saved images/mean_radius_regplot.png")

Saved images/mean_radius_regplot.png


We observe a nice sigmoid graph with a very low confidence interval. 

In [34]:
sns.regplot(
    data=df,
    x='mean fractal dimension',
    y='target',
    logistic=True,
)
plt.title('Mean Fractal Dimension vs. Target Regression Plot')
plt.savefig('images/mean_fractal_dimension_regplot.png', dpi=200, bbox_inches='tight')
plt.close()
print("Saved images/mean_fractal_dimension_regplot.png")

Saved images/mean_fractal_dimension_regplot.png


Contrary to the mean radius, we observe that there is not a meaningful correlation between the target and the mean fractal dimension. We might remove this feature, then. 

Next, we examine the corr between the error features and the target.

In [35]:
corr = abs(df_full.filter(regex='error|target').corr())
sns.heatmap(
    corr,
    annot=True,
    cmap='Reds',
     fmt='.2f',
)
plt.xticks(rotation=45, ha="right")
plt.title('Correlation Matrix of Breast Cancer Dataset (Error Features)')
plt.savefig('images/Corr_Matrix_Error_Features.png', dpi=200, bbox_inches='tight')
plt.close()
print("Saved images/Corr_Matrix_Error_Features.png")

Saved images/Corr_Matrix_Error_Features.png


There is not any strong correlation between error features and the target, except for the radius error.

We check whether there exist any correlation between the mean radius and the radius error:

In [36]:
df_full['mean radius'].corr(df_full['radius error'])

np.float64(0.6790903880020749)

The corr between the mean radius and the radius error is high enough. This might be due to the error of the measuremnet device.

Hence, we remove whole error features and proceed to worst features. We next check the correlation between the target and the worst features.

In [37]:
corr = abs(df_full.filter(regex='worst|target').corr())
sns.heatmap(
    corr,
    annot=True,
    cmap='Reds',
     fmt='.2f',
)
plt.xticks(rotation=45, ha="right")
plt.title('Correlation Matrix of Breast Cancer Dataset (Worst Features)')
plt.savefig('images/Corr_Matrix_Worst_Features.png', dpi=200, bbox_inches='tight')
plt.close()
print("Saved images/Corr_Matrix_Worst_Features.png")

Saved images/Corr_Matrix_Worst_Features.png


Althought there is strong correlation between some of the worst features and the target, since nearly the same thing was obsereved between mean features and the target, we need to check whether worst features provide additional info. This can be verified by trying various predictive models.