# Business Understanding

## Background and Context

Breast cancer is one of the most common forms of cancer among women worldwide, and early detection plays a critical role in improving treatment outcomes and survival rates. In clinical practice, fine needle aspirate (FNA) biopsies are commonly used as a minimally invasive method to collect cell samples from breast masses. These samples are examined under a microscope to assess whether the tumor is benign or malignant.

The Wisconsin Breast Cancer Diagnostic dataset contains quantitative features extracted from digitized images of FNA samples. Rather than using raw medical images, the dataset provides numerical measurements that describe the size, shape, and texture of cell nuclei, which are known to differ significantly between benign and malignant tumors. Each observation is labeled as either malignant or benign, making the dataset suitable for a binary classification task.

For each of the ten nucleus characteristics, three summary statistics are provided: the mean value, the standard error, and the worst (largest) value observed. This results in a total of 30 numerical predictors for each sample. These features aim to capture not only the average behavior of cell nuclei, but also their variability and extreme abnormalities, which are clinically relevant indicators of cancer.

## Business Objective

The primary objective of this project is to develop a predictive model that can accurately distinguish between malignant and benign breast tumors based on cell nucleus characteristics. Such a model could potentially assist medical professionals by providing an additional, data-driven tool to support diagnostic decision-making.

Given the medical context, particular emphasis is placed on correctly identifying malignant cases, as failing to detect cancer at an early stage may have serious consequences for patients. Therefore, the model should aim to minimize the number of malignant tumors that are incorrectly classified as benign.

## Data Science Objective

From a data science perspective, the objective is to build, evaluate, and compare multiple interpretable classification models using resampling techniques such as cross-validation. The focus is on assessing how well different models perform in terms of generalization, robustness, and their ability to correctly identify malignant cases.

The analysis will emphasize models commonly discussed in statistical learning literature, such as logistic regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and k-nearest neighbors (KNN). Cross-validation will be used to obtain reliable performance estimates and to reduce the risk of overfitting.

## Success Criteria

The project will be considered successful if the following criteria are met:
- High recall (sensitivity) for malignant cases, ensuring that the majority of cancerous tumors are correctly identified.
- Reasonable overall accuracy, indicating balanced performance across both classes.
- Stable cross-validated performance, demonstrating that the selected model generalizes well to unseen data and is not overly dependent on a specific train-test split.

## Scope and Constraints
This analysis is restricted to the variables provided in the Wisconsin Breast Cancer Diagnostic dataset. No external clinical information, such as patient demographics, genetic data, or medical history, is incorporated into the modeling process. As a result, the conclusions of this study are limited to patterns present within the extracted cell nucleus features.

Additionally, the dataset represents measurements derived from digitized images rather than direct clinical diagnoses. Therefore, the models developed in this project are intended for analytical and educational purposes only and should not be interpreted as standalone diagnostic tools. The primary analytical focus is on reducing classification errors, with particular emphasis on minimizing false negative predictions for malignant cases due to their clinical significance.

## Analytical Approach
This project follows the CRISP-DM framework, beginning with business understanding and data understanding, followed by data preparation, modeling, and evaluation. During the modeling phase, several classification techniques will be implemented and compared, including Logistic Regression, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and K-Nearest Neighbors (KNN).

Resampling techniques, specifically cross-validation, will be employed to assess model performance and ensure reliable generalization to unseen data. Model evaluation will focus on metrics relevant to medical classification tasks, such as recall (sensitivity), overall accuracy, and stability across validation folds. The final results will be interpreted in the context of both statistical performance and clinical relevance.

### Import required libraries

In [10]:
# Data Manipulation
import pandas as pd
import numpy as np

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots

# Model Selection and resampling
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier

# Preprocessing
from sklearn.preprocessing import StandardScaler

# Evaluation Metrics
from sklearn.metrics import (
     confusion_matrix,
     classification_report,
     accuracy_score,
     recall_score,
     roc_auc_score,
     roc_curve
)

# Pipeline
from sklearn.pipeline import Pipeline

## Data Understanding

### Data Collection

The dataset used in this project was obtained from Kaggle and originates from the Wisconsin Breast Cancer Diagnostic (WDBC) dataset, which was developed at the University of Wisconsin–Madison. The dataset is also publicly available through the UCI Machine Learning Repository and has been widely used in medical and machine learning research.

The data were originally collected from digitized images of fine needle aspirate (FNA) biopsies of breast masses. From these images, quantitative features describing the characteristics of cell nuclei were extracted using image analysis techniques. The Kaggle version of the dataset provides these processed numerical features in a structured tabular format, making it suitable for statistical learning and classification tasks.

Each row in the dataset corresponds to one breast mass sample, summarized by measurements of multiple cell nuclei, and is associated with a diagnostic label indicating whether the tumor is malignant or benign.

In [11]:
df = pd.read_csv('breast_cancer.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [12]:
df.shape

(569, 33)

The dataset consists of 569 observations and 33 columns, including one identifier column, one diagnostic target variable, and 30 numerical predictor variables. The data are structured and complete, with no missing values, which makes the dataset suitable for statistical modeling and comparative analysis.

Although the sample size is sufficient for classical classification methods such as logistic regression, LDA, and QDA, it remains modest relative to the complexity and variability of real-world medical populations. As the data originate from a single source and represent a limited set of samples, the results of this analysis should be interpreted with appropriate caution regarding generalizability to broader clinical settings.

### Target Variable
#### Diagnosis (Categorical)

The target variable in this analysis is diagnosis, a categorical variable indicating whether a breast mass is benign (B) or malignant (M). A malignant diagnosis corresponds to the presence of cancerous cells, while a benign diagnosis indicates a non-cancerous breast mass.

This variable defines a binary classification problem, where the primary goal is to correctly identify malignant cases. In a medical context, misclassifying a malignant tumor as benign (false negative) can have serious consequences, as it may delay further clinical evaluation or treatment. Therefore, particular attention is given to model performance with respect to the malignant class.

In [13]:
df.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

### Predictor Variables

The predictor variables in this dataset consist of quantitative measurements extracted from digitized images of fine needle aspirate (FNA) biopsies of breast masses. These measurements describe the morphology and texture of cell nuclei, which are critical indicators in distinguishing benign from malignant tumors in histopathological analysis.

For each breast mass, measurements are obtained from multiple cell nuclei present in the sample. From these measurements, ten fundamental nucleus characteristics are computed. To summarize both typical behavior and abnormal variations among nuclei, three statistical descriptors are calculated for each characteristic: the mean, the standard error, and the worst (largest) value. This results in a total of 30 numerical predictor variables.

### Fundamental Nucleus Characteristics

The ten core characteristics measured for each cell nucleus are described below:
1. Radius:
Represents the mean distance from the center of the nucleus to points on its perimeter. Malignant cells often exhibit enlarged nuclei due to    abnormal cell growth.
2. Texture:
Measures the variation in gray-scale intensity within the nucleus. Higher texture values indicate greater internal heterogeneity, which is commonly observed in malignant cells.
3. Perimeter:
Represents the total length of the nucleus boundary. Irregular and elongated boundaries tend to increase perimeter measurements in malignant nuclei.
4. Area:
Measures the total area enclosed by the nucleus boundary. Malignant tumors typically contain nuclei with larger areas compared to benign cells.
5. Smoothness:
Quantifies local variations in the radius of the nucleus. Lower smoothness indicates irregular or jagged boundaries, which are characteristic of cancerous cells.
6. Compactness:
Defined as perimeter^2/area - 1, this feature measures how closely the nucleus resembles a circular shape. Higher compactness values indicate irregular nuclear shapes.
7. Concavity:
Measures the severity of inward indentations in the nucleus boundary. Malignant nuclei often display pronounced concave regions.
8. Concave Points:
Counts the number of concave portions of the nucleus contour. A higher number of concave points reflects greater structural distortion.
9. Symmetry:
Assesses how symmetrical the nucleus is across its axes. Benign nuclei tend to be more symmetric, while malignant nuclei are often asymmetrical.
10. Fractal Dimension:
Measures the complexity of the nucleus boundary. Higher fractal dimension values indicate more complex and irregular contours, which are associated with malignancy.

### Statistical Descriptors of Nucleus Characteristics
For each of the ten characteristics described above, three statistical summaries are provided:
1. Mean Features (*_mean):
The mean features represent the average value of each nucleus characteristic across all measured nuclei within a breast mass. These variables capture the overall or typical nuclear morphology of the tumor.

Examples include:

- radius_mean
- texture_mean
- perimeter_mean
- area_mean

Clinically, these features describe the general structural differences between benign and malignant tumors, as malignant tumors often exhibit larger and more irregular nuclei on average.

2. Standard Error Features (*_se):
The standard error (SE) features quantify the variability of each nucleus characteristic across the sampled nuclei. These variables reflect the degree of heterogeneity within a tumor.

Examples include:
- radius_se
- texture_se
- perimeter_se
- area_se

In a medical context, increased variability in nuclear morphology is an important indicator of malignancy, as cancerous tumors often consist of cells with diverse and abnormal characteristics.

3. Worst Features (*_worst):
The worst features correspond to the mean of the three largest observed values for each nucleus characteristic. These variables are designed to capture the most extreme nuclear abnormalities present in the sample.

Examples include:
- radius_worst
- texture_worst
- perimeter_worst
- area_worst

These features are particularly important in cancer diagnosis, as malignant behavior may be driven by a small subset of highly abnormal cells rather than the average behavior of all nuclei.


### Additional Variables

These features are particularly important in cancer diagnosis, as malignant behavior may be driven by a small subset of highly abnormal cells rather than the average behavior of all nuclei.

In [14]:
df.dtypes

id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst     

columns or features whose data type is stored as object are categorical, therefore there is only 1 categorical feature in this dataset(diagnosis).

### Statistical Summary

#### Numeric Variables

In [15]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,569.0,30371830.0,125020600.0,8670.0,869218.0,906024.0,8813129.0,911320500.0
radius_mean,569.0,14.12729,3.524049,6.981,11.7,13.37,15.78,28.11
texture_mean,569.0,19.28965,4.301036,9.71,16.17,18.84,21.8,39.28
perimeter_mean,569.0,91.96903,24.29898,43.79,75.17,86.24,104.1,188.5
area_mean,569.0,654.8891,351.9141,143.5,420.3,551.1,782.7,2501.0
smoothness_mean,569.0,0.09636028,0.01406413,0.05263,0.08637,0.09587,0.1053,0.1634
compactness_mean,569.0,0.104341,0.05281276,0.01938,0.06492,0.09263,0.1304,0.3454
concavity_mean,569.0,0.08879932,0.07971981,0.0,0.02956,0.06154,0.1307,0.4268
concave points_mean,569.0,0.04891915,0.03880284,0.0,0.02031,0.0335,0.074,0.2012
symmetry_mean,569.0,0.1811619,0.02741428,0.106,0.1619,0.1792,0.1957,0.304


The numeric predictors can be grouped into three categories: mean values, standard errors, and worst values, each computed from ten core tumor characteristics.

Size-related features such as radius, perimeter, and area show substantial variability and wide ranges, suggesting the presence of both small benign tumors and large malignant ones. Worst-case measurements exhibit particularly large maxima, highlighting localized aggressive tumor regions that may be more clinically relevant than average values.

Shape-related features, including compactness, concavity, and concave points, demonstrate high variability and extreme values, consistent with known morphological irregularities in malignant tumors.

Standard error features capture intra-tumor heterogeneity and show large dispersion for some observations, indicating chaotic growth patterns often associated with malignancy.

#### Categorical Variables

In [16]:
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
diagnosis,569,2,B,357


In [17]:
df['diagnosis'].value_counts()

diagnosis
B    357
M    212
Name: count, dtype: int64

The target variable diagnosis contains two classes: benign (B) and malignant (M). Out of 569 observations, 357 cases (approximately 62.7%) are benign, while 212 cases (approximately 37.3%) are malignant. This indicates a mildly imbalanced class distribution, with benign cases representing the majority class.

Although the dataset contains a sufficient number of malignant samples for training classical classification models, the imbalance highlights the importance of using appropriate evaluation metrics. In particular, relying solely on overall accuracy may lead to misleading conclusions, as a model biased toward predicting the majority class could achieve high accuracy while failing to correctly identify malignant cases. Therefore, special attention will be given to recall (sensitivity) for the malignant class and to confusion matrix–based evaluation.

## EDA

### Data Quality Check

In [18]:
df.isnull().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

An inspection of missing values showed that the dataset does not contain null entries except for the column "Unnamed". This indicates that imputation will be required before modeling.

#### Check Outliers

To systematically identify extrem observations, the Interquartile Range(IQR) will be implemented. This method flags values that fall below 
Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

In [21]:
def iqr_outliers(df, column, remove=False):
    """
    Detects outliers in a given column using the IQR method.

    Parameters:
    - df (pd.DataFrame): Input dataframe
    - column (str): Column name to analyze
    - remove (bool): If true, returns dataframe without outliers.

    Returns:
    - outliers (pd.DataFrame): rows identified as outliers(if remove=False)
    - cleaned_df (pd.DataFrame): dataframe without outliers(if remove=True)    
    """

    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5*IQR
    upper_bound = Q1 + 1.5*IQR

    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

    if remove:
        cleaned_df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
        return cleaned_df
    
    return outliers

In [23]:
# Detect Outliers
outliers = iqr_outliers(df, 'radius_mean')
outliers

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.11840,0.27760,0.3001,0.14710,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.11890,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.1860,0.2750,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.10960,0.15990,0.1974,0.12790,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.2430,0.3613,0.08758,
4,84358402,M,20.29,14.34,135.1,1297.0,0.10030,0.13280,0.1980,0.10430,...,16.67,152.2,1575.0,0.1374,0.2050,0.4000,0.1625,0.2364,0.07678,
6,844359,M,18.25,19.98,119.6,1040.0,0.09463,0.10900,0.1127,0.07400,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
535,919555,M,20.55,20.86,137.8,1308.0,0.10460,0.17390,0.2085,0.13220,...,25.48,160.2,1809.0,0.1268,0.3135,0.4433,0.2148,0.3077,0.07569,
563,926125,M,20.92,25.09,143.0,1347.0,0.10990,0.22360,0.3174,0.14740,...,29.41,179.1,1819.0,0.1407,0.4186,0.6599,0.2542,0.2929,0.09873,
564,926424,M,21.56,22.39,142.0,1479.0,0.11100,0.11590,0.2439,0.13890,...,26.40,166.1,2027.0,0.1410,0.2113,0.4107,0.2216,0.2060,0.07115,
565,926682,M,20.13,28.25,131.2,1261.0,0.09780,0.10340,0.1440,0.09791,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
