# BI Project - Breast Cancer

## **Content of the Notebook**

Procedure according to CRISP-DM Framework.

The steps Data Understanding to Evaluation are illustrated in the notebook, all other contents are dealt with in the written elaboration.

- Business Understanding
- **Data Understanding**
- **Data Preperation**
- **Modeling**
- **Evaluation**
- Deployment Options & Future Outlook
- Conclusion



*The content and structure of the notebook, as well as definitions and explanations, are strongly based on the “Applied Analytics” course*

In [None]:
# Read csv file


## 1. Data Understanding

**Task:**
- Describe the dataset used
- Explore and describe features
- Discuss data quality (e.g., missing values)
- Highlight patterns or trends
- Include visualizations to support your data understanding


### Descriptive Statistics

Data Structure:

- 1 Table
- CSV-Format 
- Each feature has 2 linked/related features (except for the target variable and the ID) - e.g. the Radius variable has 3 columns: Mean, Standard Erros and Worst Radius



Data Content:
- Number of observations & variables
- Data types

In [None]:
# Data Content

Numerical/contiuos variables
- Distribution
- Min, Max, Mean, Median
- Standard Deviation
- Skewness
- Kurtosis

In [None]:
# Describe numerical data

Visualization
- Plot data (Unidimensional (Histograms) and Multidemensional(Scatterplots))

In [None]:
# Visualization
# - Plot data (Unidimensional (Histograms) and Multidemensional(Scatterplots)

EDA (Automated Exploratory Data Analysis)
- sweetwiz
- ydata

In [None]:
# EDA 
# choose sweetviz or ydata

## 2. Data Preperation

**Task**
- Detail how you selected and filtered your data
- Explain any transformations or feature engineering steps (including missing


Variable Cleaning
- Incorrect Values (True instead of 1, 1.000 instead of 1000 → else remove observations)
- Consistency in Data (same format, check time and dates, ..)

In [None]:
# Variable Cleaning

Outliers
- Find 1. Point (univariate) outliers or/and 2. Contextual (multivariate) outliers
- Methods to detect outliers: Tukey's Fence and Standardization / 3DS Method
- Methods to detect multivariate outliers: Mahalanobis distance, PCA-based techniques, Robuts Covariance Estimation
- Handle outliers: remove, transform, ignore

In [None]:
# Outlier Detection

In [None]:
# Outlier Handling

Missing Data
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Missing not at random (MNAR)

How to deal with missing data:
- Mean/Median Imputation
- Hot-Deck / Distribution
- Model based (eg. regression model)

In [None]:
# Check for missing data

Data Wrangling
- Data must be trimmed, reshaped, transformed, aggregated, merged, …
- Data removal of irrelevant variables
- Data transformations:
    - Existing data to new variables (if-else rules, eg. if m → male, elif f → female, else diverse)
    - Binning data: Continuous data to ordinal data (e.g. cluster heights)
    - Distribution transformation: Skewed data can lead to bias → (log-) Transforming the variables
- Reshaping data: more than one data point per unit of observations (wide to long or long to wide)


In [None]:
# Data Wrangling

Dimensionality Reduction
- Goal: remove dimensions -> Curse of dimensionality
- Remove redundancy and noise from the dataset
- Techniques:
    - Combining Features
    - Principal component analyis (PCA)
    - Factor analysis (FA)
    - Singular value decomposition (SVD)
    - Linear discriminant (LDA)

PCA in detail:
- Goal: Combine existing, correlated variables into less, independent variables
- Process:
    - Start with N dimensions in data set
    - Normalize Data
        - Min-Max (uniform distr.)
        - Z-score scaling (gaussian distr.)
        - Log transformation (skewness)
    - Calculate N principal components (PCs) from data set
        - Each PC explains a part of the variance ( → eigenvalue)
        - Each PC receives a contribution to each other ( → eigenvector)
    - Select the PCs with high eigenvalues; ignore the X remaining PCs
- Analyze loadings / eigenvalues of the PCs (Feature Loadings on PCs Heat Map)
- When to use PCAs:
    - Too many variables → Curse of Dimensionality
    - The features in the data are highly correlated → Multicollinearity
    - Visualization of high-dimensional data → Visualize relevant PCs
    - Noise reduction → Getting rid of low-variance components
    - In combination with clustering algorithms


In [None]:
# Dimensionality Reduction (1)
# -- Combining features

In [None]:
# Dimensionality Reduction (2)
# -- PCA
# -- Evaluate PCA

Clustering
- K-Means
    - Evaluation: 
        - Sum of squared Errors (SSE)
        - Silhouette Score
    - Optimal number of clusters:
        - Elbow method

- Optional: Kohonen Self Organizing Maps Algorithm (SOM - run a clustering algorithm (e.g. K-Means) on the weight vectors of SOM)

In [4]:
# Clustering

Sampling
- Partition data in training (70% - 80%) and testing data (20% - 30%)
- Methods
    - k-fold cross validation:
        - partition data in k random samples
        - train model on k-1 of the subsamples
        - evaluate the k-1 models on the remaining testing subsample → error distribution indicates how stable the training is across data samples

    - bootstrapping:
        - From a dataset with N observations you sample N-times with replacement
        - some of the original observations are multiple times in the sampled data - others not at all
        - repeat a large number of times
        - used to sample training data


In [None]:
# Sample Data

## 3. Modeling


**Task**
- Describe two different models selected for analysis
- Describe how each model was implemented

- Taxanomic Overview
- Predictive Modeling vs. Descriptive Modeling

- **Predictive Modeling**
  - Goal: forecast future outcomes
  - Regression:
    - Linear Regression
    - Neural Networks
    - CART
  - Classification:
    - Logistic Regression
    - Nearest Neighbour
    - Boosting Algorithm

- **Descriptive Modeling**
  - Goal: understand data structure & find patterns
  - Segmentation:
    - Hierarchical Clustering
    - K-means
    - DBSCAN
  - Rule/Sequence Mining:
    - Apriori Algorithm
    - FP-growth
    - PrefixSpan

- Machine Learning task
  - Target function: f: Attributes → Class Labels L

- Data Partitioning strategy
  - training using labeled instances (“gold standard”)
  - split data into disjuncts sets of training data and test data

- Learning process
  - classifier learns from training data
  - performance assessment: evaluate classifier on separate test data (to avoid data leakage = classifier learns from training data)

- Problems
  - Limitation of training data
  - Mapping Challenges: Training Set ≠ Entire Population
  - Overfitting

- Local vs. global models
  - Local models
    - k-Nearest Neighbors (k-NN)
    - Focus: Local neighborhood structures
    - Distance based decision boundary
    - Uses nearby data points and distance-based decision boundaries for localized predictions
  - Global models
    - Linear/Logistic Regression; Neural Networks
    - Single, comprehensive decision boundary
    - Utilizes all data to create a unified decision rule for overall generalization

- Trade-Off
  - Interpretability vs Complexity
  - Computational Speed
  - Data Requirements

- Parametric vs. Non-Parametric Models
  - Parametric Models
    - Logistic Regression
    - Finite parameter set
    - Strong distributional assumptions
  - Non-Parametric Models
    - Decision Trees
    - Random Forests
    - Minimal data distribution assumptions

- Model Complexity Spectrum
  - “Easy” models → High interpretability
    - Linear Regression
    - Logistic Regression
    - Decision Trees
  - “Complex” models → Higher predictive accuracy
    - Random Forests
    - Gradient Boosting
    - Deep Neural Networks


In [None]:
# Easy model with high interpretability and lower prediction accuracy
    # - Linear Regression
    # - Logistic Regression
    # - Decision Tree

In [None]:
# Complex model with lower interpretability but higheer prediction accuracy
    # - XGBoost
    # - LightGBM   
    # - Deep Neual Network
    # - Random Forest
    # - Support Vector Machine
    # - Gradient Boosting

## 4. Evaluation

**Task**
- Present evaluation metrics for both models
- Compare model performance and explain results


Evaluation Strategies:
- Confusion Matrix
    - Accuracy: Overall correctness of the model
    - Precision (Positive Predictive Value): How many predicted positives are actually positive
    - Recall (Sensitivity, True Positive Rate): How many actual positives were correctly predicted
    - Specificity (True Negative Rate): How many actual negatives were correctly predicted
    - F1 Score: Harmonic mean of precision and recall (balance between the two)

In [None]:
# Confusion Matrix Easy model

In [None]:
# Confusion Matrix Complex model

- Precision-Recall (PR) Curve
  - The PR curve plots Precision (y-axis) vs. Recall (x-axis) at different thresholds
  - Focuses on performance in imbalanced datasets, where the positive class is rare

- Area Under the PR Curve (AUC-PR)
  - Measures the average trade-off between precision and recall
  - Value range: 0 to 1 → Higher = Better
  - More informative than ROC AUC for imbalanced classes, since it doesn’t account for true negatives

In [None]:
# PR & AUC-PR easy model

In [None]:
# PR & AUC-PR complex model

- ROC Curve (Receiver Operating Characteristic)
  - Plots True Positive Rate (Recall) vs. False Positive Rate (FPR)
  - Shows model performance across all classification thresholds
  - Each point on the curve corresponds to a different threshold

- Area Under the ROC Curve (AUC-ROC)
  - AUC = Area under the ROC Curve (value between 0 and 1)
  - Measures the model’s ability to distinguish between classes
  - Interpretation:
    - 1.0 = perfect classifier
    - 0.5 = random guessing
    - < 0.5 = worse than random

In [None]:
# ROC & AUC ROC easy model

In [None]:
# ROC & AUC ROC complex model