# 🧪 Classification pipeline with RDF

### **About This Exercise**

Today you will not have any tasks to solve instead you should solve this challenge completely to practice what we have learned so far.

You’ll work with `data.csv`, which contain clinical and pathological data for cancer diagnoses. Each case includes measurements of three cell nuclei (radius, texture, perimeter), along with patient age, diagnosis date, treatment start date, and cancer type.

### **Context**

The task is to classify tissue samples as cancerous or healthy using features extracted by a pathologist. You'll build a machine learning pipeline that preprocesses this data and applies classification models to make accurate predictions. You should finally build a ML for the production phase.


### **Goal**

Your objectives:

* Preprocess the data with various cleaning and transformation methods.
* Build and tune a Random Forest (RDF) classifier.
* Experiment with different hyperparameter settings
* try to reduce n_estimators to 10 , what will happen?
* you must understand all RDF options so please spend some time reading the function signature

To ensure reproducibility, document your preprocessing and modeling options early in the notebook. Recommended practices include:

* **Dictionary in Notebook**: Simple and effective for small projects.
* **JSON Config File**: Useful for managing and reusing configurations.
* **MLflow Tracking**: Best for logging experiments, metrics, and comparing results visually.

These practices support consistent and trackable machine learning workflows.

#### Dictionarys you can use to save the options for the preprocessing and modeling. 

In [1]:
# Configuration dictionary to track preprocessing and modeling choices
config = {
    "missing_value_strategy": "replace_by_mean_cancer_type",  # Options: replace_by_med_cancer_type, replace_by_mean_cancer_type
    "scaling_method": "StandardScaler",                       # Options: StandardScaler, RobustScaler, normalizer, MinMaxScaler
    "RDF": {"n_estimators": 1000, # options: 5, 10, 100
            "max_depth": 6,       # options: None, 2, 6
            "min_samples_split":2, # options: 2, 4, 10
            "min_samples_leaf":2, # options: 2, 4, 10
            "max_features":0.7,
            "class_weight":'balanced'}                         
}

#### Improt needed packages and set the random seed

In [2]:
# Data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Model and preprocessing tools will be added as needed later


In [3]:
# Set global random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

#### Read and check the data

In [4]:
data = pd.read_csv('./data.csv',header='infer')