## **0. Download dataset**
**Note:** If you can't download using gdown due to limited number of downloads, please download it manually and upload it to your drive, then copy it from the drive to colab.
```python
from google.colab import drive

drive.mount('/content/drive')
!cp /path/to/dataset/on/your/drive .
```

In [1]:
# https://drive.google.com/file/d/1T4oy9neutVe2egrEcnjhKwAdHJCyDcJ6/view?usp=drive_link
!gdown --id 1T4oy9neutVe2egrEcnjhKwAdHJCyDcJ6

Downloading...
From: https://drive.google.com/uc?id=1T4oy9neutVe2egrEcnjhKwAdHJCyDcJ6
To: c:\Users\ADMIN\Desktop\DEEP LEARNING MATERIAL\CODE_SVM\breast-cancer.csv

  0%|          | 0.00/24.4k [00:00<?, ?B/s]
100%|██████████| 24.4k/24.4k [00:00<00:00, 572kB/s]


## **1. Import libraries**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.svm import SVC
from sklearn.preprocessing import (
    StandardScaler,
    LabelEncoder,
    OneHotEncoder,
    OrdinalEncoder
)
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## **2. Load dataset**

In [3]:
dataset_path = './breast-cancer.csv'
df = pd.read_csv(
    dataset_path,
    names=[
        'age',
        'meonpause',
        'tumor-size',
        'inv-nodes',
        'node-caps',
        'deg-malig',
        'breast',
        'breast-quad',
        'irradiat',
        'label'
    ]
)
df

Unnamed: 0,age,meonpause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,label
0,'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
1,'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
2,'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
3,'40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events'
4,'40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events'
...,...,...,...,...,...,...,...,...,...,...
281,'50-59','ge40','30-34','6-8','yes','2','left','left_low','no','no-recurrence-events'
282,'50-59','premeno','25-29','3-5','yes','2','left','left_low','yes','no-recurrence-events'
283,'30-39','premeno','30-34','6-8','yes','2','right','right_up','no','no-recurrence-events'
284,'50-59','premeno','15-19','0-2','no','2','right','left_low','no','no-recurrence-events'


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286 entries, 0 to 285
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          286 non-null    object
 1   meonpause    286 non-null    object
 2   tumor-size   286 non-null    object
 3   inv-nodes    286 non-null    object
 4   node-caps    278 non-null    object
 5   deg-malig    286 non-null    object
 6   breast       286 non-null    object
 7   breast-quad  285 non-null    object
 8   irradiat     286 non-null    object
 9   label        286 non-null    object
dtypes: object(10)
memory usage: 22.5+ KB


In [5]:
df.describe()

Unnamed: 0,age,meonpause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,label
count,286,286,286,286,278,286,286,285,286,286
unique,6,3,11,7,2,3,2,5,2,2
top,'50-59','premeno','30-34','0-2','no','2','left','left_low','no','no-recurrence-events'
freq,96,150,60,213,222,130,152,110,218,201


## **3. Preprocess dataset**

### **3.1. Filling missing values**

In [7]:
df['node-caps'] = df['node-caps'].fillna(df['node-caps'].mode()[0])
df['breast-quad'] = df['breast-quad'].fillna(df['breast-quad'].mode()[0])

In [8]:
df.describe()

Unnamed: 0,age,meonpause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,label
count,286,286,286,286,286,286,286,286,286,286
unique,6,3,11,7,2,3,2,5,2,2
top,'50-59','premeno','30-34','0-2','no','2','left','left_low','no','no-recurrence-events'
freq,96,150,60,213,230,130,152,111,218,201


# 📊 Ý nghĩa các cột trong bộ dữ liệu Breast Cancer Recurrence

| Cột          | Ý nghĩa                                           | Giá trị ví dụ                                               |
|--------------|--------------------------------------------------|-------------------------------------------------------------|
| **age**      | Nhóm tuổi của bệnh nhân                          | `20-29`, `30-39`, …                                        |
| **menopause**| Tình trạng mãn kinh                              | `lt40` (chưa mãn kinh, <40 tuổi), `ge40` (≥40 tuổi), `premeno` (tiền mãn kinh) |
| **tumor-size** | Kích thước khối u                              | `0-4`, `5-9`, `10-14`, … (đơn vị mm)                       |
| **inv-nodes**| Số hạch bạch huyết bị xâm lấn                    | `0-2`, `3-5`, `6-8`, …                                     |
| **node-caps**| Có bao hạch bị vỡ (node-capsule)?                 | `yes` / `no` / `?` (thiếu dữ liệu)                         |
| **deg-malig**| Mức độ ác tính của khối u (grade)                | `1`, `2`, `3` (tăng dần độ nghiêm trọng)                   |
| **breast**   | Bên ngực bị ảnh hưởng                            | `left`, `right`                                            |
| **breast-quad** | Vị trí khối u trong ngực (theo 4 phần tư)     | `left_up`, `left_low`, `right_up`, `right_low`, `central`  |
| **irradiat** | Bệnh nhân có từng xạ trị hay không               | `yes` / `no`                                               |
| **label**    | Nhãn kết quả: bệnh tái phát hay không            | `recurrence-events`, `no-recurrence-events`                |


### **3.2. Encode categorical features**

In [9]:
for col_name in df.columns:
    n_uniques = df[col_name].unique()
    print(f'Unique values in {col_name}: {n_uniques}')

Unique values in age: ["'40-49'" "'50-59'" "'60-69'" "'30-39'" "'70-79'" "'20-29'"]
Unique values in meonpause: ["'premeno'" "'ge40'" "'lt40'"]
Unique values in tumor-size: ["'15-19'" "'35-39'" "'30-34'" "'25-29'" "'40-44'" "'10-14'" "'0-4'"
 "'20-24'" "'45-49'" "'50-54'" "'5-9'"]
Unique values in inv-nodes: ["'0-2'" "'3-5'" "'15-17'" "'6-8'" "'9-11'" "'24-26'" "'12-14'"]
Unique values in node-caps: ["'yes'" "'no'"]
Unique values in deg-malig: ["'3'" "'1'" "'2'"]
Unique values in breast: ["'right'" "'left'"]
Unique values in breast-quad: ["'left_up'" "'central'" "'left_low'" "'right_up'" "'right_low'"]
Unique values in irradiat: ["'no'" "'yes'"]
Unique values in label: ["'recurrence-events'" "'no-recurrence-events'"]


In [10]:
non_rank_features = ['meonpause', 'node-caps', 'breast', 'breast-quad', 'irradiat']
rank_features = ['age', 'tumor-size', 'inv-nodes', 'deg-malig']

y = df['label']
X = df.drop('label', axis=1)

In [11]:
transformer = ColumnTransformer(
    transformers=[
        ("OneHot", OneHotEncoder(drop='first'), non_rank_features),
        ("Ordinal", OrdinalEncoder(), rank_features)
    ],
    remainder='passthrough'
)
X_transformed = transformer.fit_transform(X)

onehot_features = transformer.named_transformers_['OneHot'].get_feature_names_out(non_rank_features)
all_features = onehot_features.tolist() + rank_features

X_encoded = pd.DataFrame(
    X_transformed,
    columns=all_features
)

In [12]:
X_encoded

Unnamed: 0,meonpause_'lt40',meonpause_'premeno',node-caps_'yes',breast_'right',breast-quad_'left_low',breast-quad_'left_up',breast-quad_'right_low',breast-quad_'right_up',irradiat_'yes',age,tumor-size,inv-nodes,deg-malig
0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,6.0,0.0,1.0
3,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,2.0,6.0,0.0,2.0
4,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,5.0,4.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
281,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,5.0,5.0,1.0
282,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,3.0,4.0,4.0,1.0
283,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,5.0,5.0,1.0
284,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,3.0,2.0,0.0,1.0


### **3.3. Encode label**

In [16]:
# Khởi tạo LabelEncoder
label_encoder = LabelEncoder()

# Fit và transform
y_encoded = label_encoder.fit_transform(y)

# In thử vài phần tử
print("👉 Trước khi encode:", y[:3])         # chỉ in 3 phần tử đầu
print("👉 Sau khi encode:", y_encoded[:3])   # chỉ in 3 phần tử đầu

# Mapping nhãn ↔ số
print("👉 Mapping:", dict(zip(label_encoder.classes_, range(len(label_encoder.classes_)))))




👉 Trước khi encode: 0       'recurrence-events'
1    'no-recurrence-events'
2       'recurrence-events'
Name: label, dtype: object
👉 Sau khi encode: [1 0 1]
👉 Mapping: {"'no-recurrence-events'": 0, "'recurrence-events'": 1}


### **3.4. Normalization**

In [15]:
normalizer = StandardScaler()
X_normalized = normalizer.fit_transform(X_encoded)

## **4. Train test split**

In [21]:
test_size = 0.3
random_state = 1
is_shuffle = True
X_train, X_test, y_train, y_test = train_test_split(
    X_normalized, y_encoded,
    test_size=test_size,
    random_state=random_state,
    shuffle=is_shuffle
)

In [22]:
print(f'Number of training samples: {X_train.shape[0]}')
print(f'Number of val samples: {X_test.shape[0]}')

Number of training samples: 200
Number of val samples: 86


## **5. Training**

In [27]:
# Khởi tạo với parameter mặc định (chỉ set random_state cho reproducibility)
classifier = SVC(random_state=random_state)
classifier.fit(X_train, y_train)

# Dự đoán
y_pred = classifier.predict(X_test)
scores = accuracy_score(y_pred, y_test)

# Evaluation
print('Evaluation results on test set:')
print(f'Accuracy: {scores:.4f}')

# In ra 3 parameter chính
print("\n👉 Important parameters:")
print(f"C: {classifier.C}")
print(f"Gamma: {classifier.gamma}")
print(f"Kernel: {classifier.kernel}")

Evaluation results on test set:
Accuracy: 0.6860

👉 Important parameters:
C: 1.0
Gamma: scale
Kernel: rbf


## **6. Evaluation**

In [24]:
y_pred = classifier.predict(X_test)
scores = accuracy_score(y_pred, y_test)

print('Evaluation results on test set:')
print(f'Accuracy: {scores}')

Evaluation results on test set:
Accuracy: 0.686046511627907


# Fine Tune Parameters

In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Định nghĩa parameter grid
param_grid = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'C': [0.1, 1, 10],          # tham số điều chỉnh độ phạt
    'gamma': ['scale', 'auto']  # hệ số kernel (áp dụng cho poly, rbf, sigmoid)
}

# Khởi tạo SVC
svc = SVC(random_state=random_state)

# GridSearchCV
grid_search = GridSearchCV(
    estimator=svc,
    param_grid=param_grid,
    cv=5,              # cross-validation 5 folds
    scoring='accuracy',
    n_jobs=-1
)

# Train
grid_search.fit(X_train, y_train)

# In kết quả tốt nhất
print("👉 Best parameters:", grid_search.best_params_)
print("👉 Best accuracy:", grid_search.best_score_)

👉 Best parameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'poly'}
👉 Best accuracy: 0.74
