# Support Vector Machine Regression 

## Mục tiêu
- Xây dựng mô hình SVM Regressor để dự đoán tuổi thọ trung bình
- Sử dụng dữ liệu đã được tiền xử lý từ `data/processed/`
- Tối ưu hóa siêu tham số bằng 5-Fold Cross-Validation
- Đánh giá mô hình trên tập train
- Lưu mô hình đã huấn luyện 

## Giới thiệu

SVR là phiên bản hồi quy của SVM, có khả năng mô hình hóa quan hệ phi tuyến bằng kernel.

### **Ưu điểm:**

- Mạnh mẽ trong dữ liệu phi tuyến.

- Không bị ảnh hưởng nhiều bởi outliers.

- Tổng quát hóa tốt.

### **Nhược điểm:**

- Chậm khi dữ liệu lớn.

- Cần chuẩn hóa dữ liệu.

- Khó chọn tham số (C, epsilon, gamma).

## Bước 1 - Import các thư viện cần thiết

### 1.1. Import thư viện

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.svm import SVR, LinearSVR
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import joblib

### 1.2. Cấu hình thư mục

In [2]:
RANDOM_STATE = 42
os.makedirs("../models/2_SVM_regression", exist_ok=True)

## Bước 2 - Đọc dữ liệu đã tiền xử lý 

In [3]:
# Đọc dữ liệu
train_df = pd.read_csv('../data/processed/train.csv')

print("THÔNG TIN DỮ LIỆU")
print("="*60)
print(f"Kích thước tập train: {train_df.shape}")

# Hiển thị 5 dòng đầu của tập train
print("\n5 dòng đầu tiên của tập train (đã được chuẩn hóa):")
train_df.head()

THÔNG TIN DỮ LIỆU
Kích thước tập train: (3124, 13)

5 dòng đầu tiên của tập train (đã được chuẩn hóa):


Unnamed: 0,country_name,country_code,year,population,pop_growth,life_expectancy,gdp_per_capita,gdp_growth,sanitation,electricity,water_access,co2_emissions,labor_force
0,Denmark,DNK,2017,-0.203264,-0.418953,81.102439,1.682694,-0.047741,1.67351,0.642797,0.754689,0.118599,0.069223
1,"Korea, Dem. People's Rep.",PRK,2017,-0.056076,-0.530921,73.034,-0.412271,0.028651,-0.245951,-1.29473,0.439985,-0.216848,2.087276
2,Madagascar,MDG,2008,-0.091998,1.00779,61.992,-0.60717,0.547296,-1.651029,-2.194047,-2.540432,-0.528445,2.542268
3,Greece,GRC,2018,-0.166799,-0.945727,81.787805,0.170491,-0.209157,1.356075,0.642797,0.754689,0.161554,-1.033944
4,South Sudan,SSD,2019,-0.169071,1.000983,58.129,-0.412271,0.028651,-1.37545,-2.593358,-2.691194,-0.278341,1.279234


## Bước 3 - Chuẩn bị dữ liệu cho mô hình

Tách biến mục tiêu (`life_expectancy`) khỏi các đặc trưng. Loại bỏ các cột không cần thiết như `country_name`, `country_code`.

In [4]:
# Định nghĩa các cột dùng để dự đoán
feature_cols = [col for col in train_df.columns 
                if col not in ['life_expectancy', 'country_name', 'country_code']]

# Tách X và y cho từng tập
X_train = train_df[feature_cols]
y_train = train_df['life_expectancy']

print("THÔNG TIN CÁC TẬP DỮ LIỆU")
print("="*60)
print(f"Số lượng đặc trưng: {len(feature_cols)}")
print(f"\nCác đặc trưng được sử dụng:")
for i, col in enumerate(feature_cols, 1):
    print(f"  {i}. {col}")

print(f"\nKích thước X_train: {X_train.shape}")
print(f"Kích thước y_train: {y_train.shape}")

THÔNG TIN CÁC TẬP DỮ LIỆU
Số lượng đặc trưng: 10

Các đặc trưng được sử dụng:
  1. year
  2. population
  3. pop_growth
  4. gdp_per_capita
  5. gdp_growth
  6. sanitation
  7. electricity
  8. water_access
  9. co2_emissions
  10. labor_force

Kích thước X_train: (3124, 10)
Kích thước y_train: (3124,)


## Bước 4 - Xây dựng và huấn luyện mô hình LinearSVR bằng RandomizedSearchCV (k=5)


### 4.1. Tìm cấu hình tốt nhất cho LinearSVR bằng 5-fold CV trên tập train

In [5]:
from scipy.stats import uniform, randint

# Lưới tham số cho LinearSVR
param_distributions = {
    "C": uniform(0.01, 100),              # 0.01 → 100
    "epsilon": uniform(0.001, 1.0),       # 0.001 → 1.0
    "loss": ["epsilon_insensitive", "squared_epsilon_insensitive"],
    "max_iter": randint(1000, 10000),     # random số vòng lặp
}

# Tìm ra mô hình LinearSVR tốt nhất
linsvr = LinearSVR(random_state=RANDOM_STATE, dual='auto')
cv_linear_svr = RandomizedSearchCV(
    linsvr, 
    param_distributions=param_distributions, 
    scoring='neg_mean_squared_error', 
    cv=5, 
    n_jobs=-1, 
    verbose=2,
    n_iter=60
)

# Huấn luyện mô hình
cv_linear_svr.fit(X_train, y_train)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[CV] END C=99.25694954090761, epsilon=0.5214888035548396, loss=squared_epsilon_insensitive, max_iter=3340; total time=   0.0s
[CV] END C=99.25694954090761, epsilon=0.5214888035548396, loss=squared_epsilon_insensitive, max_iter=3340; total time=   0.0s
[CV] END C=25.51823018898193, epsilon=0.9847978003511819, loss=squared_epsilon_insensitive, max_iter=9272; total time=   0.0s
[CV] END C=9.958140587197162, epsilon=0.02242844577169978, loss=squared_epsilon_insensitive, max_iter=9819; total time=   0.0s
[CV] END C=60.93783823250685, epsilon=0.6989231873084083, loss=squared_epsilon_insensitive, max_iter=8972; total time=   0.0s
[CV] END C=60.93783823250685, epsilon=0.6989231873084083, loss=squared_epsilon_insensitive, max_iter=8972; total time=   0.0s
[CV] END C=60.93783823250685, epsilon=0.6989231873084083, loss=squared_epsilon_insensitive, max_iter=8972; total time=   0.0s
[CV] END C=60.93783823250685, epsilon=0.6989231873084083, loss=squared_epsilon_insensitive, max_iter=8972; total time



[CV] END C=85.48545902165183, epsilon=0.17111910535565855, loss=epsilon_insensitive, max_iter=6319; total time=   1.1s
[CV] END C=85.48545902165183, epsilon=0.17111910535565855, loss=epsilon_insensitive, max_iter=6319; total time=   1.0s




[CV] END C=61.592447938201374, epsilon=0.8802859881703854, loss=epsilon_insensitive, max_iter=6033; total time=   1.0s
[CV] END C=61.592447938201374, epsilon=0.8802859881703854, loss=epsilon_insensitive, max_iter=6033; total time=   1.0s
[CV] END C=76.05707985308895, epsilon=0.649583604652676, loss=epsilon_insensitive, max_iter=4973; total time=   0.8s
[CV] END C=61.592447938201374, epsilon=0.8802859881703854, loss=epsilon_insensitive, max_iter=6033; total time=   1.0s
[CV] END C=89.53841432999576, epsilon=0.9485288660714386, loss=squared_epsilon_insensitive, max_iter=4427; total time=   0.0s
[CV] END C=89.53841432999576, epsilon=0.9485288660714386, loss=squared_epsilon_insensitive, max_iter=4427; total time=   0.0s
[CV] END C=85.48545902165183, epsilon=0.17111910535565855, loss=epsilon_insensitive, max_iter=6319; total time=   1.2s
[CV] END C=89.53841432999576, epsilon=0.9485288660714386, loss=squared_epsilon_insensitive, max_iter=4427; total time=   0.0s
[CV] END C=89.53841432999576,



[CV] END C=76.05707985308895, epsilon=0.649583604652676, loss=epsilon_insensitive, max_iter=4973; total time=   1.1s
[CV] END C=41.54649750242682, epsilon=0.14344057637193852, loss=squared_epsilon_insensitive, max_iter=3979; total time=   0.0s
[CV] END C=61.592447938201374, epsilon=0.8802859881703854, loss=epsilon_insensitive, max_iter=6033; total time=   1.1s
[CV] END C=12.813527604566456, epsilon=0.3636849096097491, loss=squared_epsilon_insensitive, max_iter=3258; total time=   0.0s
[CV] END C=12.813527604566456, epsilon=0.3636849096097491, loss=squared_epsilon_insensitive, max_iter=3258; total time=   0.0s
[CV] END C=12.813527604566456, epsilon=0.3636849096097491, loss=squared_epsilon_insensitive, max_iter=3258; total time=   0.0s
[CV] END C=12.813527604566456, epsilon=0.3636849096097491, loss=squared_epsilon_insensitive, max_iter=3258; total time=   0.0s
[CV] END C=12.813527604566456, epsilon=0.3636849096097491, loss=squared_epsilon_insensitive, max_iter=3258; total time=   0.0s
[C



[CV] END C=33.40699305039087, epsilon=0.48906905237669074, loss=epsilon_insensitive, max_iter=7612; total time=   1.5s
[CV] END C=76.05707985308895, epsilon=0.649583604652676, loss=epsilon_insensitive, max_iter=4973; total time=   0.9s
[CV] END C=76.05707985308895, epsilon=0.649583604652676, loss=epsilon_insensitive, max_iter=4973; total time=   0.9s
[CV] END C=42.374000513062825, epsilon=0.17785553754405092, loss=squared_epsilon_insensitive, max_iter=9943; total time=   0.0s
[CV] END C=42.374000513062825, epsilon=0.17785553754405092, loss=squared_epsilon_insensitive, max_iter=9943; total time=   0.0s
[CV] END C=42.374000513062825, epsilon=0.17785553754405092, loss=squared_epsilon_insensitive, max_iter=9943; total time=   0.0s
[CV] END C=42.374000513062825, epsilon=0.17785553754405092, loss=squared_epsilon_insensitive, max_iter=9943; total time=   0.0s
[CV] END C=42.374000513062825, epsilon=0.17785553754405092, loss=squared_epsilon_insensitive, max_iter=9943; total time=   0.0s
[CV] EN



[CV] END C=76.05707985308895, epsilon=0.649583604652676, loss=epsilon_insensitive, max_iter=4973; total time=   1.0s
[CV] END C=61.86784647090847, epsilon=0.7665799271651603, loss=epsilon_insensitive, max_iter=4108; total time=   0.7s
[CV] END C=61.86784647090847, epsilon=0.7665799271651603, loss=epsilon_insensitive, max_iter=4108; total time=   0.7s
[CV] END C=61.86784647090847, epsilon=0.7665799271651603, loss=epsilon_insensitive, max_iter=4108; total time=   0.7s




[CV] END C=61.86784647090847, epsilon=0.7665799271651603, loss=epsilon_insensitive, max_iter=4108; total time=   0.8s
[CV] END C=61.86784647090847, epsilon=0.7665799271651603, loss=epsilon_insensitive, max_iter=4108; total time=   0.9s
[CV] END C=70.3777989594487, epsilon=0.20321032265892602, loss=epsilon_insensitive, max_iter=7477; total time=   1.2s
[CV] END C=70.3777989594487, epsilon=0.20321032265892602, loss=epsilon_insensitive, max_iter=7477; total time=   1.2s




[CV] END C=70.3777989594487, epsilon=0.20321032265892602, loss=epsilon_insensitive, max_iter=7477; total time=   1.2s
[CV] END C=70.3777989594487, epsilon=0.20321032265892602, loss=epsilon_insensitive, max_iter=7477; total time=   1.5s
[CV] END C=96.7543582520618, epsilon=0.9601289225370155, loss=epsilon_insensitive, max_iter=5593; total time=   1.0s
[CV] END C=96.7543582520618, epsilon=0.9601289225370155, loss=epsilon_insensitive, max_iter=5593; total time=   0.9s
[CV] END C=72.63073152765399, epsilon=0.33130277112537354, loss=epsilon_insensitive, max_iter=7629; total time=   1.2s
[CV] END C=96.7543582520618, epsilon=0.9601289225370155, loss=epsilon_insensitive, max_iter=5593; total time=   0.9s
[CV] END C=70.3777989594487, epsilon=0.20321032265892602, loss=epsilon_insensitive, max_iter=7477; total time=   1.5s
[CV] END C=96.7543582520618, epsilon=0.9601289225370155, loss=epsilon_insensitive, max_iter=5593; total time=   1.0s
[CV] END C=72.63073152765399, epsilon=0.33130277112537354, 



[CV] END C=96.7543582520618, epsilon=0.9601289225370155, loss=epsilon_insensitive, max_iter=5593; total time=   1.1s
[CV] END C=72.63073152765399, epsilon=0.33130277112537354, loss=epsilon_insensitive, max_iter=7629; total time=   1.4s
[CV] END C=72.63073152765399, epsilon=0.33130277112537354, loss=epsilon_insensitive, max_iter=7629; total time=   1.4s




[CV] END C=51.61550367901669, epsilon=0.5009065475115424, loss=epsilon_insensitive, max_iter=1444; total time=   0.3s
[CV] END C=28.765367863098927, epsilon=0.2801363209999552, loss=epsilon_insensitive, max_iter=6396; total time=   1.2s
[CV] END C=51.61550367901669, epsilon=0.5009065475115424, loss=epsilon_insensitive, max_iter=1444; total time=   0.3s
[CV] END C=51.61550367901669, epsilon=0.5009065475115424, loss=epsilon_insensitive, max_iter=1444; total time=   0.2s
[CV] END C=28.765367863098927, epsilon=0.2801363209999552, loss=epsilon_insensitive, max_iter=6396; total time=   1.2s
[CV] END C=28.765367863098927, epsilon=0.2801363209999552, loss=epsilon_insensitive, max_iter=6396; total time=   1.1s
[CV] END C=98.10542852272393, epsilon=0.12562098784972142, loss=epsilon_insensitive, max_iter=5449; total time=   0.8s
[CV] END C=98.10542852272393, epsilon=0.12562098784972142, loss=epsilon_insensitive, max_iter=5449; total time=   0.9s
[CV] END C=28.765367863098927, epsilon=0.2801363209



[CV] END C=98.10542852272393, epsilon=0.12562098784972142, loss=epsilon_insensitive, max_iter=5449; total time=   0.8s
[CV] END C=28.765367863098927, epsilon=0.2801363209999552, loss=epsilon_insensitive, max_iter=6396; total time=   1.2s
[CV] END C=41.95237256179461, epsilon=0.47725651463361174, loss=squared_epsilon_insensitive, max_iter=5766; total time=   0.0s
[CV] END C=41.95237256179461, epsilon=0.47725651463361174, loss=squared_epsilon_insensitive, max_iter=5766; total time=   0.0s
[CV] END C=41.95237256179461, epsilon=0.47725651463361174, loss=squared_epsilon_insensitive, max_iter=5766; total time=   0.0s
[CV] END C=51.61550367901669, epsilon=0.5009065475115424, loss=epsilon_insensitive, max_iter=1444; total time=   0.5s
[CV] END C=41.95237256179461, epsilon=0.47725651463361174, loss=squared_epsilon_insensitive, max_iter=5766; total time=   0.0s
[CV] END C=41.95237256179461, epsilon=0.47725651463361174, loss=squared_epsilon_insensitive, max_iter=5766; total time=   0.0s
[CV] END 



[CV] END C=40.78004278636741, epsilon=0.03673440685977536, loss=epsilon_insensitive, max_iter=5665; total time=   0.9s
[CV] END C=61.13306710442572, epsilon=0.8974108938510479, loss=epsilon_insensitive, max_iter=4589; total time=   0.8s
[CV] END C=40.78004278636741, epsilon=0.03673440685977536, loss=epsilon_insensitive, max_iter=5665; total time=   0.9s
[CV] END C=71.24152107015792, epsilon=0.9138349167928626, loss=squared_epsilon_insensitive, max_iter=7397; total time=   0.0s
[CV] END C=71.24152107015792, epsilon=0.9138349167928626, loss=squared_epsilon_insensitive, max_iter=7397; total time=   0.0s
[CV] END C=71.24152107015792, epsilon=0.9138349167928626, loss=squared_epsilon_insensitive, max_iter=7397; total time=   0.0s
[CV] END C=61.13306710442572, epsilon=0.8974108938510479, loss=epsilon_insensitive, max_iter=4589; total time=   0.8s
[CV] END C=71.24152107015792, epsilon=0.9138349167928626, loss=squared_epsilon_insensitive, max_iter=7397; total time=   0.0s
[CV] END C=71.24152107



[CV] END C=92.77427002908566, epsilon=0.30086342421743695, loss=epsilon_insensitive, max_iter=2906; total time=   0.5s
[CV] END C=92.77427002908566, epsilon=0.30086342421743695, loss=epsilon_insensitive, max_iter=2906; total time=   0.5s
[CV] END C=92.77427002908566, epsilon=0.30086342421743695, loss=epsilon_insensitive, max_iter=2906; total time=   0.4s
[CV] END C=92.77427002908566, epsilon=0.30086342421743695, loss=epsilon_insensitive, max_iter=2906; total time=   0.5s
[CV] END C=92.77427002908566, epsilon=0.30086342421743695, loss=epsilon_insensitive, max_iter=2906; total time=   0.5s




[CV] END C=69.23856807573843, epsilon=0.5911470891910978, loss=epsilon_insensitive, max_iter=5944; total time=   0.9s
[CV] END C=80.08567685419474, epsilon=0.4766190984150368, loss=epsilon_insensitive, max_iter=6550; total time=   1.0s
[CV] END C=69.23856807573843, epsilon=0.5911470891910978, loss=epsilon_insensitive, max_iter=5944; total time=   0.9s
[CV] END C=5.315686147098809, epsilon=0.07430146787237313, loss=squared_epsilon_insensitive, max_iter=3049; total time=   0.0s
[CV] END C=80.08567685419474, epsilon=0.4766190984150368, loss=epsilon_insensitive, max_iter=6550; total time=   1.0s
[CV] END C=5.315686147098809, epsilon=0.07430146787237313, loss=squared_epsilon_insensitive, max_iter=3049; total time=   0.0s
[CV] END C=5.315686147098809, epsilon=0.07430146787237313, loss=squared_epsilon_insensitive, max_iter=3049; total time=   0.0s
[CV] END C=5.315686147098809, epsilon=0.07430146787237313, loss=squared_epsilon_insensitive, max_iter=3049; total time=   0.0s
[CV] END C=5.3156861



[CV] END C=9.499571481820235, epsilon=0.03347325585843397, loss=epsilon_insensitive, max_iter=3535; total time=   0.7s
[CV] END C=1.843972064035183, epsilon=0.12558285630873578, loss=epsilon_insensitive, max_iter=6805; total time=   1.1s
[CV] END C=69.23856807573843, epsilon=0.5911470891910978, loss=epsilon_insensitive, max_iter=5944; total time=   1.1s
[CV] END C=80.08567685419474, epsilon=0.4766190984150368, loss=epsilon_insensitive, max_iter=6550; total time=   1.2s
[CV] END C=69.23856807573843, epsilon=0.5911470891910978, loss=epsilon_insensitive, max_iter=5944; total time=   1.1s
[CV] END C=9.499571481820235, epsilon=0.03347325585843397, loss=epsilon_insensitive, max_iter=3535; total time=   0.5s
[CV] END C=1.843972064035183, epsilon=0.12558285630873578, loss=epsilon_insensitive, max_iter=6805; total time=   1.1s
[CV] END C=61.2693221661363, epsilon=0.7588537621771768, loss=squared_epsilon_insensitive, max_iter=4489; total time=   0.0s
[CV] END C=1.843972064035183, epsilon=0.12558



[CV] END C=1.843972064035183, epsilon=0.12558285630873578, loss=epsilon_insensitive, max_iter=6805; total time=   1.4s
[CV] END C=33.2246675486631, epsilon=0.5896048194040999, loss=epsilon_insensitive, max_iter=3368; total time=   0.5s
[CV] END C=33.2246675486631, epsilon=0.5896048194040999, loss=epsilon_insensitive, max_iter=3368; total time=   0.6s
[CV] END C=49.93497515741046, epsilon=0.610846781007786, loss=squared_epsilon_insensitive, max_iter=8816; total time=   0.0s
[CV] END C=49.93497515741046, epsilon=0.610846781007786, loss=squared_epsilon_insensitive, max_iter=8816; total time=   0.0s
[CV] END C=49.93497515741046, epsilon=0.610846781007786, loss=squared_epsilon_insensitive, max_iter=8816; total time=   0.0s
[CV] END C=49.93497515741046, epsilon=0.610846781007786, loss=squared_epsilon_insensitive, max_iter=8816; total time=   0.0s
[CV] END C=33.2246675486631, epsilon=0.5896048194040999, loss=epsilon_insensitive, max_iter=3368; total time=   0.6s
[CV] END C=49.93497515741046, 



[CV] END C=33.2246675486631, epsilon=0.5896048194040999, loss=epsilon_insensitive, max_iter=3368; total time=   0.6s
[CV] END C=33.2246675486631, epsilon=0.5896048194040999, loss=epsilon_insensitive, max_iter=3368; total time=   0.7s
[CV] END C=10.561813474752292, epsilon=0.893546968167199, loss=epsilon_insensitive, max_iter=7331; total time=   1.1s
[CV] END C=10.561813474752292, epsilon=0.893546968167199, loss=epsilon_insensitive, max_iter=7331; total time=   1.2s
[CV] END C=73.26419998146875, epsilon=0.23219648872031118, loss=squared_epsilon_insensitive, max_iter=4013; total time=   0.0s
[CV] END C=73.26419998146875, epsilon=0.23219648872031118, loss=squared_epsilon_insensitive, max_iter=4013; total time=   0.0s
[CV] END C=10.561813474752292, epsilon=0.893546968167199, loss=epsilon_insensitive, max_iter=7331; total time=   1.2s
[CV] END C=73.26419998146875, epsilon=0.23219648872031118, loss=squared_epsilon_insensitive, max_iter=4013; total time=   0.0s
[CV] END C=73.26419998146875, e



[CV] END C=60.444544910040875, epsilon=0.11785975350802613, loss=epsilon_insensitive, max_iter=6846; total time=   1.1s
[CV] END C=60.444544910040875, epsilon=0.11785975350802613, loss=epsilon_insensitive, max_iter=6846; total time=   1.1s
[CV] END C=10.561813474752292, epsilon=0.893546968167199, loss=epsilon_insensitive, max_iter=7331; total time=   1.1s
[CV] END C=60.444544910040875, epsilon=0.11785975350802613, loss=epsilon_insensitive, max_iter=6846; total time=   1.1s
[CV] END C=60.444544910040875, epsilon=0.11785975350802613, loss=epsilon_insensitive, max_iter=6846; total time=   1.1s
[CV] END C=60.444544910040875, epsilon=0.11785975350802613, loss=epsilon_insensitive, max_iter=6846; total time=   1.2s




[CV] END C=29.753366032023866, epsilon=0.8804607984547561, loss=epsilon_insensitive, max_iter=8670; total time=   1.3s
[CV] END C=29.753366032023866, epsilon=0.8804607984547561, loss=epsilon_insensitive, max_iter=8670; total time=   1.3s
[CV] END C=76.14787094492215, epsilon=0.7818425185731692, loss=squared_epsilon_insensitive, max_iter=1731; total time=   0.0s
[CV] END C=76.14787094492215, epsilon=0.7818425185731692, loss=squared_epsilon_insensitive, max_iter=1731; total time=   0.0s
[CV] END C=76.14787094492215, epsilon=0.7818425185731692, loss=squared_epsilon_insensitive, max_iter=1731; total time=   0.0s
[CV] END C=76.14787094492215, epsilon=0.7818425185731692, loss=squared_epsilon_insensitive, max_iter=1731; total time=   0.0s
[CV] END C=76.14787094492215, epsilon=0.7818425185731692, loss=squared_epsilon_insensitive, max_iter=1731; total time=   0.0s
[CV] END C=17.047416470267898, epsilon=0.09903333606473919, loss=squared_epsilon_insensitive, max_iter=5065; total time=   0.0s
[CV]



[CV] END C=75.07341154266864, epsilon=0.7604810785159524, loss=epsilon_insensitive, max_iter=8990; total time=   1.4s
[CV] END C=75.07341154266864, epsilon=0.7604810785159524, loss=epsilon_insensitive, max_iter=8990; total time=   1.4s
[CV] END C=29.753366032023866, epsilon=0.8804607984547561, loss=epsilon_insensitive, max_iter=8670; total time=   1.5s
[CV] END C=29.753366032023866, epsilon=0.8804607984547561, loss=epsilon_insensitive, max_iter=8670; total time=   1.5s
[CV] END C=56.21655080113022, epsilon=0.33098553313539747, loss=epsilon_insensitive, max_iter=3520; total time=   0.5s
[CV] END C=29.753366032023866, epsilon=0.8804607984547561, loss=epsilon_insensitive, max_iter=8670; total time=   1.6s




[CV] END C=75.07341154266864, epsilon=0.7604810785159524, loss=epsilon_insensitive, max_iter=8990; total time=   1.4s
[CV] END C=56.21655080113022, epsilon=0.33098553313539747, loss=epsilon_insensitive, max_iter=3520; total time=   0.5s
[CV] END C=75.07341154266864, epsilon=0.7604810785159524, loss=epsilon_insensitive, max_iter=8990; total time=   1.6s




[CV] END C=77.111298912572, epsilon=0.5946023245208218, loss=epsilon_insensitive, max_iter=9095; total time=   1.4s
[CV] END C=77.111298912572, epsilon=0.5946023245208218, loss=epsilon_insensitive, max_iter=9095; total time=   1.6s
[CV] END C=77.111298912572, epsilon=0.5946023245208218, loss=epsilon_insensitive, max_iter=9095; total time=   1.6s
[CV] END C=77.111298912572, epsilon=0.5946023245208218, loss=epsilon_insensitive, max_iter=9095; total time=   1.5s
[CV] END C=7.741761958362759, epsilon=0.13578751224095142, loss=epsilon_insensitive, max_iter=9500; total time=   1.3s
[CV] END C=99.29208362979537, epsilon=0.4986582978169771, loss=epsilon_insensitive, max_iter=4472; total time=   0.6s
[CV] END C=7.741761958362759, epsilon=0.13578751224095142, loss=epsilon_insensitive, max_iter=9500; total time=   1.4s
[CV] END C=7.741761958362759, epsilon=0.13578751224095142, loss=epsilon_insensitive, max_iter=9500; total time=   1.3s
[CV] END C=7.741761958362759, epsilon=0.13578751224095142, lo



[CV] END C=2.3615270894515548, epsilon=0.26965297923205767, loss=epsilon_insensitive, max_iter=9352; total time=   1.0s
[CV] END C=2.3615270894515548, epsilon=0.26965297923205767, loss=epsilon_insensitive, max_iter=9352; total time=   1.0s
[CV] END C=2.3615270894515548, epsilon=0.26965297923205767, loss=epsilon_insensitive, max_iter=9352; total time=   1.0s
[CV] END C=2.3615270894515548, epsilon=0.26965297923205767, loss=epsilon_insensitive, max_iter=9352; total time=   1.0s
[CV] END C=2.3615270894515548, epsilon=0.26965297923205767, loss=epsilon_insensitive, max_iter=9352; total time=   1.0s




0,1,2
,estimator,LinearSVR(random_state=42)
,param_distributions,"{'C': <scipy.stats....x70a7741efce0>, 'epsilon': <scipy.stats....x70a6f1bf3140>, 'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'], 'max_iter': <scipy.stats....x70a6f07b16a0>}"
,n_iter,60
,scoring,'neg_mean_squared_error'
,n_jobs,-1
,refit,True
,cv,5
,verbose,2
,pre_dispatch,'2*n_jobs'
,random_state,

0,1,2
,epsilon,np.float64(0.8802859881703854)
,tol,0.0001
,C,np.float64(61.592447938201374)
,loss,'epsilon_insensitive'
,fit_intercept,True
,intercept_scaling,1.0
,dual,'auto'
,verbose,0
,random_state,42
,max_iter,6033


### 4.2. Hiển thị siêu tham số tối ưu

In [6]:
best_linsvr = cv_linear_svr.best_estimator_
best_params_linsvr = cv_linear_svr.best_params_

print("SIÊU THAM SỐ TỐI ƯU")
print("="*60)
for param, value in best_params_linsvr.items():
    print(f"  {param:20s}: {value}")

SIÊU THAM SỐ TỐI ƯU
  C                   : 61.592447938201374
  epsilon             : 0.8802859881703854
  loss                : epsilon_insensitive
  max_iter            : 6033


### 4.3. Đánh giá và lưu model

In [7]:
# Đánh giá trên tập train
y_pred_linsvr = best_linsvr.predict(X_train)
linsvr_rmse = np.sqrt(mean_squared_error(y_train, y_pred_linsvr))
linsvr_r2  = r2_score(y_train, y_pred_linsvr)
linsvr_mae = mean_absolute_error(y_train, y_pred_linsvr)

print("KẾT QUẢ MÔ HÌNH TỐI ƯU")
print("="*60)
print("Model: LinearSVR:")
print(f"RMSE loss: {linsvr_rmse:.3f}")
print(f"MAE loss: {linsvr_mae:.3f}")
print(f"R2 score: {linsvr_r2:.3f}")

# Lưu model
joblib.dump(best_linsvr, "../models/2_SVM_regression/linear_svr.pkl")

KẾT QUẢ MÔ HÌNH TỐI ƯU
Model: LinearSVR:
RMSE loss: 4.305
MAE loss: 3.159
R2 score: 0.761


['../models/2_SVM_regression/linear_svr.pkl']

## Bước 5 - Xây dựng và huấn luyện mô hình SVR (kernel=rbf) bằng RandomizedSearchCV (k=5)

### 5.1. Tìm cấu hình tốt nhất cho SVR bằng 5-fold CV 

In [8]:
from scipy.stats import loguniform, uniform

# Lưới tham số cho SVR (RBF)
param_distributions_svr_rbf = {
    "C": loguniform(1e-2, 1e3),           
    "gamma": loguniform(1e-4, 1e0),        
    "epsilon": uniform(0.001, 1.0),      
}

svr = SVR()
cv_svr = RandomizedSearchCV(
    svr, 
    param_distributions_svr_rbf, 
    scoring='neg_mean_squared_error', 
    cv=5, 
    n_jobs=-1, 
    verbose=2,
    n_iter=60
)

cv_svr.fit(X_train, y_train)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
[CV] END C=0.18663761350282734, epsilon=0.9965299375363642, gamma=0.077028344310898; total time=   0.7s
[CV] END C=0.18663761350282734, epsilon=0.9965299375363642, gamma=0.077028344310898; total time=   0.8s
[CV] END C=0.14240953289034622, epsilon=0.6382915203306125, gamma=0.028947375205475644; total time=   0.8s
[CV] END C=0.18663761350282734, epsilon=0.9965299375363642, gamma=0.077028344310898; total time=   0.8s
[CV] END C=0.14240953289034622, epsilon=0.6382915203306125, gamma=0.028947375205475644; total time=   0.8s
[CV] END C=0.18663761350282734, epsilon=0.9965299375363642, gamma=0.077028344310898; total time=   0.9s
[CV] END C=0.14240953289034622, epsilon=0.6382915203306125, gamma=0.028947375205475644; total time=   0.9s
[CV] END C=8.579503124419876, epsilon=0.4766654431344527, gamma=0.0005273804830715262; total time=   0.9s
[CV] END C=0.14240953289034622, epsilon=0.6382915203306125, gamma=0.028947375205475644; total t

0,1,2
,estimator,SVR()
,param_distributions,"{'C': <scipy.stats....x70a6f07b36e0>, 'epsilon': <scipy.stats....x70a6f07e0080>, 'gamma': <scipy.stats....x70a6f07b3110>}"
,n_iter,60
,scoring,'neg_mean_squared_error'
,n_jobs,-1
,refit,True
,cv,5
,verbose,2
,pre_dispatch,'2*n_jobs'
,random_state,

0,1,2
,kernel,'rbf'
,degree,3
,gamma,np.float64(0.0630700746029064)
,coef0,0.0
,tol,0.001
,C,np.float64(134.77248113487076)
,epsilon,np.float64(0.134955301114153)
,shrinking,True
,cache_size,200
,verbose,False


### 5.2. Hiển thị siêu tham số tốt nhất của mô hình

In [9]:
best_svr = cv_svr.best_estimator_
best_params_svr = cv_svr.best_params_

print("SIÊU THAM SỐ TỐI ƯU")
print("="*60)
for param, value in best_params_svr.items():
    print(f"  {param:20s}: {value}")

SIÊU THAM SỐ TỐI ƯU
  C                   : 134.77248113487076
  epsilon             : 0.134955301114153
  gamma               : 0.0630700746029064


### 5.2. Đánh giá và lưu model

In [10]:
y_pred_svr = best_svr.predict(X_train)
svr_rmse = np.sqrt(mean_squared_error(y_train, y_pred_svr))
svr_r2  = r2_score(y_train, y_pred_svr)
svr_mae = mean_absolute_error(y_train, y_pred_svr)

print("KẾT QUẢ MÔ HÌNH TỐI ƯU")
print("="*60)
print("Model: SVR:")
print(f"RMSE loss: {svr_rmse:.3f}")
print(f"MAE loss: {svr_mae:.3f}")
print(f"R2 score: {svr_r2:.3f}")

# Lưu model
joblib.dump(best_svr, "../models/2_SVM_regression/svr_rbf.pkl")

KẾT QUẢ MÔ HÌNH TỐI ƯU
Model: SVR:
RMSE loss: 2.484
MAE loss: 1.406
R2 score: 0.921


['../models/2_SVM_regression/svr_rbf.pkl']

## Kết luận
Thông qua notebook này, ta đã:
- Tìm được bộ tham số tốt nhất tìm được cho LinearSVR/SVR-RBF
- Train các mô hình này trên tập dữ liệu huấn luyện
- Đánh giá các mô hình thông qua các loại loss khác nhau: MAE, RMSE, R2