# ðŸ¤– Scikit-learn Review for ML/DL

**Má»¥c tiÃªu:** Ã”n táº­p ML pipeline vá»›i Scikit-learn

**Ná»™i dung:**
- Data preprocessing
- Train-test split & cross-validation
- ML algorithms (classification, regression)
- Model evaluation metrics
- Hyperparameter tuning
- Pipelines

**Level:** Intermediate

---

In [None]:
import numpy as np
import pandas as pd
from sklearn import __version__ as sklearn_version

print(f"Scikit-learn: {sklearn_version}")

# Sample data
from sklearn.datasets import make_classification, make_regression

X_cls, y_cls = make_classification(n_samples=1000, n_features=20, n_informative=15, 
                                     n_redundant=5, random_state=42)
X_reg, y_reg = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)

print(f"Classification: X={X_cls.shape}, y={y_cls.shape}")
print(f"Regression: X={X_reg.shape}, y={y_reg.shape}")

---

## 1. Data Preprocessing

### Scaling & Encoding

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer

# StandardScaler: (x - mean) / std
scaler_std = StandardScaler()
X_scaled = scaler_std.fit_transform(X_cls)

# MinMaxScaler: (x - min) / (max - min)
scaler_minmax = MinMaxScaler()
X_minmax = scaler_minmax.fit_transform(X_cls)

# Imputation (missing values)
X_with_missing = X_cls.copy()
X_with_missing[np.random.rand(*X_with_missing.shape) < 0.1] = np.nan
imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent'
X_imputed = imputer.fit_transform(X_with_missing)

# Label encoding (categorical -> numeric)
categories = np.array(['A', 'B', 'C', 'A', 'B'])
le = LabelEncoder()
categories_encoded = le.fit_transform(categories)

# One-hot encoding
categories_2d = categories.reshape(-1, 1)
ohe = OneHotEncoder(sparse_output=False)
categories_onehot = ohe.fit_transform(categories_2d)

print(f"Scaled mean: {X_scaled.mean(axis=0)[:3]}")
print(f"Scaled std: {X_scaled.std(axis=0)[:3]}")
print(f"MinMax range: [{X_minmax.min()}, {X_minmax.max()}]")
print(f"Label encoded: {categories_encoded}")
print(f"One-hot shape: {categories_onehot.shape}")

### ðŸ’¡ When to Use

- **StandardScaler**: Most algorithms (SVM, Neural Networks, Logistic Regression)
- **MinMaxScaler**: When bounded range needed [0, 1]
- **No scaling**: Tree-based models (Random Forest, XGBoost) - they're scale-invariant

## 2. Train-Test Split & Cross-Validation

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold

# Basic split
X_train, X_test, y_train, y_test = train_test_split(
    X_cls, y_cls, test_size=0.2, random_state=42, stratify=y_cls  # stratify maintains class balance
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Train class balance: {np.bincount(y_train) / len(y_train)}")
print(f"Test class balance: {np.bincount(y_test) / len(y_test)}")

# Cross-validation
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')

print(f"\nCV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")

# Stratified K-Fold (for imbalanced data)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(skf.split(X_train, y_train)):
    print(f"Fold {fold+1}: Train={len(train_idx)}, Val={len(val_idx)}")

## 3. Classification Models

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5)
}

# Train and evaluate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("Model Comparison:\n")
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average='binary')
    rec = recall_score(y_test, y_pred, average='binary')
    f1 = f1_score(y_test, y_pred, average='binary')
    
    results.append({'Model': name, 'Accuracy': acc, 'Precision': prec, 'Recall': rec, 'F1': f1})

results_df = pd.DataFrame(results).sort_values('F1', ascending=False)
print(results_df.to_string(index=False))

## 4. Regression Models

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Split regression data
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Models
reg_models = {
    'Linear Regression': LinearRegression(),
    'Ridge (L2)': Ridge(alpha=1.0),
    'Lasso (L1)': Lasso(alpha=1.0),
    'Decision Tree': DecisionTreeRegressor(max_depth=5, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

print("Regression Model Comparison:\n")
reg_results = []

for name, model in reg_models.items():
    model.fit(X_train_r, y_train_r)
    y_pred = model.predict(X_test_r)
    
    mse = mean_squared_error(y_test_r, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test_r, y_pred)
    r2 = r2_score(y_test_r, y_pred)
    
    reg_results.append({'Model': name, 'RMSE': rmse, 'MAE': mae, 'RÂ²': r2})

reg_results_df = pd.DataFrame(reg_results).sort_values('RÂ²', ascending=False)
print(reg_results_df.to_string(index=False))

## 5. Model Evaluation

### Classification Metrics

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Best model from previous comparison
best_model = RandomForestClassifier(n_estimators=100, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1']))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:")
print(cm)
print(f"  TN: {cm[0,0]}, FP: {cm[0,1]}")
print(f"  FN: {cm[1,0]}, TP: {cm[1,1]}")

# ROC AUC
auc = roc_auc_score(y_test, y_pred_proba)
print(f"\nROC AUC: {auc:.3f}")

## 6. Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Grid Search (exhaustive)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5, 10]
}

rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='f1', n_jobs=-1, verbose=0)
grid_search.fit(X_train, y_train)

print("Grid Search Results:")
print(f"  Best params: {grid_search.best_params_}")
print(f"  Best score: {grid_search.best_score_:.3f}")

# Random Search (sample from distribution)
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': randint(3, 15),
    'min_samples_split': randint(2, 20)
}

random_search = RandomizedSearchCV(rf, param_dist, n_iter=20, cv=3, 
                                   scoring='f1', random_state=42, n_jobs=-1, verbose=0)
random_search.fit(X_train, y_train)

print("\nRandom Search Results:")
print(f"  Best params: {random_search.best_params_}")
print(f"  Best score: {random_search.best_score_:.3f}")

# Compare on test set
y_pred_tuned = grid_search.best_estimator_.predict(X_test)
print(f"\nTest F1 (tuned): {f1_score(y_test, y_pred_tuned):.3f}")

## 7. Pipelines

### End-to-end workflow

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Simple pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Fit and predict
pipe.fit(X_train, y_train)
y_pred_pipe = pipe.predict(X_test)
print(f"Pipeline Accuracy: {accuracy_score(y_test, y_pred_pipe):.3f}")

# More complex: Different preprocessing for different features
# (Demo with synthetic data)
X_mixed = np.column_stack([
    np.random.randn(1000, 5),  # Numeric features
    np.random.choice(['A', 'B', 'C'], (1000, 2))  # Categorical
])

numeric_features = [0, 1, 2, 3, 4]
categorical_features = [5, 6]

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(sparse_output=False), categorical_features)
    ])

full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

print("\nâœ… Pipelines ensure:")
print("  1. No data leakage (fit only on train)")
print("  2. Reproducible preprocessing")
print("  3. Easy deployment (single object)")
print("  4. Hyperparameter tuning across all steps")

---

## ðŸŽ¯ Key Takeaways

### ML Workflow

```python
# 1. Load data
X, y = load_data()

# 2. Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

# 3. Preprocess (fit on train only!)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)  # No fit!

# 4. Train
model = RandomForestClassifier()
model.fit(X_train, y_train)

# 5. Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

# 6. Tune (optional)
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
```

### Critical Points

1. **Always split BEFORE preprocessing** (avoid data leakage)
2. **Fit on train, transform on test** (no fit on test!)
3. **Use stratify for imbalanced data**
4. **Use pipelines** for reproducibility
5. **Cross-validation** for robust evaluation
6. **Tree-based models don't need scaling**

### Common Metrics

**Classification:**
- Accuracy: Overall correctness
- Precision: Of predicted positives, how many correct?
- Recall: Of actual positives, how many found?
- F1: Harmonic mean of precision & recall
- ROC AUC: Trade-off FPR vs TPR

**Regression:**
- MSE/RMSE: Squared error (penalizes large errors)
- MAE: Absolute error (robust to outliers)
- RÂ²: Explained variance (0-1, higher better)

---

**Next:** OpenCV for computer vision