# Cheatsheet - วิธีทำ Regression & Classification (Step-by-step)
ไฟล์นี้สรุปวิธีทำสำหรับแต่ละอัลกอริทึมที่ขอ (ไทย) พร้อมตัวอย่างโค้ดสั้นๆ เพื่อให้ใช้ได้ทันทีตอนสอบ
แต่ละหัวข้อมี: สรุป, inputs/outputs, ขั้นตอนปฏิบัติ, hyperparameters ที่ควร tune, และตัวอย่างโค้ดสั้นๆ

## หมายเหตุก่อนเริ่ม (General checklist)
- แยกข้อมูล: train/test (และ validation หรือใช้ cross-validation).
- ตรวจสอบ missing values และ data types (df.info(), df.isnull().sum()).
- Scale features ถ้าจำเป็น (SVM, KNN, Logistic, NN).
- ใช้ cross-validation (GridSearchCV) เพื่อ tune hyperparameters.

## Regression Methods
### 1) Linear Regression (Simple / Multiple)
สรุป: แบบพื้นฐาน ใช้เมื่อความสัมพันธ์เป็นเชิงเส้น
Inputs: numeric features (1 or more). Output: continuous target.
Steps:
1. แบ่ง train/test
2. (Optional) scale features (not necessary for OLS but ok)
3. fit LinearRegression()
4. evaluate ด้วย R², RMSE, MAE
Hyperparams: ไม่มีมาก (OLS) — ถ้าจำเป็นให้ใช้ regularization (Ridge/Lasso)
Code snippet:

In [None]:
# Linear / Multiple Linear Regression (scikit-learn)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('R2', r2_score(y_test, y_pred))

### 2) Polynomial Regression
สรุป: ขยาย features ด้วย PolynomialFeatures เมื่อความสัมพันธ์เป็น non-linear โดยยังใช้ linear model
Steps:
1. ใช้ PolynomialFeatures(degree=k, include_bias=False) -> poly.fit_transform(X)
2. Fit LinearRegression on transformed features
3. Evaluate และระวัง overfitting (เพิ่ม degree -> overfit)
Hyperparams: degree (2,3...), interaction_only optional
Code snippet:

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y)

### 3) Ridge (L2 regularization) & Lasso (L1)
สรุป: Regularized linear models ช่วยลด overfitting และ (Lasso) ทำ feature selection ได้บางส่วน
Steps:
1. Standardize features (important)
2. ใช้ Ridge(alpha=...) หรือ Lasso(alpha=...)
3. Tune alpha via cross-validation (GridSearchCV / RidgeCV / LassoCV)
Hyperparams: alpha (regularization strength)
Code snippet:

In [None]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
sc = StandardScaler()
Xs = sc.fit_transform(X)
params = {'alpha':[0.01,0.1,1,10]}
grid = GridSearchCV(Ridge(), params, cv=5)
grid.fit(Xs, y)
print('best alpha', grid.best_params_)

### 4) When to pick which (Regression) — Quick advice
- Small dataset, linear relation: Linear / Ridge (if noise)
- Non-linear but low-dim: Polynomial (with care)
- Many features / potential multicollinearity: Ridge/Lasso
- Complex patterns / high accuracy target: tree-based ensembles (RandomForest, XGBoost) — usually gives best accuracy

## Classification Methods
### 1) Logistic Regression
สรุป: baseline binary classifier; outputs probabilities; good baseline and interpretable coefficients
Steps:
1. Split train/test, scale features
2. Fit LogisticRegression(max_iter=1000, C=...) (C is inverse reg strength)
3. Evaluate with accuracy, precision, recall, f1
Hyperparams: C (regularization), penalty ('l2' default)
Code snippet:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
Xtr = sc.fit_transform(X_train)
Xte = sc.transform(X_test)
clf = LogisticRegression(max_iter=1000)
clf.fit(Xtr, y_train)
print(clf.score(Xte, y_test))

### 2) Decision Tree
สรุป: tree-based model; interpretability (visualize tree), sensitive to depth -> prune or set max_depth
Steps:
1. Split data (no need to scale)
2. Fit DecisionTreeClassifier(max_depth=...)
3. Tune max_depth, min_samples_split via CV
Hyperparams: max_depth, min_samples_split, criterion ('gini'/'entropy')
Code snippet:

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
clf = DecisionTreeClassifier(max_depth=5, random_state=42)
clf.fit(X_train, y_train)
print('acc', clf.score(X_test, y_test))

### 3) Ensemble: Random Forest
สรุป: Bagging of decision trees, less overfitting than single tree, good default for tabular data
Steps:
1. Fit RandomForestClassifier(n_estimators=100)
2. Tune n_estimators, max_depth, max_features via CV
3. Check feature_importances_ for interpretability
Hyperparams: n_estimators, max_depth, max_features
Code snippet:

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
print('acc', rf.score(X_test, y_test))
print('imp', rf.feature_importances_)

### 4) Naïve Bayes
สรุป: สมมติ independence ระหว่าง features; เหมาะกับ text (Bag-of-Words) หรือ categorical data
Steps:
1. Vectorize text (CountVectorizer / Tfidf)
2. Fit MultinomialNB or GaussianNB depending on data type
Hyperparams: alpha (smoothing) for MultinomialNB
Code snippet:

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_text = cv.fit_transform(texts)
nb = MultinomialNB(alpha=1.0)
nb.fit(X_text_train, y_train)

### 5) Support Vector Machine (SVM)
สรุป: margin-based classifier; good for medium-sized datasets and high-dimensional space (text). สำคัญ: scale features
Steps:
1. Scale features (StandardScaler)
2. For linear separable try LinearSVC or SVC(kernel='linear'), otherwise use kernel='rbf' with C and gamma tuning
Hyperparams: C (regularization), kernel, gamma (for rbf)
Code snippet:

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
Xtr = sc.fit_transform(X_train)
Xte = sc.transform(X_test)
svc = SVC(kernel='rbf', C=1.0, gamma='scale')
svc.fit(Xtr, y_train)
print('acc', svc.score(Xte, y_test))

### 6) Dimensionality reduction (PCA) for RandomForest / SVM
สรุป: PCA ช่วยลดมิติ ทำให้ SVM เร็วขึ้นและบางครั้งช่วย generalize; RandomForest มักไม่ต้องการ PCA แต่ในกรณีมี noise/very high-dim อาจช่วย
Steps:
1. Standardize data
2. Fit PCA(n_components=k) on train only
3. Transform train/test and feed to model
Code snippet:

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
Xs = sc.fit_transform(X)
pca = PCA(n_components=10)
Xp = pca.fit_transform(Xs)

### 7) Unsupervised: KMeans & Agglomerative Clustering
สรุป: สำหรับ clustering tasks (no labels). KMeans ดีสำหรับกลุ่มที่กลมและมี variance เท่าๆ กัน; Agglomerative ดีสำหรับ nested clusters และเมื่อต้องการ dendrogram
Steps:
1. (Optional) scale data
2. Choose number of clusters (elbow, silhouette)
3. Fit KMeans or AgglomerativeClustering
Code snippet:

In [None]:
from sklearn.cluster import KMeans, AgglomerativeClustering
km = KMeans(n_clusters=3, random_state=42)
labels = km.fit_predict(X)
agg = AgglomerativeClustering(n_clusters=3)
labels2 = agg.fit_predict(X)

### 8) Perceptron & Single-Layer Perceptron (SLP)
สรุป: Perceptron เป็น linear classifier (เหมือน Logistic แต่ไม่มี probabilistic output); SLP คือ single-layer neural unit (no hidden layers)
Steps:
1. Scale features
2. Fit Perceptron (sklearn.linear_model.Perceptron) or a single Linear layer in PyTorch/Keras
Code snippet:

In [None]:
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
Xtr = sc.fit_transform(X_train)
clf = Perceptron(max_iter=1000, tol=1e-3)
clf.fit(Xtr, y_train)
print('acc', clf.score(sc.transform(X_test), y_test))

### 9) Multi-Layer Perceptron (MLP)
สรุป: Feedforward neural network; good when data is large and non-linear patterns exist
Steps:
1. Scale features
2. Choose architecture (hidden layers, units), activation (ReLU), optimizer (Adam)
3. Tune learning rate, epochs, regularization
Code snippet (sklearn MLPClassifier):

In [None]:
from sklearn.neural_network import MLPClassifier
sc = StandardScaler()
Xtr = sc.fit_transform(X_train)
Xte = sc.transform(X_test)
mlp = MLPClassifier(hidden_layer_sizes=(64,32), max_iter=200, random_state=42)
mlp.fit(Xtr, y_train)
print('acc', mlp.score(Xte, y_test))

### 10) XGBoost (eXtreme Gradient Boosting)
สรุป: Gradient boosting tree; มักให้ผลลัพธ์ดีที่สุดในงาน tabular; ต้องติดตั้ง xgboost หรือ use sklearn wrapper
Steps:
1. Install xgboost (pip install xgboost)
2. Tune n_estimators, learning_rate, max_depth, subsample, colsample_bytree
Code snippet:

In [None]:
# XGBoost example (if installed)
from xgboost import XGBClassifier, XGBRegressor
xgb = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42)
xgb.fit(X_train, y_train)
print('acc', xgb.score(X_test, y_test))

## Summary: Which is best (quick picks for exams)
- Regression (highest accuracy on tabular): Gradient Boosting (XGBoost / LightGBM) or RandomForest if XGBoost not available.
- For small/simple problems: Ridge (stable) or Linear Regression.
- Classification (highest accuracy on tabular): XGBoost / LightGBM or RandomForest.
- For text: Multinomial Naive Bayes or fine-tuned Logistic/SVM on TF-IDF.
- For high-dim small-sample: SVM with kernel (or regularized linear models).
Practical exam advice:
1) If allowed any model and accuracy matters -> try XGBoost/LightGBM with basic hyperparameter tuning.
2) If time-limited-> RandomForest with 100-200 trees + max_depth tuned.
3) Always run cross-validation (5-fold) and report the metric requested.

## Quick cheat-sheet (Commands to run in PowerShell)
# Install common packages if missing:
pip install scikit-learn xgboost lightgbm shap statsmodels