# Cheat Sheet: Key Steps and Decisions for MLP & NLP Mastery Challenge

---

## 1. Initial Data Handling
- **Load the dataset**:
    - Use `pandas.read_csv()` or equivalent.
    - **Check missing values**: `df.isnull().sum()`.

- **Split your data**:
    - Use `train_test_split` from `sklearn` (80/20 or 70/30 split).
  
- **Explore the dataset**:
    - Use `df.describe()`, `df.info()`.
    - Plot distributions with `sns.pairplot(df)` and `sns.heatmap(df.corr())`.

---

## 2. Data Preprocessing
- **Feature Scaling**:
    - Use `StandardScaler` for normal distributions.
    - Use `MinMaxScaler` or `RobustScaler` for non-normal distributions or data with outliers.

- **Handle Missing Data**:
    - **Numerical missing data**: `SimpleImputer`, `KNNImputer`.
    - **Categorical missing data**: Use **Most Frequent Imputation** or add a new "Missing" category.

- **Encoding Categorical Data**:
    - Use `LabelEncoder` for binary categorical data.
    - Use `OneHotEncoder` for multi-class categorical data.
    - Use `ColumnTransformer` for applying transformations to specific columns.

---

## 3. Feature Engineering
- **Handle Date/Time Variables**:
    - Extract day, month, year, etc.

- **Feature Construction**:
    - Create new features from existing ones (interaction terms, binning).

---

## 4. Outlier Detection & Removal
- **Handle Outliers**:
    - **Z-score** for normal data.
    - **IQR Method** for non-normal data.
    - **Winsorization** for capping outliers without removal.

---

## 5. Model Selection
- **Algorithm choice**:
    - **Classification**: Start with **Logistic Regression**, **KNN**, or **Random Forest**.
    - **Regression**: Use **Linear Regression** or **Gradient Boosting**.
    - **Dimensionality Reduction**: Use **PCA** if needed.

- **Consider ensemble methods** like **Bagging**, **Boosting**, or **Stacking**.

---

## 6. Model Training
- **Fit the model**:
    - `model.fit(X_train, y_train)`.

- **Check for overfitting**:
    - Use **cross-validation** (`cross_val_score`).

---

## 7. Model Evaluation
- **For classification**:
    - Evaluate with **accuracy**, **precision**, **recall**, **F1-score**, **confusion matrix**.
    - For **imbalanced data**, use **F1-score** or **AUC-ROC curve**.

- **For regression**:
    - Evaluate using **MSE**, **RMSE**, **R2 Score**.
  
- **Plot performance**:
    - Confusion matrices, ROC curves, and residual plots.

---

## 8. Hyperparameter Tuning
- **Optimize parameters** using:
    - **GridSearchCV** or **RandomizedSearchCV**.

---

## 9. Post-Model Analysis
- **Check feature importance**:
    - Use `model.feature_importances_` for tree-based models like **Random Forest**.

- **Evaluate model generalization**:
    - Analyze test set performance and cross-validation scores.

---

## 10. Model Interpretability
- **Interpret the model**:
    - Use **SHAP** or **LIME** for interpretability.

---

## 11. Model Comparison
- **Compare multiple models**:
    - Train several models and compare their performance.
    - Visualize model performance using bar charts for accuracy or relevant metrics.

---

## 12. Final Notes
- **Avoid common pitfalls**:
    - Don’t forget to **scale your data** (especially for distance-based algorithms).
    - Always split your data into **train/test sets** to avoid data leakage.
    - **Document your code** for clarity.

---

### Additional Algorithms to Remember:
- **Logistic Regression** for binary classification.
- **KNN** for non-parametric tasks.
- **SVM** for complex decision boundaries.
- **Random Forest** for feature importance.
- **Gradient Boosting** for performance optimization.
- **K-Means** for clustering.
- **PCA** for dimensionality reduction.

---
