

## 🗺️ 10-Step EDA & Modelling Map

| #                                                 | Checkpoint                      | What you run                                                                                                                                               | What you’re deciding                                                                              | Typical follow-up                                     |
| ------------------------------------------------- | ------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- | ----------------------------------------------------- |
| 0                                                 | **Load & sanity check**         | `df.shape`, `df.head()`, `df.sample(5)`                                                                                                                    | File read correctly? Wrong delimiter? Weird headers?                                              | Fix delimiter/encoding; rename columns                |
| 1                                                 | **Structural overview**         | `df.info()`, `df.dtypes`, `df.nunique()`                                                                                                                   | Numeric vs categorical vs mixed/text/date; very-wide table?                                       | Start a column-type list for later pipelines          |
| 2                                                 | **Target hunt**                 | Do you already know the target? If not:<br>• Search for obvious labels: `price`, `outcome`, `class`, `y`.<br>• If none, you might be in unsupervised land. | Supervised vs unsupervised → splits & metrics                                                     | For supervised: `y = df[target]`                      |
| 3                                                 | **Missing-value audit**         | `df.isna().sum().sort_values(ascending=False)`                                                                                                             | How much is missing & where? MNAR patterns?                                                       | Imputation strategy per column; or drop               |
| 4                                                 | **Basic statistics**            | `df.describe(include='all')`                                                                                                                               | Ranges, zeros, constants, suspicious values                                                       | Unit conversions; winsorise/extreme-value flags       |
| 5                                                 | **Univariate plots**            | Histograms/density for numeric; bar plots for categorical                                                                                                  | Skewed dists → log/Box-Cox? Rare categories → grouping?                                           | Decide on scaling / encoding approach                 |
| 6                                                 | **Bivariate / target relation** | • Numeric vs target: `sns.boxplot(x=y, y=feat)` or corr heatmap<br>• Cat vs target: `pd.crosstab(y, feat, normalize='columns')`                            | Which variables look predictive? Possible leakage?                                                | Reserve promising features; flag leaky ones           |
| 7                                                 | **Multicollinearity**           | `df.corr(numeric_only=True)` + VIF check                                                                                                                   | Highly correlated numeric → drop/regularise/Tree-model                                            | Pick model that tolerates multicollinearity or do PCA |
| 8                                                 | **Class-balance / outliers**    | • `y.value_counts(normalize=True)`<br>• `sns.boxplot(data=df[numeric])`                                                                                    | Imbalanced target → SMOTE / class\_weight.<br>Extreme outliers → robust scaler / isolation forest | Pick metrics: ROC-AUC, F1, balanced accuracy          |
| 9                                                 | **Initial modelling guess**     | Decision tree of questions:<br>\`\`\`txt                                                                                                                   |                                                                                                   |                                                       |
| Target numeric?  → Regression                     |                                 |                                                                                                                                                            |                                                                                                   |                                                       |
| Target <= 20 cats? → Classification               |                                 |                                                                                                                                                            |                                                                                                   |                                                       |
| No target?         → Clustering / Topic modelling |                                 |                                                                                                                                                            |                                                                                                   |                                                       |
| Timestamp index?    → Time-series model           |                                 |                                                                                                                                                            |                                                                                                   |                                                       |
| Image pixels?       → CNN / CV stack              |                                 |                                                                                                                                                            |                                                                                                   |                                                       |
| Text strings?       → NLP transformers            |                                 |                                                                                                                                                            |                                                                                                   |                                                       |

````| Algorithm family & baseline metric | Create preprocessing + model pipeline |
| 10 | **Baseline & iterate** | Train/test split → baseline pipeline (e.g. `ColumnTransformer` + model) | How far is baseline from naïve guess? | Hyper-tune, feature engineering, ensemble |

---

## 🔑  How each checkpoint narrows the modelling path

1. **Data types (Step 1)** → decide *encoding* (One-Hot, Ordinal, TF-IDF) and *scaling* (Standard, MinMax, Robust).  
2. **Target nature (Step 2)** → classification vs regression vs clustering → evaluation metric.  
3. **Missingness (Step 3)** → simple vs fancy imputation (KNN / MICE).  
4. **Distribution shape (Step 5)** → log-transform or yeo-johnson for skew; choose algorithms that care (linear) vs those that don’t (trees).  
5. **Correlation (Step 7)** → if high: Ridge/Lasso/PCA or tree/boosting model.  
6. **Imbalance (Step 8)** → SMOTE / ADASYN / focal-loss; precision-recall curve focus.  

---

## 🏁  One-screen cheat-sheet code

```python
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt
df = pd.read_csv("mystery.csv")       # Step 0

print(df.shape, df.head(), sep="\n")  # 0
print(df.info())                      # 1
print(df.nunique().sort_values())     # 1
# If supervised:
target = "quality"                    # 2 (set manually)
y = df[target]; X = df.drop(target, axis=1)

missing = df.isna().sum()
print(missing[missing>0].sort_values(ascending=False))  # 3

print(df.describe(include='all').T)   # 4
for col in X.select_dtypes("number"):
    sns.histplot(df[col]); plt.show() # 5

if target:
    sns.heatmap(df.corr(numeric_only=True)); plt.show() # 6 & 7
    sns.countplot(x=y); plt.show()                      # 8
````

Run the snippet, fill in the blanks (`target` name) and follow the 10-step table for next actions.

---

### 🚦 Quick decision rules for model selection

| Situation                          | Fast baseline model                                          | Why                                    |
| ---------------------------------- | ------------------------------------------------------------ | -------------------------------------- |
| Numeric target, ≤ 1 k rows         | **ElasticNet**                                               | handles multicollinearity, easy interp |
| Numeric, > 10 k rows or non-linear | **LightGBM Regressor**                                       | speed & non-linear power               |
| Binary target, imbalance < 5 %     | **XGBoost / LightGBM Classifier + class\_weight**            | handles imbalance, sparse dummies      |
| > 5 classes, many rare categories  | **CatBoost**                                                 | native categorical handling            |
| No target, mostly numeric          | **K-Means → silhouette score**                               | quick clustering baseline              |
| Timestamp + single numeric         | **Prophet / ARIMA**                                          | off-the-shelf forecasting              |
| Text column as target or feature   | **TF-IDF + LogisticRegression**, then try **BERT fine-tune** | simple vs SOTA                         |

---

## 🎯  Remember

1. **Split early, preprocess in-pipeline** to avoid leakage.
2. **Start simple**, measure, then iterate; EDA is iterative too.
3. **Document each decision** (why you dropped `deck`, why you log-scaled `fare`) — future you or teammates will thank you.

With this flow you can land on a sensible first model quickly, confident you haven’t missed critical data-quality landmines. Happy exploring!
