## All Imports

| Category                        | Purpose                                  | Import Statements                                                                                                                                                       |
| ------------------------------- | ---------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Core Data Handling**          | DataFrames & arrays                      | `import pandas as pd`<br>`import numpy as np`                                                                                                                           |
| **Statsmodels – Formula API**   | OLS & Logit using formulas               | `import statsmodels.formula.api as smf`                                                                                                                                 |
| **Statsmodels – Class API**     | OLS & Logit (X/y)                        | `import statsmodels.api as sm`                                                                                                                                          |
| **Statsmodels Diagnostics**     | VIF, influence plots                     | `from statsmodels.stats.outliers_influence import variance_inflation_factor`<br>`from statsmodels.graphics.regressionplots import influence_plot`                       |
| **Regression Models (sklearn)** | Linear, Logistic, Ridge, Lasso, etc.     | `from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet`                                                                       |
| **Classification Models**       | SVM, Trees, RF, GBM                      | `from sklearn.svm import SVC`<br>`from sklearn.tree import DecisionTreeClassifier`<br>`from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier` |
| **Feature Selection**           | **Recursive Feature Elimination**        | **`from sklearn.feature_selection import RFE`**                                                                                                                         |
| **Dimensionality Reduction**    | **PCA**                                  | **`from sklearn.decomposition import PCA`**                                                                                                                             |
| **Train/Test Split**            | Data splitting                           | `from sklearn.model_selection import train_test_split`                                                                                                                  |
| **Scaling**                     | Standardization, normalization           | `from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler`                                                                                          |
| **Encoding**                    | One-hot, label, multi-label              | `from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MultiLabelBinarizer`                                                                                    |
| **Feature Engineering**         | Polynomial features, function transforms | `from sklearn.preprocessing import FunctionTransformer, PolynomialFeatures`                                                                                             |
| **Imputation**                  | Missing values                           | `from sklearn.impute import SimpleImputer`                                                                                                                              |
| **Pipelines**                   | Workflow pipelines                       | `from sklearn.pipeline import Pipeline`                                                                                                                                 |
| **Column-wise Processing**      | Numeric + categorical pipelines          | `from sklearn.compose import ColumnTransformer`                                                                                                                         |
| **Model Selection**             | CV, grid search                          | `from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score`                                                                                 |
| **Regression Metrics**          | R², RMSE, MAE                            | `from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error`                                                                                         |
| **Classification Metrics**      | Accuracy, Precision, Recall, F1, AUC     | `from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report`                           |
| **Visualization**               | Plots & EDA                              | `import matplotlib.pyplot as plt`<br>`import seaborn as sns`                                                                                                            |
| **Model Persistence**           | Save/load models                         | `import joblib`                                                                                                                                                         |
| **Warnings**                    | Suppress warnings                        | `import warnings`<br>`warnings.filterwarnings('ignore')`                                                                                                                |


## Statsmodel code

| Approach                                   | Model Type  | Required Imports                                                                                                                                                       | Code Pattern                                                                     | When to Use                                |
| ------------------------------------------ | ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ------------------------------------------ |
| **Formula API (Simple)**                   | OLS         | `import statsmodels.formula.api as smf`                                                                                                                                | `smf.ols("y ~ x1 + x2", data=df).fit()`                                          | Few predictors, readable formulas          |
|                                            | Logit       | `import statsmodels.formula.api as smf`                                                                                                                                | `smf.logit("y ~ x1 + x2", data=df).fit()`                                        | Binary classification, easy syntax         |
| **Formula API (All columns shortcut)**     | OLS / Logit | `import statsmodels.formula.api as smf`                                                                                                                                | `smf.ols("y ~ .", data=df).fit()`                                                | Use all predictors automatically           |
| **Formula API (Programmatic formula)**     | OLS / Logit | `import statsmodels.formula.api as smf`                                                                                                                                | `formula = "y ~ " + " + ".join(cols)`<br>`smf.ols(formula, data=df).fit()`       | Many columns, dynamic selection            |
| **Formula API with categorical variables** | OLS / Logit | `import statsmodels.formula.api as smf`                                                                                                                                | `smf.ols("y ~ C(cat_var)", data=df).fit()`                                       | Auto one-hot encoding with baseline        |
| **Class API (Manual X & y)**               | OLS         | `import statsmodels.api as sm`                                                                                                                                         | `X = df.drop("y", axis=1)`<br>`X = sm.add_constant(X)`<br>`sm.OLS(y, X).fit()`   | Production pipelines, large feature sets   |
|                                            | Logit       | `import statsmodels.api as sm`                                                                                                                                         | `X = df.drop("y", axis=1)`<br>`X = sm.add_constant(X)`<br>`sm.Logit(y, X).fit()` | Binary classification with many predictors |
| **Class API with NumPy arrays**            | OLS / Logit | `import statsmodels.api as sm`                                                                                                                                         | `model = sm.OLS(y_arr, X_arr).fit()`                                             | When already using numeric matrices        |
| **Using `Q()` for special column names**   | OLS / Logit | `import statsmodels.formula.api as smf`<br>`from patsy import dmatrices, dmatrix, build_design_matrices, ModelDesc, Term, EvalEnvironment` (usually only `Q()` needed) | `formula = "y ~ " + " + ".join([f"Q('{c}')" for c in cols])`                     | Columns with spaces or special characters  |
| **Automatic categorical handling (Patsy)** | OLS / Logit | `import statsmodels.formula.api as smf`                                                                                                                                | `smf.ols("y ~ C(cat1) + C(cat2)", data=df)`                                      | Clean categorical variable modeling        |


## Performance Evaluation Import Statements

| Metric Type                 | Imports Needed                                                                                                                                | What It Does                                   |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- |
| **Regression Metrics**      | `from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error`                                                               | Evaluate linear regression: R², RMSE, MAE      |
| **Classification Metrics**  | `from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report` | Evaluate logistic regression                   |
| **Model Predictions**       | —                                                                                                                                             | Predictions are obtained via `model.predict()` |
| **Residual Analysis (OLS)** | `import matplotlib.pyplot as plt`<br>`import seaborn as sns`                                                                                  | For residual plots, QQ-plots                   |
| **VIF Check**               | `from statsmodels.stats.outliers_influence import variance_inflation_factor`                                                                  | To detect multicollinearity                    |


## Code Commands

| Category                                            | Syntax (clean, copy-friendly)                                                                                                                                                                                                                                                                                                                                                                                                          |
| --------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Standardization (Z-score)**                       | from sklearn.preprocessing import StandardScaler<br>scaler = StandardScaler()<br>X_train_scaled = scaler.fit_transform(X_train)<br>X_test_scaled = scaler.transform(X_test)                                                                                                                                                                                                                                                            |
| **Min–Max Normalization (0–1)**                     | from sklearn.preprocessing import MinMaxScaler<br>scaler = MinMaxScaler()<br>X_train_scaled = scaler.fit_transform(X_train)<br>X_test_scaled = scaler.transform(X_test)                                                                                                                                                                                                                                                                |
| **Robust Scaling**                                  | from sklearn.preprocessing import RobustScaler<br>scaler = RobustScaler()<br>X_train_scaled = scaler.fit_transform(X_train)<br>X_test_scaled = scaler.transform(X_test)                                                                                                                                                                                                                                                                |
| **One-Hot Encoding (Sklearn)**                      | from sklearn.preprocessing import OneHotEncoder<br>enc = OneHotEncoder(drop='first', handle_unknown='ignore')<br>X_train_enc = enc.fit_transform(X_train_cat)<br>X_test_enc = enc.transform(X_test_cat)                                                                                                                                                                                                                                |
| **One-Hot Encoding (Pandas)**                       | import pandas as pd<br>X_encoded = pd.get_dummies(X, drop_first=True)                                                                                                                                                                                                                                                                                                                                                                  |
| **Label Encoding**                                  | from sklearn.preprocessing import LabelEncoder<br>le = LabelEncoder()<br>y_enc = le.fit_transform(y)                                                                                                                                                                                                                                                                                                                                   |
| **MultiLabelBinarizer**                             | from sklearn.preprocessing import MultiLabelBinarizer<br>mlb = MultiLabelBinarizer()<br>Y_bin = mlb.fit_transform(Y)                                                                                                                                                                                                                                                                                                                   |
| **Train-Test Split (Regression)**                   | from sklearn.model_selection import train_test_split<br>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)                                                                                                                                                                                                                                                                                      |
| **Train-Test Split (Classification with Stratify)** | from sklearn.model_selection import train_test_split<br>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)                                                                                                                                                                                                                                                                          |
| **Linear Regression**                               | from sklearn.linear_model import LinearRegression<br>model = LinearRegression()<br>model.fit(X_train, y_train)                                                                                                                                                                                                                                                                                                                         |
| **Logistic Regression**                             | from sklearn.linear_model import LogisticRegression<br>model = LogisticRegression()<br>model.fit(X_train, y_train)                                                                                                                                                                                                                                                                                                                     |
| **Random Forest Classifier**                        | from sklearn.ensemble import RandomForestClassifier<br>model = RandomForestClassifier()<br>model.fit(X_train, y_train)                                                                                                                                                                                                                                                                                                                 |
| **SVM (probability enabled)**                       | from sklearn.svm import SVC<br>model = SVC(probability=True)<br>model.fit(X_train, y_train)                                                                                                                                                                                                                                                                                                                                            |
| **Predict (class or numeric)**                      | y_pred = model.predict(X_test)                                                                                                                                                                                                                                                                                                                                                                                                         |
| **Predict Probabilities**                           | y_prob = model.predict_proba(X_test)[:, 1]                                                                                                                                                                                                                                                                                                                                                                                             |
| **RFE (Recursive Feature Elimination)**             | from sklearn.feature_selection import RFE<br>from sklearn.linear_model import LogisticRegression<br>estimator = LogisticRegression()<br>selector = RFE(estimator, n_features_to_select=5)<br>selector.fit(X, y)<br>selected = selector.support_<br>ranking = selector.ranking_                                                                                                                                                         |
| **PCA**                                             | from sklearn.decomposition import PCA<br>pca = PCA(n_components=2)<br>X_pca = pca.fit_transform(X)<br>explained = pca.explained_variance_ratio_                                                                                                                                                                                                                                                                                        |
| **VIF (Variance Inflation Factor)**                 | import statsmodels.api as sm<br>from statsmodels.stats.outliers_influence import variance_inflation_factor<br>import pandas as pd<br>X_const = sm.add_constant(X)<br>vif = pd.DataFrame()<br>vif['feature'] = X_const.columns<br>vif['VIF'] = [variance_inflation_factor(X_const.values, i) for i in range(X_const.shape[1])]                                                                                                          |
| **GridSearchCV (with stratify)**                    | from sklearn.model_selection import GridSearchCV<br>params = {'C': [0.1, 1, 10]}<br>grid = GridSearchCV(LogisticRegression(), params, cv=5, scoring='roc_auc')<br>grid.fit(X_train, y_train)                                                                                                                                                                                                                                           |
| **RandomizedSearchCV**                              | from sklearn.model_selection import RandomizedSearchCV<br>rand = RandomizedSearchCV(RandomForestClassifier(), {'n_estimators':[50,100]}, n_iter=10, cv=5)<br>rand.fit(X_train, y_train)                                                                                                                                                                                                                                                |
| **cross_val_score (Regression)**                    | from sklearn.model_selection import cross_val_score<br>scores = cross_val_score(LinearRegression(), X, y, cv=5)<br>mean_score = scores.mean()                                                                                                                                                                                                                                                                                          |
| **cross_val_score (Classification, Stratified)**    | from sklearn.model_selection import cross_val_score<br>scores = cross_val_score(LogisticRegression(), X, y, cv=5, scoring='roc_auc')                                                                                                                                                                                                                                                                                                   |
| **KFold (Regression)**                              | from sklearn.model_selection import KFold<br>kf = KFold(n_splits=5, shuffle=True, random_state=42)<br>for train_idx, val_idx in kf.split(X):<br>    X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]<br>    y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]                                                                                                                                                                           |
| **StratifiedKFold (Classification)**                | from sklearn.model_selection import StratifiedKFold<br>skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)<br>for train_idx, val_idx in skf.split(X, y):<br>    X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]<br>    y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]                                                                                                                                                  |
| **Regression Metrics**                              | from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error<br>r2 = r2_score(y_test, y_pred)<br>rmse = mean_squared_error(y_test, y_pred, squared=False)<br>mae = mean_absolute_error(y_test, y_pred)                                                                                                                                                                                                                |
| **Classification Metrics**                          | from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report<br>acc = accuracy_score(y_test, y_pred)<br>prec = precision_score(y_test, y_pred)<br>rec = recall_score(y_test, y_pred)<br>f1 = f1_score(y_test, y_pred)<br>auc = roc_auc_score(y_test, y_prob)<br>cm = confusion_matrix(y_test, y_pred)<br>report = classification_report(y_test, y_pred) |
| **ROC Curve**                                       | from sklearn.metrics import roc_curve<br>fpr, tpr, thresholds = roc_curve(y_test, y_prob)<br>plt.plot(fpr, tpr)<br>plt.xlabel("False Positive Rate")<br>plt.ylabel("True Positive Rate")                                                                                                                                                                                                                                               |


In [None]:
import statsmodels.api as sm
from sklearn.metrics import r2_score, mean_squared_error
import pandas as pd

X = df.drop(columns=["Sales"])
X = sm.add_constant(X)
y = df["Sales"]

model = sm.OLS(y, X).fit()
print(model.summary())

preds = model.predict(X)
print("R2:", r2_score(y, preds))


In [None]:
import statsmodels.formula.api as smf
from sklearn.metrics import accuracy_score, roc_auc_score

model = smf.logit("Outcome ~ Age + BMI + Cholesterol", data=df).fit()
print(model.summary())

pred_prob = model.predict(df)
pred_class = (pred_prob >= 0.5).astype(int)

print("Accuracy:", accuracy_score(df["Outcome"], pred_class))
print("AUC:", roc_auc_score(df["Outcome"], pred_prob))


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    "sales": [10, 20, 15, 25, 30],
    "tv": [100, 200, 150, 250, 300],
    "radio": [20, 30, 25, 35, 40]
})

# Split X and y
X = df[["tv", "radio"]]
y = df["sales"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print("R²:", r2_score(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    "bought": [0, 1, 0, 1, 1],
    "age": [22, 45, 30, 35, 50],
    "income": [30, 60, 45, 50, 80]
})

# Split X and y
X = df[["age", "income"]]
y = df["bought"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit logistic regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions (class and probability)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
