# EDA, Feature Engineering and Training â€” Example Notebook

This notebook provides a compact, runnable example that demonstrates:

- Generating a synthetic tabular dataset
- Performing quick exploratory data analysis (EDA)
- Applying common feature engineering transformations
- Training a LightGBM model and evaluating it
- Optionally logging the run with MLflow and saving a model artifact

Notes: this is intended as a minimal, self-contained example you can run locally.

In [None]:
# Imports
import math
import json
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import lightgbm as lgb
import joblib
import os

# Optional: MLflow logging (if available)
try:
    import mlflow
    MLFLOW_AVAILABLE = True
except Exception:
    MLFLOW_AVAILABLE = False

print('numpy', np.__version__, 'pandas', pd.__version__, 'lightgbm', lgb.__version__)


In [None]:
# Generate a synthetic classification dataset with a few categorical and datetime-like features
X, y = make_classification(n_samples=5000, n_features=8, n_informative=4, n_redundant=1, random_state=42)
df = pd.DataFrame(X, columns=[f"feat_{i}" for i in range(X.shape[1])])
df['target'] = y

# Add a synthetic datetime (event_time) and a categorical column
rng = pd.date_range('2022-01-01', periods=len(df), freq='T')
df['event_time'] = rng.to_series().sample(frac=1, random_state=1).reset_index(drop=True)
df['user_tier'] = np.random.choice(['free', 'pro', 'enterprise'], size=len(df), p=[0.7,0.25,0.05])

# Add some missing values deliberately
df.loc[df.sample(frac=0.03, random_state=2).index, 'feat_0'] = np.nan
df.loc[df.sample(frac=0.02, random_state=3).index, 'user_tier'] = None

df.head().T

## Quick EDA
We'll check shape, summary statistics, missingness, distributions, and simple correlations. These quick checks help us form hypotheses about useful features and potential data issues.

In [None]:
print('shape:', df.shape)
print('
missing per column:
', df.isna().mean())
print('
summary stats for numeric cols:
')
display(df.describe().T)

# Correlation heatmap (numeric features)
corr = df.select_dtypes(include=[np.number]).corr()
print('
Top correlations with target:')
print(corr['target'].abs().sort_values(ascending=False).head(10))

In [None]:
# Small visual checks (matplotlib/seaborn) -- inline plots in notebook
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')

plt.figure(figsize=(8,4))
sns.histplot(df['feat_0'].dropna(), kde=True)
plt.title('Distribution of feat_0 (with missing values)')
plt.show()

plt.figure(figsize=(10,6))
sns.heatmap(corr, annot=False, cmap='coolwarm', center=0)
plt.title('Correlation matrix (numeric)')
plt.show()

## Feature Engineering
We'll create features commonly useful in production: datetime-derived features, categorical encodings, missing indicators, interaction features, and scaling for models that need it.

In [None]:
df_fe = df.copy()
# datetime features
df_fe['event_hour'] = df_fe['event_time'].dt.hour
df_fe['event_dayofweek'] = df_fe['event_time'].dt.dayofweek
df_fe['is_weekend'] = df_fe['event_dayofweek'].isin([5,6]).astype(int)

# missing indicator
df_fe['feat_0_missing'] = df_fe['feat_0'].isna().astype(int)
# simple imputation for numeric
df_fe['feat_0'] = df_fe['feat_0'].fillna(df_fe['feat_0'].median())

# target (mean) encoding for user_tier (simple out-of-fold is preferred in production)
tier_means = df_fe.groupby('user_tier')['target'].mean()
df_fe['user_tier_mean_target'] = df_fe['user_tier'].map(tier_means).fillna(df_fe['target'].mean())

# interaction feature example
df_fe['feat_0_x_feat_1'] = df_fe['feat_0'] * df_fe['feat_1']

# Numeric scaling for models that need it
num_cols = [c for c in df_fe.columns if c.startswith('feat_')]
scaler = StandardScaler()
df_fe[num_cols] = scaler.fit_transform(df_fe[num_cols])

df_fe.head().T

## Train / Test split and Model Training
We'll split by time if this were a temporal problem; for this synthetic example we do a random split. We'll then train LightGBM and evaluate ROC AUC and accuracy.

In [None]:
# Prepare feature matrix
features = [c for c in df_fe.columns if c not in ('target','event_time','user_tier')]
X = df_fe[features]
y = df_fe['target']

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# LightGBM dataset and params
train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
params = {
    'objective': 'binary',
    'metric': ['auc','binary_logloss'],
    'learning_rate': 0.05,
    'num_leaves': 31,
    'verbose': -1
}

bst = lgb.train(params, train_data, num_boost_round=200, valid_sets=[train_data, valid_data], early_stopping_rounds=20, verbose_eval=False)

# predictions and evaluation
y_pred_proba = bst.predict(X_test, num_iteration=bst.best_iteration)
y_pred = (y_pred_proba >= 0.5).astype(int)
print('ROC AUC:', round(roc_auc_score(y_test, y_pred_proba), 4))
print('Accuracy:', round(accuracy_score(y_test, y_pred), 4))
print('
Classification report:
', classification_report(y_test, y_pred))

In [None]:
# Optional: log to MLflow if available
if MLFLOW_AVAILABLE:
    mlflow.set_experiment('eda_feature_training_example')
    with mlflow.start_run():
        mlflow.log_params({'model': 'lightgbm', 'num_leaves': 31, 'learning_rate': 0.05})
        mlflow.log_metric('roc_auc', float(roc_auc_score(y_test, y_pred_proba)))
        mlflow.log_metric('accuracy', float(accuracy_score(y_test, y_pred)))
        # save model artifact
        model_path = 'models/lightgbm_example.txt'
        os.makedirs('models', exist_ok=True)
        bst.save_model(model_path)
        mlflow.log_artifact(model_path)
        print('Logged run to MLflow')
else:
    print('MLflow not available in this environment; skipping MLflow logging')

## Save model locally and simple inference example
We'll persist a scikit-compatible artifact with joblib and load it back for scoring. In production you'd use a model server or BentoML / Seldon.

In [None]:
# Save with joblib (wrapping LightGBM booster for convenience)
model_artifact = 'models/lgb_model.joblib'
os.makedirs('models', exist_ok=True)
joblib.dump({'booster': bst, 'features': features, 'scaler': scaler}, model_artifact)
print('Saved model to', model_artifact)

# Load and run a quick prediction
loaded = joblib.load(model_artifact)
booster = loaded['booster']
sample_X = X_test.iloc[:5]
print('Sample predictions:', booster.predict(sample_X))

## Next steps and production notes
- Replace synthetic data with real datasets and prefer time-aware splitting for temporal data.
- Use out-of-fold target encoding and cross-validated feature calculations.
- Materialize features in a feature store for parity between training and serving (Feast).
- Add unit tests for transformations and data quality checks with Great Expectations.