# ML Engineer Workflow Demo Notebook (Regression)

This notebook demonstrates a typical ML engineering workflow for a **regression** problem:
1. Data Generation & Importation
2. Exploratory Data Analysis (EDA)
3. Feature Engineering (including encoding)
4. Model Building
5. Model Evaluation

## 1. Data Generation & Importation

Generate synthetic regression data and add a categorical feature.

In [None]:
from sklearn.datasets import make_regression
import pandas as pd
import numpy as np

# Generate synthetic regression data
X, y = make_regression(n_samples=1000, n_features=8, n_informative=5,
                       noise=10.0, random_state=42)

# Create DataFrame
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
df['target'] = y

# Add a categorical feature for encoding demo
df['category'] = np.random.choice(['A', 'B', 'C'], size=df.shape[0])
df.head()

## 2. Exploratory Data Analysis (EDA)

Inspect distributions and look at category counts.

In [None]:
# Summary and category counts
print(df.describe())
print(df['category'].value_counts())

## 3. Feature Engineering

### 3.1 Interaction and Scaling

- Interaction term
- Standard scaling

In [None]:
# Interaction term
df['feat0_x_feat1'] = df['feature_0'] * df['feature_1']

# Scaling numeric features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
num_features = [col for col in df.columns if col.startswith('feature_') or col=='feat0_x_feat1']
df[num_features] = scaler.fit_transform(df[num_features])
df.head()

### 3.2 Encoding Categorical Features

- **Label Encoding**: Convert categories to integers
- **One-Hot Encoding**: Create binary columns for each category

In [None]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

le = LabelEncoder()
df['category_label'] = le.fit_transform(df['category'])

# One-Hot Encoding
df_ohe = pd.get_dummies(df['category'], prefix='cat')
df = pd.concat([df, df_ohe], axis=1)

df.drop('category', axis=1, inplace=True)
df.head()

## 4. Model Building

Train a Linear Regression model.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Features and target
X = df.drop(['target'], axis=1)
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

## 5. Model Evaluation

Evaluate using MSE and R², and plot predictions vs actual.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Predictions
y_pred = model.predict(X_test)

# Metrics
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))

# Plot
plt.figure(figsize=(6,6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')
plt.show()