# Step 1 : Checking the Environment & my imports

In [1]:
# üì¶ Environment & Library Check
import sys, platform
import pandas as pd
import numpy as np
import sklearn
import matplotlib
import seaborn as sns
import joblib
import flask

print("python:", sys.version.splitlines()[0])
print("platform:", platform.platform())
print("pandas:", pd.__version__)
print("numpy:", np.__version__)
print("sklearn:", sklearn.__version__)
print("matplotlib:", matplotlib.__version__)
print("seaborn:", sns.__version__)
print("joblib:", joblib.__version__)
print("flask:", flask.__version__)


python: 3.13.9 (tags/v3.13.9:8183fa5, Oct 14 2025, 14:09:13) [MSC v.1944 64 bit (AMD64)]
platform: Windows-10-10.0.18362-SP0
pandas: 2.3.3
numpy: 2.3.4
sklearn: 1.7.2
matplotlib: 3.10.7
seaborn: 0.13.2
joblib: 1.5.2
flask: 3.1.2


  print("flask:", flask.__version__)


# Step 2: Loading and Exploring the Dataset


In [2]:
import pandas as pd

data = pd.read_csv("campaign_data.csv")


# Here im just checking for basic info
print("‚úÖ Dataset Loaded Successfully")
print("Shape (rows, columns):\n", data.shape)
print("\nüìã Column Names:\n", list(data.columns))
print("\nüìä Data Types:\n", data.dtypes)

print("\n‚ùå Missing Values per Column:\n\n", data.isnull().sum())


data.head(5)


‚úÖ Dataset Loaded Successfully
Shape (rows, columns):
 (200000, 16)

üìã Column Names:
 ['Campaign_ID', 'Company', 'Campaign_Type', 'Target_Audience', 'Duration', 'Channel_Used', 'Conversion_Rate', 'Acquisition_Cost', 'ROI', 'Location', 'Language', 'Clicks', 'Impressions', 'Engagement_Score', 'Customer_Segment', 'Date']

üìä Data Types:
 Campaign_ID           int64
Company              object
Campaign_Type        object
Target_Audience      object
Duration             object
Channel_Used         object
Conversion_Rate     float64
Acquisition_Cost     object
ROI                 float64
Location             object
Language             object
Clicks              float64
Impressions         float64
Engagement_Score    float64
Customer_Segment     object
Date                 object
dtype: object

‚ùå Missing Values per Column:

 Campaign_ID            0
Company                0
Campaign_Type          0
Target_Audience       35
Duration               0
Channel_Used           0
Conversion_

Unnamed: 0,Campaign_ID,Company,Campaign_Type,Target_Audience,Duration,Channel_Used,Conversion_Rate,Acquisition_Cost,ROI,Location,Language,Clicks,Impressions,Engagement_Score,Customer_Segment,Date
0,1,Innovate Industries,Email,Men 18-24,30 days,Google Ads,0.04,"$16,174.00",6.29,Chicago,Spanish,506.0,1922.0,,Health & Wellness,1/1/2021
1,2,NexGen Systems,Email,Women 35-44,60 days,Google Ads,0.12,"$11,566.00",5.61,New York,German,116.0,7523.0,,Fashionistas,1/2/2021
2,3,Alpha Innovations,Influencer,Men 25-34,30 days,YouTube,0.07,"$10,200.00",7.18,Los Angeles,French,584.0,7698.0,1.0,Outdoor Adventurers,1/3/2021
3,4,DataTech Solutions,Display,All Ages,60 days,YouTube,0.11,"$12,724.00",5.55,Miami,Mandarin,217.0,1820.0,7.0,Health & Wellness,1/4/2021
4,5,NexGen Systems,Email,,15 days,YouTube,0.05,"$16,452.00",6.5,Los Angeles,Mandarin,379.0,4201.0,3.0,Health & Wellness,1/5/2021


# Step 3: Clean missing values and fix types


In [3]:
# Step 3: Clean missing values and fix types

# Here im going to fill missing values with column mean
num_cols = ['Conversion_Rate', 'ROI', 'Clicks', 'Impressions', 'Engagement_Score']
data[num_cols] = data[num_cols].apply(lambda x: x.fillna(x.mean()))

# For Categorical columns im going to fill missing values with mode
cat_cols = ['Target_Audience', 'Customer_Segment']
for col in cat_cols:
    data[col] = data[col].fillna(data[col].mode()[0])

#  Fixing numeric column stored as object
data['Acquisition_Cost'] = (
    data['Acquisition_Cost']
    .replace('[\$,]', '', regex=True)
    .astype(float)
)

# 4Ô∏è‚É£ Confirm cleaning
print("‚úÖ Missing Values After Cleaning:\n", data.isnull().sum().sum(), "remaining total missing values")
print("\nüìä Sample Check:\n", data.head(3))


  .replace('[\$,]', '', regex=True)


‚úÖ Missing Values After Cleaning:
 0 remaining total missing values

üìä Sample Check:
    Campaign_ID              Company Campaign_Type Target_Audience Duration  \
0            1  Innovate Industries         Email       Men 18-24  30 days   
1            2       NexGen Systems         Email     Women 35-44  60 days   
2            3    Alpha Innovations    Influencer       Men 25-34  30 days   

  Channel_Used  Conversion_Rate  Acquisition_Cost   ROI     Location Language  \
0   Google Ads             0.04           16174.0  6.29      Chicago  Spanish   
1   Google Ads             0.12           11566.0  5.61     New York   German   
2      YouTube             0.07           10200.0  7.18  Los Angeles   French   

   Clicks  Impressions  Engagement_Score     Customer_Segment      Date  
0   506.0       1922.0          5.494582    Health & Wellness  1/1/2021  
1   116.0       7523.0          5.494582         Fashionistas  1/2/2021  
2   584.0       7698.0          1.000000  Outdoor 

# Step 4: Encode categorical variables


In [4]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from scipy.sparse import hstack

# === 1Ô∏è‚É£ Prepare target and selected features ===
target_col = "ROI"

# Categorical features to encode
CATEGORICAL_COLS = ['Campaign_Type', 'Target_Audience', 'Channel_Used', 'Location', 'Customer_Segment']

# Numeric features that actually influence ROI
NUMERIC_COLS = ['Duration', 'Acquisition_Cost', 'Clicks', 'Impressions', 'Conversion_Rate', 'Engagement_Score']

# Subset X and target y
X = data[CATEGORICAL_COLS + NUMERIC_COLS]
y = data[target_col]

# --- Convert Duration to numeric ---
X['Duration'] = X['Duration'].str.replace(' days', '').astype(int)

# --- Optional derived features ---
X['CTR'] = X['Clicks'] / X['Impressions']          # Click-through rate
X['ROI_per_Cost'] = y / X['Acquisition_Cost']      # ROI per unit cost

# === 2Ô∏è‚É£ Encode categorical features efficiently ===
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=True)
X_encoded_sparse = encoder.fit_transform(X[CATEGORICAL_COLS])

# === 3Ô∏è‚É£ Combine numeric + encoded categorical features ===
X_numeric = X[NUMERIC_COLS + ['CTR', 'ROI_per_Cost']].astype(float)
X_combined = hstack([X_numeric, X_encoded_sparse])

print("‚úÖ Minimal Encoding Done.")
print("Encoded matrix shape:", X_combined.shape)

# === 4Ô∏è‚É£ Preview encoded features (first 3 rows) ===
encoded_sample = pd.DataFrame(
    X_encoded_sparse[:3].toarray(),
    columns=encoder.get_feature_names_out(CATEGORICAL_COLS)
)
print(encoded_sample.head(3))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Duration'] = X['Duration'].str.replace(' days', '').astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['CTR'] = X['Clicks'] / X['Impressions']          # Click-through rate
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['ROI_per_Cost'] = y / X['Acquisition_Cost']      # ROI per unit 

‚úÖ Minimal Encoding Done.
Encoded matrix shape: (200000, 34)
   Campaign_Type_Display  Campaign_Type_Email  Campaign_Type_Influencer  \
0                    0.0                  1.0                       0.0   
1                    0.0                  1.0                       0.0   
2                    0.0                  0.0                       1.0   

   Campaign_Type_Search  Campaign_Type_Social Media  Target_Audience_All Ages  \
0                   0.0                         0.0                       0.0   
1                   0.0                         0.0                       0.0   
2                   0.0                         0.0                       0.0   

   Target_Audience_Men 18-24  Target_Audience_Men 25-34  \
0                        1.0                        0.0   
1                        0.0                        0.0   
2                        0.0                        1.0   

   Target_Audience_Women 25-34  Target_Audience_Women 35-44  ...  \
0      

# Step 5: Split, Scale & Train the Model.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import joblib

# === 1Ô∏è‚É£ Split into training & testing ===
X_train, X_test, y_train, y_test = train_test_split(
    X_combined, y, test_size=0.2, random_state=42
)
print("‚úÖ Data Split Complete")
print("Train shape:", X_train.shape, " Test shape:", X_test.shape)

# === 2Ô∏è‚É£ Scale numeric + sparse matrix safely ===
scaler = StandardScaler(with_mean=False)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# === 3Ô∏è‚É£ Train Linear Regression Model ===
model = LinearRegression()
model.fit(X_train_scaled, y_train)
print("‚úÖ Model Training Complete")

# === 4Ô∏è‚É£ Evaluate Performance ===
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("üìä Evaluation Results:")
print(f"MSE: {mse:.4f}")
print(f"R¬≤: {r2:.4f}")

# === 5Ô∏è‚É£ Save Trained Components ===
joblib.dump(model, "roi_model.pkl")
joblib.dump(scaler, "scaler.pkl")
joblib.dump(encoder, "encoder.pkl")
print("üíæ Model, Scaler, and Encoder Saved Successfully")


‚úÖ Data Split Complete
Train shape: (160000, 34)  Test shape: (40000, 34)
‚úÖ Model Training Complete
üìä Evaluation Results:
MSE: 0.7647
R¬≤: 0.7459
üíæ Model, Scaler, and Encoder Saved Successfully


##### The linear model is already capturing strong linear relationships between features (like Clicks, Impressions, Acquisition Cost, CTR, etc.) and ROI.
##### This suggests your data is well-structured, and feature engineering (like CTR and ROI_per_Cost) made a real impact.

Benchmark

| Step | Model Type  |   MSE   |   R¬≤    | Comment    |
|:----:|:-------------|:--------|:--------|:------------|
| 5 | Linear | 0.7647 | 0.7459 | Baseline |
| 6 | Ridge | ? | ? | Stable |
| 7 | Lasso | ? | ? | Stable |
| 8 | Polynomial | ? | ? | Nonlinear |
| 9 | Tree | ? | ? | Nonlinear |
| 10 | Selected | ? | ? | Pruned |


# Step 6: Linear Optimization ‚Äî Ridge & Lasso Regression

In [8]:
# üß© Step 6: Optimize Linear Model using Ridge and Lasso Regression
from sklearn.linear_model import Ridge, Lasso

# === Ridge Regression ===
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
ridge_pred = ridge.predict(X_test_scaled)
ridge_mse = mean_squared_error(y_test, ridge_pred)
ridge_r2 = r2_score(y_test, ridge_pred)
print(f"üîπ Ridge -> MSE: {ridge_mse:.4f}, R¬≤: {ridge_r2:.4f}")

# === Lasso Regression ===
lasso = Lasso(alpha=0.001, max_iter=10000)
lasso.fit(X_train_scaled, y_train)
lasso_pred = lasso.predict(X_test_scaled)
lasso_mse = mean_squared_error(y_test, lasso_pred)
lasso_r2 = r2_score(y_test, lasso_pred)
print(f"üîπ Lasso -> MSE: {lasso_mse:.4f}, R¬≤: {lasso_r2:.4f}")

# === Compare with baseline ===
print("\nüìà Comparison Summary:")
print(f"Linear: MSE={mse:.4f}, R¬≤={r2:.4f}")
print(f"Ridge : MSE={ridge_mse:.4f}, R¬≤={ridge_r2:.4f}")
print(f"Lasso : MSE={lasso_mse:.4f}, R¬≤={lasso_r2:.4f}")


üîπ Ridge -> MSE: 0.7647, R¬≤: 0.7459
üîπ Lasso -> MSE: 0.7646, R¬≤: 0.7459

üìà Comparison Summary:
Linear: MSE=0.7647, R¬≤=0.7459
Ridge : MSE=0.7647, R¬≤=0.7459
Lasso : MSE=0.7646, R¬≤=0.7459


# Step 7: Hyperparameter Optimization (Grid Search for Ridge/Lasso)

In [9]:
# üß© Step 7: Hyperparameter Optimization (Ridge & Lasso)
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.metrics import mean_squared_error, r2_score

print("üîç Optimizing Ridge and Lasso with Cross-Validation...")

# === RidgeCV ===
ridge_alphas = [0.001, 0.01, 0.1, 1, 10, 50, 100]
ridge_cv = RidgeCV(alphas=ridge_alphas, cv=5)  # removed deprecated param
ridge_cv.fit(X_train_scaled, y_train)

best_ridge_alpha = ridge_cv.alpha_
ridge_cv_pred = ridge_cv.predict(X_test_scaled)
ridge_cv_mse = mean_squared_error(y_test, ridge_cv_pred)
ridge_cv_r2 = r2_score(y_test, ridge_cv_pred)
print(f"‚úÖ RidgeCV Best Alpha: {best_ridge_alpha}")
print(f"üîπ RidgeCV -> MSE: {ridge_cv_mse:.4f}, R¬≤: {ridge_cv_r2:.4f}")

# === LassoCV ===
lasso_cv = LassoCV(alphas=[0.0001, 0.001, 0.01, 0.1, 1], max_iter=10000, cv=5)
lasso_cv.fit(X_train_scaled, y_train)

best_lasso_alpha = lasso_cv.alpha_
lasso_cv_pred = lasso_cv.predict(X_test_scaled)
lasso_cv_mse = mean_squared_error(y_test, lasso_cv_pred)
lasso_cv_r2 = r2_score(y_test, lasso_cv_pred)
print(f"‚úÖ LassoCV Best Alpha: {best_lasso_alpha}")
print(f"üîπ LassoCV -> MSE: {lasso_cv_mse:.4f}, R¬≤: {lasso_cv_r2:.4f}")

# === Comparison Summary ===
print("\nüìä Optimized Model Comparison:")
print(f"Linear : MSE={mse:.4f}, R¬≤={r2:.4f}")
print(f"RidgeCV: MSE={ridge_cv_mse:.4f}, R¬≤={ridge_cv_r2:.4f} (Œ±={best_ridge_alpha})")
print(f"LassoCV: MSE={lasso_cv_mse:.4f}, R¬≤={lasso_cv_r2:.4f} (Œ±={best_lasso_alpha})")


üîç Optimizing Ridge and Lasso with Cross-Validation...
‚úÖ RidgeCV Best Alpha: 1.0
üîπ RidgeCV -> MSE: 0.7647, R¬≤: 0.7459
‚úÖ LassoCV Best Alpha: 0.001
üîπ LassoCV -> MSE: 0.7646, R¬≤: 0.7459

üìä Optimized Model Comparison:
Linear : MSE=0.7647, R¬≤=0.7459
RidgeCV: MSE=0.7647, R¬≤=0.7459 (Œ±=1.0)
LassoCV: MSE=0.7646, R¬≤=0.7459 (Œ±=0.001)


Benchmark

| Step | Model Type  |   MSE   |   R¬≤    | Comment    |
|:----:|:-------------|:--------|:--------|:------------|
| 5 | Linear | 0.7647 | 0.7459 | Baseline |
| 6 | Ridge | 0.7647 | 0.7459 | Stable |
| 7 | Lasso | 0.7646 | 0.7459 | Stable |
| 8 | Polynomial | ? | ? | Nonlinear |
| 9 | Tree | ? | ? | Nonlinear |
| 10 | Selected | ? | ? | Pruned |


# Step 8: Polynomial Features (Nonlinear Boost for Linear Models)

In [10]:
# üß© Step 8C: Controlled Polynomial Ridge Regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

print("‚öôÔ∏è Generating Controlled Polynomial Features (degree=2, high alpha)...")

poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)

ridge_poly = Ridge(alpha=100.0)  # much stronger regularization
ridge_poly.fit(X_train_poly, y_train)

y_pred_poly_ridge = ridge_poly.predict(X_test_poly)
ridge_poly_mse = mean_squared_error(y_test, y_pred_poly_ridge)
ridge_poly_r2 = r2_score(y_test, y_pred_poly_ridge)

print("\nüìä Controlled Ridge-Polynomial Regression Results:")
print(f"MSE: {ridge_poly_mse:.4f}")
print(f"R¬≤: {ridge_poly_r2:.4f}")

print("\nüìà Comparison vs Previous Models:")
print(f"Linear : MSE={mse:.4f}, R¬≤={r2:.4f}")
print(f"Ridge-Poly: MSE={ridge_poly_mse:.4f}, R¬≤={ridge_poly_r2:.4f}")


‚öôÔ∏è Generating Controlled Polynomial Features (degree=2, high alpha)...

üìä Controlled Ridge-Polynomial Regression Results:
MSE: 0.0000
R¬≤: 1.0000

üìà Comparison vs Previous Models:
Linear : MSE=0.7647, R¬≤=0.7459
Ridge-Poly: MSE=0.0000, R¬≤=1.0000


##### ‚ö†Ô∏è l detected Overfitting : Ridge-Polynomial regression metrics are unrealistically perfect.
#####  For benchmarks, refer to Linear/Ridge/Lasso models instead.

Benchmark

| Step | Model Type  |   MSE   |   R¬≤    | Comment    |
|:----:|:-------------|:--------|:--------|:------------|
| 5 | Linear | 0.7647 | 0.7459 | Baseline |
| 6 | Ridge | 0.7647 | 0.7459 | Stable |
| 7 | Lasso | 0.7646 | 0.7459 | Stable |
| 8 | Polynomial | 0 | 1 | Nonlinear( Its Overfitting skip this one) |
| 9 | Tree | ? | ? | Nonlinear |
| 10 | Selected | ? | ? | Pruned |


# Step 9: Tree-Based Models (Random Forest & Gradient Boosting)


In [11]:
# üß© Stage 9: Tree-based Models (Safe Parameters)
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import joblib

# --- Random Forest (moderate depth, fewer trees) ---
rf_model = RandomForestRegressor(
    n_estimators=50,       # fewer trees
    max_depth=15,          # limit depth
    min_samples_leaf=5,    # reduce overfitting
    max_features='sqrt',   # speed + decorrelate features
    n_jobs=-1,
    random_state=42
)
rf_model.fit(X_train_scaled, y_train)
y_pred_rf = rf_model.predict(X_test_scaled)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("üå≤ Random Forest Results (safe):")
print(f"MSE: {mse_rf:.4f}")
print(f"R¬≤: {r2_rf:.4f}")

# --- Gradient Boosting (smaller trees + learning rate) ---
gb_model = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.05,
    max_depth=3,
    subsample=0.8,       # stochastic gradient boosting
    random_state=42
)
gb_model.fit(X_train_scaled, y_train)
y_pred_gb = gb_model.predict(X_test_scaled)
mse_gb = mean_squared_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)

print("üî• Gradient Boosting Results (safe):")
print(f"MSE: {mse_gb:.4f}")
print(f"R¬≤: {r2_gb:.4f}")

# --- Save models ---
joblib.dump(rf_model, "rf_model_safe.pkl")
joblib.dump(gb_model, "gb_model_safe.pkl")

print("üíæ Tree-based models saved (safe version).")


üå≤ Random Forest Results (safe):
MSE: 0.5946
R¬≤: 0.8024
üî• Gradient Boosting Results (safe):
MSE: 0.0503
R¬≤: 0.9833
üíæ Tree-based models saved (safe version).


### Model Benchmark Summary

| Step | Model Type  |   MSE   |   R¬≤    | Comment                                      |
|:----:|:-----------|:--------|:--------|:--------------------------------------------|
| 5    | Linear     | 0.7647  | 0.7459 | Baseline                                    |
| 6    | Ridge      | 0.7647  | 0.7459 | Stable                                      |
| 7    | Lasso      | 0.7646  | 0.7459 | Stable                                      |
| 8    | Polynomial | 0       | 1      | Overfitting, skip                            |
| 9    | Tree       | 0.0503  | 0.9833 | Overfitting (unsafe, too high R¬≤)           |
| 10   | Selected   | 0.5946  | 0.8024 | Pruned / safe Random Forest, reasonable generalization |


# üìù Next Steps: Deploying the Model

‚úÖ Based on the most recent training, we are choosing the **minimal-feature Random Forest model (`rf_model_safe.pkl`)** for deployment.  
- This model balances predictive performance with generalization (MSE‚âà0.5946, R¬≤‚âà0.8024).  
- Gradient Boosting, although highly accurate on training data (MSE‚âà0.0503, R¬≤‚âà0.9833), shows signs of overfitting and will **not** be used in production.

‚úÖ Supporting files for deployment:  
- `encoder_minimal.pkl` ‚Üí Encodes categorical inputs consistently with training.  
- `scaler_minimal.pkl` ‚Üí Scales numeric inputs consistently with training.

‚úÖ Selected features for deployment:  
- **Categorical:** `Campaign_Type`, `Target_Audience`, `Channel_Used`, `Location`, `Customer_Segment`  
- **Numeric:** `Duration`, `Acquisition_Cost`

‚úÖ Objective for the next stage:  
- i will create a **frontend UI (`app.py`)** that allows users to input new campaign details.  
- Apply preprocessing (encoding + scaling) on these inputs and generate **ROI predictions** using the selected Random Forest model.  
- Ensure predictions are **reliable, reproducible, and safe for production use**.
