# 🚗 Lead Prediction Model Training Example

This notebook demonstrates how to train the lead prediction model using the updated `LeadsPredictionTrainer` class.

## 📋 Input Data Structure

The model expects data with the following required columns:

### Required Raw Data Columns:
- `leads` (target): Number of leads generated
- `views`: Number of ad views
- `phone_clicks`: Number of phone clicks
- `cd_vehicle_brand`: Vehicle brand code
- `year_model`: Model year
- `zip_2dig`: First 2 digits of ZIP code
- `vl_advertise`: Advertised value
- `n_photos`: Number of photos
- `km_vehicle`: Vehicle mileage
- `vl_market`: Market value
- `transmission_type`: Transmission type
- `flg_leather_seats`: Leather seats flag
- `flg_parking_sensor`: Parking sensor flag
- `city_state`: City/State (format: "City/State")
- `fuel_type`: Fuel type
- Feature flags: `flg_gasolina`, `flg_electric_locks`, `flg_air_conditioning`, `flg_electric_windows`, `flg_rear_defogger`, `flg_heater`, `flg_alarm`, `flg_airbag`, `flg_abs`, `flg_alcool`

In [1]:
# Import necessary libraries
import sys  # System-specific parameters and functions
import os   # Operating system interface
import pandas as pd  # Data manipulation and analysis library
import numpy as np   # Numerical computing library
import warnings      # Warning control
warnings.filterwarnings('ignore')  # Suppress warning messages

# Add project path to system path for module imports
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), '..'))

# Import necessary classes for model training
from src.models.model_trainer import LeadsPredictionTrainer, get_required_columns, create_sample_data
from src.data.data_loader import load_raw_data  # Data loading utility
from src.features.feature_engineering import PreprocessingFeaturesTransformer, FlagClusteringTransformer  # Feature engineering classes
from sklearn.pipeline import Pipeline  # Scikit-learn pipeline for chaining transformers

## 📊 1. Check Required Columns

In [2]:
# List required columns
required_cols = get_required_columns()

print("📋 Required columns for the model:")
for i, col in enumerate(required_cols, 1):
    print(f"  {i:2d}. {col}")

print(f"\nTotal: {len(required_cols)} columns required")

📋 Required columns for the model:
   1. leads
   2. views
   3. phone_clicks
   4. cd_vehicle_brand
   5. year_model
   6. zip_2dig
   7. vl_advertise
   8. n_photos
   9. km_vehicle
  10. vl_market
  11. transmission_type
  12. flg_leather_seats
  13. flg_parking_sensor
  14. city_state
  15. flg_electric_locks
  16. flg_air_conditioning
  17. flg_electric_windows
  18. flg_rear_defogger
  19. flg_heater
  20. flg_alarm
  21. flg_airbag
  22. flg_abs
  23. fuel_type

Total: 23 columns required


## 🎯 2. Create Sample Data

In [3]:
# # Create sample data
# sample_data = create_sample_data()
# print("📊 Sample data structure:")
# print(f"Shape: {sample_data.shape}")
# print("\n📝 First rows:")
# display(sample_data.head())

# print("\n📈 Data info:")
# display(sample_data.info())

## 🔄 3. Load and Process Real Data (Optional)

If you have real data, uncomment and run this section:

In [4]:
# Load real data
try:
    raw_data = load_raw_data('../../data/raw/raw_data.csv')
    print(f"✅ Real data loaded: {raw_data.shape}")
    
    # Check if all required columns are present
    missing_cols = set(required_cols) - set(raw_data.columns)
    if missing_cols:
        print(f"⚠️ Missing columns: {missing_cols}")
    else:
        print("✅ All required columns are present")
        # Use real data instead of sample
        sample_data = raw_data.copy()
        print("✅ Using real data for training")
except Exception as e:
    print(f"❌ Error loading real data: {e}")
    print("📊 Proceeding with sample data")

2025-08-13 15:15:52,403 - src.data.data_loader - INFO - Successfully loaded 48665 rows from ../../data/raw/raw_data.csv


✅ Real data loaded: (48665, 41)
✅ All required columns are present
✅ Using real data for training


## 🏗️ 4. Configure Pre-processing Pipeline

In [5]:
# Features to be removed (based on analysis)
feat_to_drop = [
    "cd_type_individual", "cd_advertise", "cd_client",
    "flg_rain_sensor", "flg_diesel", "flg_eletrico", "flg_benzina",
    "flg_pcd", "flg_trade_in", "flg_armored", "flg_factory_warranty",
    "flg_all_dealership_schedule_vehicle", "flg_all_dealership_services",
    "flg_single_owner", "priority", "cd_model_vehicle", "cd_version_vehicle",
    "flg_lincese", "flg_tax_paid", "n_doors", "flg_alloy_wheels", "flg_gas_natural"
]

# Preprocessing pipeline
pipeline_pre = Pipeline([
    (
        "preprocessing",
        PreprocessingFeaturesTransformer(
            location_col="city_state",
            fuel_type_column="fuel_type",
            cols_to_drop=feat_to_drop,
            outlier_columns=["vl_advertise", "km_vehicle"],
        ),
    ),
    (
        "feat_engineering",
        FlagClusteringTransformer(
            feature_flag_cols=[
                "flg_gasolina", "flg_electric_locks", "flg_air_conditioning",
                "flg_electric_windows", "flg_rear_defogger", "flg_heater",
                "flg_alarm", "flg_airbag", "flg_abs"
            ]
        ),
    ),
])

print("✅ Preprocessing pipeline configured")

✅ Preprocessing pipeline configured


## 🔄 5. Apply Preprocessing

In [6]:
# Apply preprocessing
df_processed = pipeline_pre.fit_transform(sample_data)

print(f"📊 Data after preprocessing: {df_processed.shape}")
print("\n📋 Final columns:")
print(list(df_processed.columns))

print("\n📝 First rows of processed data:")
display(df_processed.head())

2025-08-13 15:15:55,303 - src.features.feature_engineering - INFO - 📈 Numerical features: 5
2025-08-13 15:15:55,304 - src.features.feature_engineering - INFO - 🏷️ Categorical features: 35
2025-08-13 15:15:55,305 - src.features.feature_engineering - INFO - 🎯 Target variable: ['flg_leads', 'leads']
2025-08-13 15:15:55,305 - src.features.feature_engineering - INFO - Runnig location split...
2025-08-13 15:15:56,683 - src.features.feature_engineering - INFO - Converting flag columns to integer type...
2025-08-13 15:15:56,877 - src.features.feature_engineering - INFO - Flag columns converted to integer type successfully
2025-08-13 15:15:56,879 - src.features.feature_engineering - INFO - Converting fuel type to flag columns...
2025-08-13 15:15:56,881 - src.features.feature_engineering - INFO - Found 7 unique fuel types: ['alcool', 'gasolina', 'gas', 'natural', 'diesel', 'eletrico', 'benzina']
2025-08-13 15:15:56,985 - src.features.feature_engineering - INFO - Removing duplicate columns...
202

📊 Data after preprocessing: (48547, 17)

📋 Final columns:
['leads', 'views', 'phone_clicks', 'cd_vehicle_brand', 'year_model', 'zip_2dig', 'vl_advertise', 'n_photos', 'km_vehicle', 'vl_market', 'transmission_type', 'flg_leather_seats', 'flg_parking_sensor', 'city', 'state', 'flg_alcool', 'flag_cluster']

📝 First rows of processed data:


Unnamed: 0,leads,views,phone_clicks,cd_vehicle_brand,year_model,zip_2dig,vl_advertise,n_photos,km_vehicle,vl_market,transmission_type,flg_leather_seats,flg_parking_sensor,city,state,flg_alcool,flag_cluster
0,1,0,0,34,2018,75,110990.0,7,0,,,0,0,Itumbiara,GO,0,2
1,1,0,4,2,1996,6,8300.0,0,689815,,manual,0,0,Osasco,SP,1,1
2,4,0,11,30,2002,2,38800.0,4,33700,,manual,0,0,São Paulo,SP,0,2
3,12,0,14,10,1995,4,44000.0,8,105000,,automatico,1,0,São Paulo,SP,0,0
4,8,0,11,12,1995,5,30000.0,6,71240,,automatico,1,0,São Paulo,SP,0,2


## 🤖 6. Initialize and Configure Trainer

In [8]:


# Initialize trainer
trainer = LeadsPredictionTrainer()
X, y = trainer.prepare_data(df_processed)
print(f"📊 Features: {X.shape}")
print(f"🎯 Target: {y.shape}")

X_train, X_test, y_train, y_test = trainer.split_data(X, y)

print(f"\n📈 Training data: {X_train.shape}")
print(f"📊 Test data: {X_test.shape}")

📊 Features: (48547, 16)
🎯 Target: (48547,)

📈 Training data: (38837, 16)
📊 Test data: (9710, 16)


## 📊 7. Calculate Baseline

In [9]:
# Calculate baseline (mean prediction)
baseline_score = trainer.calculate_baseline(y_train)
print(f"📊 Baseline RMSE: {baseline_score:.3f}")

2025-08-13 15:16:34,943 - src.models.model_trainer - INFO - Baseline MSE: 141.3263


📊 Baseline RMSE: 141.326


## 🔧 8. Hyperparameter Optimization 

In [None]:
# Hyperparameter optimization (few trials for demo)
print("🔧 Starting hyperparameter optimization (demo with few trials)...")
best_params = trainer.optimize_hyperparameters(
    X, y,
    X_train, y_train, 
)
print(f"✅ Best parameters found: {best_params}")

[I 2025-08-13 15:16:42,377] A new study created in memory with name: no-name-d1f4d776-6d02-41b9-bb79-5436fe2fda33


🔧 Starting hyperparameter optimization (demo with few trials)...


feature_fraction, val_score: 49.221971:  14%|#4        | 1/7 [00:25<02:30, 25.08s/it][I 2025-08-13 15:17:07,474] Trial 0 finished with value: 49.2219707395011 and parameters: {'feature_fraction': 1.0}. Best is trial 0 with value: 49.2219707395011.
feature_fraction, val_score: 48.736742:  29%|##8       | 2/7 [00:48<01:59, 23.83s/it][I 2025-08-13 15:17:30,425] Trial 1 finished with value: 48.736742252324476 and parameters: {'feature_fraction': 0.6}. Best is trial 1 with value: 48.736742252324476.
feature_fraction, val_score: 48.736742:  43%|####2     | 3/7 [01:16<01:42, 25.74s/it][I 2025-08-13 15:17:58,430] Trial 2 finished with value: 48.77783596085235 and parameters: {'feature_fraction': 0.5}. Best is trial 1 with value: 48.736742252324476.
feature_fraction, val_score: 48.736742:  57%|#####7    | 4/7 [01:39<01:14, 24.68s/it][I 2025-08-13 15:18:21,484] Trial 3 finished with value: 48.99739412243519 and parameters: {'feature_fraction': 0.7}. Best is trial 1 with value: 48.736742252324476

## 🎯 9. Train Final Model

In [None]:
# Train final model with best parameters
print("🎯 Training final model...")
final_model = trainer.train_final_model(X_train, y_train)
print("✅ Final model trained successfully!")

## 📊 10. Evaluate Performance

In [None]:
# Evaluate model performance
results = trainer.evaluate_model(X_test, y_test)
print(results)
print("📊 PERFORMANCE METRICS:")
print(f"  RMSE: {results['rmse']:.3f}")
print(f"  R²: {results['r2']:.3f}")
print(f"  MSE: {results['mse']:.3f}")



## 🔮 11. Make Predictions with Categorization

In [None]:
# Make categorized predictions
predictions, categories = trainer.predict_with_categories(X_test)

# Create results DataFrame

results_df = pd.DataFrame({
    'Real_Value': y_test.values,
    'Prediction': predictions,
    'Category': categories,
    'Absolute_Error': np.abs(y_test.values - predictions)
})

print("🔮 CATEGORIZED PREDICTIONS:")
print("\nDistribution by category:")
print(results_df['Category'].value_counts().sort_index())

print("\nSample predictions:")
display(results_df.head(10))

## 💾 12. Save Model

In [None]:
dir(trainer)

In [None]:
# Save model
model_path = trainer.save_model("../../models/leads_prediction_model_example.joblib")
print(f"💾 Model saved to: {model_path}")

# Also save in pickle format
import pickle
pickle_path = "../../models/leads_prediction_model_example.pkl" 
with open(pickle_path, 'wb') as f:
    pickle.dump(trainer.trained_model, f)
print(f"💾 Model also saved to: {pickle_path}")

## 🎯 13. Usage Example for New Data

In [None]:
# Example usage for new data
print("🎯 USAGE EXAMPLE FOR NEW DATA:")
print("\nTo use the trained model with new data:")

example_code = '''
# Load the saved model
import joblib
model = joblib.load("models/leads_prediction_model_example.joblib")

# Prepare new data (must have the same structure)
new_data = pd.DataFrame({
    'views': [1500],
    'phone_clicks': [45], 
    'cd_vehicle_brand': [150],
    'year_model': [2020],
    'zip_2dig': [1],
    'vl_advertise': [45000],
    'n_photos': [12],
    'km_vehicle': [25000],
    'vl_market': [42000],
    'transmission_type': ['Manual'],
    'fuel_type': ['Gasoline'],
    'city_state': ['São Paulo/SP'],
    # ... add all other required columns
})

# Make prediction
prediction = model.predict(new_data)
print(f"Predicted leads: {prediction[0]:.2f}")

# Get business category
def categorize_business(lead_count):
    if lead_count <= 5: return 'Small Business (0-5)'
    elif lead_count <= 15: return 'Growing Business (6-15)'
    elif lead_count <= 30: return 'Established Business (16-30)'
    else: return 'Enterprise Business (31+)'

category = categorize_business(prediction[0])
print(f"Business category: {category}")
'''

print(example_code)