#  Model Testing Notebook

This notebook demonstrates how to use the trained Random Forest model to make predictions on new data.

**What you'll learn:**
- Load the saved model
- Make predictions for individual startups
- Test with custom scenarios
- Interpret results

## Step 1: Import Libraries

In [1]:
import pickle
import pandas as pd
import numpy as np
from pathlib import Path

print(" Libraries imported successfully")

 Libraries imported successfully


## Step 2: Load the Trained Model

In [None]:
# Load model and features
with open('../models/best_regressor.pkl', 'rb') as f:
    model = pickle.load(f)

with open('../models/regression_features.pkl', 'rb') as f:
    features = pickle.load(f)

print("[SUCCESS] Model loaded successfully!")
print(f"\n[INFO] Model type: {type(model).__name__}")
print(f"[INFO] Features required: {features}")
print(f"[INFO] Number of features: {len(features)}")

 Model loaded successfully!

 Model type: RandomForestRegressor
 Features required: ['Year', 'Month', 'Quarter', 'Stage_Order', 'Investor_Count', 'City_Category_Encoded', 'Industry_Category_Encoded', 'Has_Multiple_Investors']
 Number of features: 8


## Step 3: Feature Encoding Reference

Use these encodings when creating test data:

In [None]:
# Stage Order Encoding
stage_encoding = {
    0: "Angel/Grant",
    1: "Corporate Round",
    2: "Seed",
    3: "Debt Funding",
    4: "Pre-Series A",
    5: "Series A",
    6: "Series B",
    7: "Series C",
    8: "Series D+",
    9: "Private Equity",
    10: "Undisclosed"
}

# City Category Encoding
city_encoding = {
    0: "Metro (Bengaluru, Mumbai, Delhi, Gurugram, Pune, Hyderabad)",
    1: "Other cities",
    2: "Tier-2 (Ahmedabad, Chandigarh, Jaipur, Kochi)",
    3: "Unknown"
}

# Industry Category Encoding
industry_encoding = {
    0: "Consumer",
    1: "E-commerce",
    2: "Education",
    3: "Fintech",
    4: "Healthcare",
    5: "Logistics",
    6: "Media",
    7: "Other",
    8: "Real Estate",
    9: "Technology"
}

print("[INFO] Encoding Reference Loaded")

 Encoding Reference Loaded


## Step 4: Test with Example Scenarios

In [None]:
# Example 1: Early-stage tech startup in Bengaluru
example_1 = {
    'Year': 2020,
    'Month': 6,
    'Quarter': 2,
    'Stage_Order': 2,  # Seed
    'Investor_Count': 1,
    'City_Category_Encoded': 0,  # Metro
    'Industry_Category_Encoded': 9,  # Technology
    'Has_Multiple_Investors': 0
}

# Create DataFrame
df_test = pd.DataFrame([example_1])

# Make prediction (log scale)
prediction_log = model.predict(df_test[features])[0]

# Convert to actual amount
prediction_amount = np.exp(prediction_log)

print("="*70)
print("[TEST CASE 1] Seed Stage Tech Startup in Bengaluru")
print("="*70)
print("\nInput Features:")
for key, val in example_1.items():
    print(f"  {key}: {val}")

print("\n[PREDICTION] Funding Amount:")
print(f"  Log Scale: {prediction_log:.2f}")
print(f"  Actual Amount: Rs.{prediction_amount:,.0f} INR")
print(f"  In Lakhs: Rs.{prediction_amount/100000:.2f} L")
print(f"  In Crores: Rs.{prediction_amount/10000000:.2f} Cr")

 TEST CASE 1: Seed Stage Tech Startup in Bengaluru

Input Features:
  Year: 2020
  Month: 6
  Quarter: 2
  Stage_Order: 2
  Investor_Count: 1
  City_Category_Encoded: 0
  Industry_Category_Encoded: 9
  Has_Multiple_Investors: 0

 PREDICTED FUNDING AMOUNT:
  Log Scale: 15.25
  Actual Amount: ₹4,186,568 INR
  In Lakhs: ₹41.87 L
  In Crores: ₹0.42 Cr


## Step 5: Test Multiple Scenarios at Once

In [None]:
# Create multiple test cases
test_scenarios = [
    {
        'Scenario': 'Seed Stage Tech Startup',
        'Year': 2020, 'Month': 6, 'Quarter': 2,
        'Stage_Order': 2, 'Investor_Count': 1,
        'City_Category_Encoded': 0, 'Industry_Category_Encoded': 9,
        'Has_Multiple_Investors': 0
    },
    {
        'Scenario': 'Series C Fintech with Multiple Investors',
        'Year': 2019, 'Month': 9, 'Quarter': 3,
        'Stage_Order': 7, 'Investor_Count': 3,
        'City_Category_Encoded': 0, 'Industry_Category_Encoded': 3,
        'Has_Multiple_Investors': 1
    },
    {
        'Scenario': 'Private Equity E-commerce',
        'Year': 2020, 'Month': 3, 'Quarter': 1,
        'Stage_Order': 9, 'Investor_Count': 2,
        'City_Category_Encoded': 0, 'Industry_Category_Encoded': 1,
        'Has_Multiple_Investors': 1
    },
    {
        'Scenario': 'Series A Healthcare Startup',
        'Year': 2018, 'Month': 4, 'Quarter': 2,
        'Stage_Order': 5, 'Investor_Count': 2,
        'City_Category_Encoded': 0, 'Industry_Category_Encoded': 4,
        'Has_Multiple_Investors': 1
    },
]

# Convert to DataFrame
df_scenarios = pd.DataFrame(test_scenarios)

# Make predictions
df_scenarios['Predicted_Log'] = model.predict(df_scenarios[features])
df_scenarios['Predicted_Amount_INR'] = np.exp(df_scenarios['Predicted_Log'])
df_scenarios['Predicted_Crores'] = df_scenarios['Predicted_Amount_INR'] / 10000000

# Display results
print("="*70)
print("[MULTIPLE SCENARIO PREDICTIONS]")
print("="*70)
print()
display(df_scenarios[['Scenario', 'Stage_Order', 'Predicted_Crores', 'Predicted_Log']].round(2))

print("\n[INSIGHTS]")
print(f"  • Highest predicted funding: {df_scenarios['Predicted_Crores'].max():.2f} Cr")
print(f"  • Lowest predicted funding: {df_scenarios['Predicted_Crores'].min():.2f} Cr")
print(f"  • Average prediction: {df_scenarios['Predicted_Crores'].mean():.2f} Cr")

 MULTIPLE SCENARIO PREDICTIONS



Unnamed: 0,Scenario,Stage_Order,Predicted_Crores,Predicted_Log
0,Seed Stage Tech Startup,2,0.42,15.25
1,Series C Fintech with Multiple Investors,7,3.61,17.4
2,Private Equity E-commerce,9,1.93,16.78
3,Series A Healthcare Startup,5,0.73,15.8



 Insights:
  • Highest predicted funding: 3.61 Cr
  • Lowest predicted funding: 0.42 Cr
  • Average prediction: 1.67 Cr


## Step 6: Test with CSV File

In [None]:
# Load test data from CSV
try:
    df_csv_test = pd.read_csv('../data/processed/test_data.csv')
    
    print(f"[SUCCESS] Loaded {len(df_csv_test)} test cases from test_data.csv\n")
    
    # Make predictions
    df_csv_test['Predicted_Log'] = model.predict(df_csv_test[features])
    df_csv_test['Predicted_Amount_INR'] = np.exp(df_csv_test['Predicted_Log'])
    df_csv_test['Predicted_Crores'] = df_csv_test['Predicted_Amount_INR'] / 10000000
    
    print("[PREDICTIONS]")
    display(df_csv_test)
    
    # Save results
    output_path = '../data/processed/test_predictions.csv'
    df_csv_test.to_csv(output_path, index=False)
    print(f"\n[SUCCESS] Results saved to: {output_path}")
    
except FileNotFoundError:
    print("[WARNING] test_data.csv not found in data/ folder")
    print("[INFO] Create one with the required feature columns")

 Loaded 5 test cases from test_data.csv

 Predictions:


Unnamed: 0,Year,Month,Quarter,Stage_Order,Investor_Count,City_Category_Encoded,Industry_Category_Encoded,Has_Multiple_Investors,Predicted_Log,Predicted_Amount_INR,Predicted_Crores
0,2020,6,2,2,1,0,9,0,15.247392,4186568.0,0.418657
1,2019,9,3,7,3,0,3,1,17.40057,36055500.0,3.60555
2,2020,3,1,9,2,0,1,1,16.777637,19339050.0,1.933905
3,2018,4,2,5,2,0,9,1,15.90011,8041371.0,0.804137
4,2017,11,4,2,1,2,0,0,12.687989,323834.8,0.032383



 Results saved to: ../data/processed/test_predictions.csv


## Step 7: Custom Prediction (Enter Your Own Values)

In [None]:
# Create your own test case here
custom_input = {
    'Year': 2020,           # Enter year (2015-2020)
    'Month': 8,             # Enter month (1-12)
    'Quarter': 3,           # Enter quarter (1-4)
    'Stage_Order': 5,       # Enter stage (see encoding above)
    'Investor_Count': 2,    # Enter investor count
    'City_Category_Encoded': 0,         # Enter city category
    'Industry_Category_Encoded': 3,     # Enter industry category
    'Has_Multiple_Investors': 1         # 0 or 1
}

# Make prediction
df_custom = pd.DataFrame([custom_input])
pred_log = model.predict(df_custom[features])[0]
pred_amount = np.exp(pred_log)

print("="*70)
print("[CUSTOM PREDICTION]")
print("="*70)
print("\nYour Input:")
for key, val in custom_input.items():
    print(f"  {key}: {val}")

print("\n[PREDICTION] Funding Amount:")
print(f"  Log Scale: {pred_log:.2f}")
print(f"  Amount: Rs.{pred_amount:,.0f} INR")
print(f"  In Crores: Rs.{pred_amount/10000000:.2f} Cr")

 CUSTOM PREDICTION

Your Input:
  Year: 2020
  Month: 8
  Quarter: 3
  Stage_Order: 5
  Investor_Count: 2
  City_Category_Encoded: 0
  Industry_Category_Encoded: 3
  Has_Multiple_Investors: 1

 PREDICTED FUNDING:
  Log Scale: 15.83
  Amount: ₹7,468,964 INR
  In Crores: ₹0.75 Cr


##  Key Insights

### Model Performance:
- **R² Score**: 0.5838 (58.38% variance explained)
- **RMSE**: 1.30 (on log scale)
- **MAE**: 0.83 (on log scale)

### Most Important Features:
1. **Stage_Order** (81.8%) - Funding stage is the dominant predictor
2. **Year** (7.2%) - Recent years see higher funding
3. **Month** (4.2%) - Seasonal patterns exist
4. **City_Category** (2.5%) - Metro cities attract more funding

### How to Use Predictions:
- Predictions are in **log scale** - convert using `np.exp()` for actual amounts
- Model works best for **typical funding scenarios** (within training data range)
- Extreme values may have higher prediction error
- Stage_Order has the biggest impact on predictions

### Tips:
- Metro cities (City_Category=0) generally get higher predictions
- Later stages (higher Stage_Order) predict larger amounts
- Multiple investors (Has_Multiple_Investors=1) can increase predictions
- Recent years (2019-2020) tend to have higher predictions