#  Notebook 4: Feature Engineering (Simplified)

## Objective
Transform the cleaned dataset into modeling-ready format with **core features only**:
- City categories (Metro/Tier-2/Other)
- Industry groups (10 major categories)
- Derived features (investor count, funding per investor)

## Prerequisites
- Requires `startup_funding_clean.csv` from Notebook 2

---

## 1. Import Libraries

In [23]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

print(" Libraries imported successfully")

 Libraries imported successfully


## 2. Load Cleaned Data

In [24]:
# Load cleaned data
df = pd.read_csv('../data/startup_funding_clean.csv')
df['Date'] = pd.to_datetime(df['Date'])

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
df.head()

Dataset shape: (3044, 22)

Columns: ['Sr No', 'Date dd/mm/yyyy', 'Startup Name', 'Industry Vertical', 'SubVertical', 'City  Location', 'Investors Name', 'InvestmentnType', 'Amount in USD', 'Remarks', 'Date', 'Year', 'Month', 'Quarter', 'Amount_INR', 'Amount_Lakhs', 'Amount_Crores', 'Funding_Amount_Log', 'Stage', 'Stage_Order', 'City_Clean', 'Investor_Count']


Unnamed: 0,Sr No,Date dd/mm/yyyy,Startup Name,Industry Vertical,SubVertical,City Location,Investors Name,InvestmentnType,Amount in USD,Remarks,...,Month,Quarter,Amount_INR,Amount_Lakhs,Amount_Crores,Funding_Amount_Log,Stage,Stage_Order,City_Clean,Investor_Count
0,1,09/01/2020,BYJU’S,E-Tech,E-learning,Bengaluru,Tiger Global Management,Private Equity Round,200000000,,...,1.0,1.0,200000000.0,2000.0,20.0,19.113828,Private Equity,9,Bengaluru,1
1,2,13/01/2020,Shuttl,Transportation,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394,,...,1.0,1.0,8048394.0,80.48394,0.804839,15.900983,Series C,7,Gurugram,1
2,3,09/01/2020,Mamaearth,E-commerce,Retailer of baby and toddler products,Bengaluru,Sequoia Capital India,Series B,18358860,,...,1.0,1.0,18358860.0,183.5886,1.835886,16.725623,Series B,6,Bengaluru,1
3,4,02/01/2020,https://www.wealthbucket.in/,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-series A,3000000,,...,1.0,1.0,3000000.0,30.0,0.3,14.914123,Pre-Series A,4,Delhi,1
4,5,02/01/2020,Fashor,Fashion and Apparel,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Round,1800000,,...,1.0,1.0,1800000.0,18.0,0.18,14.403298,Seed,2,Mumbai,1


## 3. City Categorization

Group cities into 3 tiers based on funding activity.

In [25]:
# Define city tiers
metro_cities = ['Bengaluru', 'Mumbai', 'Delhi', 'Gurugram', 'Noida', 'Pune', 'Hyderabad', 'Chennai']
tier2_cities = ['Kolkata', 'Ahmedabad', 'Jaipur', 'Chandigarh', 'Kochi', 'Surat', 'Indore']

def categorize_city(city):
    if pd.isna(city):
        return 'Unknown'
    if city in metro_cities:
        return 'Metro'
    elif city in tier2_cities:
        return 'Tier-2'
    else:
        return 'Other'

# Apply categorization
df['City_Category'] = df['City_Clean'].apply(categorize_city)

print("City Category Distribution:")
print(df['City_Category'].value_counts())
print(f"\nTotal categories: {df['City_Category'].nunique()}")

City Category Distribution:
City_Category
Metro      2593
Unknown     180
Other       154
Tier-2      117
Name: count, dtype: int64

Total categories: 4


## 4. Industry Categorization

Group 100+ industries into 10 major categories.

In [26]:
# Define industry mappings
industry_mapping = {
    'Technology': ['Technology', 'IT Services', 'Software', 'Cloud', 'SaaS', 'Enterprise Software'],
    'E-commerce': ['E-commerce', 'Online Marketplace', 'Retail', 'Fashion', 'Grocery'],
    'Fintech': ['Fintech', 'Financial Services', 'Payments', 'Lending', 'Insurance'],
    'Healthcare': ['Healthcare', 'HealthTech', 'Pharma', 'MedTech', 'Diagnostics'],
    'Consumer': ['Consumer Internet', 'FMCG', 'Food & Beverage', 'Restaurant'],
    'Education': ['Education', 'EdTech', 'E-learning', 'Training'],
    'Logistics': ['Logistics', 'Supply Chain', 'Delivery', 'Transportation'],
    'Media': ['Media', 'Entertainment', 'Content', 'Gaming', 'Advertising'],
    'Real Estate': ['Real Estate', 'PropTech', 'Construction', 'Infrastructure'],
    'Other': []  # Catch-all
}

def categorize_industry(industry):
    if pd.isna(industry):
        return 'Other'
    
    industry = str(industry)
    for category, keywords in industry_mapping.items():
        if any(keyword.lower() in industry.lower() for keyword in keywords):
            return category
    return 'Other'

df['Industry_Category'] = df['Industry Vertical'].apply(categorize_industry)

print("Industry Category Distribution:")
print(df['Industry_Category'].value_counts())
print(f"\nTotal categories: {df['Industry_Category'].nunique()}")

Industry Category Distribution:
Industry_Category
Other          1143
Consumer        984
Technology      519
Logistics       106
Healthcare       88
E-commerce       85
Education        49
Fintech          28
Media            25
Real Estate      17
Name: count, dtype: int64

Total categories: 10


## 5. Derived Features

In [27]:
# Funding per investor
df['Funding_Per_Investor'] = df['Amount_INR'] / df['Investor_Count']

# Binary flag for multiple investors
df['Has_Multiple_Investors'] = (df['Investor_Count'] > 1).astype(int)

# High funding flag (>10 Crores)
df['Is_High_Funding'] = (df['Amount_Crores'] > 10).astype(int)

print("Derived Features Created:")
print("- Funding_Per_Investor")
print("- Has_Multiple_Investors")
print("- Is_High_Funding")
print(f"\nSample values:")
print(df[['Investor_Count', 'Funding_Per_Investor', 'Has_Multiple_Investors', 'Is_High_Funding']].head())

Derived Features Created:
- Funding_Per_Investor
- Has_Multiple_Investors
- Is_High_Funding

Sample values:
   Investor_Count  Funding_Per_Investor  Has_Multiple_Investors  \
0               1           200000000.0                       0   
1               1             8048394.0                       0   
2               1            18358860.0                       0   
3               1             3000000.0                       0   
4               1             1800000.0                       0   

   Is_High_Funding  
0                1  
1                0  
2                0  
3                0  
4                0  


## 6. Label Encoding

Encode categorical features as integers for modeling.

In [28]:
# Initialize encoders
city_encoder = LabelEncoder()
industry_encoder = LabelEncoder()
stage_encoder = LabelEncoder()

# Apply encoding
df['City_Category_Encoded'] = city_encoder.fit_transform(df['City_Category'])
df['Industry_Category_Encoded'] = industry_encoder.fit_transform(df['Industry_Category'])
df['Stage_Encoded'] = stage_encoder.fit_transform(df['Stage'])

print("Encoding Complete:")
print(f"\nCity Categories: {dict(enumerate(city_encoder.classes_))}")
print(f"\nIndustry Categories: {dict(enumerate(industry_encoder.classes_))}")
print(f"\nStage Categories: {dict(enumerate(stage_encoder.classes_))}")

Encoding Complete:

City Categories: {0: 'Metro', 1: 'Other', 2: 'Tier-2', 3: 'Unknown'}

Industry Categories: {0: 'Consumer', 1: 'E-commerce', 2: 'Education', 3: 'Fintech', 4: 'Healthcare', 5: 'Logistics', 6: 'Media', 7: 'Other', 8: 'Real Estate', 9: 'Technology'}

Stage Categories: {0: 'Angel', 1: 'Corporate Round', 2: 'Debt Funding', 3: 'Pre-Series A', 4: 'Private Equity', 5: 'Seed', 6: 'Series A', 7: 'Series B', 8: 'Series C', 9: 'Series D+', 10: 'Undisclosed'}


## 7. Missing Value Imputation

Use **stage-wise median** for missing funding amounts.

In [29]:
# Count missing values before imputation
print("Missing values before imputation:")
print(df[['Amount_INR', 'Funding_Amount_Log']].isnull().sum())

# Stage-wise median imputation
for stage in df['Stage'].unique():
    stage_median = df.loc[df['Stage'] == stage, 'Funding_Amount_Log'].median()
    df.loc[(df['Stage'] == stage) & (df['Funding_Amount_Log'].isnull()), 'Funding_Amount_Log'] = stage_median

print("\nMissing values after imputation:")
print(df[['Amount_INR', 'Funding_Amount_Log']].isnull().sum())

print("\n Stage-wise median imputation complete")

Missing values before imputation:
Amount_INR            967
Funding_Amount_Log    967
dtype: int64

Missing values after imputation:
Amount_INR            967
Funding_Amount_Log      0
dtype: int64

 Stage-wise median imputation complete


## 8. Feature Summary

Review all engineered features.

In [30]:
# List all features
engineered_features = [
    'Year', 'Month', 'Quarter',
    'Stage_Order', 'Investor_Count',
    'City_Category', 'City_Category_Encoded',
    'Industry_Category', 'Industry_Category_Encoded',
    'Has_Multiple_Investors', 'Is_High_Funding',
    'Funding_Per_Investor', 'Funding_Amount_Log',
    'Stage_Encoded'
]

print("="*60)
print("ENGINEERED FEATURES SUMMARY")
print("="*60)
print(f"\nTotal features: {len(engineered_features)}")
print(f"\nFeature list:")
for i, feature in enumerate(engineered_features, 1):
    print(f"  {i:2d}. {feature}")
print("\n" + "="*60)

# Display sample
print("\nSample data:")
df[engineered_features].head(10)

ENGINEERED FEATURES SUMMARY

Total features: 14

Feature list:
   1. Year
   2. Month
   3. Quarter
   4. Stage_Order
   5. Investor_Count
   6. City_Category
   7. City_Category_Encoded
   8. Industry_Category
   9. Industry_Category_Encoded
  10. Has_Multiple_Investors
  11. Is_High_Funding
  12. Funding_Per_Investor
  13. Funding_Amount_Log
  14. Stage_Encoded


Sample data:


Unnamed: 0,Year,Month,Quarter,Stage_Order,Investor_Count,City_Category,City_Category_Encoded,Industry_Category,Industry_Category_Encoded,Has_Multiple_Investors,Is_High_Funding,Funding_Per_Investor,Funding_Amount_Log,Stage_Encoded
0,2020.0,1.0,1.0,9,1,Metro,0,Other,7,0,1,200000000.0,19.113828,4
1,2020.0,1.0,1.0,7,1,Metro,0,Logistics,5,0,0,8048394.0,15.900983,8
2,2020.0,1.0,1.0,6,1,Metro,0,E-commerce,1,0,0,18358860.0,16.725623,7
3,2020.0,1.0,1.0,4,1,Metro,0,Fintech,3,0,0,3000000.0,14.914123,3
4,2020.0,1.0,1.0,2,1,Metro,0,E-commerce,1,0,0,1800000.0,14.403298,5
5,2020.0,1.0,1.0,5,1,Metro,0,Logistics,5,0,0,9000000.0,16.012735,6
6,2020.0,1.0,1.0,9,1,Metro,0,Other,7,0,1,150000000.0,18.826146,4
7,2019.0,12.0,4.0,5,1,Metro,0,Technology,9,0,0,6000000.0,15.60727,6
8,2019.0,12.0,4.0,8,1,Metro,0,E-commerce,1,0,0,70000000.0,18.064006,9
9,2019.0,12.0,4.0,2,2,Metro,0,Other,7,1,0,25000000.0,17.727534,5


## 9. Save Processed Dataset

In [31]:
# Save to CSV
output_path = '../data/processed_features.csv'
df.to_csv(output_path, index=False)

print(f" Processed dataset saved: {output_path}")
print(f"   Shape: {df.shape}")
print(f"   Size: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

 Processed dataset saved: ../data/processed_features.csv
   Shape: (3044, 30)
   Size: 2.59 MB


## 10. Key Takeaways

### Features Created (14 total):
1. **Temporal:** Year, Month, Quarter (simple integers)
2. **Categorical:** City_Category, Industry_Category
3. **Encoded:** City_Category_Encoded, Industry_Category_Encoded, Stage_Encoded
4. **Derived:** Has_Multiple_Investors, Is_High_Funding, Funding_Per_Investor
5. **Target:** Funding_Amount_Log, Stage_Order

### Simplifications Applied:
-  No cyclical encoding (Month/Quarter kept as integers)
-  No interaction features (Stage×City, Stage×Industry removed)
-  Simple label encoding (no one-hot)
-  Stage-wise median imputation (logical approach)

### Ready for Modeling:
- Clean dataset with no missing target values
- 8 core features for regression
- Suitable for 2nd-year BTech project scope