In [1]:
pip install faker

Collecting faker
  Downloading faker-37.1.0-py3-none-any.whl.metadata (15 kB)
Downloading faker-37.1.0-py3-none-any.whl (1.9 MB)
   ---------------------------------------- 0.0/1.9 MB ? eta -:--:--
   ---------- ----------------------------- 0.5/1.9 MB 4.2 MB/s eta 0:00:01
   ---------------- ----------------------- 0.8/1.9 MB 2.6 MB/s eta 0:00:01
   ---------------- ----------------------- 0.8/1.9 MB 2.6 MB/s eta 0:00:01
   --------------------- ------------------ 1.0/1.9 MB 1.2 MB/s eta 0:00:01
   --------------------- ------------------ 1.0/1.9 MB 1.2 MB/s eta 0:00:01
   --------------------- ------------------ 1.0/1.9 MB 1.2 MB/s eta 0:00:01
   --------------------- ------------------ 1.0/1.9 MB 1.2 MB/s eta 0:00:01
   --------------------------- ------------ 1.3/1.9 MB 745.3 kB/s eta 0:00:01
   --------------------------- ------------ 1.3/1.9 MB 745.3 kB/s eta 0:00:01
   --------------------------- ------------ 1.3/1.9 MB 745.3 kB/s eta 0:00:01
   -------------------------------- 

### **Dataset Generation Explanation**  

This dataset simulates **insurance policy records** for three categories: **Life, Health, and Motor Insurance**. It is generated using **Python, Faker, and Random libraries** to create synthetic yet realistic policyholder data. The data helps in various **insurance-related analyses, fraud detection, customer segmentation, and predictive modeling.**  

---

### **1. Life Insurance Dataset**  
- **Purpose:** Simulates life insurance policies with essential details of policyholders.  
- **Number of Records:** 800  
- **Columns:**  
  - **Policy_Number**: Unique identifier (e.g., LIFE-123456)  
  - **Customer_Name**: Randomly generated names  
  - **Age**: Between 20 to 70 years  
  - **Policy_Type**: Term Plan, Endowment, or Whole Life  
  - **Sum_Assured**: Coverage amount (e.g., 5L, 10L, etc.)  
  - **Term**: Duration (10, 15, 20, 25, or 30 years)  
  - **Company**: Randomly assigned from major life insurance providers  
  - **Nominee**: Randomly generated name  
  - **Smoker**: Boolean (15% chance of being a smoker)  

---

### **2. Health Insurance Dataset**  
- **Purpose:** Captures key details of health insurance policyholders, including pre-existing conditions and hospitalizations.  
- **Number of Records:** 700  
- **Columns:**  
  - **Policy_Number**: Unique identifier (e.g., HLTH-987654)  
  - **Customer_Name**: Randomly generated names  
  - **Age**: Between 18 to 80 years  
  - **Sum_Insured**: Coverage amount (3L, 5L, 7.5L, 10L)  
  - **Company**: Randomly chosen from major health insurance providers  
  - **Policy_Type**: Individual, Family Floater, or Senior Citizen  
  - **Medical_History**: Conditions like Hypertension, Diabetes, Arthritis (if age > 50)  
  - **Last_Hospitalization**: Random date within the last 5 years (30% chance)  
  - **Preventive_Checkup**: Boolean (random True/False)  

---

### **3. Motor Insurance Dataset**  
- **Purpose:** Generates motor insurance policy data, including vehicle details and claim history.  
- **Number of Records:** 500  
- **Columns:**  
  - **Policy_Number**: Unique identifier (e.g., MOTR-654321)  
  - **Vehicle_Number**: Random vehicle registration number  
  - **Vehicle_Type**: Hatchback, Sedan, SUV, or Electric  
  - **Company**: Randomly assigned from major motor insurance providers  
  - **Policy_Type**: Comprehensive or Third-Party  
  - **Manufacture_Year**: Derived from a randomly generated car age (0-15 years)  
  - **IDV_Percentage**: Depreciation-based Insured Declared Value (max 80%)  
  - **Accident_History**: None, Minor, or Major (70% chance of no accidents)  
  - **Current_Value**: Depreciated vehicle value based on age  
  - **Buyback_Eligible**: True if car age < 8 years  

---

### **Why is This Dataset Useful?**  
- **Data Analysis & Insights:** Helps study policy trends, customer demographics, and risk factors.  
- **Predictive Modeling:** Enables risk assessment, fraud detection, and claim prediction.  
- **Simulation & Testing:** Provides a synthetic dataset for developing and testing insurance-based AI/ML models.  

Each dataset is saved in a **CSV format**, making it easy to analyze and integrate into machine learning workflows. 🚀

In [4]:
import pandas as pd
import random
import numpy as np
from faker import Faker
from datetime import datetime
from itertools import product

fake = Faker('en_IN')

# Configuration
COMPANIES = {
    "Life": ["LIC", "HDFC Life", "SBI Life", "ICICI Pru Life", "Max Life"],
    "Health": ["Star Health", "HDFC ERGO", "Niva Bupa", "Care Health", "ManipalCigna"],
    "Motor": ["ICICI Lombard", "Bajaj Allianz", "New India Assurance", "Tata AIG", "Oriental Insurance"]
}

# Product names mapping
PRODUCT_NAMES = {
    "Life": {
        "Term Plan": ["Term Elite", "Life Protect", "Secure Future", "Shield Plan"],
        "Endowment": ["Money Back Plus", "Savings Plan", "Wealth Builder", "Golden Years"],
        "Whole Life": ["Lifetime Cover", "Forever Secure", "Permanent Shield"]
    },
    "Health": {
        "Individual": ["Health Guard", "MediCare Plus", "Wellness Plan"],
        "Family Floater": ["Family Shield", "Health Umbrella", "Total Care"],
        "Senior Citizen": ["Senior Secure", "ElderCare", "Retirement Health"]
    },
    "Motor": {
        "Comprehensive": ["Total Cover", "Complete Protect", "Shield Plus"],
        "Third-Party": ["Basic Cover", "Legal Shield", "Essential Protect"]
    }
}

def generate_balanced_products(category, num_records):
    """Generate all possible company-product combinations and distribute records evenly"""
    companies = COMPANIES[category]
    policy_types = list(PRODUCT_NAMES[category].keys())
    
    # Create all possible combinations
    combinations = list(product(companies, policy_types))
    random.shuffle(combinations)
    
    # Calculate how many records per combination
    records_per_combo = num_records // len(combinations)
    extra_records = num_records % len(combinations)
    
    products = []
    for i, (company, policy_type) in enumerate(combinations):
        count = records_per_combo + (1 if i < extra_records else 0)
        base_names = PRODUCT_NAMES[category][policy_type]
        for _ in range(count):
            products.append({
                "Company": company,
                "Policy_Type": policy_type,
                "Product_Name": f"{company} {random.choice(base_names)}"
            })
    
    random.shuffle(products)
    return products

# =============================================
# 1. LIFE INSURANCE DATASET (Balanced)
# =============================================
def generate_life_policies(num_records=800):
    products = generate_balanced_products("Life", num_records)
    data = []
    
    for product in products:
        age = random.randint(20, 70)
        record = {
            "Policy_Number": f"LIFE-{random.randint(100000,999999)}",
            "Customer_Name": fake.name(),
            "Age": age,
            "Policy_Type": product["Policy_Type"],
            "Company": product["Company"],
            "Company_Product": product["Product_Name"],
            "Sum_Assured": random.choice([500000, 1000000, 2000000, 5000000]),
            "Term": random.choice([10, 15, 20, 25, 30]),
            "Nominee": fake.name(),
            "Smoker": random.choices([True, False], weights=[0.15, 0.85])[0],
        }
        data.append(record)
    
    return pd.DataFrame(data)

# =============================================
# 2. HEALTH INSURANCE DATASET (Balanced)
# =============================================
def generate_health_policies(num_records=700):
    products = generate_balanced_products("Health", num_records)
    data = []
    
    for product in products:
        age = random.randint(18, 80)
        conditions = []
        if age > 50:
            conditions.extend(random.sample(["Hypertension", "Diabetes", "Arthritis"], random.randint(0, 2)))
        
        data.append({
            "Policy_Number": f"HLTH-{random.randint(100000,999999)}",
            "Customer_Name": fake.name(),
            "Age": age,
            "Policy_Type": product["Policy_Type"],
            "Company": product["Company"],
            "Company_Product": product["Product_Name"],
            "Sum_Insured": random.choice([300000, 500000, 750000, 1000000]),
            "Medical_History": ", ".join(conditions) if conditions else "None",
            "Last_Hospitalization": fake.date_between(start_date='-5y', end_date='today') if random.random() < 0.3 else "None",
            "Preventive_Checkup": random.choice([True, False])
        })
    
    return pd.DataFrame(data)

# =============================================
# 3. MOTOR INSURANCE DATASET (Balanced)
# =============================================
def generate_motor_policies(num_records=500):
    products = generate_balanced_products("Motor", num_records)
    data = []
    
    for product in products:
        car_age = random.randint(0, 15)
        base_value = random.choice([500000, 750000, 1000000, 1500000])
        depreciated_value = int(base_value * (1 - car_age/20))
        
        data.append({
            "Policy_Number": f"MOTR-{random.randint(100000,999999)}",
            "Vehicle_Number": f"{random.choice(['KA','DL','MH','TN'])}{random.randint(10,99)}{chr(random.randint(65,90))}{chr(random.randint(65,90))}{random.randint(1000,9999)}",
            "Policy_Type": product["Policy_Type"],
            "Company": product["Company"],
            "Company_Product": product["Product_Name"],
            "Vehicle_Type": random.choice(["Hatchback", "Sedan", "SUV", "Electric"]),
            "Manufacture_Year": datetime.now().year - car_age,
            "IDV_Percentage": max(80 - car_age, 20),
            "Accident_History": random.choices(["None", "Minor", "Major"], weights=[0.7, 0.25, 0.05])[0],
            "Current_Value": depreciated_value,
            "Buyback_Eligible": car_age < 8
        })
    
    return pd.DataFrame(data)

# =============================================
# GENERATE AND SAVE DATASETS
# =============================================
life_df = generate_life_policies()
health_df = generate_health_policies()
motor_df = generate_motor_policies()

# Verify balance
print("\nLife Insurance Product Distribution:")
print(life_df['Company_Product'].value_counts())

print("\nHealth Insurance Product Distribution:")
print(health_df['Company_Product'].value_counts())

print("\nMotor Insurance Product Distribution:")
print(motor_df['Company_Product'].value_counts())

# Save to CSV
life_df.to_csv("balanced_life_insurance.csv", index=False)
health_df.to_csv("balanced_health_insurance.csv", index=False)
motor_df.to_csv("balanced_motor_insurance.csv", index=False)


Life Insurance Product Distribution:
Company_Product
LIC Forever Secure                 21
ICICI Pru Life Permanent Shield    21
LIC Permanent Shield               21
Max Life Forever Secure            21
SBI Life Lifetime Cover            20
Max Life Permanent Shield          20
HDFC Life Permanent Shield         20
HDFC Life Secure Future            19
SBI Life Life Protect              19
SBI Life Forever Secure            18
ICICI Pru Life Lifetime Cover      18
LIC Term Elite                     17
HDFC Life Wealth Builder           17
HDFC Life Forever Secure           17
Max Life Secure Future             17
Max Life Money Back Plus           17
Max Life Savings Plan              17
LIC Secure Future                  16
ICICI Pru Life Term Elite          16
SBI Life Money Back Plus           16
HDFC Life Lifetime Cover           16
HDFC Life Savings Plan             15
ICICI Pru Life Forever Secure      15
SBI Life Savings Plan              15
HDFC Life Term Elite              

In [29]:
life_df.isnull().sum()

Policy_Number    0
Customer_Name    0
Age              0
Policy_Type      0
Sum_Assured      0
Term             0
Company          0
Nominee          0
Smoker           0
dtype: int64

In [13]:
health_df.isnull().sum()

Policy_Number           0
Customer_Name           0
Age                     0
Sum_Insured             0
Company                 0
Policy_Type             0
Medical_History         0
Last_Hospitalization    0
Preventive_Checkup      0
dtype: int64

In [15]:
motor_df.isnull().sum()

Policy_Number       0
Vehicle_Number      0
Vehicle_Type        0
Company             0
Policy_Type         0
Manufacture_Year    0
IDV_Percentage      0
Accident_History    0
Current_Value       0
Buyback_Eligible    0
dtype: int64