# Examining Crop Insurance Scheme from India

Rushikesh Jadhav

### What is your current goal? Has it changed since the proposal?
- Explore data to see if there are any trends within insurance coverage and take uprates in states given demographics of the farmers like gender and caste and rainfall. It hasn't changed but I am realizing that I am limited by my explanetory variables. 

### Are there data challenges you are facing? Are you currently depending on mock data?
- I am not having a any specific data challanges. I have managed to gather, clean, and add more data as needed. I added a new dataset to get a total number of farmer by state from the following source: 
    - RAJYA SABHA SESSION - 265 UNSTARRED QUESTION No 1281. ANSWERED ON, 2nd August 2024. Data Figures are in Number. Source - Department of Agriculture and Farmers Welfare (Agriculture Census 2015-16). https://www.data.gov.in/resource/stateut-wise-total-number-small-and-marginal-operational-holdings-farmers-country-31st
- I am having trouble create choropleth maps. I have removed the code from this file but I will seek help soon. 

### Describe each of the provided images with 2-3 sentences to give the context and how it relates to your goal.

Tip: The markdown syntax ![](image-name.png) will let you embed images directly, or you can number them and describe them by number in this file.

Please find description of each image below each image in a markdown box. 

## What form do you envision your final narrative taking? (e.g. An article incorporating the images? A poster? An infographic?)

In [1]:
import pandas as pd
import numpy as np
import altair as alt

In [2]:
crop_df = pd.read_csv('/Users/rushikesh/Library/CloudStorage/OneDrive-TheUniversityofChicago/Autumn 2025/Data Vis/static_project/pmfby-district-level.csv')
rain_df = pd.read_csv('/Users/rushikesh/Library/CloudStorage/OneDrive-TheUniversityofChicago/Autumn 2025/Data Vis/static_project/daily-rainfall-data-district-level.csv')

In [3]:
rain_df['date'] = pd.to_datetime(rain_df['date'])
print(rain_df.dtypes)

#rain_df['date'].min()
#rain_df['date'].max()

id                        int64
date             datetime64[ns]
state_code                int64
state_name               object
district_code             int64
district_name            object
actual                  float64
rfs                     float64
normal                  float64
deviation               float64
dtype: object


In [4]:
rain_df = rain_df[(rain_df['date'].dt.year >= 2018) & (rain_df['date'].dt.year <= 2022)]
first_date = rain_df['date'].min()
last_date = rain_df['date'].max()

first_date, last_date

(Timestamp('2018-01-01 00:00:00'), Timestamp('2022-12-31 00:00:00'))

In [5]:
# missing values 
rain_df['actual'] = rain_df['actual'].fillna(0) 

In [6]:
# season and year columns
def get_season_year(d):
    if 5 <= d.month <= 9:
        season = 1
        year = d.year
    else:
        season = 0
        year = d.year + 1 if d.month >= 10 else d.year
    return season, year
rain_df[['season', 'year']] = rain_df['date'].apply(
    lambda x: pd.Series(get_season_year(x))
)

In [7]:
# Grouping and aggrigating. 
rain_df_grouped = rain_df.groupby(['year', 'season', 'state_name', 'district_name']).agg(
    total_rainfall=('actual', 'sum'),
    days_recorded=('actual', 'count'), 
    total_normal_rainfall=('normal', 'sum'),
    net_deviation=('deviation', 'sum'),
    avg_rfs=('rfs', 'mean')
).round(2).reset_index()

In [8]:
rain_df_grouped.head(10)


Unnamed: 0,year,season,state_name,district_name,total_rainfall,days_recorded,total_normal_rainfall,net_deviation,avg_rfs
0,2018,0,Andaman And Nicobar Islands,Nicobars,0.0,120,125.5,0.0,0.0
1,2018,0,Andaman And Nicobar Islands,North And Middle Andaman,0.0,120,339.3,0.0,0.0
2,2018,0,Andaman And Nicobar Islands,South Andamans,0.0,120,0.0,0.0,
3,2018,0,Andhra Pradesh,Anantapur,35.41,120,25.3,380.44,0.2
4,2018,0,Andhra Pradesh,Chittoor,38.79,120,49.3,-1487.65,0.17
5,2018,0,Andhra Pradesh,East Godavari,62.73,120,48.7,-725.4,0.2
6,2018,0,Andhra Pradesh,Guntur,11.5,120,34.6,-2819.5,0.04
7,2018,0,Andhra Pradesh,Krishna,9.02,120,36.4,-7076.35,0.02
8,2018,0,Andhra Pradesh,Kurnool,20.42,120,24.8,-2756.53,0.11
9,2018,0,Andhra Pradesh,Prakasam,45.03,120,36.5,1552.48,0.23


In [9]:
# flag for season
crop_df['season_flag'] = crop_df['season'].map({'Kharif': 1, 'Rabi': 0})

In [10]:
print(crop_df.columns)

Index(['id', 'year', 'season', 'scheme', 'state_name', 'state_code',
       'district_name', 'district_code', 'farmer_count', 'loanee',
       'non_loanee', 'area_insured', 'sum_insured', 'farmer_share',
       'goi_share', 'state_share', 'male', 'female', 'transgender', 'sc', 'st',
       'obc', 'gen', 'marginal', 'small', 'other', 'iu_count', 'gross_premium',
       'season_flag'],
      dtype='object')


In [11]:
# Year to date_time keeping 01_01 for date and month
crop_df['year_date'] = pd.to_datetime(crop_df['year'].astype(str) + '-01-01')

count_variables = [
    'farmer_count',        
    'loanee',              
    'non_loanee',          
    'area_insured',        
    'sum_insured',         
    'farmer_share',        
    'goi_share',           
    'state_share',         
    'iu_count',            
    'gross_premium'        
]

percent_variables = [
    'male', 'female', 'transgender',  
    'sc', 'st', 'obc', 'gen',         
    'marginal', 'small', 'other'      
]

In [12]:
# countable variables summed 
grouped_counts = crop_df.groupby([
    'year', 
    'season_flag', 
    'state_name', 
    'district_name',
    'district_code'
]).agg({
    **{var: 'sum' for var in count_variables},
    'scheme': lambda x: x.mode()[0] if len(x.mode()) > 0 else x.iloc[0]
}).reset_index()

# percentage variables wieghted 
def calculate_weighted_averages(group):
    result = {}
    for var in percent_variables:
        if var in crop_df.columns:
            weights = group['farmer_count']
            values = group[var]
            if weights.sum() == 0:
                result[var] = values.mean()
            else:
                result[var] = np.average(values, weights=weights)
    return pd.Series(result)

# Apply weighted averages to each group
weighted_percentages = crop_df.groupby([
    'year', 
    'season_flag', 
    'state_name', 
    'district_name',
    'district_code'
]).apply(calculate_weighted_averages).reset_index()

# Merge the counts and weighted percentages
grouped_insurance = pd.merge(
    grouped_counts,
    weighted_percentages,
    on=['year', 'season_flag', 'state_name', 'district_name', 'district_code'],
    how='left'
)

grouped_insurance.head(10)

  ]).apply(calculate_weighted_averages).reset_index()


Unnamed: 0,year,season_flag,state_name,district_name,district_code,farmer_count,loanee,non_loanee,area_insured,sum_insured,...,male,female,transgender,sc,st,obc,gen,marginal,small,other
0,2018,0,Andhra Pradesh,Anantapur,502.0,59776.0,7073.0,98303.0,143.99,37195.47,...,71.132408,28.828077,0.039515,3.663936,0.519218,34.332201,61.484646,14.988915,69.902551,15.108534
1,2018,0,Andhra Pradesh,Chittoor,503.0,20504.0,13718.0,9238.0,229.73,10903.9,...,73.919839,26.060334,0.019827,5.794519,1.466364,36.093725,56.645392,24.409827,64.504166,11.086007
2,2018,0,Andhra Pradesh,East Godavari,505.0,9434.0,9843.0,950.0,18.95,7657.08,...,66.676733,33.286693,0.036574,3.017708,2.917062,42.825552,51.239678,26.268086,54.978435,18.753479
3,2018,0,Andhra Pradesh,Guntur,506.0,7352.0,6650.0,2688.0,7.78,5166.87,...,67.306647,32.563512,0.129841,4.72421,0.469425,42.847484,51.958882,22.72215,69.92116,7.356689
4,2018,0,Andhra Pradesh,Krishna,510.0,4782.0,4791.0,1755.0,27.65,4961.06,...,66.594981,33.385709,0.01931,5.186192,0.600088,35.554743,58.658977,28.852309,49.548833,21.598858
5,2018,0,Andhra Pradesh,Kurnool,511.0,117205.0,17674.0,167773.0,188.53,88149.63,...,75.725678,24.244338,0.029984,6.617269,0.873666,44.744963,47.764101,9.290796,81.430959,9.278246
6,2018,0,Andhra Pradesh,Prakasam,517.0,78686.0,7747.0,108299.0,105.11,52071.39,...,69.340314,30.629695,0.029991,5.000735,1.013081,25.356782,68.629402,10.249134,83.470353,6.280514
7,2018,0,Andhra Pradesh,Srikakulam,519.0,12180.0,9957.0,7190.0,20.6,6647.26,...,73.951995,25.955353,0.092652,2.502719,5.653642,71.592897,20.250742,16.94914,78.523057,4.527803
8,2018,0,Andhra Pradesh,Vizianagaram,521.0,14021.0,10472.0,14191.0,135.05,8271.89,...,73.303978,26.686912,0.00911,1.467067,2.867067,76.081585,19.584281,25.234659,68.356122,6.409219
9,2018,0,Andhra Pradesh,West Godavari,523.0,10222.0,11380.0,1038.0,15.85,8904.1,...,65.802491,34.003829,0.19368,4.214517,2.643216,49.304591,43.837677,23.112788,65.291115,11.596097


In [13]:
# Merge the crop insurance data with rainfall data
merged_df = pd.merge(
    grouped_insurance,
    rain_df_grouped,
    left_on=['year', 'season_flag', 'state_name', 'district_name'],
    right_on=['year', 'season', 'state_name', 'district_name'],
    how='left',  
    suffixes=('_insurance', '_rainfall')
)

merged_df.head()

Unnamed: 0,year,season_flag,state_name,district_name,district_code,farmer_count,loanee,non_loanee,area_insured,sum_insured,...,gen,marginal,small,other,season,total_rainfall,days_recorded,total_normal_rainfall,net_deviation,avg_rfs
0,2018,0,Andhra Pradesh,Anantapur,502.0,59776.0,7073.0,98303.0,143.99,37195.47,...,61.484646,14.988915,69.902551,15.108534,0.0,35.41,120.0,25.3,380.44,0.2
1,2018,0,Andhra Pradesh,Chittoor,503.0,20504.0,13718.0,9238.0,229.73,10903.9,...,56.645392,24.409827,64.504166,11.086007,0.0,38.79,120.0,49.3,-1487.65,0.17
2,2018,0,Andhra Pradesh,East Godavari,505.0,9434.0,9843.0,950.0,18.95,7657.08,...,51.239678,26.268086,54.978435,18.753479,0.0,62.73,120.0,48.7,-725.4,0.2
3,2018,0,Andhra Pradesh,Guntur,506.0,7352.0,6650.0,2688.0,7.78,5166.87,...,51.958882,22.72215,69.92116,7.356689,0.0,11.5,120.0,34.6,-2819.5,0.04
4,2018,0,Andhra Pradesh,Krishna,510.0,4782.0,4791.0,1755.0,27.65,4961.06,...,58.658977,28.852309,49.548833,21.598858,0.0,9.02,120.0,36.4,-7076.35,0.02


In [14]:
# Load the farmer count dataset
farmer_population = pd.read_csv('/Users/rushikesh/Library/CloudStorage/OneDrive-TheUniversityofChicago/Autumn 2025/Data Vis/static_project/farmer_pop.csv')
# Clean
farmer_population['total_farmers'] = farmer_population['Marginal'] + farmer_population['Small'] + farmer_population.get('Other', 0)
farmer_population['state_name'] = farmer_population['State/UT'].str.strip()

# Merge with your existing data
merged_final = pd.merge(
    merged_df,
    farmer_population[['state_name', 'total_farmers', 'Marginal', 'Small']],
    on='state_name',
    how='left'
)

# Calculate insurance penetration rate
merged_final['insurance_penetration'] = (merged_final['farmer_count'] / merged_final['total_farmers'] * 100)
merged_final = merged_final[merged_final['state_name'] != 'all_state']

merged_final.head()


Unnamed: 0,year,season_flag,state_name,district_name,district_code,farmer_count,loanee,non_loanee,area_insured,sum_insured,...,season,total_rainfall,days_recorded,total_normal_rainfall,net_deviation,avg_rfs,total_farmers,Marginal,Small,insurance_penetration
0,2018,0,Andhra Pradesh,Anantapur,502.0,59776.0,7073.0,98303.0,143.99,37195.47,...,0.0,35.41,120.0,25.3,380.44,0.2,7550285.0,5904039.0,1646246.0,0.791705
1,2018,0,Andhra Pradesh,Chittoor,503.0,20504.0,13718.0,9238.0,229.73,10903.9,...,0.0,38.79,120.0,49.3,-1487.65,0.17,7550285.0,5904039.0,1646246.0,0.271566
2,2018,0,Andhra Pradesh,East Godavari,505.0,9434.0,9843.0,950.0,18.95,7657.08,...,0.0,62.73,120.0,48.7,-725.4,0.2,7550285.0,5904039.0,1646246.0,0.124949
3,2018,0,Andhra Pradesh,Guntur,506.0,7352.0,6650.0,2688.0,7.78,5166.87,...,0.0,11.5,120.0,34.6,-2819.5,0.04,7550285.0,5904039.0,1646246.0,0.097374
4,2018,0,Andhra Pradesh,Krishna,510.0,4782.0,4791.0,1755.0,27.65,4961.06,...,0.0,9.02,120.0,36.4,-7076.35,0.02,7550285.0,5904039.0,1646246.0,0.063335


In [15]:
alt.data_transformers.enable("vegafusion")

DataTransformerRegistry.enable('vegafusion')

In [16]:
caste_data = merged_final.melt(
    id_vars=['state_name', 'farmer_count'],
    value_vars=['sc', 'st', 'obc', 'gen'],
    var_name='caste_group',
    value_name='percentage'
)

# wighted by state 
Chart1 = alt.Chart(caste_data).mark_bar().encode(
    x=alt.X('state_name:N', title='State', axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('mean(percentage):Q', title='Percentage of Beneficiaries', stack="normalize"),
    color=alt.Color('caste_group:N', title='Caste Group',
                   scale=alt.Scale(domain=['sc', 'st', 'obc', 'gen'],
                                 range=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'])),
    tooltip=['state_name', 'caste_group', 'mean(percentage)']
).properties(
    title='Distribution of Insurance Beneficiaries by Caste and State',
    width=700,
    height=400
)

Chart1

Shows which social groups benefit most from crop insurance, revealing whether marginalized communities have equitable access to agricultural safety nets.

In [17]:
size_data = merged_final.melt(
    id_vars=['state_name'],
    value_vars=['marginal', 'small', 'other'],
    var_name='farmer_size',
    value_name='percentage'
)

chart2 = alt.Chart(size_data).mark_bar().encode(
    x=alt.X('state_name:N', title='State', axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('mean(percentage):Q', title='Percentage of Farmers', stack="normalize"),
    color=alt.Color('farmer_size:N', title='Farmer Size',
                   scale=alt.Scale(domain=['marginal', 'small', 'other'],
                                 range=['#8c564b', "#66e3df", "#5fa76f"])),
    tooltip=['state_name', 'farmer_size', 'mean(percentage)']
).properties(
    title='Distribution in Insurance Beneficiaties by Farm Size and State',
    width=700,
    height=400
)

chart2

Demonstrates if insurance programs primarily serve marginal/small farmers as intended, or if coverage skews toward larger landholders.

In [18]:
# Aggregate and calculate insurance per farmer and area per farmer 
state_agg = (
    merged_final.groupby('state_name', as_index=False)
    .agg({
        'total_rainfall': 'mean',
        'sum_insured': 'sum',
        'farmer_count': 'sum',
        'area_insured': 'sum'
    })
    .assign(
        insurance_per_farmer=lambda d: d['sum_insured'] / d['farmer_count'],
        area_per_farmer=lambda d: d['area_insured'] / d['farmer_count']
    )
)

# scatter plot 
scatter = alt.Chart(state_agg).mark_circle(size=100).encode(
    x=alt.X('total_rainfall:Q', title='Average Rainfall (mm)'),
    y=alt.Y('insurance_per_farmer:Q', title='Insurance per Farmer (₹)'),
    color=alt.Color('state_name:N', title='State'),
    tooltip=['state_name', 'total_rainfall', 'insurance_per_farmer', 'farmer_count']
)

# Vertical line showign average rain in India in these years 
vline = alt.Chart(
    pd.DataFrame({'x': [state_agg['total_rainfall'].mean()]})).mark_rule(
    color='red', strokeWidth=2, strokeDash=[5, 5]
).encode(x='x:Q')

# put both on top of each other 
chart3 = (scatter + vline).properties(
    title='Average rainfall and Insurance amount per Farmer by State',
    width=600, height=400
)

chart3


Tests the maini hypothesis of whether farmers in rain-dependent regions show higher insurance uptake, indicating rational risk management behavior.

In [27]:
# To seperate the seasons of each year 
def insurance_heatmap_seasons(df):
    df['season_name'] = df['season_flag'].map({0: 'Rabi', 1: 'Kharif'})
    df['year_season'] = df['year'].astype(str) + ' ' + df['season_name']
    
    chart4 = alt.Chart(df).mark_rect().encode(
        x=alt.X('year_season:N', title='Year-Season', axis=alt.Axis(labelAngle=-45)),
        y=alt.Y('state_name:N', title='State', sort='-x'),
        color=alt.Color('mean(insurance_penetration):Q', 
                       title='Insurance Penetration %',
                       scale=alt.Scale(scheme='viridis', 
                                     domainMid=5,
                                     type='symlog')),
        tooltip=['state_name', 'year_season', 'mean(insurance_penetration)']
    ).properties(
        width=700,
        height=400,
        title='Insurance Penetration Heatmap by State and Season (8 Seasons)'
    )
    
    return chart4

insurance_heatmap_seasons(merged_final)

Reveals temporal patterns in insurance enrollment, showing if uptake aligns with monsoon seasons when crop failure risk peaks. Also shows longterm increase or decrease in uptake. 

In [20]:
# state aggrigates of male and sum insured 
state_agg = merged_final.groupby('state_name').agg({
    'male': 'mean',
    'sum_insured': 'sum'
}).reset_index()

chart5 = alt.Chart(state_agg).mark_circle(size=80).encode(
    x=alt.X('male:Q', title='Male Farmers (%)'),
    y=alt.Y('sum_insured:Q', title='Total Sum Insured'),
    color=alt.Color('state_name:N', title='State'),
    tooltip=('state_name', 'sum_insured')
).properties(
    title='Male Ratio and Insurance coverage by state',
    width=600,
    height=400
)

chart5

Examines gender disparities in insurance access, highlighting whether women farmers are adequately covered. I am considering addign numbers of women and men farmer per state to provide more context. 

In [21]:
# comaring marginal and small farmers in state
agg_df = merged_final.groupby('state_name', as_index=False).agg({
    'marginal': 'mean',         
    'small': 'mean',
    'farmer_count': 'sum'       
})
chart6 = alt.Chart(agg_df).mark_circle().encode(
    x=alt.X('marginal:Q', title='Marginal Farmers (%)'),
    y=alt.Y('small:Q', title='Small Farmers (%)'),
    size=alt.Size('farmer_count:Q', scale=alt.Scale(type='log'), title='Total Farmers (log)'),
    color=alt.Color('state_name:N', title='State'),
    tooltip=['state_name', 'farmer_count', 'marginal', 'small']
).properties(
    title='Marginal vs Small Farmers Distribution by State (Aggregated)',
    width=600,
    height=400
)

chart6


Compares insurance penetration across different farm size categories, showing if the program effectively reaches the most smallholders.

In [22]:
# Aggregate by state 
state_shares = merged_final.groupby('state_name').agg({
    'goi_share': 'sum',
    'state_share': 'sum'
}).reset_index()

state_shares['total_share'] = state_shares['goi_share'] + state_shares['state_share']
state_shares['goi_pct'] = state_shares['goi_share'] / state_shares['total_share']
state_shares['state_pct'] = state_shares['state_share'] / state_shares['total_share']

shares_long = state_shares.melt(
    id_vars='state_name',
    value_vars=['goi_pct', 'state_pct'],
    var_name='share_type',
    value_name='share_value'
)

chart7 = alt.Chart(shares_long).mark_bar().encode(
    x=alt.X('state_name:N', axis=alt.Axis(labelAngle=-45), title='State'),
    y=alt.Y('share_value:Q', stack='normalize', title='Share (%)', axis=alt.Axis(format='%')),
    color=alt.Color('share_type:N', title='Type')
).properties(
    title='GOI vs State share proportions by state',
    width=700,
    height=400
)

chart7

Tracks how states' insurance performance changes, identifying states that taking a lead on agri innsurance. I am considering adding 3 facet charts for 4 states showing temporal changes. 

In [23]:
# Calculate average penetration rate per state per year
state_year_penetration = merged_final.groupby(['year', 'state_name']).agg({
    'insurance_penetration': 'mean'
}).reset_index()


Chart8 = alt.Chart(state_year_penetration).mark_line(point=True).encode(
    x=alt.X('year:O', title='Year'),
    y=alt.Y('rank:O', title='Rank'),
    color=alt.Color('state_name:N', title='State'),
    tooltip=('state_name', 'insurance_penetration')
).transform_window(
    rank="rank()",
    sort=[alt.SortField('insurance_penetration', order='descending')],
    groupby=['year']
).properties(
    title='Bump Chart: State Rankings by Insurance Penetration Rate',
    width=600,
    height=400
)

Chart8

Shows which states have outperformed others in insurance penetration. This helps us observe the overall trends across the country. 

In [26]:
# Rainfall and Insurance coverage heatmap 
chart9 = alt.Chart(merged_final).mark_rect().encode(
    alt.X('total_rainfall:Q').bin(maxbins=30),
    alt.Y('insurance_penetration:Q').bin(maxbins=30),
    alt.Color('count():Q', 
             scale=alt.Scale(scheme='viridis', domainMid=5, type='symlog')) 
).properties(
    title='Rainfall and Insurance penetration',
    width=400,
    height=300
)

chart9

Provides granular view of how rainfall levels correlate with insurance adoption across different penetration rates. 

In [25]:
chart10 = alt.Chart(merged_final).mark_area().encode(
    x=alt.X('year:N', title='Year'),
    y=alt.Y('sum(gross_premium):Q', title= "Sum of Gross Premium"),
    color=alt.Color('state_name:N', title='State'),
).properties(
    title='Gross Premium trends by State',
    width=600,
    height=400
)

chart10

This helps us see which states have the largest gross premium helping us understand which states have been prioratizing agri insurance. It would be curious to split this by state and GOI shares and see whether it was the GOI or the states pushing for more insurance plans. 