<a href="https://www.kaggle.com/code/aryanprajapati33/exploratory-analysis-of-sih-2025-dataset?scriptVersionId=289976681" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Introduction

This notebook presents an exploratory data analysis (EDA) of the **Smart India Hackathon (SIH) 2025** dataset.

The objective is to:
- Validate data quality
- Understand distribution of problem statements and teams
- Analyze institutional and geographic participation
- Examine outcome and prize patterns

This analysis is intended to demonstrate dataset usability and provide baseline insights.


## üìÇ Dataset Loading
We begin by loading the dataset and previewing its structure.


In [None]:
df = pd.read_csv("/kaggle/input/smart-india-hackathon-2025-team-outcomes/sih_2025_problem_statements_team_outcomes.csv")

df.head()


In [None]:
df.shape
df.info()
df.isna().sum()

### üó∫Ô∏è Top 10 States by Participation

This analysis shows which states contribute the highest number of submissions,
highlighting regional participation trends.

In [None]:
top_states = (
    df['institute_state']
    .value_counts()
    .head(10)
    .reset_index()
)

top_states.columns = ['State', 'Count']
top_states

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(
    data=top_states,
    y='State',
    x='Count',
    color='#6A5ACD'
)
plt.title("Top 10 States by Number of Entries")
plt.xlabel("Number of Entries")
plt.ylabel("State")
plt.show()

### üìå Insight:
Participation is unevenly distributed across states,
with a few states dominating the overall dataset.

In [None]:
### üè¢ Top 10 Organizations by Number of Entries

This section highlights the Institutes with the highest participation in
SIH 2025, helping identify institutions that actively contribute the most projects.


In [None]:
top_institutes = (
    df['institute_name']
    .value_counts()
    .head(10)
    .reset_index()
)

top_institutes.columns = ['Institute', 'Count']
top_institutes


In [None]:
plt.figure(figsize=(9,5))
sns.barplot(
    data=top_institutes,
    y='Institute',
    x='Count',
    color='#2E86C1'
)
plt.title("Top 10 Institutes by Number of Entries")
plt.xlabel("Number of Entries")
plt.ylabel("Institute")
plt.show()


### üìå Insight:
A small number of institutes contribute a disproportionately high number of entries,
indicating strong institutional engagement and awareness of the SIH program.


### üèÜ Winner-Only Analysis

This section focuses exclusively on projects that achieved **Winner** or
**Joint Winner** or **Other Awards**status , helping identify regions with strong performance
rather than just high participation.


In [None]:
winner_df = df[df['status'].isin(['Winner', 'Joint Winner', 'Consolation Prize', 'Future Innovators Award', 'Quantum Frontier Award', 'Girls Achiever Award', 'First Prize', 'Second Prize', 'Third Prize'])]
winner_df.head()


In [None]:
top_winner_states = (
    winner_df['institute_state']
    .value_counts()
    .head(10)
    .reset_index()
)

top_winner_states.columns = ['State', 'Winner Count']
top_winner_states


In [None]:
plt.figure(figsize=(8,5))
sns.barplot(
    data=top_winner_states,
    y='State',
    x='Winner Count',
    color='#F39C12'
)
plt.title("Top 10 States with Winning Teams")
plt.xlabel("Number of Winning Teams")
plt.ylabel("State")
plt.show()


### üìå Insight:
Winning teams are concentrated in a limited number of states,
indicating that sometimes higher participation does not always translate
into higher success rates.


### üí∞ Prize Money Distribution

This section analyzes how prize money is distributed among projects.
It helps understand whether rewards are evenly spread or concentrated
among a small number of teams.


In [None]:
# Convert prize_money to numeric (safe conversion)
df['prize_money'] = pd.to_numeric(df['prize_money'], errors='coerce')

# Consider only rows where prize money exists
prize_df = df[df['prize_money'].notna()]

prize_df[['prize_money']].head()


In [None]:
prize_counts = (
    prize_df['prize_money']
    .astype(int)
    .value_counts()
    .sort_index()
    .reset_index()
)

prize_counts.columns = ['Prize Money', 'Number of Teams']
prize_counts


In [None]:
plt.figure(figsize=(8,5))
sns.barplot(
    data=prize_counts,
    x='Prize Money',
    y='Number of Teams',
    color='#AF7AC5'
)
plt.title("Distribution of Prize Money")
plt.xlabel("Prize Money Amount")
plt.ylabel("Number of Teams")
plt.show()


### üìå Insight:
Prize money distribution is highly skewed,
with a small number of teams receiving higher rewards,
indicating a strong emphasis on top-performing projects.


## üîë Key Insights Summary

- Participation is highly concentrated among a small number of states and organizations,
  indicating uneven regional and institutional engagement.

- While some states and organizations submit a large number of projects,
  winning outcomes are more concentrated, showing that high participation
  does not always translate into higher success rates.

- The majority of projects fall under the **Shortlisted** category,
  highlighting the competitive nature of the evaluation process.

- Winner-only analysis reveals regional centers of excellence,
  where fewer entries result in a higher proportion of winning teams.

- Overall, the dataset reflects strong competition with limited winning slots,
  emphasizing quality and innovation over quantity.
