### Student Startup Success Prediction

The dataset used in this project comprises 2,100 student-led startup projects collected from 40 academic institutions between 2019 and 2023. Each record represents an individual startup initiative and captures a variety of structural, strategic, and support-related attributes that may influence its success. Key variables include team size, average experience, innovation score, funding amount, mentorship and incubation support, market readiness, and business model strength. The target variable, success_label, indicates whether a startup achieved measurable success. This comprehensive dataset enables data-driven exploration of the factors that most strongly contribute to the success of student-founded startups.

In [2]:
import pandas as pd

Kaggle dataset: [Student Startup Success Dataset](https://www.kaggle.com/datasets/ziya07/student-startup-success-dataset/data)


In [3]:
#load dataset
df = pd.read_csv('/Users/nirvikarajendra/Downloads/student_startup_success_dataset-df.csv')

In [4]:
print("Initial data shape:", df.shape)
df.head()

Initial data shape: (2100, 16)


Unnamed: 0,project_id,institution_name,institution_type,project_domain,team_size,avg_team_experience,innovation_score,funding_amount_usd,mentorship_support,incubation_support,market_readiness_level,competition_awards,business_model_score,technology_maturity,year,success_label
0,P0001,Institution_39,Non-technical,AgriTech,3,1.35,0.75,48336.75,1,1,3,3,0.86,4,2023,1
1,P0002,Institution_29,Technical,FinTech,4,1.57,0.38,30601.34,0,0,5,0,0.39,5,2023,0
2,P0003,Institution_15,Non-technical,AgriTech,3,2.19,0.61,37712.58,1,1,1,0,0.38,2,2019,1
3,P0004,Institution_8,Private,GreenTech,7,0.72,0.98,46881.0,1,1,5,1,0.69,5,2021,1
4,P0005,Institution_21,Public,HealthTech,7,2.64,0.33,29988.37,0,0,4,2,0.85,2,2020,0


In [5]:
print("\nColumns and Data Types:")
print(df.dtypes)


Columns and Data Types:
project_id                 object
institution_name           object
institution_type           object
project_domain             object
team_size                   int64
avg_team_experience       float64
innovation_score          float64
funding_amount_usd        float64
mentorship_support          int64
incubation_support          int64
market_readiness_level      int64
competition_awards          int64
business_model_score      float64
technology_maturity         int64
year                        int64
success_label               int64
dtype: object


In [6]:
duplicate_rows = df[df.duplicated()]
print("Number of duplicate rows:", duplicate_rows.shape[0])

Number of duplicate rows: 0


In [7]:
#Handling Missing values
df.isnull().sum()

project_id                0
institution_name          0
institution_type          0
project_domain            0
team_size                 0
avg_team_experience       0
innovation_score          0
funding_amount_usd        0
mentorship_support        0
incubation_support        0
market_readiness_level    0
competition_awards        0
business_model_score      0
technology_maturity       0
year                      0
success_label             0
dtype: int64

In [8]:
#Statistical Summary of Numeric Features
df.describe()

Unnamed: 0,team_size,avg_team_experience,innovation_score,funding_amount_usd,mentorship_support,incubation_support,market_readiness_level,competition_awards,business_model_score,technology_maturity,year,success_label
count,2100.0,2100.0,2100.0,2100.0,2100.0,2100.0,2100.0,2100.0,2100.0,2100.0,2100.0,2100.0
mean,4.472381,2.271671,0.650124,25372.473505,0.494762,0.50619,2.962381,1.947143,0.598343,3.062381,2021.062857,0.419524
std,1.709403,1.004748,0.202946,14141.821776,0.500092,0.500081,1.427963,1.391654,0.233342,1.417045,1.460567,0.493599
min,2.0,0.5,0.3,1050.09,0.0,0.0,1.0,0.0,0.2,1.0,2019.0,0.0
25%,3.0,1.4375,0.48,13532.4725,0.0,0.0,2.0,1.0,0.39,2.0,2020.0,0.0
50%,4.0,2.29,0.65,25216.43,0.0,1.0,3.0,2.0,0.6,3.0,2021.0,0.0
75%,6.0,3.12,0.83,37403.67,1.0,1.0,4.0,3.0,0.8,4.0,2022.0,1.0
max,7.0,4.0,1.0,49982.15,1.0,1.0,5.0,4.0,1.0,5.0,2023.0,1.0


> Target Variable: success_label (binary: 1 = Successful, 0 = Unsuccessful).

> Features: All other columns (e.g., funding_amount_usd, innovation_score, team_size, etc.).


#### NEXT STEPS: 
1. Exploratory Data Analysis (EDA):
Analyze feature distributions, missing values, and correlations with success.
2. Visualize patterns and trends.
Feature Engineering:
3. Create new features (e.g., team dynamics, support systems, innovation metrics).
Normalize/scale continuous features.
4. Model Selection & Training:
Models: Random Forest, XGBoost, Logistic Regression.
Tune models with cross-validation.
5. Model Evaluation:
Evaluate using accuracy, precision, recall, F1-score.
Compare performance of models.
6. Feature Importance Analysis:
Identify most influential features (e.g., funding, mentorship support).
7. Practical Implications:
Insights for universities, incubators to enhance startup support systems.
8. Frontend Application:
Deploy the model via Streamlit for real-time success predictions and insights.