### Student Startup Success Prediction 

This dataset contains student-level information used to predict startup success. It includes demographic, academic, extracurricular, and personal development attributes.

The primary objective of this study is to predict student startup success using academic performance, behavioral traits, and entrepreneurial development factors. By analyzing this dataset, the project aims to identify which academic indicators, extracurricular activities, and entrepreneurial traits contribute most significantly to the likelihood of students successfully founding and sustaining startups.

In [6]:
import pandas as pd

Data Source :  [Kaggle dataset: Real-Time Dataset on Academic and Entrepreneurial Development ](https://www.kaggle.com/datasets/datasetengineer/academic-and-entrepreneurial-development-dataset/data)


In [3]:
#load dataset
df = pd.read_csv('Academic and Entrepreneurial Development.csv')

In [4]:
print("Initial data shape:", df.shape)
df.head()

Initial data shape: (214354, 49)


Unnamed: 0,Student_ID,Age,Gender,Major,Year_of_Study,Educational_Background,Socioeconomic_Status,Location,High_School_Type,Cumulative_GPA,...,Startup_Founded,Funding_Secured,Business_Plan_Quality_Score,Competitions_Won,Research_Publications,Employment_in_Entrepreneurial_Roles,Innovative_Skill_Score,Entrepreneurial_Talent_Level,Prototype_Completion,Startup_Success
0,S000001,20,Female,Business,2,Medium,Middle,Urban,Public,1.31,...,No,1439.91,71.39,0,4,Yes,37.52,Low,Not Completed,Failure
1,S000002,22,Male,Business,2,Medium,Middle,Urban,Public,2.18,...,No,7091.42,59.37,2,0,No,50.05,Low,Not Completed,Success
2,S000003,21,Male,Sciences,3,High,Middle,Urban,Public,0.26,...,No,7549.31,38.01,0,0,No,65.56,Medium,Not Completed,Success
3,S000004,20,Male,Engineering,4,High,Low,Rural,Public,0.64,...,No,7455.6,24.38,0,0,Yes,44.2,High,Not Completed,Failure
4,S000005,19,Female,Engineering,2,Medium,Low,Urban,Public,0.87,...,No,3184.49,39.17,3,1,No,39.63,Low,Not Completed,Failure


In [5]:
print("\nColumns and Data Types:")
print(df.dtypes)


Columns and Data Types:
Student_ID                                 object
Age                                         int64
Gender                                     object
Major                                      object
Year_of_Study                               int64
Educational_Background                     object
Socioeconomic_Status                       object
Location                                   object
High_School_Type                           object
Cumulative_GPA                            float64
Course_Grades                             float64
Attendance                                float64
Project_Scores                            float64
Internship_Experience                      object
Applied_Courses_Count                       int64
Club_Membership                            object
Workshops_Attended                          int64
Competitions_Participated                   int64
Leadership_Roles                           object
Volunteering_Activities  

In [53]:
#Statistical Summary of Numeric Features
df.describe()

Unnamed: 0,Age,Year_of_Study,Cumulative_GPA,Course_Grades,Attendance,Project_Scores,Applied_Courses_Count,Workshops_Attended,Competitions_Participated,Volunteering_Activities,...,Entrepreneurial_Mindset,Business_Acumen,Motivation_Level,Resilience_Score,Adaptability,Self_Efficacy_Score,Mentorship_Hours,Institutional_Resources_Used,Faculty_Feedback_Score,Institutional_Support_Score
count,214354.0,214354.0,214354.0,214354.0,214354.0,214354.0,214354.0,214354.0,214354.0,214354.0,...,214354.0,214354.0,214354.0,214354.0,214354.0,214354.0,214354.0,214354.0,214354.0,214354.0
mean,20.100278,2.903505,1.143603,70.10596,80.056058,74.976484,3.10332,1.006359,0.519808,1.018162,...,42.832868,37.507533,42.834131,28.577646,42.804226,57.124554,10.007619,42.911401,57.150439,37.536041
std,0.887108,0.942126,0.638769,9.785085,11.540885,14.450068,1.136581,1.249484,0.726354,1.046614,...,17.475501,16.145825,17.489328,15.957298,17.477318,17.523855,10.008776,17.527401,17.454461,16.095479
min,18.0,1.0,0.0,50.0,60.0,50.0,1.0,0.0,0.0,0.0,...,0.48,0.23,0.6,0.04,0.64,1.76,0.0,0.74,3.06,0.54
25%,20.0,2.0,0.65,63.29,70.07,62.48,2.0,0.0,0.0,0.0,...,29.68,25.29,29.72,16.11,29.67,44.64,2.89,29.68,44.71,25.4125
50%,20.0,3.0,1.06,70.06,80.07,74.97,3.0,0.0,0.0,1.0,...,42.07,36.41,42.06,26.51,42.1,57.85,6.94,42.17,57.81,36.48
75%,21.0,4.0,1.56,76.79,90.05,87.46,4.0,2.0,1.0,2.0,...,55.27,48.63,55.32,38.98,55.24,70.32,13.88,55.4275,70.28,48.57
max,22.0,4.0,3.76,100.0,100.0,100.0,5.0,5.0,3.0,4.0,...,97.89,97.38,96.98,92.66,97.35,99.19,150.75,97.92,99.35,95.59


### Target Variable Selection
There are three potential target variables in this dataset:

`Startup_Success` (Binary - success/failure)

`Entrepreneurial_Talent_Level` 

`Innovative_Skill_Score` 

Reason for choosing Startup_Success as the target:

- In the real world, the ultimate question institutions and investors care about is whether a student’s entrepreneurial efforts succeed or not? 

### Columns Dropped
The following columns are removed from the dataset:

`Student_ID` → Just an identifier
`Funding_Secured`, `Startup_Founded`, `Prototypes_Developed`, `Prototype_Completion`, `Competitions_Won`, `Research_Publications`, `Employment_in_Entrepreneurial_Roles`, `Business_Plan_Quality_Score` , `Entrepreneurial_Talent_Level`, `Innovative_Skill_Score` 

Reason for dropping these columns:

- These are post-success outcomes, so including them would artificially inflate model performance because the model would be learning from information that is not available before the outcome occurs.

### Features Used
All other remaining columns are considered as predictive features.

In [None]:
columns_to_drop = [
    'Innovative_Skill_Score',
    'Entrepreneurial_Talent_Level',
    'Business_Plan_Quality_Score',
    'Prototype_Completion',
    'Research_Publications',
    'Startup_Founded',
    'Funding_Secured',
    'Prototypes_Developed',
    'Competitions_Won',
    'Employment_in_Entrepreneurial_Roles',
    'Student_ID'
]

df = df.drop(columns=[col for col in columns_to_drop if col in df.columns])
print("Data shape after dropping columns:", df.shape)

# Data after dropping few columns
df.head()

Data shape after dropping columns: (214354, 38)


Unnamed: 0,Age,Gender,Major,Year_of_Study,Educational_Background,Socioeconomic_Status,Location,High_School_Type,Cumulative_GPA,Course_Grades,...,Motivation_Level,Resilience_Score,Adaptability,Self_Efficacy_Score,Mentorship_Hours,Institutional_Resources_Used,Faculty_Feedback_Score,Exposure_to_Entrepreneurial_Curriculum,Institutional_Support_Score,Startup_Success
0,20,Female,Business,2,Medium,Middle,Urban,Public,1.31,69.42,...,47.39,19.75,43.18,78.2,7.25,38.96,34.17,High,55.12,Failure
1,22,Male,Business,2,Medium,Middle,Urban,Public,2.18,56.33,...,73.37,27.23,34.19,63.82,8.12,25.04,36.96,Medium,31.68,Success
2,21,Male,Sciences,3,High,Middle,Urban,Public,0.26,69.65,...,22.82,44.34,31.73,50.82,22.1,24.51,87.2,Medium,40.89,Success
3,20,Male,Engineering,4,High,Low,Rural,Public,0.64,88.31,...,60.0,65.36,18.11,56.56,21.96,63.4,46.45,Medium,46.38,Failure
4,19,Female,Engineering,2,Medium,Low,Urban,Public,0.87,71.72,...,41.08,42.02,35.87,16.85,30.87,28.11,33.59,Medium,39.99,Failure
