# Task
Generate a single Jupyter notebook (`startup_health_scoring.ipynb`) designed to run in Google Colab. This notebook should perform the following steps: setup the environment, load and inspect the dataset from "/content/drive/MyDrive/ScaleDux/Startup_Scoring_Dataset.csv", preprocess the data by normalizing numerical features and transforming the monthly burn rate, compute a health score for each startup based on weighted features, rank the startups by their health score, create visualizations including a bar chart of scores, a correlation heatmap, and a histogram of the score distribution (saving all figures to `/content/outputs/`), and document the notebook with clear markdown and inline comments.

## Setup environment

### Subtask:
Install necessary libraries and ensure the dataset is accessible.


**Reasoning**:
Install the required libraries using pip.



In [1]:
%pip install pandas numpy scikit-learn matplotlib seaborn



**Reasoning**:
Mount Google Drive to access the dataset and create the output directory.



In [2]:
from google.colab import drive
import os

drive.mount('/content/drive')

output_dir = '/content/outputs'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Load and inspect data

### Subtask:
Load the dataset into a pandas DataFrame, display basic statistics, and check for missing values.


**Reasoning**:
Load the data into a pandas DataFrame, display basic information and check for missing values as requested by the instructions.



In [3]:
import pandas as pd

# Load the dataset
df_startup = pd.read_csv('/content/drive/MyDrive/ScaleDux/Startup_Scoring_Dataset.csv')

# Display the first 5 rows
print("First 5 rows of the DataFrame:")
display(df_startup.head())

# Display a concise summary of the DataFrame
print("\nDataFrame Info:")
df_startup.info()

# Generate descriptive statistics of numerical columns
print("\nDescriptive Statistics:")
display(df_startup.describe())

# Check for missing values
print("\nMissing values per column:")
print(df_startup.isnull().sum())

First 5 rows of the DataFrame:


Unnamed: 0,startup_id,team_experience,market_size_million_usd,monthly_active_users,monthly_burn_rate_inr,funds_raised_inr,valuation_inr
0,S001,7,789,66299,6091495,4183839,263247468
1,S002,4,939,34866,9106842,666326,172927107
2,S003,8,42,73630,8817088,649150,313090455
3,S004,5,57,93657,6374167,20153173,305887183
4,S005,7,512,61187,841971,118834,84815222



DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   startup_id               100 non-null    object
 1   team_experience          100 non-null    int64 
 2   market_size_million_usd  100 non-null    int64 
 3   monthly_active_users     100 non-null    int64 
 4   monthly_burn_rate_inr    100 non-null    int64 
 5   funds_raised_inr         100 non-null    int64 
 6   valuation_inr            100 non-null    int64 
dtypes: int64(6), object(1)
memory usage: 5.6+ KB

Descriptive Statistics:


Unnamed: 0,team_experience,market_size_million_usd,monthly_active_users,monthly_burn_rate_inr,funds_raised_inr,valuation_inr
count,100.0,100.0,100.0,100.0,100.0,100.0
mean,5.91,548.5,52002.67,5271927.0,22568080.0,231602200.0
std,2.871296,270.025046,29193.444748,2801648.0,16330370.0,138882600.0
min,1.0,14.0,954.0,166381.0,118834.0,10070640.0
25%,3.0,346.75,26729.25,2928578.0,5996659.0,112172300.0
50%,6.0,578.0,52472.5,5277296.0,22143050.0,243729600.0
75%,8.0,755.0,74684.75,7694987.0,35499960.0,319858800.0
max,10.0,996.0,98606.0,9931393.0,49655440.0,497894800.0



Missing values per column:
startup_id                 0
team_experience            0
market_size_million_usd    0
monthly_active_users       0
monthly_burn_rate_inr      0
funds_raised_inr           0
valuation_inr              0
dtype: int64


## Preprocess data

### Subtask:
Normalize numerical features and transform the monthly burn rate.


**Reasoning**:
Import the necessary scaler and apply normalization and transformation to the specified columns.



In [4]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Create a list of numerical column names to be normalized, excluding 'monthly_burn_rate_inr'.
numerical_cols_to_normalize = ['team_experience', 'market_size_million_usd', 'monthly_active_users', 'funds_raised_inr', 'valuation_inr']

# Instantiate a MinMaxScaler.
scaler = MinMaxScaler()

# Apply the scaler to the selected numerical columns in df_startup and update these columns with the scaled values.
df_startup[numerical_cols_to_normalize] = scaler.fit_transform(df_startup[numerical_cols_to_normalize])

# Apply a logarithmic transformation to the 'monthly_burn_rate_inr' column.
# Add a small constant to avoid log(0) if there are zeros.
df_startup['monthly_burn_rate_log'] = np.log(df_startup['monthly_burn_rate_inr'] + 1e-9)

# Display the first few rows to verify the transformations
display(df_startup.head())

Unnamed: 0,startup_id,team_experience,market_size_million_usd,monthly_active_users,monthly_burn_rate_inr,funds_raised_inr,valuation_inr,monthly_burn_rate_log
0,S001,0.666667,0.789206,0.669162,6091495,0.082061,0.518992,15.622404
1,S002,0.333333,0.941955,0.347274,9106842,0.011052,0.333843,16.024537
2,S003,0.777778,0.028513,0.744235,8817088,0.010706,0.621166,15.992202
3,S004,0.444444,0.043788,0.94932,6374167,0.404435,0.6064,15.667764
4,S005,0.666667,0.507128,0.616813,841971,0.0,0.15322,13.643501


## Compute health score

### Subtask:
Define feature weights and calculate a composite health score for each startup.


**Reasoning**:
Define the feature weights and calculate the health score based on these weights.



In [5]:
# Define the weights for each feature contributing to the health score.
# 'monthly_burn_rate_log' has a negative weight as a higher burn rate is less favorable.
# The weights are chosen such that they sum to 1.
feature_weights = {
    'team_experience': 0.2,
    'market_size_million_usd': 0.2,
    'monthly_active_users': 0.2,
    'monthly_burn_rate_log': -0.1,  # Negative weight
    'funds_raised_inr': 0.2,
    'valuation_inr': 0.3
}

# Calculate the health score for each startup
df_startup['health_score'] = (
    df_startup['team_experience'] * feature_weights['team_experience'] +
    df_startup['market_size_million_usd'] * feature_weights['market_size_million_usd'] +
    df_startup['monthly_active_users'] * feature_weights['monthly_active_users'] +
    df_startup['monthly_burn_rate_log'] * feature_weights['monthly_burn_rate_log'] +
    df_startup['funds_raised_inr'] * feature_weights['funds_raised_inr'] +
    df_startup['valuation_inr'] * feature_weights['valuation_inr']
)

# Display the first few rows including the new 'health_score' column
display(df_startup.head())

Unnamed: 0,startup_id,team_experience,market_size_million_usd,monthly_active_users,monthly_burn_rate_inr,funds_raised_inr,valuation_inr,monthly_burn_rate_log,health_score
0,S001,0.666667,0.789206,0.669162,6091495,0.082061,0.518992,15.622404,-0.965124
1,S002,0.333333,0.941955,0.347274,9106842,0.011052,0.333843,16.024537,-1.175578
2,S003,0.777778,0.028513,0.744235,8817088,0.010706,0.621166,15.992202,-1.100624
3,S004,0.444444,0.043788,0.94932,6374167,0.404435,0.6064,15.667764,-1.016459
4,S005,0.666667,0.507128,0.616813,841971,0.0,0.15322,13.643501,-0.960262


## Rank startups

### Subtask:
Sort startups by their health score and highlight the top and bottom performers.


**Reasoning**:
Sort the DataFrame by 'health_score' in descending order and display the top and bottom 5 rows.



In [6]:
# Sort the DataFrame by 'health_score' in descending order
df_sorted_by_health = df_startup.sort_values(by='health_score', ascending=False)

# Display the top 5 performing startups
print("Top 5 Performing Startups:")
display(df_sorted_by_health.head())

# Display the bottom 5 performing startups
print("\nBottom 5 Performing Startups:")
display(df_sorted_by_health.tail())

Top 5 Performing Startups:


Unnamed: 0,startup_id,team_experience,market_size_million_usd,monthly_active_users,monthly_burn_rate_inr,funds_raised_inr,valuation_inr,monthly_burn_rate_log,health_score
5,S006,1.0,0.409369,0.972689,551157,0.985549,0.609569,13.219775,-0.465585
76,S077,0.888889,0.349287,0.871738,1155534,0.93057,0.79932,13.960073,-0.548114
34,S035,0.666667,0.903259,0.530537,166381,0.848808,0.019745,12.022036,-0.606426
32,S033,0.444444,0.868635,0.643919,2033509,0.992092,0.788628,14.525273,-0.626121
96,S097,0.888889,0.684318,0.67419,1222622,0.269524,0.880156,14.016508,-0.63422



Bottom 5 Performing Startups:


Unnamed: 0,startup_id,team_experience,market_size_million_usd,monthly_active_users,monthly_burn_rate_inr,funds_raised_inr,valuation_inr,monthly_burn_rate_log,health_score
73,S074,0.444444,0.331976,0.394564,8156609,0.578306,0.0,15.914339,-1.241576
22,S023,1.0,0.0,0.0,5122148,0.414358,0.008128,15.449084,-1.259598
57,S058,0.555556,0.242363,0.347151,8553902,0.008291,0.344865,15.961898,-1.262058
90,S091,0.222222,0.360489,0.735674,6562440,0.100251,0.046099,15.696873,-1.272131
54,S055,0.777778,0.157841,0.020041,7029451,0.018661,0.034569,15.765619,-1.371327


## Create visualizations

### Subtask:
Generate a bar chart of scores, a correlation heatmap, and a score distribution histogram, saving all figures.


**Reasoning**:
Generate and save the bar chart of top 10 health scores, the correlation heatmap, and the health score distribution histogram.



In [7]:
import matplotlib.pyplot as plt
import seaborn as sns

# 2. Create a figure and axes for the bar chart.
plt.figure(figsize=(12, 6))
sns.barplot(x='startup_id', y='health_score', data=df_sorted_by_health.head(10))
plt.title('Top 10 Startup Health Scores')
plt.xlabel('Startup ID')
plt.ylabel('Health Score')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('/content/outputs/top_10_health_scores_bar_chart.png')

# 3. Create a figure and axes for the correlation heatmap.
plt.figure(figsize=(10, 8))
numerical_cols = df_startup.select_dtypes(include=np.number).columns
correlation_matrix = df_startup[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.savefig('/content/outputs/feature_correlation_heatmap.png')

# 4. Create a figure and axes for the histogram.
plt.figure(figsize=(8, 6))
sns.histplot(df_startup['health_score'], kde=True)
plt.title('Distribution of Startup Health Scores')
plt.xlabel('Health Score')
plt.ylabel('Frequency')
plt.tight_layout()
plt.savefig('/content/outputs/health_score_distribution_histogram.png')

# 5. Close all plot figures to free up memory.
plt.close('all')

## Document the notebook

### Subtask:
Add markdown headers and explanations to the notebook.


**Reasoning**:
Add markdown headers and explanations to the notebook based on the completed subtasks.



**Reasoning**:
The previous command failed because `Markdown` was not imported. Import `Markdown` from `IPython.display` and re-execute the code to add markdown headers and explanations.



In [9]:
from IPython.display import Markdown, display

# Add markdown headers and explanations

# Main Header
display(Markdown("# Startup Health Scoring Analysis"))

# Section: Setup Environment
display(Markdown("## Setup Environment"))
display(Markdown("""
This section sets up the necessary environment for the analysis. It includes installing required Python libraries such as pandas, numpy, scikit-learn, matplotlib, and seaborn. It also mounts Google Drive to access the dataset and creates an output directory to save the generated visualizations.
"""))

# Section: Load and Inspect Data
display(Markdown("## Load and Inspect Data"))
display(Markdown("""
In this section, the startup dataset is loaded from the specified CSV file into a pandas DataFrame. Basic data inspection is performed, including displaying the first few rows, checking the data types and non-null counts using `info()`, generating descriptive statistics using `describe()`, and identifying any missing values. This step ensures the data is loaded correctly and provides initial insights into its structure and content.
"""))

# Section: Preprocess Data
display(Markdown("## Preprocess Data"))
display(Markdown("""
This section focuses on preparing the data for health score calculation. Numerical features, except for the monthly burn rate, are normalized using `MinMaxScaler` to scale them to a consistent range (typically 0 to 1). The monthly burn rate is transformed using a logarithmic function (`np.log`) to handle its potentially skewed distribution and reduce the impact of extreme values. A small constant is added before the log transformation to avoid issues with zero values.
"""))

# Section: Compute Health Score
display(Markdown("## Compute Health Score"))
display(Markdown("""
Here, a composite health score is calculated for each startup. A dictionary `feature_weights` is defined to assign weights to the normalized and transformed features. These weights reflect the perceived importance of each factor in determining a startup's health. The 'monthly_burn_rate_log' is assigned a negative weight because a higher burn rate is considered detrimental to a startup's health. The health score is calculated as a weighted sum of the features and added as a new column to the DataFrame.
"""))

# Section: Rank Startups
display(Markdown("## Rank Startups"))
display(Markdown("""
In this section, the startups are ranked based on their calculated health scores. The DataFrame is sorted in descending order of the 'health_score' column to easily identify the top-performing startups. The top and bottom 5 startups are then displayed to highlight the range of health scores and identify the best and worst performers according to the defined metric.
"""))

# Section: Create Visualizations
display(Markdown("## Create Visualizations"))
display(Markdown("""
This section generates several visualizations to provide insights into the health scores and feature relationships:
- **Bar Chart of Top 10 Health Scores:** Visualizes the health scores of the top 10 performing startups to easily compare their scores.
- **Correlation Heatmap:** Displays the correlation matrix of the numerical features, showing the linear relationships between different variables and the 'health_score'.
- **Health Score Distribution Histogram:** Shows the distribution of the calculated health scores across all startups, indicating the frequency of different score ranges.
All generated figures are saved to the `/content/outputs/` directory.
"""))

# Startup Health Scoring Analysis

## Setup Environment


This section sets up the necessary environment for the analysis. It includes installing required Python libraries such as pandas, numpy, scikit-learn, matplotlib, and seaborn. It also mounts Google Drive to access the dataset and creates an output directory to save the generated visualizations.


## Load and Inspect Data


In this section, the startup dataset is loaded from the specified CSV file into a pandas DataFrame. Basic data inspection is performed, including displaying the first few rows, checking the data types and non-null counts using `info()`, generating descriptive statistics using `describe()`, and identifying any missing values. This step ensures the data is loaded correctly and provides initial insights into its structure and content.


## Preprocess Data


This section focuses on preparing the data for health score calculation. Numerical features, except for the monthly burn rate, are normalized using `MinMaxScaler` to scale them to a consistent range (typically 0 to 1). The monthly burn rate is transformed using a logarithmic function (`np.log`) to handle its potentially skewed distribution and reduce the impact of extreme values. A small constant is added before the log transformation to avoid issues with zero values.


## Compute Health Score


Here, a composite health score is calculated for each startup. A dictionary `feature_weights` is defined to assign weights to the normalized and transformed features. These weights reflect the perceived importance of each factor in determining a startup's health. The 'monthly_burn_rate_log' is assigned a negative weight because a higher burn rate is considered detrimental to a startup's health. The health score is calculated as a weighted sum of the features and added as a new column to the DataFrame.


## Rank Startups


In this section, the startups are ranked based on their calculated health scores. The DataFrame is sorted in descending order of the 'health_score' column to easily identify the top-performing startups. The top and bottom 5 startups are then displayed to highlight the range of health scores and identify the best and worst performers according to the defined metric.


## Create Visualizations


This section generates several visualizations to provide insights into the health scores and feature relationships:
- **Bar Chart of Top 10 Health Scores:** Visualizes the health scores of the top 10 performing startups to easily compare their scores.
- **Correlation Heatmap:** Displays the correlation matrix of the numerical features, showing the linear relationships between different variables and the 'health_score'.
- **Health Score Distribution Histogram:** Shows the distribution of the calculated health scores across all startups, indicating the frequency of different score ranges.
All generated figures are saved to the `/content/outputs/` directory.


## Summary:

### Data Analysis Key Findings

*   The dataset contains 100 entries and 7 columns, including `startup_id` and six numerical features. There are no missing values.
*   Numerical features (excluding 'monthly\_burn\_rate\_inr') were successfully normalized using `MinMaxScaler`.
*   The 'monthly\_burn\_rate\_inr' column was transformed using a logarithmic function, creating a new column 'monthly\_burn\_rate\_log'.
*   A composite 'health\_score' was calculated for each startup based on defined weights for the features, with 'monthly\_burn\_rate\_log' having a negative weight.
*   The startups were successfully ranked by their health scores, and the top and bottom 5 performers were identified.
*   Three visualizations were successfully generated and saved to `/content/outputs/`: a bar chart of the top 10 health scores, a feature correlation heatmap, and a histogram of the health score distribution.
*   Markdown headers and explanatory text were added to the notebook to document the analysis process.

### Insights or Next Steps

*   Analyze the features contributing most significantly to the health score for top and bottom performers to identify key success/failure factors.
*   Explore different feature weighting schemes or alternative scoring methodologies to see how they impact the ranking and distribution of health scores.
