<a href="https://colab.research.google.com/github/Cflalex/Practicum_StarUp/blob/main/Comprehensive_Report_on_Startup_Closure_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comprehensive Report on Startup Closure Prediction

## 1. Introduction
This report investigates the prediction of startup closures using a classification model. The task is to determine whether a startup will close (label 0) or remain operational (label 1) based on historical and categorical data. The model is evaluated for its strengths, limitations, and potential improvements.

Predicting startup outcomes is critical for investors, entrepreneurs, and policy-makers. By understanding the patterns and drivers of startup success or failure, this analysis aims to provide actionable insights for decision-making and improve the model's predictive capabilities.

---

## 2. Dataset Overview

### 2.1 Key Features
The dataset includes information on startups spanning several decades (1970–2018). Key features include:
- **Category List**: The industry or sector of the startup.
- **City, Region, Country Code**: Geographical information.
- **Funding Rounds**: Number of financing rounds completed.
- **Funding Total (USD)**: Total funding received.
- **Days Since First Funding**: Time since the first funding was received.
- **Lifetime**: Duration between founding and closure.

### 2.2 Key Exploratory Findings
1. **Distribution of Lifespan**:
   - Closed startups typically have a lifespan of less than 2000 days (~5.5 years).
   - Operational startups show a broader range of lifespans.

2. **Funding Trends**:
   - Most startups raise under 10 million USD.
   - High funding is rare but significantly influences outcomes.

3. **Geographic Insights**:
   - USA dominates the dataset, followed by China and India.
   - Certain cities (e.g., San Francisco, New York) have higher startup densities and funding levels.

---

## 3. Model Description

### 3.1 Model Used
The classification model is a **Gradient Boosted Trees (GBT)** algorithm trained to distinguish between closed and operational startups. Key features used in the model:
- **Lifetime**
- **Days Since First Funding**
- **Log Funding Total USD**
- **Category List**
- **Geographical Features (City, Region, Country Code)**

### 3.2 Hyperparameters
- **Number of Trees**: 58
- **Max Depth**: 6
- **Loss Function**: Binomial Log Likelihood
- **Custom Threshold**: Optimized for recall on operational startups.

---

## 4. Classification Metrics

### 4.1 Performance Summary

| Metric         | Class 0 (Closed) | Class 1 (Operational) | Macro Avg | Weighted Avg |
|----------------|------------------|------------------------|-----------|--------------|
| **Precision**  | 1.00             | 0.84                   | 0.92      | 0.92         |
| **Recall**     | 0.81             | 1.00                   | 0.90      | 0.90         |
| **F1-Score**   | 0.89             | 0.91                   | 0.90      | 0.90         |
| **Support**    | 984              | 983                    | -         | 1967         |

### 4.2 Interpretation
- **Precision**: For closed startups (Class 0), precision is perfect (1.00), indicating no false positives. For operational startups (Class 1), precision is 0.84, with some misclassifications.
- **Recall**: The model identifies all operational startups (recall = 1.00) but misses some closed startups (recall = 0.81).
- **F1-Score**: Balanced performance between precision and recall for both classes, with slightly better results for operational startups.
- **Overall Accuracy**: 90%, indicating high correctness across predictions.

---

## 5. Feature Importance

### 5.1 Key Features Driving Predictions
The most significant features driving the model's predictions are:
1. **Lifetime**: The duration a startup has been operational is the strongest predictor of closure.
2. **Days Since First Funding**: Startups that secure funding quickly tend to remain operational longer.
3. **Geographical Factors** (City, Region): Location plays a significant role in startup success, likely due to resource availability and ecosystem support.
4. **Category List**: The industry type impacts outcomes, with some sectors showing higher resilience.
5. **Log Funding Total USD**: While less critical than other factors, the total funding amount influences operational longevity.

---

## 6. Strengths of the Model

1. **High Precision for Closed Startups**:
   - Avoids mislabeling operational startups as closed, which is crucial for real-world applications.

2. **Perfect Recall for Operational Startups**:
   - Ensures all viable businesses are correctly identified.

3. **Balanced Metrics**:
   - Macro and weighted averages indicate consistent performance across both classes.

---

## 7. Limitations

1. **Lower Recall for Closed Startups**:
   - Some closed startups are classified as operational, potentially leading to misallocated resources.

2. **Feature Dependence**:
   - Heavy reliance on `Lifetime` and `Days Since First Funding` suggests the need for more diverse features.

3. **Threshold Bias**:
   - The custom threshold prioritizes recall for operational startups, potentially reducing precision.

---

## 8. Recommendations

1. **Enhance Feature Engineering**:
   - Incorporate additional data such as market trends, founder experience, and external economic factors.

2. **Optimize Hyperparameters**:
   - Perform grid search or Bayesian optimization to fine-tune the model's parameters.

3. **Experiment with Alternative Models**:
   - Compare with Random Forest, Neural Networks, or Logistic Regression to validate performance.

4. **Balance Precision and Recall**:
   - Adjust thresholds to improve recall for closed startups without sacrificing precision for operational ones.

---

## 9. Conclusion

The Gradient Boosted Trees model provides a strong baseline for predicting startup outcomes, achieving 90% accuracy with robust metrics across both classes. Future work should focus on improving recall for closed startups and exploring additional features to enhance prediction quality further.

---

## 10. Appendix

- **Key Visualizations**: Include funding distributions, lifespan histograms, and feature correlations.
- **Detailed EDA Findings**: Summarized insights into startup data trends and patterns.
