# <center><font color='magenta'>**Assignment 2 for DA3**</font></center>
### <center>Central European University 2025</center>
## <center> Technical report (Finding fast growing firms) </center>

#### <center> Created by: Gréta Zsikla & Károly Takács</center>

# Summary Report: Predicting Fast-Growing Firms 2025

## 1. Overview

This technical report details the methodology, decisions, and results of predicting fast-growing firms using the `bisnode-firms` dataset (2010-2015), supporting investment strategies for 2025. The target, `f_growth`, is a binary variable (sales growth ≥20%, 2012-2013), and we evaluate multiple models—logistic regression (M1, M2, LASSO) and Random Forest (RF)—across Task 1 (model selection) and Task 2 (industry comparison: Manufacturing vs. Services). RF emerges as the best model, with all code available at [GitHub: DA3AS2](https://github.com/Karoly97/DA3AS2/tree/main).


## 2. Data Preparation

### 2.1 Data Source and Initial Processing
- **Source**: `cs_bisnode_panel.csv` (287,829 rows), loaded from [OSF](https://osf.io/mbu3d).
- **Decision**: Filter to 2012-2013 (56,943 rows), focusing on 2013 for prediction (14,689 firms post-cleaning). Rationale: One-year growth balances data availability and relevance, avoiding sparsity in 2014-2015.


<details><summary>load data</summary>
  ```python
  df = pd.read_csv("https://osf.io/mbu3d/download")
  clean = df[df['year'].isin([2012, 2013])]
  ```
</details>

### 2.2 Cleaning and Feature Engineering

#### Cleaning Decisions:
- Drop high-NA columns (e.g., `COGS`, 94% missing) to reduce noise.
- Cap negative assets (e.g., `curr_assets`) at 0, flagging issues (`flag_asset_problem`).
- Impute `ceo_age` with mean (50.5) for continuity.

#### Feature Engineering:
- **Ratios**: `personnel_exp_pl = personnel_exp/sales`, normalized by total assets or sales.
- **Quadratic terms**: `profit_loss_year_pl_quad` to capture non-linearity.
- **Remove leakage**: Drop `sales12`, `growth` from predictors to ensure integrity.

<details><summary>Feature Engineering</summary>
```python
clean["total_assets_bs"] = clean[["intang_assets", "curr_assets", "fixed_assets"]].sum(axis=1)
clean["personnel_exp_pl"] = clean["personnel_exp"] / clean["sales"]
clean["ceo_age"] = np.where(clean["ceo_age"].isna(), clean["ceo_age"].mean(), clean["ceo_age"])
clean = clean.drop(columns=["sales12", "growth", "growth2"])
```
</details>

**Rationale**: Ratios enhance interpretability, flags add categorical insights, and leakage removal ensures predictions rely on 2012 data only.

### 2.3 Target Definition

**Decision**: `f_growth = 1` if `(sales_2013 - sales_2012) / sales_2012 ≥ 20%` (3,810 positive, 26%).

**Rationale**: 20% exceeds typical growth (5-10%), aligning with corporate finance’s focus on high-return firms. Two-year growth rejected due to data gaps.

<details><summary>Targeting</summary>
```python
pivoted = clean.pivot(index='comp_id', columns='year', values='sales')
clean["f_growth"] = np.where(((pivoted[2013] - pivoted[2012]) / pivoted[2012]) * 100 >= 20, 1, 0)
```
</details>
---

## 3. Task 1: Model Development and Selection

### 3.1 Model Specifications

- **M1 (Logit with Splines)**: 15 variables, e.g., `lspline(amort, [125000])` for non-linear effects.
- **M2 (Simple Logit)**: 19 variables, linear terms, e.g., `age`, `foreign`.
- **LASSO Logit**: 23 variables, L1 regularization, 20 non-zero coefficients post-tuning.
- **Random Forest (RF)**: 18 variables, tuned (`max_features=5, min_samples_split=90`).

**Decision**: Exceed requirement (3 models) with 4 to explore diverse approaches.


<details><summary>RF Tuning</summary>
```
grid = {"max_features": range(1, 6), "min_samples_split": range(80, 100, 5)}
rf = RandomForestClassifier(n_estimators=500, random_state=50)
rf_grid = GridSearchCV(rf, grid, cv=5, scoring="roc_auc")
```
</details>




### 3.2 Evaluation Metrics

- **Metrics**: CV AUC, RMSE, expected loss (`FP=1`, `FN=10`).
- **Decision**: Use 5-fold CV for robustness; loss prioritizes recall (`FN` costlier).

#### Results (Table 1):

| Model  | Features | CV AUC | CV RMSE | Expected Loss (Threshold) |
|--------|---------|--------|---------|---------------------------|
| M1     | 26      | 0.637  | 0.436   | 0.68 (0.06) * |
| M2     | 20      | 0.629  | 0.437   | 0.69 (0.06) * |
| LASSO  | 20      | 0.632  | 0.436   | 0.68 (0.06) * |
| RF     | 18      | 0.660  | 0.431   | 0.68 (0.06) |



---

## 3.3 Model Selection

- **Decision**: Choose **RF** for best CV AUC (0.660) and RMSE (0.431).  
- **Rationale**: RF captures non-linear patterns (e.g., `personnel_exp_pl`) better than logits.  
- Bug in original selection logic (M1) was corrected during re-analysis.

---

## 3.4 Holdout Performance

**RF Results**  
- AUC = 0.65  
- RMSE = 0.434  
- Expected loss ≈ 0.68 (threshold = 0.06)

<details><summary>Confusion Matrix</summary>

```
TN = 555
FP = 1484
FN = 0
TP = 723

Recall = 1.0, Precision = 0.33
```
</details>

---

# 4. Task 2: Industry-Specific Analysis

## 4.1 Industry Split

- **Decision**: Split by NACE:  
  - **Manufacturing** (1000–3400), 4,861 firms  
  - **Services** (≥4500), 9,828 firms  

<details><summary>Code Snippet</summary>

```python
def classify_industry(nace):
    return 'manufacturing' if 1000 <= nace < 3400 else 'services' if nace >= 4500 else 'other'

clean['industry'] = clean['nace_main'].apply(classify_industry)
```
</details>

---

## 4.2 RF Application

- **Decision**: Apply RF with the same loss function (FP=1, FN=10) and threshold optimization.

**Results (Table 2)**

| Industry       | Threshold | Expected Loss | AUC   | Precision | Recall | F1 Score |
|----------------|-----------|--------------:|------:|----------:|-------:|---------:|
| **Manufacturing** | 0.060   | 0.680        | 0.691 | 0.301     | 0.998  | 0.462    |
| **Services**      | 0.060   | 0.682        | 0.683 | 0.265     | 0.998  | 0.418    |

---

## 4.3 Feature Importance

- **Manufacturing**: `personnel_exp_pl (0.040)`, `profit_loss_year (0.032)`.  
- **Services**: `profit_loss_year (0.031)`, `sales (0.030)`.

<details><summary>Graph (Placeholder)</summary>

```python
import matplotlib.pyplot as plt
feats = pd.Series(rf.feature_importances_, index=X.columns).nlargest(5)
feats.plot(kind='barh', title='Top 5 Features (Manufacturing)')
plt.savefig('feature_importance.png')
```
</details>

**Figure 1**: Feature Importance (Manufacturing)

---

# 5. Discussion

## 5.1 Data Decisions

- **20% Threshold**: Balances sensitivity and sample size; a 50% threshold would reduce positives excessively.  
- **Leakage Removal**: Essential for validity (e.g., dropping `sales12`); note EDA with LOWESS was limited by missing `scikit-misc`.

## 5.2 Model Performance

- **RF Advantage**: Non-linear modeling outperforms logistic linearity, validated by AUC gains.  
- **Threshold (0.06)**: Driven by high FN cost, ensuring maximal recall (1.0) at the expense of lower precision.

## 5.3 Industry Insights

- **Manufacturing** Edge: Higher AUC/F1 suggest more structured, consistent predictors (e.g., labor costs).  
- **Services** Noise: Lower precision due to sector diversity.

---

## 6. Conclusion

RF with a `0.06` threshold is recommended for deployment, offering `AUC=0.66-0.69`, `RMSE=0.43`, and `expected loss=0.68` across tasks. Future work could enhance features or adjust cost ratios.

Full code is available at [GitHub: DA3AS2](https://github.com/Karoly97/DA3AS2/tree/main).
