ðŸ“˜ **Finding Fast-Growing Firms**

Data Analysis 3 â€“ Assignment 2
CEU 2025
Prepared by: Farangiz Jurakhonova

**Business Motivation**

High-growth firms drive innovation, employment, and economic expansion.
Identifying such firms early is valuable for:

- Investors allocating capital

- Banks assessing credit growth opportunities

- Policymakers supporting dynamic sectors

The goal of this project is to design and evaluate predictive models that identify firms likely to experience rapid revenue growth.

We approach this as a classification problem using firm-level financial data from 2010â€“2015.

In [4]:
# Environment Setup; Core packages
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, confusion_matrix

# Settings
pd.set_option('display.max_columns', None)
sns.set_style("whitegrid")

print("Environment ready.")


Environment ready.


In [6]:
# Data Loading

url = "https://osf.io/download/3qyut/"

data = pd.read_csv(url)

print("Shape:", data.shape)
print("Years:", data['year'].unique())
data.head()



Shape: (287829, 48)
Years: [2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016]


Unnamed: 0,comp_id,begin,end,COGS,amort,curr_assets,curr_liab,extra_exp,extra_inc,extra_profit_loss,finished_prod,fixed_assets,inc_bef_tax,intang_assets,inventories,liq_assets,material_exp,net_dom_sales,net_exp_sales,personnel_exp,profit_loss_year,sales,share_eq,subscribed_cap,tang_assets,wages,D,balsheet_flag,balsheet_length,balsheet_notfullyear,year,founded_year,exit_year,ceo_count,foreign,female,birth_year,inoffice_days,gender,origin,nace_main,ind2,ind,urban_m,region_m,founded_date,exit_date,labor_avg
0,1001034.0,2005-01-01,2005-12-31,,692.59259,7266.666504,7574.074219,0.0,0.0,0.0,,1229.629639,218.518524,0.0,4355.555664,2911.111084,38222.222656,,,22222.222656,62.962963,62751.851562,881.481506,1388.888916,1229.629639,,,0,364,0,2005,1990.0,,2.0,0.0,0.5,1968.0,5686.5,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
1,1001034.0,2006-01-01,2006-12-31,,603.703674,13122.222656,12211.111328,0.0,0.0,0.0,,725.925903,996.296326,0.0,7225.925781,5896.296387,38140.742188,,,23844.445312,755.555542,64625.925781,1637.036987,1388.888916,725.925903,,,0,364,0,2006,1990.0,,2.0,0.0,0.5,1968.0,5686.5,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
2,1001034.0,2007-01-01,2007-12-31,,425.925934,8196.295898,7800.0,0.0,0.0,0.0,,1322.222168,570.370361,0.0,7951.852051,177.777771,40174.074219,,,22262.962891,0.0,65100.0,1633.333374,1388.888916,1322.222168,,,0,364,0,2007,1990.0,,2.0,0.0,0.5,1968.0,5686.5,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
3,1001034.0,2008-01-01,2008-12-31,,300.0,8485.185547,7781.481445,0.0,0.0,0.0,,1022.222229,714.814819,0.0,5233.333496,1392.592651,54274.074219,,,21107.408203,0.0,78085.1875,1725.925903,1481.481445,1022.222229,,,0,365,0,2008,1990.0,,2.0,0.0,0.5,1968.0,5686.5,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
4,1001034.0,2009-01-01,2009-12-31,,207.40741,5137.037109,15300.0,0.0,0.0,0.0,,814.814819,-11044.444336,0.0,3259.259277,11.111111,41755.554688,,,13237.037109,-11074.074219,45388.890625,-9348.148438,1481.481445,814.814819,,,0,364,0,2009,1990.0,,2.0,0.0,0.5,1968.0,5686.5,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,0.083333


**Restricting the Panel to 2010â€“2015**

To ensure consistency and avoid structural breaks, we restrict the analysis to the 2010â€“2015 period. This window:

- Aligns with the seminar framework

- Avoids early crisis volatility

- Allows forward-looking growth measurement

- Preserves sufficient sample size

- We maintain a panel structure and avoid cross-sectional shortcuts.

In [8]:
# Restrict to 2010â€“2015
data = data.query("year >= 2010 & year <= 2015")

print("Years included:", data['year'].unique())
print("Shape after restriction:", data.shape)



Years included: [2010 2011 2012 2013 2014 2015]
Shape after restriction: (167606, 48)


The filtered dataset contains 167,606 firm-year observations spanning 2010â€“2015. This ensures temporal consistency and allows forward growth construction without look-ahead bias.

**Target Variable Design: Defining Fast Growth**

Before constructing the target, we evaluate alternative growth definitions:

1. One-year log revenue growth

2. Two-year cumulative growth

3. Asset growth

Revenue growth is preferred because:

- Firm valuation in corporate finance depends on expected future cash flows.

- Revenue expansion signals market traction and competitive strength.

- Log-differences approximate percentage changes and stabilize variance.

We therefore define:

*Fast growth = Top 25% of one-year log revenue growth (2013 vs 2012)*

This percentile approach ensures a balanced classification problem and avoids arbitrary cutoffs.

In [12]:
# Extract Firms Observed in 2012 and 2013
growth_panel = data.query("year == 2012 or year == 2013")

print(growth_panel['year'].value_counts())


year
2013    28474
2012    28469
Name: count, dtype: int64


In [14]:
# Pivot to Wide Format
wide = growth_panel.pivot(index='comp_id', columns='year', values='sales')

wide = wide.dropna(subset=[2012, 2013])

print("Number of firms with both years:", wide.shape)


Number of firms with both years: (24895, 2)


In [15]:
# Compute Log Growth
wide['log_growth'] = np.log(wide[2013]) - np.log(wide[2012])

print(wide['log_growth'].describe())


count    2.198600e+04
mean              NaN
std               NaN
min              -inf
25%     -2.731474e-01
50%      3.187579e-02
75%      3.350403e-01
max               inf
Name: log_growth, dtype: float64


  result = getattr(ufunc, method)(*inputs, **kwargs)
  result = getattr(ufunc, method)(*inputs, **kwargs)
  return umr_sum(a, axis, dtype, out, keepdims, initial, where)


**Handling Log Transformation Issues**

Revenue growth is computed using log differences to approximate percentage changes.
However, log transformations are undefined for zero or negative sales.

Therefore:

Firms with non-positive sales in either year were removed.

Infinite and undefined growth values were excluded.

This ensures statistical validity and avoids distortions from extreme accounting cases.

In [16]:
# Remove non-positive sales before log transformation
wide = wide[(wide[2012] > 0) & (wide[2013] > 0)]

# Compute log growth again
wide['log_growth'] = np.log(wide[2013]) - np.log(wide[2012])

# Remove inf or NaN
wide = wide.replace([np.inf, -np.inf], np.nan)
wide = wide.dropna(subset=['log_growth'])

print("Clean growth sample:", wide.shape)
print(wide['log_growth'].describe())


Clean growth sample: (19900, 3)
count    19900.000000
mean         0.045070
std          1.031769
min         -9.429155
25%         -0.214525
50%          0.036617
75%          0.296897
max         11.486854
Name: log_growth, dtype: float64


The average firm experienced modest positive revenue growth (~4â€“5%) between 2012 and 2013, consistent with post-crisis recovery dynamics.

However, the distribution exhibits extreme tails:

Some firms show log declines below âˆ’9 (near-collapse)

Others show log increases above +11 (explosive growth)

Such extreme values are likely driven by:

Very small base-year sales

Accounting adjustments

Data reporting artifacts

While informative, these outliers may distort model estimation â€” particularly for logistic regression.

In [18]:
# Winsorized Growth Distribution

lower = wide['log_growth'].quantile(0.01)
upper = wide['log_growth'].quantile(0.99)

wide['log_growth_w'] = wide['log_growth'].clip(lower, upper)

print(wide['log_growth_w'].describe())


count    19900.000000
mean         0.045592
std          0.910640
min         -3.381332
25%         -0.214525
50%          0.036617
75%          0.296897
max          3.447826
Name: log_growth_w, dtype: float64


Winsorization at the 1st and 99th percentiles substantially reduces the influence of extreme outliers while preserving the core distributional structure.

The standard deviation decreased from 1.03 to 0.91, indicating that extreme leverage points have been controlled. Importantly, the median and interquartile range remain unchanged, meaning the central growth dynamics are preserved.

This ensures:

Greater numerical stability for logistic regression

More robust classification thresholds

Reduced risk of overfitting to extreme accounting cases

We proceed using the winsorized growth variable.

**Defining Fast Growth**
We now define:

*Fast growth = Top 25% of winsorized log revenue growth*

This percentile-based definition ensures:

- Balanced class distribution

- Meaningful economic distinction

- Avoidance of arbitrary fixed cutoffs

In [19]:
threshold = wide['log_growth_w'].quantile(0.75)

wide['fast_growth'] = (wide['log_growth_w'] > threshold).astype(int)

print("Threshold:", threshold)
print(wide['fast_growth'].value_counts())
print(wide['fast_growth'].value_counts(normalize=True))


Threshold: 0.2968969116885578
fast_growth
0    14925
1     4975
Name: count, dtype: int64
fast_growth
0    0.75
1    0.25
Name: proportion, dtype: float64


The percentile-based threshold produces a clean 75â€“25 class split, ensuring:

- Sufficient representation of both classes

- No severe imbalance problem

- Stable cross-validation performance

Economically, firms exceeding ~30% log revenue growth represent a distinct expansion group. These firms are likely benefiting from strong demand shocks, innovation, or strategic repositioning.

This definition balances statistical practicality with economic meaning.

In [20]:
# Construct Modeling Dataset

# Take 2012 cross-section
base_2012 = data.query("year == 2012").copy()

# Merge fast growth label
model_data = base_2012.merge(
    wide[['log_growth_w', 'fast_growth']],
    left_on='comp_id',
    right_index=True,
    how='inner'
)

print("Model dataset shape:", model_data.shape)
print("Class balance:")
print(model_data['fast_growth'].value_counts(normalize=True))


Model dataset shape: (19900, 50)
Class balance:
fast_growth
0    0.75
1    0.25
Name: proportion, dtype: float64


The final modeling dataset contains 19,900 firms observed in 2012 with realized growth in 2013. The class distribution is balanced enough to avoid severe imbalance issues, while still reflecting the economic reality that high-growth firms are a minority.

This setup mirrors a realistic forecasting problem: using current financial and structural characteristics to identify firms likely to expand rapidly in the near future.

**Feature Engineering for Growth Prediction**

In [25]:
# Size
model_data['sales_pos'] = np.where(model_data['sales'] > 0, model_data['sales'], np.nan)
model_data['log_sales'] = np.log(model_data['sales_pos'])

# Age
model_data['age'] = model_data['year'] - model_data['founded_year']
model_data['age'] = np.where(model_data['age'] < 0, 0, model_data['age'])
model_data['age2'] = model_data['age'] ** 2

# Financial ratios
model_data['total_assets'] = (
    model_data['intang_assets'] +
    model_data['curr_assets'] +
    model_data['fixed_assets']
)

model_data['profit_margin'] = model_data['profit_loss_year'] / model_data['sales']
model_data['leverage'] = model_data['curr_liab'] / model_data['total_assets']
model_data['liquidity'] = model_data['liq_assets'] / model_data['total_assets']

# CEO age
model_data['ceo_age'] = model_data['year'] - model_data['birth_year']

# Replace infinities
model_data = model_data.replace([np.inf, -np.inf], np.nan)


In [26]:
model_data[['log_sales','age','profit_margin','leverage','liquidity','ceo_age']].describe()


Unnamed: 0,log_sales,age,profit_margin,leverage,liquidity,ceo_age
count,19900.0,18094.0,19895.0,19879.0,19879.0,15564.0
mean,10.772295,9.133967,-0.533459,3.696456,0.224777,46.828258
std,1.973711,6.980155,12.944828,71.441372,0.282243,11.316954
min,1.309333,0.0,-1157.500039,-28.789271,-7.096189,-4.0
25%,9.640438,3.0,-0.201025,0.163307,0.024087,38.0
50%,10.742137,8.0,0.002458,0.471013,0.101567,46.0
75%,11.888771,15.0,0.041667,0.999737,0.328073,55.0
max,18.472229,32.0,752.109993,5773.00035,3.484716,92.0


In [27]:
# Cleaning and Stabilizing Predictors
# Winsorization function
def winsorize(series, lower_q=0.01, upper_q=0.99):
    lower = series.quantile(lower_q)
    upper = series.quantile(upper_q)
    return series.clip(lower, upper)

# Apply winsorization
model_data['profit_margin_w'] = winsorize(model_data['profit_margin'])
model_data['leverage_w'] = winsorize(model_data['leverage'])
model_data['liquidity_w'] = winsorize(model_data['liquidity'])

# Fix CEO age (remove impossible values)
model_data['ceo_age'] = np.where(
    (model_data['ceo_age'] < 25) | (model_data['ceo_age'] > 75),
    np.nan,
    model_data['ceo_age']
)

# Check again
model_data[['profit_margin_w','leverage_w','liquidity_w','ceo_age']].describe()


Unnamed: 0,profit_margin_w,leverage_w,liquidity_w,ceo_age
count,19895.0,19879.0,19879.0,15309.0
mean,-0.242165,1.597182,0.225157,46.949921
std,0.938913,4.168879,0.275481,10.825535
min,-6.886912,0.0,0.0,25.0
25%,-0.201025,0.163307,0.024087,38.0
50%,0.002458,0.471013,0.101567,46.0
75%,0.041667,0.999737,0.328073,55.0
max,0.74156,31.082349,1.0,75.0


The negative average margin suggests many firms operate with thin or negative profits â€” common in SME panels.

In [28]:
predictors = [
    'log_sales',
    'age',
    'age2',
    'profit_margin_w',
    'leverage_w',
    'liquidity_w',
    'ceo_age',
    'foreign',
    'female'
]

# Drop rows with missing predictor values
model_final = model_data.dropna(subset=predictors)

print("Final modeling sample:", model_final.shape)
print("Class balance:")
print(model_final['fast_growth'].value_counts(normalize=True))


Final modeling sample: (15291, 62)
Class balance:
fast_growth
0    0.763979
1    0.236021
Name: proportion, dtype: float64


The final dataset represents established firms observed in 2012, with financial, structural, and managerial characteristics used to predict whether they will experience top-quartile revenue growth in 2013.

The modeling setup satisfies three core principles of sound predictive analysis:

Temporal ordering â€” predictors precede outcomes.

No information leakage â€” growth itself is excluded from predictors.

Economically interpretable variables â€” size, age, leverage, profitability, liquidity, and governance characteristics are grounded in firm-level finance theory.

The dataset is now fully prepared for predictive modeling.