# ðŸ¤– **Machine Learning Regression: Startup Profit Prediction**

**Author:** Nadia Rozman  
**Date:** January 2026  

## **Project Overview**

This notebook applies a **structured, classification-style workflow** (inspired by my Drug Classification ML project) to a **regression problem**. The goal is not only to predict startup profit, but to **compare models systematically**, justify each step, and translate results into **business insights**.

**Objective:**
- Predict startup profit based on **R&D**, **Administration**, and **Marketing** spend
- Compare multiple regression algorithms
- Identify the **best-performing model**
- Interpret **feature importance** for decision-making

**Models evaluated:**
- Decision Tree Regressor
- Polynomial Regression
- Random Forest Regressor
- Support Vector Regressor (SVR)

### **Import libraries**

In [30]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

### **Load and Explore Data**

In [None]:
# Load the startup profit dataset
dataset = pd.read_csv('ML_Regression_Startup_Profit_Prediction/data/startup_profit_dataset.csv')

In [32]:
# Display first few rows to understand data structure
dataset.head()

# Display dataset information (data types, non-null counts, memory usage)
dataset.info()

# Display statistical summary (mean, std, min, max, quartiles)
dataset.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
count,50.0,50.0,50.0,50.0
mean,73721.6156,121344.6396,211025.0978,112012.6392
std,45902.256482,28017.802755,122290.310726,40306.180338
min,0.0,51283.14,0.0,14681.4
25%,39936.37,103730.875,129300.1325,90138.9025
50%,73051.08,122699.795,212716.24,107978.19
75%,101602.8,144842.18,299469.085,139765.9775
max,165349.2,182645.56,471784.1,192261.83


### **Feature Selection & Target Definition**

In [33]:
# Select features: R&D Spend, Administration, Marketing Spend
# Exclude 'State' column (categorical variable not used in this analysis)
X = dataset[['R&D Spend', 'Administration', 'Marketing Spend']].values

# Define target variable: Profit
y = dataset['Profit'].values

### **Train-Test Split**

In [34]:
# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

### **Feature Scaling**

In [35]:
# Initialize scalers for features (X) and target (y)
sc_X = StandardScaler()
sc_y = StandardScaler()

# Fit scaler on training data and transform both train and test features
X_train_scaled = sc_X.fit_transform(X_train)
X_test_scaled = sc_X.transform(X_test)

# Scale target variable (critical for SVR performance)
y_train_scaled = sc_y.fit_transform(y_train.reshape(-1,1)).ravel()

### **Model Training & Evaluation Template**

In [36]:
# Define helper function to calculate regression metrics consistently
def evaluate_model(y_true, y_pred):
    """
    Calculate RÂ², RMSE, and MAE for regression model evaluation.
    
    Returns:
        dict: Dictionary containing RÂ², RMSE, and MAE metrics
    """
    return {
        'RÂ²': float(r2_score(y_true, y_pred)),
        'RMSE': float(np.sqrt(mean_squared_error(y_true, y_pred))),
        'MAE': float(mean_absolute_error(y_true, y_pred))
    }

### **Model 1: Decision Tree Regression**

In [37]:
# Initialize and train Decision Tree Regressor
dt = DecisionTreeRegressor(random_state=0)
dt.fit(X_train_scaled, y_train_scaled)

# Make predictions on test set
# Inverse transform to convert scaled predictions back to original scale
y_pred_dt = sc_y.inverse_transform(
    dt.predict(X_test_scaled).reshape(-1,1)
).ravel()

# Evaluate model performance
metrics_dt = evaluate_model(y_test, y_pred_dt)
metrics_dt

{'RÂ²': 0.9766026460977932, 'RMSE': 5470.162630493728, 'MAE': 4480.289000000006}

### **Model 2: Polynomial Regression (Degree 2)**

In [38]:
# Create polynomial features (degree 2) from scaled training data
poly = PolynomialFeatures(degree=2)
X_poly_train = poly.fit_transform(X_train_scaled)
X_poly_test = poly.transform(X_test_scaled)

# Train linear regression model on polynomial features
poly_model = LinearRegression()
poly_model.fit(X_poly_train, y_train_scaled)

# Make predictions on test set
# Inverse transform to convert scaled predictions back to original scale
y_pred_poly = sc_y.inverse_transform(
    poly_model.predict(X_poly_test).reshape(-1,1)
).ravel()

# Evaluate model performance
metrics_poly = evaluate_model(y_test, y_pred_poly)
metrics_poly

{'RÂ²': 0.9201823374226858,
 'RMSE': 10103.372587571128,
 'MAE': 8720.931217588139}

### **Model 3: Random Forest Regression**

In [39]:
# Initialize and train Random Forest Regressor with 100 trees
rf = RandomForestRegressor(n_estimators=100, random_state=0)
rf.fit(X_train_scaled, y_train_scaled)

# Make predictions on test set
# Inverse transform to convert scaled predictions back to original scale
y_pred_rf = sc_y.inverse_transform(
    rf.predict(X_test_scaled).reshape(-1,1)
).ravel()

# Evaluate model performance
metrics_rf = evaluate_model(y_test, y_pred_rf)
metrics_rf

{'RÂ²': 0.9671431761347026, 'RMSE': 6482.307907999423, 'MAE': 5268.487560000003}

**Feature Importance**

In [40]:
# Extract and rank feature importances from Random Forest model
feature_names = ['R&D Spend', 'Administration', 'Marketing Spend']
rf_importance = pd.Series(rf.feature_importances_, index=feature_names)
rf_importance.sort_values(ascending=False)

R&D Spend          0.916366
Marketing Spend    0.077555
Administration     0.006079
dtype: float64

### **Model 4: Support Vector Regression**

In [41]:
# Initialize and train SVR with RBF kernel
svr = SVR(kernel='rbf', gamma='scale', epsilon=0.01)
svr.fit(X_train_scaled, y_train_scaled)

# Make predictions on test set
# Inverse transform to convert scaled predictions back to original scale
y_pred_svr = sc_y.inverse_transform(
    svr.predict(X_test_scaled).reshape(-1,1)
).ravel()

# Evaluate model performance
metrics_svr = evaluate_model(y_test, y_pred_svr)
metrics_svr

{'RÂ²': 0.8733802443784792,
 'RMSE': 12725.287461290127,
 'MAE': 10272.881385722852}

### **Model Comparison**

In [42]:
# Create DataFrame to compare all model performances
results = pd.DataFrame([
    metrics_rf,
    metrics_poly,
    metrics_dt,
    metrics_svr
], index=['Random Forest', 'Polynomial', 'Decision Tree', 'SVR'])

# Sort models by RÂ² Score in descending order (best model first)
results.sort_values('RÂ²', ascending=False)

Unnamed: 0,RÂ²,RMSE,MAE
Decision Tree,0.976603,5470.16263,4480.289
Random Forest,0.967143,6482.307908,5268.48756
Polynomial,0.920182,10103.372588,8720.931218
SVR,0.87338,12725.287461,10272.881386
