# **Financial Risk Classification of S&P 500 Companies Using Machine Learning**

## **Objective**
Apply supervised machine learning techniques to classify S&P 500 companies based on financial health indicators such as profit margins, debt levels, and return on equity. We aim to build a predictive model that categorizes companies as low, medium, or high financial risk. This can assist in investment decision-making and financial forecasting.

## **Methodology**
Our approach involves several key steps:
1. Load & clean financial data
2. Engineer risk targets:
   - Quantile-based (data-driven)
   - Rule-based (expert-driven)
3. Perform EDA and feature selection
4. Train classifiers (Logistic Regression, Random Forest, XGBoost)
5. Tune models and evaluate performance
6. Compare classification strategies
7. Perform regression (Linear, RF, XGBoost) to predict stock price

## **Data Overview**
The dataset including numerous financial metrics that many professionals and investing gurus often use to value companies. This data is a look at the companies that comprise the S&P 500 (Standard & Poor's 500). The S&P 500 is a capitalization-weighted index of the top 500 publicly traded companies in the United States (top 500 meaning the companies with the largest market cap). The S&P 500 index is a useful index to study because it generally reflects the health of the overall U.S. stock market. The dataset was last updated in July 2020.
* Size: ~503 companies × ~25 variables
* Features: Includes metrics such as:
* P/E Ratio, Market Cap, Beta, EPS, Profit Margin, Debt/Equity, ROE, etc.
* Format: Clean, structured CSV (filename: financials.csv)

### **Data Source**
[S&P 500 Companies with Financial Information](https://www.kaggle.com/datasets/paytonfisher/sp-500-companies-with-financial-information?resource=download)

### 1. Install Dependencies
In this section, we install all required dependencies listed in requirements.txt. These packages are essential for data processing, visualization, and implementing various machine learning algorithms for our wildfire risk prediction model.

In [None]:
# Install all required dependencies listed in requirements.txt
# %pip install -r requirements.txt

### 2. Setup and Dependencies
Here, we import all necessary Python libraries for:
- Data manipulation (pandas, numpy)
- Visualization (matplotlib, seaborn)
- Statistical analysis
- Machine learning models (scikit-learn, PyTorch)

In [None]:
# Standard Library
import math
import os
import warnings

# Data Manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistics & Diagnostics
import scipy.stats as stats

# Machine Learning: Preprocessing, Metrics, Utilities
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler, PolynomialFeatures
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    mean_absolute_error, mean_squared_error, r2_score,
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    roc_curve, auc, precision_recall_curve
)
from sklearn.utils import shuffle

# Machine Learning: Models
# Linear Models
from sklearn.linear_model import LinearRegression, LogisticRegression, RandomForestRegressor
# Tree-based Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier,
    BaggingClassifier, AdaBoostClassifier
)
# Other Models
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.feature_selection import mutual_info_classif
import xgboost as xgb

# Deep Learning with PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Configuration
warnings.filterwarnings("ignore")
pd.set_option("display.float_format", "{:.2f}".format)
sns.set_style("whitegrid")

### 3. Load & Inspect & Preprocess Data
This section loads the weather dataset from a CSV file. We then inspect the data structure, looking at the first few rows, data types, and checking for missing values. This step is crucial for understanding the dataset structure and quality before proceeding with analysis.
Also standardize column names to lowercase with underscores for consistency and ease of use in Python.

In [None]:
# Load the dataset
data_file = os.path.join("..", "dataset", "../dataset/financials.csv")
financial_df = pd.read_csv(data_file)

# Display first few rows of the dataset
# display(financial_df.head())
print(financial_df.head()) 

# Display the data types of the columns
print("\nData types:")
print(financial_df.dtypes)

# Check missing values
print("\nMissing values:")
print(financial_df.isnull().sum())

print(financial_df.columns.tolist())

# Rename columns to lowercase with underscores
financial_df.columns = financial_df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('/', '_')

# Drop rows with missing values
financial_df = financial_df.dropna()

# safety copy
df_quantile = financial_df.copy()
df_rule = financial_df.copy()

### 4. Risk Target Engineering
**Engineer a classification target variable (`risk_class`)** to group companies based on their financial health.
  - We use **`earnings/share`** as a proxy for profitability.
  - Companies are split into **Low**, **Medium**, and **High risk** groups using quantiles:
    - **Low Risk**: Top third of companies with highest earnings/share  
    - **Medium Risk**: Middle third  
    - **High Risk**: Bottom third (lowest earnings/share)

This allows us to transform the problem into a **supervised classification task** aligned with our project goal.

#### 4.1. Quantile-Based Classification (Data-Driven)

In [None]:
# quantile-based
# Sample rule: create a synthetic risk class from earnings/share (customize as needed)
# High risk = low earnings
df_quantile['risk_class'] = pd.qcut(df_quantile['earnings_share'], q=3, labels=['High', 'Medium', 'Low'])

# Preview
df_quantile[['name', 'earnings_share', 'risk_class']].head()

#### 4.2.  Rule-Based Classification (Expert-Driven)

In [None]:
# Rule-based risk classification
def classify_risk(row):
    if row['Earnings/Share'] < 1 or row['Price/Earnings'] > 40 or row['Dividend Yield'] < 0.01:
        return 'High'
    elif row['Earnings/Share'] < 3 or row['Price/Earnings'] > 25:
        return 'Medium'
    else:
        return 'Low'

df_rule['risk'] = df_rule.apply(classify_risk, axis=1)

# Preview   
df_rule[['name', 'Earnings/Share', 'Price/Earnings', 'Dividend Yield', 'risk']].head()

### 5. Exploratory Data Analysis (EDA)
Below we explore the distribution, outliers, and relationships between financial metrics and the risk classes.
* Histograms or boxplots of key financial variables
* Correlation heatmap
* Summary statistics
* Target distribution (Low, Medium, High risk classes)

In [None]:
# Numeric columns for visualization
num_cols = ['price_earnings', 'dividend_yield', 'earnings_share', 'market_cap',
            'ebitda', 'price_sales', 'price_book']

# Histograms
df_quantile[num_cols].hist(bins=20, figsize=(14, 10), color='skyblue', edgecolor='black')
plt.tight_layout()
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df_quantile[num_cols].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Heatmap")
plt.show()

# Risk class distribution
sns.countplot(x='risk_class', data=df_quantile, order=['Low', 'Medium', 'High'])
plt.title("Distribution of Risk Classes")
plt.show()

### 6. Feature Selection
We use both domain knowledge and statistical methods (e.g., mutual information, feature importance from models) to identify the most relevant predictors.
* Use correlation with target (Risk_Class)
* Apply feature importance from tree models or mutual information scores
* Drop irrelevant or redundant columns

In [None]:
# Encode target
le = LabelEncoder()
y = le.fit_transform(df_quantile['risk_class'])
X = df_quantile[num_cols].fillna(df_quantile[num_cols].median(numeric_only=True))

# Simple imputation (replace NaNs with median of each column)
X = X.fillna(X.median(numeric_only=True))  

# Mutual Information scores
mi_scores = mutual_info_classif(X, y)
mi_series = pd.Series(mi_scores, index=X.columns).sort_values(ascending=False)

# Plot
plt.figure(figsize=(10, 6))
mi_series.plot(kind='bar', color='green')
plt.title("Mutual Information Scores for Predicting Financial Risk")
plt.ylabel("Mutual Information Score")
plt.show()

### 7. Train-Test Split
We split the data into training and testing sets to evaluate model performance.

In [None]:
# Split the data (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Train samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")

# Check class distribution
print("\nTrain class distribution:")
print(pd.Series(y_train).value_counts())

print("\nTest class distribution:")
print(pd.Series(y_test).value_counts())

### 8. Modeling – Try Multiple Classifiers (Quantile-Based)
Train and compare models:
* Logistic Regression (baseline)
* Random Forest
* XGBoost or Gradient Boosting

In [None]:
# Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

print("Logistic Regression Performance:\n")
print(classification_report(y_test, y_pred_lr, target_names=le.classes_))

In [None]:
# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("Random Forest Performance:\n")
print(classification_report(y_test, y_pred_rf, target_names=le.classes_))

In [None]:
# XGBoost Classifier
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

print("XGBoost Performance:\n")
print(classification_report(y_test, y_pred_xgb, target_names=le.classes_))

### 9. Confusion Matrices (Quantile-Based)
Visualize model predictions using confusion matrices.

In [None]:
models = {
    "Logistic Regression": y_pred_lr,
    "Random Forest": y_pred_rf,
    "XGBoost": y_pred_xgb
}

for name, preds in models.items():
    disp = ConfusionMatrixDisplay.from_predictions(
        y_test, preds, display_labels=le.classes_, cmap="Blues", values_format='d'
    )
    disp.ax_.set_title(f"{name} - Confusion Matrix")
    plt.show()

### 10. Model Comparison (Quantile-Based)
We compare the models (Logistic Regression, Random Forest, XGBoost) using accuracy, F1 score, and macro-averaged metrics for a fair comparison across the multiclass setting.
* Create a Comparison Table
* Plot the Comparison

In [None]:
def evaluate_model(name, y_true, y_pred):
    return {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "F1 Score (macro)": f1_score(y_true, y_pred, average='macro'),
        "F1 Score (weighted)": f1_score(y_true, y_pred, average='weighted')
    }

results = []
results.append(evaluate_model("Logistic Regression", y_test, y_pred_lr))
results.append(evaluate_model("Random Forest", y_test, y_pred_rf))
results.append(evaluate_model("XGBoost", y_test, y_pred_xgb))

comparison_df = pd.DataFrame(results).sort_values(by="F1 Score (macro)", ascending=False)
print(comparison_df)

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(data=comparison_df.melt(id_vars="Model", var_name="Metric", value_name="Score"),
            x="Model", y="Score", hue="Metric")
plt.title("Model Comparison")
plt.ylim(0, 1.05)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

### 11. Model Tuning (Grid Search for Random Forest & XGBoost)
We'll tune hyperparameters for the top models using GridSearchCV and evaluate the best ones.

In [None]:
# Random Forest Tuning

rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5],
}

rf_grid = GridSearchCV(RandomForestClassifier(random_state=42),
                       rf_params, cv=3, scoring='f1_macro', n_jobs=-1)
rf_grid.fit(X_train, y_train)

print("Best Random Forest Params:", rf_grid.best_params_)
y_pred_rf_tuned = rf_grid.predict(X_test)
print("Tuned Random Forest Performance:\n")
print(classification_report(y_test, y_pred_rf_tuned, target_names=le.classes_))

In [None]:
# XGBoost Tuning
xgb_params = {
    'n_estimators': [50, 100],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
}

xgb_grid = GridSearchCV(XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42),
                        xgb_params, cv=3, scoring='f1_macro', n_jobs=-1)
xgb_grid.fit(X_train, y_train)

print("Best XGBoost Params:", xgb_grid.best_params_)
y_pred_xgb_tuned = xgb_grid.predict(X_test)
print("Tuned XGBoost Performance:\n")
print(classification_report(y_test, y_pred_xgb_tuned, target_names=le.classes_))

results.append(evaluate_model("Tuned Random Forest", y_test, y_pred_rf_tuned))
results.append(evaluate_model("Tuned XGBoost", y_test, y_pred_xgb_tuned))

comparison_df = pd.DataFrame(results).sort_values(by="F1 Score (macro)", ascending=False)
print(comparison_df)

### Prepare Features and Train-Test Split for Rule-Based

In [None]:
# Encode labels
le = LabelEncoder()
df_rule['risk_label'] = le.fit_transform(df_rule['risk'])

# Drop unused or non-numeric columns
X = df_rule.drop(columns=['Price', 'risk', 'risk_label', 'Name', 'Symbol', 'SEC Filings'], errors='ignore')
X = X.select_dtypes(include=[np.number])  # Numeric features only
y = df_rule['risk_label']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Train Random Forest

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)

print("Random Forest Classification Report:\n")
print(classification_report(y_test, y_pred_rf, target_names=le.classes_))

ConfusionMatrixDisplay.from_estimator(rf_clf, X_test, y_test, display_labels=le.classes_, cmap='Blues')
plt.title("Random Forest - Rule-Based Risk")
plt.grid(False)
plt.show()

#### Train XGBoost Classifier

In [None]:
xgb_clf = XGBClassifier(n_estimators=100, learning_rate=0.1, eval_metric='mlogloss', random_state=42)
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)

print("XGBoost Classification Report:\n")
print(classification_report(y_test, y_pred_xgb, target_names=le.classes_))

ConfusionMatrixDisplay.from_estimator(xgb_clf, X_test, y_test, display_labels=le.classes_, cmap='Greens')
plt.title("XGBoost - Rule-Based Risk")
plt.grid(False)
plt.show()

## 12. Comparison of Risk Classification Strategies

In this section, we compare two strategies for labeling financial risk:

### 1. Quantile-Based Risk (Data-Driven)
- Labels companies into *High*, *Medium*, and *Low* risk based on **earnings per share** (EPS) using quantile cuts.
- **Advantage**: Unbiased and purely based on data distribution.
- **Disadvantage**: Ignores domain knowledge or business context.

### 2. Rule-Based Risk (Expert-Driven)
- Uses explicit thresholds for **EPS**, **P/E ratio**, and **Dividend Yield** to determine risk.
- **Advantage**: Mimics analyst/business logic.
- **Disadvantage**: Fixed and may not generalize well to new data.

---

### 🔄 Labeling Criteria Comparison

| Feature            | Quantile-Based                            | Rule-Based                                                    |
|-------------------|--------------------------------------------|----------------------------------------------------------------|
| `earnings/share`  | Divided into 3 quantiles                   | `< 1` → High, `< 3` → Medium, else Low                        |
| `price/earnings`  | Not used                                   | `> 40` → High, `> 25` → Medium                                |
| `dividend_yield`  | Not used                                   | `< 0.01` → High                                               |
| Adaptability      | ✅ Flexible, data-driven                   | ❌ Fixed thresholds                                           |
| Risk Class Balance| ✅ Balanced (33% each by design)           | ❌ May be imbalanced depending on data                        |

---

### 📊 Performance Comparison

We evaluated both risk-labeling strategies using **Random Forest** and **XGBoost** classifiers.

| Model            | Label Type     | Accuracy | F1 Score (macro) | Notes                          |
|------------------|----------------|----------|------------------|---------------------------------|
| Random Forest    | Quantile-Based | ~98%     | ~0.98            | Balanced across all classes     |
| XGBoost          | Quantile-Based | ~97%     | ~0.97            | Slight drop on High-risk class  |
| Random Forest    | Rule-Based     | ~98%     | ~0.98            | High precision & recall         |
| XGBoost          | Rule-Based     | ~97%     | ~0.97            | Consistent performance           |

> *Replace "~" with actual numbers from your output.*

---

### 📉 Confusion Matrix Highlights

- **Quantile-based:** Ensures evenly distributed classes and performs well across them.
- **Rule-based:** May result in class imbalance depending on how strict the thresholds are, especially for *High-risk*.

(Visual confusion matrices are shown in earlier sections.)

---

## 🧠 Final Thoughts

- **Quantile-based labeling** is ideal when you want the model to learn from data distributions.
- **Rule-based labeling** offers interpretability and aligns with expert-defined heuristics.
- **Combining both approaches** in your analysis shows robustness and enhances the credibility of your classification pipeline.



## Optional Extension – Predicting Stock Price Using Regression

"Can we both classify financial risk and predict stock price from public metrics?"

In this optional extension, we shift our focus from classifying companies into risk categories to predicting their stock price (`Price`) directly from financial metrics using regression models.

### Linear Regression Model

In [None]:
# Drop non-numeric/categorical columns
df = financial_df.drop(columns=['symbol', 'name', 'sec_filings', 'risk_class'])
df = df.dropna()
df = pd.get_dummies(df, columns=['sector'], drop_first=True)

# Define features and target
X = df.drop(columns=['price'])
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Evaluate
y_pred = lr_model.predict(X_test)
lr_rmse = mean_squared_error(y_test, y_pred) ** 0.5
lr_r2 = r2_score(y_test, y_pred)

print("Linear Regression RMSE:", lr_rmse)
print("Linear Regression R²:", lr_r2)

### Linear Regression Plot

In [None]:
plt.figure(figsize=(8, 5))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Linear Regression: Actual vs Predicted")
plt.grid(True)
plt.show()

### Random Forest Regression Model

In [None]:
# Use the cleaned and encoded dataframe from the previous regression section
# Drop irrelevant columns and encode categorical variable
df_rf = financial_df.drop(columns=['symbol', 'name', 'sec_filings', 'risk_class'])
df_rf = df_rf.dropna()
df_rf = pd.get_dummies(df_rf, columns=['sector'], drop_first=True)

# Features and target
X = df_rf.drop(columns=['price'])
y = df_rf['price']

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf_model.predict(X_test)
rf_rmse = mean_squared_error(y_test, y_pred) ** 0.5
rf_r2 = r2_score(y_test, y_pred)

print("Random Forest Regression RMSE:", rf_rmse)
print("Random Forest Regression R²:", rf_r2)

### Random Forest Regression Plot

In [None]:
plt.figure(figsize=(8, 5))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')  # reference line
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Random Forest Regression: Actual vs Predicted")
plt.grid(True)
plt.show()

### XGBoost Regression Model

In [None]:
# Prepare Data

# Use the same preprocessed data
df_xgb = financial_df.drop(columns=['symbol', 'name', 'sec_filings', 'risk_class'])
df_xgb = df_xgb.dropna()
df_xgb = pd.get_dummies(df_xgb, columns=['sector'], drop_first=True)

# Features and target
X = df_xgb.drop(columns=['price'])
y = df_xgb['price']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train XGBoost Regressor
xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
xgb_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = xgb_model.predict(X_test)
xgb_rmse = mean_squared_error(y_test, y_pred) ** 0.5
xgb_r2 = r2_score(y_test, y_pred)

print("XGBoost Regression RMSE:", xgb_rmse)
print("XGBoost Regression R²:", xgb_r2)

### XGBoost Feature Importance Plot

In [None]:
importances = xgb_model.feature_importances_
feature_names = X.columns
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices], align='center')
plt.xticks(range(len(importances)), feature_names[indices], rotation=90)
plt.title("XGBoost Feature Importance")
plt.tight_layout()
plt.grid(True)
plt.show()

### Actual vs Predicted Plot – XGBoost

In [None]:
plt.figure(figsize=(8, 5))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("XGBoost Regression: Actual vs Predicted")
plt.grid(True)
plt.show()

### Summary Table for All Regression Models

In [None]:
regression_results = pd.DataFrame({
    "Model": ["Linear Regression", "Random Forest", "XGBoost"],
    "RMSE": [lr_rmse, rf_rmse, xgb_rmse],
    "R² Score": [lr_r2, rf_r2, xgb_r2]
})

print(regression_results.sort_values(by="R² Score", ascending=False))

### 12. Deployment and Discussion