# 1. Business Understanding

## 1.2 Problem Statement

Many investors, lenders, and business owners rely on intuition or outdated reports when evaluating a company’s financial position. This lack of real-time, data-driven analysis can lead to poor investment or lending decisions.

Our challenge is to develop a **data-powered tool** that automatically analyzes publicly available financial data (like income statements, balance sheets, and cash flows) to assess a company’s **financial stability, profitability, and risk**.

This project will simplify financial decision-making by transforming raw numbers into actionable insights through **data analysis, visualization, and machine learning**.

---

## 1.3 Business Objectives

### Main Objective

To build a **data analysis and scoring system** that evaluates a company’s financial health using real-world financial data.

### Specific Objectives

1. To collect and preprocess financial data from **Yahoo Finance API**  
2. To analyze key financial metrics such as revenue growth, net income, debt-to-equity ratio, and cash flow trends.  
3. To build a **financial health scoring model** that assigns a score to each company based on performance indicators.  
4. To visualize financial insights using clear dashboards and charts for easier interpretation.  
5. To provide actionable recommendations for investors or business managers.

---

## 1.4 Research Questions

1. What financial indicators most accurately represent a company’s health and stability?  
2. How do profitability, liquidity, and leverage ratios correlate with a company’s risk level?  
3. Can we build a model that classifies companies into categories such as _Healthy_, _Moderate_, and _At Risk_?  
4. How can visualizing financial trends help investors make better decisions?

---

## 1.5 Success Criteria

- The system should accurately collect and clean financial data for multiple companies.  
- It should compute and visualize key financial ratios and trends.  
- The scoring model should produce realistic health scores based on financial fundamentals.  
- The final output should be clear and explainable to both technical and non-technical users.

---

# 2. Data Understanding

We will use **real financial datasets** fetched directly from APIs — not from Kaggle.

---

## Datasets & Sources

| Source | Type of Data | Description |
| --- | --- | --- |
| **Yahoo Finance API (via yfinance)** | Company financials | Income statements, balance sheets, cash flow, and stock history |
---

## Dataset Overview

Each company dataset will include:

- **Revenue**  
- **Gross profit**  
- **Operating income**  
- **Net income**  
- **Total assets & liabilities**  
- **Cash flow from operations**  
- **Debt-to-equity ratio**  
- **Return on assets (ROA)** and **Return on equity (ROE)**  
- **Stock price performance** over time  

These metrics help us assess profitability, liquidity, leverage, and efficiency — the four main pillars of financial health.

---

## Tools and Libraries

We’ll use the following tools for the analysis:

| Category | Libraries |
| --- | --- |
| **Data Collection** | `yfinance`, `requests`, `pandas` |
| **Data Cleaning & Processing** | `numpy`, `pandas` |
| **Visualization** | `matplotlib`, `seaborn`, `plotly` |
| **Modeling & Scoring** | `scikit-learn`, `statsmodels` |
| **Deployment (Optional)** | `joblib` for model serialization |


# 3. Data Preparation

In this section, we will import the necessary Python libraries and load financial data directly from Yahoo Finance using the `yfinance` API. This will form the foundation of our analysis.

The data will include income statements, balance sheets, cash flow statements, and stock price history for a chosen company. We will then explore its structure before cleaning and feature engineering.


In [1]:
!pip install yfinance
!pip install --quiet pandas-datareader openpyxl



In [2]:
# 1. Imports
import yfinance as yf
import pandas as pd
import numpy as np
import difflib
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import os, pickle, time
from pathlib import Path
from sklearn.ensemble import RandomForestRegressor, IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("All libraries imported.")

All libraries imported.


## 2. Build a Large Ticker Universe

In [5]:
# 1. NASDAQ (pandas-datareader – always works)
def get_nasdaq_tickers():
    try:
        nasdaq = web.get_nasdaq_symbols()
        return nasdaq['NASDAQ Symbol'].dropna().tolist()
    except Exception as e:
        print(f"NASDAQ fetch failed: {e}. Using fallback.")
        # Fallback list (still > 100 symbols)
        return ['AAPL','MSFT','GOOGL','AMZN','NVDA','META','TSLA','AVGO','ASML','PEP']


# 2. S&P 500 – static CSV (no Wikipedia, no Excel)

def get_sp500_tickers():
    # Direct, public CSV from a trusted source (updated daily)
    url = "https://raw.githubusercontent.com/datasets/s-and-p-500-companies/main/data/constituents.csv"
    try:
        sp500 = pd.read_csv(url)
        return sp500['Symbol'].str.replace('.', '-').tolist()
    except Exception as e:
        print(f"S&P 500 CSV failed: {e}. Using fallback.")
        return ['AAPL','MSFT','GOOGL','AMZN','NVDA','META','TSLA','BRK-B','LLY','AVGO']


# 3. Combine + market-cap filter

sp500   = get_sp500_tickers()
nasdaq  = get_nasdaq_tickers()
all_raw = list(set(sp500 + nasdaq))

min_cap = 100_000_000          # $100 M
valid   = []

print(f"Filtering {len(all_raw):,} raw tickers by market cap ≥ ${min_cap:,}")
for t in tqdm(all_raw[:15_000], desc="Cap filter"):
    try:
        info = yf.Ticker(t).info
        cap  = info.get('marketCap') or 0
        if cap >= min_cap:
            valid.append(t)
    except:
        continue

ticker_symbols = valid[:12_000]   # safety cap
print(f"Final universe: {len(ticker_symbols):,} tickers")

NASDAQ fetch failed: name 'web' is not defined. Using fallback.
Filtering 504 raw tickers by market cap ≥ $100,000,000


Cap filter: 100%|██████████| 504/504 [09:09<00:00,  1.09s/it]

Final universe: 503 tickers





## 3. Desired Line-Item Mappings

In [6]:
# 3. Desired items (friendly → Yahoo label)
income_items = {
    "Total Revenue": "Total Revenue",
    "Gross Profit": "Gross Profit",
    "Operating Income": "Operating Income",
    "Net Income": "Net Income"
}

balance_items = {
    "Total Assets": "Total Assets",
    "Total Liab": "Total Liabilities",
    "Total Stockholder Equity": "Shareholders Equity",
    "Cash": "Cash and Cash Equivalents"
}

cash_flow_items = {
    "Total Cash From Operating Activities": "Operating Cash Flow",
    "Capital Expenditures": "CapEx",
    "Total Cash From Financing Activities": "Financing Cash Flow",
    "Total Cash From Investing Activities": "Investing Cash Flow"
}

## 4. Helper: Fuzzy-Match Row Names

In [7]:
# 4. resolve_item_names
def resolve_item_names(df, desired_raw_names):
    if df is None or df.empty:
        return {k: None for k in desired_raw_names}

    actual = list(map(str, df.index))
    actual_lower = [a.lower() for a in actual]
    mapping = {}

    for desired in desired_raw_names:
        des_lower = desired.lower()
        if desired in actual:
            mapping[desired] = desired
            continue
        if des_lower in actual_lower:
            mapping[desired] = actual[actual_lower.index(des_lower)]
            continue
        close = difflib.get_close_matches(desired, actual, n=1, cutoff=0.6)
        mapping[desired] = close[0] if close else None
    return mapping

## 5. Helper: Extract & Rename a Statement

In [8]:
# 5. extract_data_resolve
def extract_data_resolve(df, items_dict, statement_name, verbose=False):
    if df is None or df.empty:
        return pd.DataFrame()

    desired_raw = list(items_dict.values())
    resolved = resolve_item_names(df, desired_raw)

    if verbose:
        print(f"\nResolved mapping for {statement_name}:")
        for d, a in resolved.items():
            print(f"  {d} → {a}")

    actual_to_extract = [resolved[d] for d in desired_raw if resolved[d]]
    extracted = df.reindex(actual_to_extract).T.copy()

    col_rename = {v: k for k, v in items_dict.items() if resolved.get(v)}
    extracted = extracted.rename(columns=col_rename)

    extracted.insert(0, "Statement", statement_name)
    extracted = extracted.reset_index().rename(columns={"index": "Report Date"})
    return extracted

## 6. Helper: Clean Financial DataFrame

In [9]:
# 6. clean_financial_df
def clean_financial_df(df):
    if df.empty:
        return pd.DataFrame()
    df = df.copy()
    df["Report Date"] = pd.to_datetime(df["Report Date"], errors="coerce")
    df["Year"] = df["Report Date"].dt.year
    df.drop(columns=["Statement"], inplace=True, errors="ignore")
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    df[numeric_cols] = df[numeric_cols] / 1e9          # billions
    df = df.round(2)
    return df

## 7. Extraction Loop 

In [10]:
# 7. Main extraction (cached + early-stop)
CACHE_DIR = Path("../cache/financials")
CACHE_DIR.mkdir(parents=True, exist_ok=True)

income_list   = []
balance_list  = []
cashflow_list = []

TARGET_ROWS = 10_000

for ticker_symbol in tqdm(ticker_symbols, desc="Processing"):
    cache_file = CACHE_DIR / f"{ticker_symbol}.pkl"
    if cache_file.exists():
        inc, bal, cf = pickle.load(open(cache_file, "rb"))
    else:
        ticker = yf.Ticker(ticker_symbol)
        try:
            raw_income  = ticker.financials
            raw_balance = ticker.balance_sheet
            raw_cf      = ticker.cash_flow
        except Exception:
            continue

        inc = extract_data_resolve(raw_income,  income_items,  "Income Statement")
        bal = extract_data_resolve(raw_balance, balance_items, "Balance Sheet")
        cf  = extract_data_resolve(raw_cf,      cash_flow_items,"Cash Flow")

        pickle.dump((inc, bal, cf), open(cache_file, "wb"))

    for name, df in [("Income", inc), ("Balance", bal), ("CashFlow", cf)]:
        if not df.empty:
            df = df.copy()
            df.insert(0, "Ticker", ticker_symbol)
            if name == "Income":
                income_list.append(df)
            elif name == "Balance":
                balance_list.append(df)
            else:
                cashflow_list.append(df)

    # early-stop
    total = sum(len(lst) for lst in [income_list, balance_list, cashflow_list])
    if total >= TARGET_ROWS:
        print(f"\nReached {total:,} rows – stopping.")
        break

Processing: 100%|██████████| 503/503 [13:52<00:00,  1.66s/it]


## 8. Combine & Clean Master Tables

In [16]:
# 8. Master tables – safe concat with column deduplication
def safe_concat(dfs):
    """Concatenate DataFrames after dropping duplicate columns."""
    if not dfs:
        return pd.DataFrame()
    
    cleaned = []
    for df in dfs:
        # Drop duplicate column names (keep first occurrence)
        df = df.loc[:, ~df.columns.duplicated()]
        cleaned.append(df.reset_index(drop=True))
    
    return pd.concat(cleaned, ignore_index=True)

# Build masters
income_master   = safe_concat(income_list)
balance_master  = safe_concat(balance_list)
cashflow_master = safe_concat(cashflow_list)

# Clean
income_master_clean   = clean_financial_df(income_master)
balance_master_clean  = clean_financial_df(balance_master)
cashflow_master_clean = clean_financial_df(cashflow_master)

print("\nFinal master tables:")
print(f"Income  : {income_master_clean.shape[0]:,} rows")
print(f"Balance : {balance_master_clean.shape[0]:,} rows")
print(f"CashFlow: {cashflow_master_clean.shape[0]:,} rows")
total_rows = (income_master_clean.shape[0] +
              balance_master_clean.shape[0] +
              cashflow_master_clean.shape[0])
print(f"TOTAL   : {total_rows:,} rows")


Final master tables:
Income  : 2,384 rows
Balance : 2,455 rows
CashFlow: 2,459 rows
TOTAL   : 7,298 rows


## 10. Sample Output

In [17]:
# 10. Show a sample
sample_ticker = ticker_symbols[0]
print(f"\nSample – {sample_ticker}")

print("\nINCOME")
display(income_master_clean[income_master_clean["Ticker"] == sample_ticker]
        .sort_values("Year", ascending=False).head())

print("\nBALANCE")
display(balance_master_clean[balance_master_clean["Ticker"] == sample_ticker]
        .sort_values("Year", ascending=False).head())

print("\nCASH FLOW")
display(cashflow_master_clean[cashflow_master_clean["Ticker"] == sample_ticker]
        .sort_values("Year", ascending=False).head())


Sample – MTD

INCOME


Unnamed: 0,Ticker,Report Date,Total Revenue,Gross Profit,Operating Income,Net Income,Year
0,MTD,2024-12-31,3.87,2.33,1.13,0.86,0.0
1,MTD,2023-12-31,3.79,2.24,1.08,0.79,0.0
2,MTD,2022-12-31,3.92,2.31,1.13,0.87,0.0
3,MTD,2021-12-31,3.72,2.17,0.99,0.77,0.0



BALANCE


Unnamed: 0,Ticker,Report Date,Total Assets,Current Liabilities,Stockholders Equity,Cash And Cash Equivalents,Total Liabilities Net Minority Interest,Year
0,MTD,2024-12-31,3.24,1.17,-0.13,0.06,,0.0
1,MTD,2023-12-31,3.36,1.18,-0.15,0.07,,0.0
2,MTD,2022-12-31,3.49,1.15,0.02,0.1,,0.0
3,MTD,2021-12-31,3.33,1.15,0.17,0.1,,0.0



CASH FLOW


Unnamed: 0,Ticker,Report Date,Total Cash From Operating Activities,Total Cash From Financing Activities,Total Cash From Investing Activities,Year
0,MTD,2024-12-31,0.97,-0.86,-0.12,0.0
1,MTD,2023-12-31,0.97,-0.86,-0.13,0.0
2,MTD,2022-12-31,0.86,-0.72,-0.14,0.0
3,MTD,2021-12-31,0.91,-0.59,-0.31,0.0
4,MTD,2020-12-31,,,,0.0
