# 1. Business Understanding

## 1.2 Problem Statement

Many investors, lenders, and business owners rely on intuition or outdated reports when evaluating a company’s financial position. This lack of real-time, data-driven analysis can lead to poor investment or lending decisions.

Our challenge is to develop a **data-powered tool** that automatically analyzes publicly available financial data (like income statements, balance sheets, and cash flows) to assess a company’s **financial stability, profitability, and risk**.

This project will simplify financial decision-making by transforming raw numbers into actionable insights through **data analysis, visualization, and machine learning**.

---

## 1.3 Business Objectives

### Main Objective

To build a **data analysis and scoring system** that evaluates a company’s financial health using real-world financial data.

### Specific Objectives

1. To collect and preprocess financial data from **Yahoo Finance API** and **Alpha Vantage API**.  
2. To analyze key financial metrics such as revenue growth, net income, debt-to-equity ratio, and cash flow trends.  
3. To build a **financial health scoring model** that assigns a score to each company based on performance indicators.  
4. To visualize financial insights using clear dashboards and charts for easier interpretation.  
5. To provide actionable recommendations for investors or business managers.

---

## 1.4 Research Questions

1. What financial indicators most accurately represent a company’s health and stability?  
2. How do profitability, liquidity, and leverage ratios correlate with a company’s risk level?  
3. Can we build a model that classifies companies into categories such as _Healthy_, _Moderate_, and _At Risk_?  
4. How can visualizing financial trends help investors make better decisions?

---

## 1.5 Success Criteria

- The system should accurately collect and clean financial data for multiple companies.  
- It should compute and visualize key financial ratios and trends.  
- The scoring model should produce realistic health scores based on financial fundamentals.  
- The final output should be clear and explainable to both technical and non-technical users.

---

# 2. Data Understanding

We will use **real financial datasets** fetched directly from APIs — not from Kaggle.

---

## Datasets & Sources

| Source | Type of Data | Description |
| --- | --- | --- |
| **Yahoo Finance API (via yfinance)** | Company financials | Income statements, balance sheets, cash flow, and stock history |
| **Alpha Vantage API** | Company and macro data | Financial statements, ratios, and performance indicators |
| **World Bank Open Data (optional)** | Macroeconomic context | GDP, inflation, interest rates (for broader analysis) |

---

## Dataset Overview

Each company dataset will include:

- **Revenue**  
- **Gross profit**  
- **Operating income**  
- **Net income**  
- **Total assets & liabilities**  
- **Cash flow from operations**  
- **Debt-to-equity ratio**  
- **Return on assets (ROA)** and **Return on equity (ROE)**  
- **Stock price performance** over time  

These metrics help us assess profitability, liquidity, leverage, and efficiency — the four main pillars of financial health.

---

## Tools and Libraries

We’ll use the following tools for the analysis:

| Category | Libraries |
| --- | --- |
| **Data Collection** | `yfinance`, `requests`, `pandas` |
| **Data Cleaning & Processing** | `numpy`, `pandas` |
| **Visualization** | `matplotlib`, `seaborn`, `plotly` |
| **Modeling & Scoring** | `scikit-learn`, `statsmodels` |
| **Deployment (Optional)** | `joblib` for model serialization |


# 3. Data Preparation

In this section, we will import the necessary Python libraries and load financial data directly from Yahoo Finance using the `yfinance` API. This will form the foundation of our analysis.

The data will include income statements, balance sheets, cash flow statements, and stock price history for a chosen company. We will then explore its structure before cleaning and feature engineering.


In [76]:
!pip install yfinance



In [8]:
# 3.1 Importing Libraries
from sklearn.ensemble import RandomForestRegressor, IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For fetching financial data
import yfinance as yf

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("Libraries imported successfully!")

Libraries imported successfully!


In [9]:
!pip install alpha_vantage
!pip install fmp-python




In [69]:


# 1. Define Ticker Symbols 
ticker_symbols = ['AAPL', 'MSFT', 'KO']

# 2. Fetch Historical Data (Data is stored in the 'data' variable)
data = yf.download(ticker_symbols, period='40y')

# 3. Filter and Store the 'Close' prices
Fourty_Years_Close = data['Close']

# 4. Preview and Inspect Data
print("\nFiltered Close Data Head")
print(Fourty_Years_Close.head())

print("\nFiltered Close Data Info")
print(Fourty_Years_Close.info())

print("\nFiltered Close Data Tail")
print(Fourty_Years_Close.tail())

  data = yf.download(ticker_symbols, period='40y')
[*********************100%***********************]  3 of 3 completed


Filtered Close Data Head
Ticker      AAPL    KO  MSFT
Date                        
1985-10-28  0.06  0.56   NaN
1985-10-29  0.06  0.57   NaN
1985-10-30  0.07  0.57   NaN
1985-10-31  0.06  0.57   NaN
1985-11-01  0.06  0.57   NaN

Filtered Close Data Info
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10079 entries, 1985-10-28 to 2025-10-28
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    10079 non-null  float64
 1   KO      10079 non-null  float64
 2   MSFT    9985 non-null   float64
dtypes: float64(3)
memory usage: 315.0 KB
None

Filtered Close Data Tail
Ticker        AAPL     KO    MSFT
Date                             
2025-10-22  258.45  70.81  520.54
2025-10-23  259.58  69.94  520.56
2025-10-24  262.82  69.71  523.61
2025-10-27  268.81  70.06  531.52
2025-10-28  269.19  70.25  543.27





In [73]:
#Fetch Financial Statements
balance_sheet = ticker.balance_sheet
income_statement = ticker.financials
cash_flow = ticker.cash_flow
#Income statement Items
income_items={
    'Total Revenue': 'Total Revenue',
    'Cost of Goods Sold': 'Cost Of Revenue',
    'Operational Expenses': 'Operating Expense',
    'Net Income': 'Net Income'
}
#Balance sheet Items
balance_items={
    'Total Assets': 'Total Assets',
    'Total liabilities': 'Total Liabilities Net Minority Interest',
    'Total Current Assets': 'Current Assets',
    'Total Current Liabilities': 'Current Liabilities',
    'Shareholders Equity': 'Stockholders Equity'
}
#Cash flow Items
cash_flow_items={
    'Operating Cash Flow': 'Operating Cash Flow'
    

}

#Extracting and Consolidating data
def extract_data(df, items_dict, statement_name):
  """Extract specified rows (financial items) from a dataframe and transposes it"""
  try:
        # Use .loc to select rows by index (financial item names)
      extracted_df = df.loc[list(items_dict.values())].T

        # Rename columns to user-friendly names
      reverse_map = {v: k for k, v in items_dict.items()}
      extracted_df = extracted_df.rename(columns=reverse_map)

        # Add a column for the financial statement name
      extracted_df.insert(0, 'Statement', statement_name)
      return extracted_df
      
  except KeyError as e:
     print(f"Warning: Could not find one or more Keys in {statement_name} for {ticker_symbol}. Missing key(s): {e}")
     return pd.Dataframe()
#Extract data from all 3 statements
income_extracted=extract_data(income_statement, income_items, 'Income Statement')
balance_extracted=extract_data(balance_sheet, balance_items, 'Balance Sheet')
cashflow_extracted=extract_data(cash_flow, cash_flow_items, 'Cash Flow')
#print and Display Statements
print("Income Statement")
display(income_extracted.head())

print("\nBalance Sheet")
display(balance_extracted.head())
print("\nCash Flow")
display(cashflow_extracted.head())



Income Statement


Unnamed: 0,Statement,Total Revenue,Cost of Goods Sold,Operational Expenses,Net Income
2024-09-30,Income Statement,391000000000.0,210000000000.0,57500000000.0,93700000000.0
2023-09-30,Income Statement,383000000000.0,214000000000.0,54800000000.0,97000000000.0
2022-09-30,Income Statement,394000000000.0,224000000000.0,51300000000.0,99800000000.0
2021-09-30,Income Statement,366000000000.0,213000000000.0,43900000000.0,94700000000.0



Balance Sheet


Unnamed: 0,Statement,Total Assets,Total liabilities,Total Current Assets,Total Current Liabilities,Shareholders Equity
2024-09-30,Balance Sheet,365000000000.0,308000000000.0,153000000000.0,176000000000.0,57000000000.0
2023-09-30,Balance Sheet,353000000000.0,290000000000.0,144000000000.0,145000000000.0,62100000000.0
2022-09-30,Balance Sheet,353000000000.0,302000000000.0,135000000000.0,154000000000.0,50700000000.0
2021-09-30,Balance Sheet,351000000000.0,288000000000.0,135000000000.0,125000000000.0,63100000000.0



Cash Flow


Unnamed: 0,Statement,Operating Cash Flow
2024-09-30,Cash Flow,118000000000.0
2023-09-30,Cash Flow,111000000000.0
2022-09-30,Cash Flow,122000000000.0
2021-09-30,Cash Flow,104000000000.0


In [74]:


def clean_financial_df(df):
    """Cleans extracted financial statement data."""
    df = df.copy()

    # Convert index dates to year
    df.index = pd.to_datetime(df.index, errors='coerce').year
    df.index.name = "Year"

    # Drop 'Statement' column 
    if 'Statement' in df.columns:
        df.drop(columns=['Statement'], inplace=True)

    # Converting all numeric columns to billions
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    df[numeric_cols] = df[numeric_cols] / 1e9

    # Rounding up for readability
    df = df.round(2)

    return df

# Apply cleaning to all three statements
income_extracted = clean_financial_df(income_extracted)
balance_extracted = clean_financial_df(balance_extracted)
cashflow_extracted = clean_financial_df(cashflow_extracted)

# Display clean results
print("Cleaned Income Statement:")
display(income_extracted)

print("\n Cleaned Balance Sheet:")
display(balance_extracted)

print("\n Cleaned Cash Flow Statement:")
display(cashflow_extracted)


Cleaned Income Statement:


Unnamed: 0_level_0,Total Revenue,Cost of Goods Sold,Operational Expenses,Net Income
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2024,391.04,210.35,57.47,93.74
2023,383.28,214.14,54.85,97.0
2022,394.33,223.55,51.34,99.8
2021,365.82,212.98,43.89,94.68



 Cleaned Balance Sheet:


Unnamed: 0_level_0,Total Assets,Total liabilities,Total Current Assets,Total Current Liabilities,Shareholders Equity
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2024,364.98,308.03,152.99,176.39,56.95
2023,352.58,290.44,143.57,145.31,62.15
2022,352.76,302.08,135.4,153.98,50.67
2021,351.0,287.91,134.84,125.48,63.09



 Cleaned Cash Flow Statement:


Unnamed: 0_level_0,Operating Cash Flow
Year,Unnamed: 1_level_1
2024,118.25
2023,110.54
2022,122.15
2021,104.04
