# Developer Guide

This documentation serves as a guide to help my groupmates understand the revisions made during the process of integrating our work.

In [1]:
!git clone https://github.com/timothyckl/stalking-stocks.git
%cd /content/stalking-stocks
!pip install -r requirements.txt
%cd app/

fatal: destination path 'stalking-stocks' already exists and is not an empty directory.
/content/stalking-stocks
/content/stalking-stocks/app


## Project Structure

To meet the requirements of modularity, we have applied the concept of separations in our codebase by clearly separating the different layers of our application, namely:

1. **Entry Point**

   * `app/main.py`: the starting point of the application that orchestrates initialization and execution.

2. **Domain Constants**

   * `app/constants/`: contains domain-specific constants such as sector definitions that can be reused throughout the application.

3. **Models**

   * `app/models/`: defines core Python objects and base abstractions used to represent entities in the system.

4. **Schemas**

   * `app/schemas/`: defines validation schemas (e.g., for dataframes) to ensure data integrity when handling external data sources.

5. **Services (Business Logic Layer)**

   * `app/services/`: implements the core logic of the system, handling interactions with external APIs such as Yahoo Finance and processing data into usable formats.

     * `core.py`: central application services.
     * `data.py`: data retrieval and preprocessing.
     * `finance.py`: finance-specific business logic.

6. **Utilities**

   * `app/utils/`: contains helper functions and utility modules that provide generic support across the codebase.

7. **Testing**

   * `tests/`: dedicated space for unit and integration tests to ensure correctness and maintainability.


In [2]:
!sudo apt-get install tree > /dev/null
!tree --gitignore

debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
[01;34m.[0m
├── [01;34mconstants[0m
│   ├── [00m__init__.py[0m
│   └── [00msectors.py[0m
├── [00mmain.py[0m
├── [01;34mmodels[0m
│   ├── [00mbase.py[0m
│   └── [00m__init__.py[0m
├── [01;34mschemas[0m
│   ├── [00mdataframe.py[0m
│   └── [00m__init__.py[0m
├── [01;34mservices[0m
│   ├── [00mcore.py[0m
│   ├── [00mdata.py[0m
│   ├── [00mfinance.py[0m
│   └── [00m__init__.py[0m
└── [01;34mutils[0m
    ├── [00mhelpers.py[0m
    └── [00m__init__.py[0m

5 directories, 13 files


---

## Module Overview

Next I will be going over the how each module/function works and the considerations I've made while integrating our work.

First I will explain what Pydantic and Pandera are and how they are used to guide the way approach designing and implementing our code.

### 1. `app\models\`

Since the main library we use to access financial data is the `yfinance` Python API, there are several potential issues to be aware of:

1. Information Overload – yfinance provides a large amount of financial data for each request. Without proper filtering, it can be overwhelming and may slow down data processing.

2. Unofficial API – yfinance is a third-party library and is not officially supported by Yahoo Finance. This means it may occasionally break or behave unexpectedly.

3. Dependency on Maintainer Updates – Any changes made by the API’s maintainer can affect our code. We need to manually track and adapt to updates to ensure compatibility.

4. Limited Control and Navigation – The way yfinance structures and delivers data can make it tricky to navigate, manipulate, or integrate seamlessly into our codebase. So extra care is needed to maintain clarity and reliability.

The solution to these issues is the use of [Pydantic](https://github.com/pydantic/pydantic) and [Pandera](https://github.com/unionai-oss/pandera). At a high level, these libraries enforce [type hints](https://www.geeksforgeeks.org/python/type-hints-in-python/) and leverage them to validate, transform, and ensure the integrity of our data.

#### What this means for us?

- Save time on manually validating data as type constraints are enforced
- Pandera validates and transforms pandas data according to our defined schemas
- Easier debug issues as errors are caught and flagged automatically
- We know exactly what data to handle, making it easier to add, remove, or modify datasets as needed.

#### `Ticker`

- Represents an individual stock with its key financial attributes, such as `symbol`, `market_cap`, `price`, `sector`, and dividend information.

- Optional fields (using | None) account for cases where data may be missing.

- Using Pydantic ensures that any Ticker object is validated upon creation, preventing type or missing-value errors.

In [3]:
from typing import Any

from pandera.typing import DataFrame
from pydantic import BaseModel
from schemas.dataframe import TopGrowing, TopPerforming

# base.py
class Ticker(BaseModel):
    """Represents an individual stock with financial attributes."""

    symbol: str
    display_name: str | None
    long_name: str | None
    short_name: str | None
    market_cap: float | None
    price: float | None
    sector: str | None
    industry: str | None
    description: str | None
    dividend_rate: float | None
    dividend_yield: float | None
    volume: int | None

#### `Industry`

- Represents a market industry and its performance data.

- Contains two Pandera DataFrame types: `top_performing` and `top_growing`, which enforce structured schemas defined in schemas.dataframe.

- This allows automatic validation and ensures that tabular data conforms to expected formats.

In [4]:
# base.py
class Industry(BaseModel):
    """Represents an industry and its top-performing companies."""

    top_performing: DataFrame[TopPerforming]
    top_growing: DataFrame[TopGrowing]

#### `Sector`

- Represents a broader market sector, containing multiple industries and financial aggregates.

- Fields include key and name for identification, overview as a flexible dictionary for descriptive info, and lists or dictionaries for top companies, ETFs, mutual funds, and industries.

- Combining Pydantic models (Ticker) with Pandera DataFrame schemas ensures consistency across both object-oriented and tabular data.

In [5]:
# base.py
class Sector(BaseModel):
    """Represents a market sector containing multiple industries."""

    key: str
    name: str
    overview: dict[str, Any]
    top_companies: list[Ticker]
    top_etfs: dict[str, str]
    top_mutual_funds: dict[str, str | None]
    industries: list[str]

### 2. `app\schemas\`

This module defines **Pandera schemas** to validate tabular financial data before it is processed in the application. Using **Pandera’s `DataFrameModel`**, we can enforce column types, handle missing data, and ensure data integrity for both historical prices and aggregated company statistics.

---

#### `TopPerforming`

   * Schema for validating **top-performing company data**.
   * Columns include:

     * `name` (str) – company name
     * `ytd_return` (float) – year-to-date return
     * `last_price` (float) – most recent stock price
     * `target_price` (float) – analyst target price
   * `nullable=True` allows missing data in any column.
   * `Config`:

     * `coerce=True` automatically converts data to the correct types.
     * `strict=False` allows extra columns to be added in the future without breaking validation.


In [6]:
# dataframe.py
from pandera.pandas import DataFrameModel, Field
from pandera.typing import Series


class TopPerforming(DataFrameModel):
    """Schema for validating top-performing company data."""

    name: Series[str] = Field(nullable=True)
    ytd_return: Series[float] = Field(nullable=True)
    last_price: Series[float] = Field(nullable=True)
    target_price: Series[float] = Field(nullable=True)

    class Config:
        coerce = True
        strict = False  # future proof toallow extra columns


#### `TopGrowing`

   * Schema for **top-growing company data**.
   * Columns include:

     * `name` (str) – company name
     * `ytd_return` (float) – year-to-date return
     * `growth_estimate` (float) – projected growth
   * Shares the same configuration as `TopPerforming`, ensuring flexibility and type safety.

In [7]:
# dataframe.py
class TopGrowing(DataFrameModel):
    """Schema for validating top-performing company data."""

    name: Series[str] = Field(nullable=True)
    ytd_return: Series[float] = Field(nullable=True)
    growth_estimate: Series[float] = Field(nullable=True)

    class Config:
        coerce = True
        strict = False  # future proof toallow extra columns

#### MarketData

   * Schema for **historical market price and volume data**, such as what is returned from `yfinance`.
   * Columns include: `Close`, `High`, `Low`, `Open` (all floats), and `Volume` (int).
   * `Config` settings:

     * `coerce=True` ensures data is cast to correct types.
     * `strict=False` allows extra columns if `yfinance` adds new fields in future updates.

In [8]:
# dataframe.py
class MarketData(DataFrameModel):
    """Schema for validating historical market price and volume data."""

    Close: Series[float]
    High: Series[float]
    Low: Series[float]
    Open: Series[float]
    Volume: Series[int]

    class Config:
        coerce = True
        strict = False  # allow extra columns if yfinance adds fields

**Key Benefits:**

* **Automatic Validation:** Pandera ensures that all incoming data matches the expected schema, reducing errors downstream.
* **Data Consistency:** Column types and names are enforced, making it easier to integrate data with business logic.
* **Future-Proofing:** Non-strict mode allows the addition of new columns without breaking existing workflows.
* **Simplified Debugging:** Any schema violations are raised immediately, making it easier to identify problems in data ingestion.

### 3. `app\utils\`

The `helpers.py` module contains general-purpose utility functions that support data ingestion, processing, and transformation across the application. These functions are designed to be reusable, modular, and independent from business logic, helping maintain a clean and maintainable codebase. The functions within this file are straightforward and largely self-explanatory.

### 4. `app\services\`

The `services` layer contains the business logic of the application, serving as the bridge between raw data sources (like yfinance) and higher-level application functionality. It's responsible for data retrieval, preprocessing, computations, and real-time querying, while abstracting away raw API calls.

- `finance.py` → data retrieval

- `core.py` → financial computations for core features

- `data.py` → data preprocessing and feature handling

### 5. `app\constants\`

This module provides constant definitions for the sectors recognised by yfinance. The constants serve as a single source of truth for sector names, making it easier to filter, query, or validate financial data consistently across the application.

In [9]:
SECTORS: list[str] = [
    "basic-materials",
    "communication-services",
    "consumer-cyclical",
    "consumer-defensive",
    "energy",
    "financial-services",
    "healthcare",
    "industrials",
    "real-estate",
    "technology",
    "utilities",
]

---

## Example Usages

In [10]:
from pprint import pprint
from services.finance import *
from services.core import *
from utils.helpers import *

In [11]:
# get info on a single stock
ticker = 'NVDA'
ticker_info = get_ticker_info(ticker_symbol=ticker)
pprint(ticker_info.model_dump())

[INFO] get_ticker_info took 0.532875 seconds.
{'description': 'NVIDIA Corporation, a computing infrastructure company, '
                'provides graphics and compute and networking solutions in the '
                'United States, Singapore, Taiwan, China, Hong Kong, and '
                'internationally. The Compute & Networking segment includes '
                'its Data Centre accelerated computing platforms and '
                'artificial intelligence solutions and software; networking; '
                'automotive platforms and autonomous and electric vehicle '
                'solutions; Jetson for robotics and other embedded platforms; '
                'and DGX Cloud computing services. The Graphics segment offers '
                'GeForce GPUs for gaming and PCs, the GeForce NOW game '
                'streaming service and related infrastructure, and solutions '
                'for gaming platforms; Quadro/NVIDIA RTX GPUs for enterprise '
                'workstatio

In [12]:
# get single ticker data over the last 3 years
start, end = n_year_window(n=3)
df = get_ticker_data(ticker_symbols=ticker, start=start, end=end, auto_adjust=True, progress=False)
df.info()
df.head()

[INFO] get_ticker_data took 0.309596 seconds.
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 751 entries, 2022-09-26 to 2025-09-23
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Close   751 non-null    float64
 1   High    751 non-null    float64
 2   Low     751 non-null    float64
 3   Open    751 non-null    float64
 4   Volume  751 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 35.2 KB


Price,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022-09-26,12.213326,12.64381,12.199343,12.476011,547343000
2022-09-27,12.398106,12.720718,12.243292,12.491993,553854000
2022-09-28,12.720718,12.807614,12.339177,12.395109,542414000
2022-09-29,12.205338,12.485001,11.931666,12.433064,532763000
2022-09-30,12.124434,12.617841,12.06051,12.072496,565638000


In [13]:
# get multiple ticker data
tickers: list[str] = ['MSFT', 'AAPL', 'NVDA']
df = get_ticker_data(ticker_symbols=tickers, auto_adjust=True, progress=False)
df.info()
df.head()

[INFO] get_ticker_data took 0.601775 seconds.
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 22 entries, 2025-08-25 to 2025-09-24
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   (Close, AAPL)   22 non-null     float64
 1   (Close, MSFT)   22 non-null     float64
 2   (Close, NVDA)   22 non-null     float64
 3   (High, AAPL)    22 non-null     float64
 4   (High, MSFT)    22 non-null     float64
 5   (High, NVDA)    22 non-null     float64
 6   (Low, AAPL)     22 non-null     float64
 7   (Low, MSFT)     22 non-null     float64
 8   (Low, NVDA)     22 non-null     float64
 9   (Open, AAPL)    22 non-null     float64
 10  (Open, MSFT)    22 non-null     float64
 11  (Open, NVDA)    22 non-null     float64
 12  (Volume, AAPL)  22 non-null     int64  
 13  (Volume, MSFT)  22 non-null     int64  
 14  (Volume, NVDA)  22 non-null     int64  
dtypes: float64(12), int64(3)
memory usage: 2.8 KB


Price,Close,Close,Close,High,High,High,Low,Low,Low,Open,Open,Open,Volume,Volume,Volume
Ticker,AAPL,MSFT,NVDA,AAPL,MSFT,NVDA,AAPL,MSFT,NVDA,AAPL,MSFT,NVDA,AAPL,MSFT,NVDA
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
2025-08-25,227.160004,504.26001,179.799866,229.300003,508.190002,181.899753,226.229996,504.119995,176.560058,226.479996,506.630005,178.339957,30983100,21638600,163012800
2025-08-26,229.309998,502.040009,181.75975,229.490005,504.980011,182.379711,224.690002,498.51001,178.799911,226.869995,504.359985,180.04984,54575100,30835700,168688200
2025-08-27,230.490005,506.73999,181.589767,230.899994,507.290009,182.479717,228.259995,499.899994,179.089908,228.610001,502.0,181.969736,31259500,17277900,235518900
2025-08-28,232.559998,509.640015,180.159836,233.410004,511.089996,184.459596,229.339996,505.5,176.400053,230.820007,507.089996,180.809808,38074700,18015600,281787800
2025-08-29,232.139999,506.690002,174.170166,233.380005,509.600006,178.139943,231.369995,504.48999,173.140225,232.509995,508.660004,178.099952,39418400,20961600,243257900


In [14]:
# get list of sectors
sectors = get_sectors()
pprint(sectors)

['basic-materials',
 'communication-services',
 'consumer-cyclical',
 'consumer-defensive',
 'energy',
 'financial-services',
 'healthcare',
 'industrials',
 'real-estate',
 'technology',
 'utilities']


In [15]:
# getting a single sector's data
sector_key = 'technology'
sector_data = get_sector_data(sector_key=sector_key)
pprint(sector_data.model_dump().keys())  # printing only keys because industries contain a lot of nest info

dict_keys(['key', 'name', 'overview', 'top_companies', 'top_etfs', 'top_mutual_funds', 'industries'])


In [16]:
# getting industry keys from sector data
pprint(sector_data.industries)

['semiconductors',
 'software-infrastructure',
 'consumer-electronics',
 'software-application',
 'information-technology-services',
 'semiconductor-equipment-materials',
 'computer-hardware',
 'communication-equipment',
 'electronic-components',
 'scientific-technical-instruments',
 'solar',
 'electronics-computer-distribution']


In [17]:
industry_data = get_industry_data(industry_key='software-application')
top_performers = industry_data.top_performing
top_performers.head()

Unnamed: 0_level_0,name,ytd_return,last_price,target_price
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MFI,mF International Limited,6.4888,40.26,16.0
AZRS,"Arculus System Co., Ltd.",4.4545,6.0,
PRCH,"Porch Group, Inc.",2.6748,18.08,19.75
APPS,"Digital Turbine, Inc.",2.1065,5.25,6.75
BLBX,Blackboxstocks Inc.,1.8364,6.24,6.0


In [18]:
top_growers = industry_data.top_growing
top_growers.head()

Unnamed: 0_level_0,name,ytd_return,growth_estimate
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
XPER,Xperi Inc.,-0.3749,80.0
PRO,"PROS Holdings, Inc.",0.0419,6.4
DUOL,"Duolingo, Inc.",-0.0788,4.714286
PCOR,"Procore Technologies, Inc.",-0.0111,2.433333
SPT,"Sprout Social, Inc.",-0.5448,2.285714


In [19]:
# calculate sma of a stock's closing price over the past 3 years
nvda = get_ticker_data("NVDA", start=start, end=end, auto_adjust=True, progress=False)
nvda_close = nvda['Close']

sma_5 = compute_sma(nvda_close, window=5)
sma_20 = compute_sma(nvda_close, window=20)
sma_50 = compute_sma(nvda_close, window=50)

[INFO] get_ticker_data took 0.051961 seconds.
[INFO] compute_sma took 0.000477 seconds.
[INFO] compute_sma took 0.000324 seconds.
[INFO] compute_sma took 0.000800 seconds.


In [20]:
# get a stock's longest upward and downward streak over the past 3 years
up_streak, down_streak = compute_streak(nvda_close)
print(f"Longest upward streak: {up_streak}")
print(f"Longest downward streak: {down_streak}")

[INFO] compute_streak took 0.005634 seconds.
Longest upward streak: 10
Longest downward streak: 5


In [21]:
# get a stock's simple daily returns over the past 3 years
sdr = compute_sdr(nvda_close)
print(sdr)

[INFO] compute_sdr took 0.001428 seconds.
Date
2022-09-26         NaN
2022-09-27    0.015129
2022-09-28    0.026021
2022-09-29   -0.040515
2022-09-30   -0.006629
                ...   
2025-09-17   -0.026247
2025-09-18    0.034940
2025-09-19    0.002440
2025-09-22    0.039282
2025-09-23   -0.028212
Name: Close, Length: 751, dtype: float64


In [22]:
# get a stock's potential max profit (assuming multiple buy/sells) over the past 3 years
max_profit = compute_max_profit(nvda_close)
print(max_profit)

[INFO] compute_max_profit took 0.007878 seconds.
799.7827501296997
