## 🏆 Top 10 Tickers in the Dataset

Below is a table showing the top 10 most frequently mentioned ticker symbols in the news dataset. These tickers represent some of the most actively discussed companies in the US capital markets between 2020 and 2024.

| 🥇 Rank | 💹 Ticker Symbol |
|---------|:----------------|
| 1       | 🍏 AAPL         |
| 2       | 💻 MSFT         |
| 3       | 🌐 GOOGL        |
| 4       | 📦 AMZN         |
| 5       | 📱 META         |
| 6       | 🚗 TSLA         |
| 7       | 🤖 NVDA         |
| 8       | 🏦 JPM          |
| 9       | 💳 V            |
| 10      | 🏥 UNH          |


In [28]:
import os
import glob
from pathlib import Path
import yfinance
import pandas as pd
from datetime import datetime, timedelta

In [4]:
os.chdir('f:\\Upwork Projects\\Trading Bot\\StockBot')
os.getcwd()

'f:\\Upwork Projects\\Trading Bot\\StockBot'

## 📈 OHLCV (Open, High, Low, Close, Volume) Data Sourcing

This section covers how we fetch and store historical OHLCV data for the top 10 most discussed US tickers using Yahoo Finance.

- 🏦 **Tickers:** AAPL, NVDA, META, TSLA, MSFT, GOOGL, AMZN, JPM, V, UNH
- ⏳ **Time Range:** Last 4 years (hourly interval)
- 💾 **Data Location:** Saved in `data/raw/ohlcv_data/`
- ⚙️ **Automation:** Output directory is created automatically if it doesn't exist
- 🚦 **Error Handling:** Notifies if data is missing or if an error occurs during download
- 📊 **Purpose:** Enables downstream analysis, modeling, and backtesting for trading strategies

⚠️ **Note:** Intraday data (e.g., 1h, 4h) is only available for the last 60–730 days due to Yahoo Finance limitations. For longer periods, use daily or weekly intervals.

In [17]:
# We try model to specilized there prediction on these dedicated tickers
tickers = ['AAPL', 'NVDA', 'META', 'TSLA', 'MSFT', 'GOOGL', 'AMZN', 'JPM', 'V', 'UNH']
end_date = datetime.now()
start_data = end_date - timedelta(days=(2 * 365))

output_dir = 'data/raw/ohlcv_data'
os.makedirs(output_dir, exist_ok=True)

print(f"🚀 Fetching historical OHLCV data for {len(tickers)} tickers...")
for ticker in tickers:
    print(f'⏳ Fetching data for {ticker}')
    try:
        data = yfinance.download(
            ticker,
            start=start_data,
            end=end_date,
            interval='1h',
            ignore_tz=True,
        )
        
        if not data.empty:
            file_path = os.path.join(output_dir, f'{ticker}_1h.csv')
            data.to_csv(file_path)
            print(f'✅ Successfully feached and save date of {ticker} in file path {file_path}.\n')
        else:
            print(f'⚠️ No data found for {ticker} in the specified data range\n')
    except Exception as e:
        print(f"❌ Error fetching data for {ticker}: {e}\n")
    
print("\n\n✨ Data fetching complete.")
print(f"📁 Data saved to the '{output_dir}' directory.")
        

🚀 Fetching historical OHLCV data for 10 tickers...
⏳ Fetching data for AAPL


[*********************100%***********************]  1 of 1 completed


✅ Successfully feached and save date of AAPL in file path data/raw/ohlcv_data\AAPL_1h.csv.

⏳ Fetching data for NVDA


[*********************100%***********************]  1 of 1 completed


✅ Successfully feached and save date of NVDA in file path data/raw/ohlcv_data\NVDA_1h.csv.

⏳ Fetching data for META


[*********************100%***********************]  1 of 1 completed


✅ Successfully feached and save date of META in file path data/raw/ohlcv_data\META_1h.csv.

⏳ Fetching data for TSLA


[*********************100%***********************]  1 of 1 completed


✅ Successfully feached and save date of TSLA in file path data/raw/ohlcv_data\TSLA_1h.csv.

⏳ Fetching data for MSFT


[*********************100%***********************]  1 of 1 completed


✅ Successfully feached and save date of MSFT in file path data/raw/ohlcv_data\MSFT_1h.csv.

⏳ Fetching data for GOOGL


[*********************100%***********************]  1 of 1 completed


✅ Successfully feached and save date of GOOGL in file path data/raw/ohlcv_data\GOOGL_1h.csv.

⏳ Fetching data for AMZN


[*********************100%***********************]  1 of 1 completed


✅ Successfully feached and save date of AMZN in file path data/raw/ohlcv_data\AMZN_1h.csv.

⏳ Fetching data for JPM


[*********************100%***********************]  1 of 1 completed


✅ Successfully feached and save date of JPM in file path data/raw/ohlcv_data\JPM_1h.csv.

⏳ Fetching data for V


[*********************100%***********************]  1 of 1 completed


✅ Successfully feached and save date of V in file path data/raw/ohlcv_data\V_1h.csv.

⏳ Fetching data for UNH


[*********************100%***********************]  1 of 1 completed


✅ Successfully feached and save date of UNH in file path data/raw/ohlcv_data\UNH_1h.csv.



✨ Data fetching complete.
📁 Data saved to the 'data/raw/ohlcv_data' directory.


In [100]:
# start from loading all tickers for cleaning and processing.
BASE_PATH = Path(__file__).resolve().parents[2] if '__file__' in globals() else Path.cwd().parents[2]
OHLCV_DATA = os.path.join(BASE_PATH, 'data', 'raw', 'ohlcv_data')
csv_files = glob.glob(os.path.join(OHLCV_DATA, '*.csv'))

ohlcv_data = {}
for file in csv_files:
    
    ticker = Path(file).stem.split('_')[0]
    ohlcv_data[ticker] = pd.read_csv(file)
    ohlcv_data[ticker] = ohlcv_data[ticker].iloc[2:].reset_index(drop=True)
    ohlcv_data[ticker].rename(columns={'Price': 'Datetime'}, inplace=True)
    ohlcv_data[ticker]['Datetime'] = pd.to_datetime(ohlcv_data[ticker]['Datetime'])
    
    for col in ohlcv_data[ticker][ohlcv_data[ticker].columns[1:]]:
        try:
            ohlcv_data[ticker][col] = ohlcv_data[ticker][col].astype(float)
        except ValueError:
            ohlcv_data[ticker][col] = ohlcv_data[ticker][col].astype(int)


In [114]:
# --- Data Consistency Checks for All Tickers ---

# 1. Check if all DataFrames have the same number of rows (to ensure no data mismatch)
lengths = [len(ohlcv_data[ticker]) for ticker in tickers]
if len(set(lengths)) == 1:
    print(f"✅ All DataFrames have the same number ({lengths[0]}) of rows!")
else:
    print("⚠️ DataFrames have different number of rows:", lengths)

# 2. Check for missing values in each ticker's DataFrame
missing_values = [ohlcv_data[ticker].isna().sum().sum() for ticker in tickers]
if len(set(missing_values)) == 1:
    print('✅ No missing values found in any tickers. Data is clean!')
else:
    print('⚠️ Some tickers have missing values. Details:')
    for ticker, missing in zip(tickers, missing_values):
        if missing > 0:
            print(f'   - {ticker}: {missing} missing values')

# 3. Check duplicated rows in each ticker's DataFrame
any_dupes = False
for ticker in tickers:
    dupes = ohlcv_data[ticker]['Datetime'].duplicated().sum()
    if dupes > 0:
        print(f'⚠️ {ticker}: {dupes} duplicate datetime entries found!')
        any_dupes = True
        
if not any_dupes:
    print('✅ No duplicate datetime entries found!')
    
# 4. Check a gaps in between the rows
print('🔍 Checking for gaps between the rows in each ticker\'s data...')
for ticker in tickers:
    idx = ohlcv_data[ticker]['Datetime']
    gaps = idx.diff().max()
    print(f"    - {ticker}: Largest gap between rows is {gaps}")

# 5. Flag any ticker that contain outliers
print('🚩 Flagging Outliers...')
for ticker in tickers:
    df = ohlcv_data[ticker]
    if 'Close' in df.columns:
        pct_change = df['Close'].pct_change().abs()
        if (pct_change > 0.2).any():
            print(f"    ⚠️ {ticker}: Large price change (>20%) detected! 🚨")
        else:
            print(f"    ✅ {ticker}: No large price changes detected.")

✅ All DataFrames have the same number (3480) of rows!
✅ No missing values found in any tickers. Data is clean!
✅ No duplicate datetime entries found!
🔍 Checking for gaps between the rows in each ticker's data...
    - AAPL: Largest gap between rows is 3 days 18:00:00
    - NVDA: Largest gap between rows is 3 days 18:00:00
    - META: Largest gap between rows is 3 days 18:00:00
    - TSLA: Largest gap between rows is 3 days 18:00:00
    - MSFT: Largest gap between rows is 3 days 18:00:00
    - GOOGL: Largest gap between rows is 3 days 18:00:00
    - AMZN: Largest gap between rows is 3 days 18:00:00
    - JPM: Largest gap between rows is 3 days 18:00:00
    - V: Largest gap between rows is 3 days 18:00:00
    - UNH: Largest gap between rows is 3 days 18:00:00
🚩 Flagging Outliers...
    ✅ AAPL: No large price changes detected.
    ⚠️ NVDA: Large price change (>20%) detected! 🚨
    ✅ META: No large price changes detected.
    ✅ TSLA: No large price changes detected.
    ✅ MSFT: No large pr