<a href="https://colab.research.google.com/github/RohanKnows/RohanKnows/blob/main/01_data_collection_and_setup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Collection & Environment Setup

## Purpose
This notebook handles:
- Environment setup and dependency installation
- Initial data sourcing for financial news
- Basic validation of collected data

This notebook does **not**:
- Train machine learning models
- Perform sentiment analysis
- Make any financial predictions

## Why This Matters
Reliable data collection is the foundation of any financial AI system.
Errors at this stage propagate downstream and invalidate results.

## Output
At the end of this notebook, we will have:
- A reproducible environment
- A clean, structured dataset of financial news
- Saved data artifacts for downstream analysis


In [3]:
# Core data handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt

# NLP & ML
from transformers import pipeline

# Utilities
import os


In [2]:
!pip install -q transformers datasets yfinance


## Data Source Selection

We use financial news data provided by Yahoo Finance via the `yfinance` Python library.

### Why Yahoo Finance?
- Free and accessible
- Ticker-specific news coverage
- Commonly used in financial analysis projects
- Suitable for reproducible academic and portfolio work

### Limitations
- News coverage may be incomplete or delayed
- Article text availability varies
- Not suitable for high-frequency trading systems


In [4]:
import yfinance as yf

# Select a single stock ticker for initial testing
ticker_symbol = "AAPL"

# Create a ticker object
ticker = yf.Ticker(ticker_symbol)

# Fetch news related to the ticker
news_data = ticker.news

# Inspect the raw output
news_data


[{'id': 'd3292c7c-be07-4408-b272-80868428db8a',
  'content': {'id': 'd3292c7c-be07-4408-b272-80868428db8a',
   'contentType': 'STORY',
   'title': 'Nvidia, Tesla lead tech stocks lower as Trump trade war threats rattle market',
   'description': '',
   'summary': 'Tech stocks led broader market declines as investors grew skittish over geopolitical tensions and fears of an AI bubble continued.',
   'pubDate': '2026-01-20T16:29:28Z',
   'displayTime': '2026-01-20T21:40:54Z',
   'isHosted': True,
   'bypassModal': False,
   'previewUrl': None,
   'thumbnail': {'originalUrl': 'https://s.yimg.com/os/creatr-uploaded-images/2026-01/d9e9f850-f619-11f0-b7f3-82aa17ff9d59',
    'originalWidth': 2808,
    'originalHeight': 1872,
    'caption': '',
    'resolutions': [{'url': 'https://s.yimg.com/uu/api/res/1.2/sAPEep6MH5tJCwff2Y4__g--~B/aD0xODcyO3c9MjgwODthcHBpZD15dGFjaHlvbg--/https://s.yimg.com/os/creatr-uploaded-images/2026-01/d9e9f850-f619-11f0-b7f3-82aa17ff9d59',
      'width': 2808,
      'hei

In [5]:
# Number of news articles retrieved
len(news_data)


10