<a href="https://colab.research.google.com/github/RohanKnows/RohanKnows/blob/main/01_data_collection_and_setup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Collection & Environment Setup

## Purpose
This notebook handles:
- Environment setup and dependency installation
- Initial data sourcing for financial news
- Basic validation of collected data

This notebook does **not**:
- Train machine learning models
- Perform sentiment analysis
- Make any financial predictions

## Why This Matters
Reliable data collection is the foundation of any financial AI system.
Errors at this stage propagate downstream and invalidate results.

## Output
At the end of this notebook, we will have:
- A reproducible environment
- A clean, structured dataset of financial news
- Saved data artifacts for downstream analysis


In [3]:
# Core data handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt

# NLP & ML
from transformers import pipeline

# Utilities
import os


In [2]:
!pip install -q transformers datasets yfinance


## Data Source Selection

We use financial news data provided by Yahoo Finance via the `yfinance` Python library.

### Why Yahoo Finance?
- Free and accessible
- Ticker-specific news coverage
- Commonly used in financial analysis projects
- Suitable for reproducible academic and portfolio work

### Limitations
- News coverage may be incomplete or delayed
- Article text availability varies
- Not suitable for high-frequency trading systems


In [4]:
import yfinance as yf

# Select a single stock ticker for initial testing
ticker_symbol = "AAPL"

# Create a ticker object
ticker = yf.Ticker(ticker_symbol)

# Fetch news related to the ticker
news_data = ticker.news

# Inspect the raw output
news_data


[{'id': 'd3292c7c-be07-4408-b272-80868428db8a',
  'content': {'id': 'd3292c7c-be07-4408-b272-80868428db8a',
   'contentType': 'STORY',
   'title': 'Nvidia, Tesla lead tech stocks lower as Trump trade war threats rattle market',
   'description': '',
   'summary': 'Tech stocks led broader market declines as investors grew skittish over geopolitical tensions and fears of an AI bubble continued.',
   'pubDate': '2026-01-20T16:29:28Z',
   'displayTime': '2026-01-20T21:40:54Z',
   'isHosted': True,
   'bypassModal': False,
   'previewUrl': None,
   'thumbnail': {'originalUrl': 'https://s.yimg.com/os/creatr-uploaded-images/2026-01/d9e9f850-f619-11f0-b7f3-82aa17ff9d59',
    'originalWidth': 2808,
    'originalHeight': 1872,
    'caption': '',
    'resolutions': [{'url': 'https://s.yimg.com/uu/api/res/1.2/sAPEep6MH5tJCwff2Y4__g--~B/aD0xODcyO3c9MjgwODthcHBpZD15dGFjaHlvbg--/https://s.yimg.com/os/creatr-uploaded-images/2026-01/d9e9f850-f619-11f0-b7f3-82aa17ff9d59',
      'width': 2808,
      'hei

In [5]:
# Number of news articles retrieved
len(news_data)


10

In [6]:
# Convert raw news data into a structured DataFrame
news_df = pd.DataFrame(news_data)

# Display the first few rows
news_df.head()


Unnamed: 0,id,content
0,d3292c7c-be07-4408-b272-80868428db8a,"{'id': 'd3292c7c-be07-4408-b272-80868428db8a',..."
1,86260a35-2724-4758-b856-430c2dfdae58,"{'id': '86260a35-2724-4758-b856-430c2dfdae58',..."
2,9f43cc1d-70e9-3c42-bede-43e674b4ffb7,"{'id': '9f43cc1d-70e9-3c42-bede-43e674b4ffb7',..."
3,ae610852-a3f9-4ea5-b185-36dffd9d9e79,"{'id': 'ae610852-a3f9-4ea5-b185-36dffd9d9e79',..."
4,a126b847-f93f-40ed-87e4-22e04447f9bb,"{'id': 'a126b847-f93f-40ed-87e4-22e04447f9bb',..."


In [7]:
# Inspect available columns
news_df.columns


Index(['id', 'content'], dtype='object')

In [9]:
# Inspect the structure of a single news item
news_data[0]


{'id': 'd3292c7c-be07-4408-b272-80868428db8a',
 'content': {'id': 'd3292c7c-be07-4408-b272-80868428db8a',
  'contentType': 'STORY',
  'title': 'Nvidia, Tesla lead tech stocks lower as Trump trade war threats rattle market',
  'description': '',
  'summary': 'Tech stocks led broader market declines as investors grew skittish over geopolitical tensions and fears of an AI bubble continued.',
  'pubDate': '2026-01-20T16:29:28Z',
  'displayTime': '2026-01-20T21:40:54Z',
  'isHosted': True,
  'bypassModal': False,
  'previewUrl': None,
  'thumbnail': {'originalUrl': 'https://s.yimg.com/os/creatr-uploaded-images/2026-01/d9e9f850-f619-11f0-b7f3-82aa17ff9d59',
   'originalWidth': 2808,
   'originalHeight': 1872,
   'caption': '',
   'resolutions': [{'url': 'https://s.yimg.com/uu/api/res/1.2/sAPEep6MH5tJCwff2Y4__g--~B/aD0xODcyO3c9MjgwODthcHBpZD15dGFjaHlvbg--/https://s.yimg.com/os/creatr-uploaded-images/2026-01/d9e9f850-f619-11f0-b7f3-82aa17ff9d59',
     'width': 2808,
     'height': 1872,
     '

In [10]:
# Manually extract relevant fields from raw news data
processed_news = []

for article in news_data:
    processed_news.append({
        "title": article.get("title"),
        "publisher": article.get("publisher"),
        "publish_time": article.get("providerPublishTime")
    })

news_df = pd.DataFrame(processed_news)
news_df.head()


Unnamed: 0,title,publisher,publish_time
0,,,
1,,,
2,,,
3,,,
4,,,


In [11]:
news_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         0 non-null      object
 1   publisher     0 non-null      object
 2   publish_time  0 non-null      object
dtypes: object(3)
memory usage: 372.0+ bytes


In [12]:
# Convert Unix timestamp to datetime (UTC)
news_df["publish_datetime"] = pd.to_datetime(
    news_df["publish_time"],
    unit="s",
    utc=True
)

news_df.head()


Unnamed: 0,title,publisher,publish_time,publish_datetime
0,,,,NaT
1,,,,NaT
2,,,,NaT
3,,,,NaT
4,,,,NaT


In [13]:
# Remove rows where publish time is missing
news_df = news_df.dropna(subset=["publish_datetime"])

news_df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   title             0 non-null      object             
 1   publisher         0 non-null      object             
 2   publish_time      0 non-null      object             
 3   publish_datetime  0 non-null      datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), object(3)
memory usage: 0.0+ bytes
