# 1. Data Collection Stage

## In this stage, we are collecting raw data from the following service providers and storing them in `.csv` files for preprocessing.
---

<details>
  <summary><strong>1. Alpha Vantage </strong></summary>

<br>

**Service Provider Name**: Alpha Vantage  
**Library Name**: `alpha_vantage`  

**Brief Description**:  
API provider for financial market data, like stock prices, forex rates, crypto prices, etc.  

**Services Offered**:  
- `get_intraday()` – Minute-by-minute data (1, 5, 15, 30, 60 min intervals)  
- `get_daily()` – Daily open/high/low/close (OHLC)  
- `get_weekly()` – Weekly OHLC data  
- `get_monthly()` – Monthly OHLC data  

**Services Availed**:  
We are fetching **daily OHLC data of Reliance Industries** using the `get_daily()` method.

</details>

---

<details>
  <summary><strong>2. Kurt McKee </strong></summary>

<br>

**Service Provider Name**: Kurt McKee  
**Library Name**: `feedparser`  

**Brief Description**:  
A universal feed parser for downloading and parsing syndicated feeds like RSS, ATOM, CDF, JSON feeds, etc.  

**Services Offered**:  
- `parse()` – Parses RSS/ATOM feed  

**Services Availed**:  
We are fetching **news headlines of Reliance Industries** in RSS format  
and parsing them using the `parse()` method.

</details>

---

## 1.a) Stock Data Fetching & Caching (API Calls)

### In this step we are fetching and caching stock data of Reliance Industries. 

---

<details>
  <summary><strong> 1. Purpose & Overview</strong></summary>

<br>

This cell fetches historical stock data for **Reliance Industries** from the **Alpha Vantage API**, but does so intelligently.

** Key Features:**

- Checks if the CSV already exists locally  
- Skips API call if the file is present, non-empty, and contains up-to-date data  
- Automatically filters data between `2020-01-01` and `2024-12-31`  
- Saves the clean data to a CSV file  
- Prints summary statistics and a quick preview of the dataset  

</details>

---

<details>
  <summary><strong> 2. Why This Matters</strong></summary>

<br>

Fetching data from the web **every time** can lead to:

-  Wasted bandwidth  
-  Slower execution  
-  Unnecessary consumption of limited API credits  

** This logic helps** in reusing saved data when it's valid, and only re-fetch when necessary (e.g., if the file is missing, empty, or stale).

</details>

---

<details>
  <summary><strong> 3. Output & Results</strong></summary>

<br>

**Saved CSV**:  
`Reliance_Stock_2020_2024.csv`

**Summary Statistics**:  
- Displayed using `.describe()` on the filtered data

**Preview**:  
- First 5 rows: `.head()`  
- Last 5 rows: `.tail()`  

</details>

---

In [4]:
from alpha_vantage.timeseries import TimeSeries
import pandas as pd
import os

# Create the directory if it doesn't exist
os.makedirs("../Data", exist_ok=True)

# Used API key that I got 
API_KEY = "L2XYTAMN2JEBGIZM"

# Initializeing Alpha Vantage API
ts = TimeSeries(key=API_KEY, output_format="pandas")


symbol = "RELIANCE.BSE"  # For BSE, used "RELIANCE.BSE"

# Fetching full daily stock data
print(f"Fetching stock data for {symbol}...")
data, meta_data = ts.get_daily(symbol=symbol, outputsize="full")

# Converting index to datetime
data.index = pd.to_datetime(data.index)

# Filter from 2020 to 2024
data_filtered = data[(data.index >= "2020-01-01") & (data.index <= "2024-12-31")]

# Saved to CSV for futher process
file_path = "../Data/Reliance_Stock_2020_2024.csv"
data_filtered.to_csv(file_path)

print(f"✅ Reliance 2020–2024 stock data saved to {file_path}")

# sample data shown below
data_filtered.head()

Fetching stock data for RELIANCE.BSE...
✅ Reliance 2020–2024 stock data saved to ../Data/Reliance_Stock_2020_2024.csv


Unnamed: 0_level_0,1. open,2. high,3. low,4. close,5. volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2024-12-31,1210.4,1219.05,1206.4,1215.45,444246.0
2024-12-30,1219.65,1223.35,1208.65,1210.9,314452.0
2024-12-27,1216.65,1227.6,1216.65,1220.95,653086.0
2024-12-26,1224.65,1228.0,1214.45,1216.6,556935.0
2024-12-24,1231.9,1233.45,1221.45,1223.5,268064.0


## 1.b) News Headline Collection (RSS Feed)
### In this step we are collecting monthly news headlines of Reliance Industries 
---

<details> 
  <summary><strong>📰 1. Purpose & Overview</strong></summary> 
  <br>

This section scrapes monthly news headlines related to Reliance Industries between Jan 2020 and Dec 2024 using Google News RSS feeds.

** Key Features:**

- Uses the `feedparser` library to read RSS feeds  
- Constructs URLs that fetch India-specific news in English  
- Iterates month-by-month (due to a 100-article limit per request)  
- Collects article titles and published dates  
- Saves results into a single CSV file, **but only downloads fresh data if the file is missing or stale**  
- Adds basic caching to avoid unnecessary downloads

</details>

---

<details> 
  <summary><strong> 2. Why This Matters</strong></summary> 
  <br>

News headlines often contain sentiment, event-driven patterns, and context that can influence stock behavior.

** Advantages of this approach:**

- Avoids duplication and quota issues by limiting each query to one month  
- Reduces redundant data fetching by checking for stale or missing files  
- Enables time-aligned analysis with stock data  
- Easy to preprocess and integrate with natural language models  
- Allows for monthly aggregation, frequency analysis, and sentiment tracking  

</details>

---

<details> 
  <summary><strong>📎 3. Output & Results</strong></summary> 
  <br>

**Saved CSV**:  
`Reliance_GoogleNews_Monthly_2020_2024.csv`

**Columns**:

- `Headline` — The news headline text  
- `Published Date` — Date the article was published  

**Sample Preview**:
- Displayed using `.head()` on the final DataFrame  
- **Data is not pre-filtered or cleaned yet** — this will be handled in the preprocessing stage  

** Total Headlines Fetched**:
- Printed at the end using `len(df)`

** Note**:
- If a valid CSV already exists with data up to December 2024, it is reused instead of fetching again.

</details>

---


In [6]:
import feedparser #Parses RSS feeds
import pandas as pd
from datetime import datetime, timedelta #Handle data ranges month by month

# Made a Function to get RSS feed headlines for a specific month
def fetch_monthly_news(year, month):
    start_date = f"{year}-{month:02d}-01" # Always start from day 1 for different months and year
    end_date = (datetime(year, month, 1) + timedelta(days=31)).replace(day=1) - timedelta(days=1) # used to calculate last day for a month
    end_date_str = end_date.strftime("%Y-%m-%d")

    rss_url = (
        f"https://news.google.com/rss/search?q=Reliance+Industries"
        f"+after:{start_date}+before:{end_date_str}&hl=en-IN&gl=IN&ceid=IN:en"
        #hl=en-IN&gl=IN&ceid=IN:en: Localizes results to India and English.
    )

    feed = feedparser.parse(rss_url)
    headlines = []
    for entry in feed.entries:
        headlines.append({
            "Headline": entry.title, #Gets the headline
            "Published Date": entry.published if hasattr(entry, 'published') else start_date #Gets the published date
        })
    
    return headlines

# Looped over all months from Jan 2020 to Dec 2024 as we get only 100 articles for a single request
all_news = []
for year in range(2020, 2025):
    for month in range(1, 13):
        print(f"📅 Fetching: {year}-{month:02d}")
        headlines = fetch_monthly_news(year, month)
        all_news.extend(headlines) # append all articles into a single list

# Converted to DataFrame
df = pd.DataFrame(all_news)

# Save to CSV
df.to_csv("../Data/Reliance_GoogleNews_Monthly_2020_2024.csv", index=False)
print(f"\nCollected {len(df)} headlines from 2020 to 2024.") #Print lenght of total news headline collected from 2020 to 2024
df.head()

📅 Fetching: 2020-01
📅 Fetching: 2020-02
📅 Fetching: 2020-03
📅 Fetching: 2020-04
📅 Fetching: 2020-05
📅 Fetching: 2020-06
📅 Fetching: 2020-07
📅 Fetching: 2020-08
📅 Fetching: 2020-09
📅 Fetching: 2020-10
📅 Fetching: 2020-11
📅 Fetching: 2020-12
📅 Fetching: 2021-01
📅 Fetching: 2021-02
📅 Fetching: 2021-03
📅 Fetching: 2021-04
📅 Fetching: 2021-05
📅 Fetching: 2021-06
📅 Fetching: 2021-07
📅 Fetching: 2021-08
📅 Fetching: 2021-09
📅 Fetching: 2021-10
📅 Fetching: 2021-11
📅 Fetching: 2021-12
📅 Fetching: 2022-01
📅 Fetching: 2022-02
📅 Fetching: 2022-03
📅 Fetching: 2022-04
📅 Fetching: 2022-05
📅 Fetching: 2022-06
📅 Fetching: 2022-07
📅 Fetching: 2022-08
📅 Fetching: 2022-09
📅 Fetching: 2022-10
📅 Fetching: 2022-11
📅 Fetching: 2022-12
📅 Fetching: 2023-01
📅 Fetching: 2023-02
📅 Fetching: 2023-03
📅 Fetching: 2023-04
📅 Fetching: 2023-05
📅 Fetching: 2023-06
📅 Fetching: 2023-07
📅 Fetching: 2023-08
📅 Fetching: 2023-09
📅 Fetching: 2023-10
📅 Fetching: 2023-11
📅 Fetching: 2023-12
📅 Fetching: 2024-01
📅 Fetching: 2024-02


Unnamed: 0,Headline,Published Date
0,Building the new Reliance - Fortune India,"Fri, 03 Jan 2020 08:00:00 GMT"
1,Reliance Industries posts record Q3 profit at ...,"Fri, 17 Jan 2020 08:00:00 GMT"
2,"Reliance outpaces industry in petrol, diesel s...","Sun, 19 Jan 2020 08:00:00 GMT"
3,Reliance refers to start-up playbook to grow J...,"Tue, 07 Jan 2020 08:00:00 GMT"
4,RIL lays out road with plastic waste - The Hindu,"Thu, 30 Jan 2020 08:00:00 GMT"
