# 1. Data Collection Stage

## In this stage, we are collecting raw data from the following service providers and storing them in `.csv` files for preprocessing.

<details>
  <summary><strong>📰 1. Alpha Vantage </strong></summary>

<br>

**Service Provider Name**: Alpha Vantage  
**Library Name**: `alpha_vantage`  

**Brief Description**:  
API provider for financial market data, like stock prices, forex rates, crypto prices, etc.  

**Services Offered**:  
- `get_intraday()` – Minute-by-minute data (1, 5, 15, 30, 60 min intervals)  
- `get_daily()` – Daily open/high/low/close (OHLC)  
- `get_weekly()` – Weekly OHLC data  
- `get_monthly()` – Monthly OHLC data  

**Services Availed**:  
We are fetching **daily OHLC data of Reliance Industries** using the `get_daily()` method.

</details>

---

<details>
  <summary><strong>📰 2. Kurt McKee </strong></summary>

<br>

**Service Provider Name**: Kurt McKee  
**Library Name**: `feedparser`  

**Brief Description**:  
A universal feed parser for downloading and parsing syndicated feeds like RSS, ATOM, CDF, JSON feeds, etc.  

**Services Offered**:  
- `parse()` – Parses RSS/ATOM feed  

**Services Availed**:  
We are fetching **news headlines of Reliance Industries** in RSS format  
and parsing them using the `parse()` method.

</details>


## 1.a) Importing Stock price Data

In [1]:
from alpha_vantage.timeseries import TimeSeries
import pandas as pd

# Used API key that we got 
API_KEY = "L2XYTAMN2JEBGIZM"

# Initializeing Alpha Vantage API
ts = TimeSeries(key=API_KEY, output_format="pandas")


symbol = "RELIANCE.BSE"  # For BSE, used "RELIANCE.BSE"

# Fetching full daily stock data
print(f"Fetching stock data for {symbol}...")
data, meta_data = ts.get_daily(symbol=symbol, outputsize="full")

# Converting index to datetime
data.index = pd.to_datetime(data.index)

# Filter from 2020 to 2024
data_filtered = data[(data.index >= "2020-01-01") & (data.index <= "2024-12-31")]

# Saved to CSV for futher process
file_path = "Reliance_Stock_2020_2024.csv"
data_filtered.to_csv(file_path)

print(f"✅ Reliance 2020–2024 stock data saved to {file_path}")

# sample data shown below
data_filtered.head()

Fetching stock data for RELIANCE.BSE...
✅ Reliance 2020–2024 stock data saved to Reliance_Stock_2020_2024.csv


Unnamed: 0_level_0,1. open,2. high,3. low,4. close,5. volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2024-12-31,1210.4,1219.05,1206.4,1215.45,444246.0
2024-12-30,1219.65,1223.35,1208.65,1210.9,314452.0
2024-12-27,1216.65,1227.6,1216.65,1220.95,653086.0
2024-12-26,1224.65,1228.0,1214.45,1216.6,556935.0
2024-12-24,1231.9,1233.45,1221.45,1223.5,268064.0


## 1.b) Importing News Headlines

In [4]:
import feedparser #Parses RSS feeds
import pandas as pd
from datetime import datetime, timedelta #Handle data ranges month by month

# Made a Function to get RSS feed headlines for a specific month
def fetch_monthly_news(year, month):
    start_date = f"{year}-{month:02d}-01" # Always start from day 1 for different months and year
    end_date = (datetime(year, month, 1) + timedelta(days=31)).replace(day=1) - timedelta(days=1) # used to calculate last day for a month
    end_date_str = end_date.strftime("%Y-%m-%d")

    rss_url = (
        f"https://news.google.com/rss/search?q=Reliance+Industries"
        f"+after:{start_date}+before:{end_date_str}&hl=en-IN&gl=IN&ceid=IN:en"
        #hl=en-IN&gl=IN&ceid=IN:en: Localizes results to India and English.
    )

    feed = feedparser.parse(rss_url)
    headlines = []
    for entry in feed.entries:
        headlines.append({
            "Headline": entry.title, #Gets the headline
            "Published Date": entry.published if hasattr(entry, 'published') else start_date #Gets the published date
        })
    
    return headlines

# Looped over all months from Jan 2020 to Dec 2024 as we get only 100 articles for a single request
all_news = []
for year in range(2020, 2025):
    for month in range(1, 13):
        print(f"📅 Fetching: {year}-{month:02d}")
        headlines = fetch_monthly_news(year, month)
        all_news.extend(headlines) # append all articles into a single list

# Converted to DataFrame
df = pd.DataFrame(all_news)

# Save to CSV
df.to_csv("Reliance_GoogleNews_Monthly_2020_2024.csv", index=False)
print(f"\n✅ Collected {len(df)} headlines from 2020 to 2024.") #Print lenght of total news headline collected from 2020 to 2024
df.head()

📅 Fetching: 2020-01
📅 Fetching: 2020-02
📅 Fetching: 2020-03
📅 Fetching: 2020-04
📅 Fetching: 2020-05
📅 Fetching: 2020-06
📅 Fetching: 2020-07
📅 Fetching: 2020-08
📅 Fetching: 2020-09
📅 Fetching: 2020-10
📅 Fetching: 2020-11
📅 Fetching: 2020-12
📅 Fetching: 2021-01
📅 Fetching: 2021-02
📅 Fetching: 2021-03
📅 Fetching: 2021-04
📅 Fetching: 2021-05
📅 Fetching: 2021-06
📅 Fetching: 2021-07
📅 Fetching: 2021-08
📅 Fetching: 2021-09
📅 Fetching: 2021-10
📅 Fetching: 2021-11
📅 Fetching: 2021-12
📅 Fetching: 2022-01
📅 Fetching: 2022-02
📅 Fetching: 2022-03
📅 Fetching: 2022-04
📅 Fetching: 2022-05
📅 Fetching: 2022-06
📅 Fetching: 2022-07
📅 Fetching: 2022-08
📅 Fetching: 2022-09
📅 Fetching: 2022-10
📅 Fetching: 2022-11
📅 Fetching: 2022-12
📅 Fetching: 2023-01
📅 Fetching: 2023-02
📅 Fetching: 2023-03
📅 Fetching: 2023-04
📅 Fetching: 2023-05
📅 Fetching: 2023-06
📅 Fetching: 2023-07
📅 Fetching: 2023-08
📅 Fetching: 2023-09
📅 Fetching: 2023-10
📅 Fetching: 2023-11
📅 Fetching: 2023-12
📅 Fetching: 2024-01
📅 Fetching: 2024-02


Unnamed: 0,Headline,Published Date
0,Reliance Industries posts record Q3 profit at ...,"Fri, 17 Jan 2020 08:00:00 GMT"
1,Building the new Reliance - Fortune India,"Fri, 03 Jan 2020 08:00:00 GMT"
2,"Reliance outpaces industry in petrol, diesel s...","Sun, 19 Jan 2020 08:00:00 GMT"
3,Reliance refers to start-up playbook to grow J...,"Tue, 07 Jan 2020 08:00:00 GMT"
4,RIL lays out road with plastic waste - The Hindu,"Thu, 30 Jan 2020 08:00:00 GMT"


# 2. Data Preprocessing Stage
## In this stage we are performing the following tasks :

  1. **Reformatting the values of 'Published Date' column of news headlines dataset such that it aligns with the 'date' column of stock price dataset.**
  2. **Dropping rows with missing Published Date if any.**
  3. **Shuffling and dropping duplicate news for same day.**
  4. **Creating a new feature 'Year_Month' from existing feature 'Published Date'**
  5. **Storing the prepocessed data in .csv file.**

# 2. Data Preprocessing Stage

## In this stage, we are performing the following tasks:

<details>
  <summary><strong>1. Reformatting the values of 'Published Date'</strong></summary>
  Align the format of the `'Published Date'` column in the news headlines dataset  
  with the `'date'` column in the stock price dataset to enable accurate joins and comparisons.
  <br>
</details>

<details>
  <summary><strong>2. Dropping rows with missing 'Published Date'</strong></summary>
  Remove any rows where `'Published Date'` is missing to avoid issues in time-based operations.<br>
</details>

<details>
  <summary><strong>3. Shuffling and dropping duplicate news for the same day</strong></summary>
  Randomly shuffle the dataset and drop duplicates to keep only **one news item per day**,  
  ensuring unbiased daily representation.<br>
</details>

<details>
  <summary><strong>4. Creating a new feature 'Year_Month'</strong></summary>
  Extract the year and month from the `'Published Date'` column to create a new feature  
  called `'Year_Month'` for use in monthly trend analysis or grouping.
</details>

<details>
  <summary><strong>5. Storing the preprocessed data in a .csv file</strong></summary>
  Save the final cleaned and feature-enhanced dataset as a `.csv` file  
  for use in downstream modeling or visualization tasks.
</details>


## 2.a) Prerocessing News Headline dataset

In [6]:
from datetime import datetime
import pandas as pd
#import torch
#from transformers import AutoTokenizer, AutoModelForSequenceClassification
#from scipy.special import softmax
#from tqdm import tqdm  # For progress bar

In [7]:
df=pd.read_csv("Reliance_GoogleNews_Monthly_2020_2024.csv") # Loaded the saved dataset

In [8]:
# Converting 'Published Date' to YYYY/MM/DD format
df["Published Date"] = df["Published Date"].apply(lambda x: datetime.strptime(x, "%a, %d %b %Y %H:%M:%S GMT").strftime("%Y/%m/%d"))

In [9]:
df.head()

Unnamed: 0,Headline,Published Date
0,Reliance Industries posts record Q3 profit at ...,2020/01/17
1,Building the new Reliance - Fortune India,2020/01/03
2,"Reliance outpaces industry in petrol, diesel s...",2020/01/19
3,Reliance refers to start-up playbook to grow J...,2020/01/07
4,RIL lays out road with plastic waste - The Hindu,2020/01/30


In [10]:
# Droping rows with missing Published Date if any
df.dropna(subset=["Published Date"], inplace=True)

# Shuffling and droping duplicates news for same day with the help of Published Date
df = df.sample(frac=1, random_state=42) 
df.drop_duplicates(subset=["Published Date"], inplace=True)
df.reset_index(drop=True, inplace=True)

print(f"✅ Randomly selected one article per day. Remaining: {len(df)}")
df.head()

✅ Randomly selected one article per day. Remaining: 1419


Unnamed: 0,Headline,Published Date
0,"Stocks in Focus: Welspun Enterprises, Reliance...",2024/09/30
1,Green buy: Reliance Industries buys REC Solar ...,2021/10/11
2,Reliance Industries becomes net debt-free comp...,2020/06/19
3,Ram Mandir Inauguration: Reliance announces ho...,2024/01/19
4,India's Reliance swoops on solar capacity as p...,2021/10/10


In [11]:
df["Year_Month"] = df["Published Date"].str.slice(0, 7)  # Extract YYYY/MM

# Counted number of articles per month
monthly_counts = df.groupby("Year_Month").size().reset_index(name="News Count")

# Sorted by date
monthly_counts = monthly_counts.sort_values("Year_Month").reset_index(drop=True)

print(monthly_counts)

   Year_Month  News Count
0     2020/01          22
1     2020/02          25
2     2020/03          20
3     2020/04          23
4     2020/05          28
5     2020/06          27
6     2020/07          25
7     2020/08          26
8     2020/09          24
9     2020/10          26
10    2020/11          24
11    2020/12          29
12    2021/01          24
13    2021/02          18
14    2021/03          21
15    2021/04          20
16    2021/05          21
17    2021/06          19
18    2021/07          25
19    2021/08          21
20    2021/09          21
21    2021/10          27
22    2021/11          20
23    2021/12          24
24    2022/01          26
25    2022/02          28
26    2022/03          24
27    2022/04          24
28    2022/05          24
29    2022/06          22
30    2022/07          23
31    2022/08          18
32    2022/09          25
33    2022/10          30
34    2022/11          27
35    2022/12          22
36    2023/01          24
37    2023/0

In [12]:
# Saved to CSV as the dataframe as almost equal number of articles per month
df.to_csv("Reliance_GoogleNews_Monthly_2020_2024_preprocessed.csv", index=False)

In [14]:
df.head()

Unnamed: 0,Headline,Published Date,Year_Month
0,"Stocks in Focus: Welspun Enterprises, Reliance...",2024/09/30,2024/09
1,Green buy: Reliance Industries buys REC Solar ...,2021/10/11,2021/10
2,Reliance Industries becomes net debt-free comp...,2020/06/19,2020/06
3,Ram Mandir Inauguration: Reliance announces ho...,2024/01/19,2024/01
4,India's Reliance swoops on solar capacity as p...,2021/10/10,2021/10
