In [None]:
!pip install feedparser


Collecting feedparser
  Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)
Collecting sgmllib3k (from feedparser)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading feedparser-6.0.12-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.5/81.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6046 sha256=e5da31ad5db9c2aea10e7f4423f06f012790d7bc14319a7618e03736f5792460
  Stored in directory: /root/.cache/pip/wheels/03/f5/1a/23761066dac1d0e8e683e5fdb27e12de53209d05a4a37e6246
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.12 sgmllib3k-1.0.0


In [None]:
import feedparser
import pandas as pd

# Custom User-Agent to avoid RSS blocking
feedparser.USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"

# RSS feed sources
rss_feeds = {
    "Yahoo Finance": "https://finance.yahoo.com/news/rssindex",
    "CNBC": "https://www.cnbc.com/id/100003114/device/rss/rss.html",
    "Investing.com": "https://www.investing.com/rss/news.rss"
}

news_data = []

for source, url in rss_feeds.items():
    feed = feedparser.parse(url)
    print(f"{source}: {len(feed.entries)} articles fetched")

    for entry in feed.entries:
        news_data.append({
            "source": source,
            "headline": entry.title,
            "pubDate": entry.published if 'published' in entry else None
        })

df_news = pd.DataFrame(news_data)

df_news.to_csv("news_raw.csv", index=False)

df_news.head()




Yahoo Finance: 42 articles fetched
CNBC: 30 articles fetched
Investing.com: 10 articles fetched


Unnamed: 0,source,headline,pubDate
0,Yahoo Finance,RAKBANK Wins Approval for AED Stablecoin as UA...,2026-01-07T10:09:33Z
1,Yahoo Finance,The POWER Interview: Investing in Energy Solut...,2026-01-07T09:57:55Z
2,Yahoo Finance,Oil falls after Trump says Venezuela will supp...,2026-01-07T09:50:06Z
3,Yahoo Finance,Is Plug Power (PLUG) One of the Best US Penny ...,2026-01-07T09:45:16Z
4,Yahoo Finance,Rosenblatt and Benchmark Positive on Taboola.c...,2026-01-07T09:45:14Z


In [None]:
df_news['source'].value_counts()


Unnamed: 0_level_0,count
source,Unnamed: 1_level_1
Yahoo Finance,42
CNBC,30
Investing.com,10


### 1. Which XML tags were used to extract the headlines?

The `<title>` tag was used to extract the news headlines from the RSS feeds.  
In an RSS feed, each news article is represented as an `<item>` element, and the `<title>` tag within the `<item>` contains the headline text of that article.

During data collection, the RSS feed was parsed programmatically, and the value stored inside the `<title>` tag of each `<item>` was extracted and saved as the `headline` field in the dataset.

---

### 2. What is the role of the `<item>` tag in RSS feeds?

The `<item>` tag represents a single news article or entry in an RSS feed.  
Each `<item>` acts as a self-contained unit that groups together all relevant information about one article.

Typically, an `<item>` includes:
- `<title>` – the headline of the article  
- `<link>` – the URL pointing to the full article  
- `<pubDate>` – the publication date and time  
- `<description>` – a short summary or excerpt  

By structuring data in this way, RSS feeds allow applications to easily iterate over multiple news articles and extract consistent information from each entry. This makes the `<item>` tag the core building block of RSS-based news aggregation.

---

### 3. How does an RSS feed differ from a normal HTML webpage?

An RSS feed is an XML-based format designed specifically for automated data sharing and machine consumption, whereas a normal HTML webpage is designed primarily for human users and visual presentation.

RSS feeds use a fixed and predictable structure with well-defined tags, making them easy to parse programmatically. In contrast, HTML webpages focus on layout, styling, and user interaction, which makes automated data extraction more complex and less reliable.

Additionally, RSS feeds typically provide only essential content such as headlines, publication dates, and summaries, while HTML pages include additional elements like images, advertisements, scripts, and navigation menus. This difference makes RSS feeds more suitable for data analysis and real-time information retrieval tasks.

In [3]:
import pandas as pd


In [4]:
df = pd.read_csv("news_raw.csv")
df.head()


Unnamed: 0,source,headline,pubDate
0,Yahoo Finance,RAKBANK Wins Approval for AED Stablecoin as UA...,2026-01-07T10:09:33Z
1,Yahoo Finance,The POWER Interview: Investing in Energy Solut...,2026-01-07T09:57:55Z
2,Yahoo Finance,Oil falls after Trump says Venezuela will supp...,2026-01-07T09:50:06Z
3,Yahoo Finance,Is Plug Power (PLUG) One of the Best US Penny ...,2026-01-07T09:45:16Z
4,Yahoo Finance,Rosenblatt and Benchmark Positive on Taboola.c...,2026-01-07T09:45:14Z


In [6]:

df['pubDate'] = pd.to_datetime(df['pubDate'], utc=True, errors='coerce')
df.head()
df['date'] = df['pubDate'].dt.date
df.head()
df['headline_length'] = df['headline'].astype(str).apply(len)
df.head()


Unnamed: 0,source,headline,pubDate,date,headline_length
0,Yahoo Finance,RAKBANK Wins Approval for AED Stablecoin as UA...,2026-01-07 10:09:33+00:00,2026-01-07,82
1,Yahoo Finance,The POWER Interview: Investing in Energy Solut...,2026-01-07 09:57:55+00:00,2026-01-07,75
2,Yahoo Finance,Oil falls after Trump says Venezuela will supp...,2026-01-07 09:50:06+00:00,2026-01-07,54
3,Yahoo Finance,Is Plug Power (PLUG) One of the Best US Penny ...,2026-01-07 09:45:16+00:00,2026-01-07,60
4,Yahoo Finance,Rosenblatt and Benchmark Positive on Taboola.c...,2026-01-07 09:45:14+00:00,2026-01-07,55


In [7]:
df.columns


Index(['source', 'headline', 'pubDate', 'date', 'headline_length'], dtype='object')

In [8]:
df.to_csv("news_cleaned.csv", index=False)

print("Task 3 completed: news_cleaned.csv generated")

Task 3 completed: news_cleaned.csv generated


In [9]:
!pip install yfinance
import yfinance as yf
import pandas as pd
ticker = "AAPL"
stock_df = yf.download(ticker, period="1mo")
stock_df.head()




  stock_df = yf.download(ticker, period="1mo")
[*********************100%***********************]  1 of 1 completed


Price,Close,High,Low,Open,Volume
Ticker,AAPL,AAPL,AAPL,AAPL,AAPL
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2025-12-08,277.890015,279.670013,276.149994,278.130005,38211800
2025-12-09,277.179993,280.029999,276.920013,278.160004,32193300
2025-12-10,278.779999,279.75,276.440002,277.75,33038300
2025-12-11,278.029999,279.589996,273.809998,279.100006,33248000
2025-12-12,278.279999,279.220001,276.820007,277.899994,39532900


In [10]:

stock_df = stock_df[['Open', 'High', 'Low', 'Close', 'Volume']]
stock_df.head()
len(stock_df)


20

In [12]:
stock_df.to_csv("stock_data.csv")



In [14]:
import pandas as pd

news_df = pd.read_csv("news_cleaned.csv")
stock_df = pd.read_csv("stock_data.csv")
news_df['date'] = pd.to_datetime(news_df['date'])
stock_df.index = pd.to_datetime(stock_df.index)

trading_dates = set(stock_df.index.date)
news_df['is_trading_day'] = news_df['date'].dt.date.isin(trading_dates)
news_df.head()


non_trading_dates = news_df.loc[~news_df['is_trading_day'], 'date'].dt.date.unique()
non_trading_count = (~news_df['is_trading_day']).sum()
non_trading_dates, non_trading_count

print(f"Non-trading days: {non_trading_dates}")
print(f"Count of non-trading days: {non_trading_count}")


Non-trading days: [datetime.date(2026, 1, 7) datetime.date(2026, 1, 6)
 datetime.date(2026, 1, 5) NaT]
Count of non-trading days: 82


### 1. Which dates in your news data are non-trading days?

Non-trading days are the dates that appear in the news dataset but do not appear in the stock price dataset.  
Based on the comparison between news dates and stock trading dates, the following dates were identified as non-trading days:

[datetime.date(2026, 1, 7) datetime.date(2026, 1, 6)
 datetime.date(2026, 1, 5) NaT]

---

### 2. Why does the stock market not trade on those days?

Stock markets do not trade on non-trading days because they are typically weekends (Saturdays and Sundays) or officially recognized market holidays.  
Although financial news can be published on any day, stock exchanges operate only during scheduled trading sessions.

---

### 3. How many news articles fall on non-trading days?

Based on the analysis, 82 news articles were published on non-trading days.

In [15]:

stock_df_reset = stock_df.reset_index()
stock_df_reset.rename(columns={'index': 'date'}, inplace=True)
stock_df_reset['date'] = stock_df_reset['date'].dt.date
stock_df_reset.head()

news_df['date'] = news_df['date'].dt.date
news_df.head()

merged_df = pd.merge(
    news_df,
    stock_df_reset,
    on='date',
    how='left'
)

merged_df.head()


Unnamed: 0,source,headline,pubDate,date,headline_length,is_trading_day,Price,Open,High,Low,Close,Volume
0,Yahoo Finance,RAKBANK Wins Approval for AED Stablecoin as UA...,2026-01-07 10:09:33+00:00,2026-01-07,82,False,,,,,,
1,Yahoo Finance,The POWER Interview: Investing in Energy Solut...,2026-01-07 09:57:55+00:00,2026-01-07,75,False,,,,,,
2,Yahoo Finance,Oil falls after Trump says Venezuela will supp...,2026-01-07 09:50:06+00:00,2026-01-07,54,False,,,,,,
3,Yahoo Finance,Is Plug Power (PLUG) One of the Best US Penny ...,2026-01-07 09:45:16+00:00,2026-01-07,60,False,,,,,,
4,Yahoo Finance,Rosenblatt and Benchmark Positive on Taboola.c...,2026-01-07 09:45:14+00:00,2026-01-07,55,False,,,,,,


In [18]:
merged_df['is_trading_day'] = ~merged_df['Open'].isna()

merged_df[['date', 'is_trading_day']].head()

merged_df.to_csv("merged_midterm_data.csv", index=False)
