# Market Sage: Financial News Insight Application

Market Sage is a proof-of-concept application that extracts financial news insights by combining traditional NLP techniques with text summarization powered by GPT-Neo. This notebook documents the entire pipeline, including:

- **Section 0:** Overview & Project Setup
- **Section 1:** Merged Data Acquisition from Multiple Free RSS Feeds
- **Section 2:** Model Training and Preprocessing (`model_training.py`)
- **Section 3:** GPT-Neo API Service Setup (`gptneo_service.py`)
- **Section 4:** Flask App Integration (`app.py`) and Minimal HTML Front-End
- **Section 5:** Running the Complete Pipeline
- **Section 6:** Conclusion


## Section 0: Overview & Project Setup

In this project, we build a robust financial news dataset by aggregating articles from multiple free RSS feeds. These sources include:
- Yahoo Finance
- Reuters
- CNBC
- MSN Finance
- Google News (as a proxy for Google Finance)

Then, we train a sentiment classifier using a Naive Bayes model and integrate a GPT-Neo–powered summarization service via FastAPI. Finally, a Flask app with a simple HTML front-end displays the results.

Before proceeding, set up a virtual environment and install the required packages (see Section 5 for details on `requirements.txt`).


## Section 1: Merged Data Acquisition from Multiple Free RSS Feeds

In this section, we merge data acquisition from multiple free RSS sources into one script. This script fetches articles from several sources and merges them into one CSV file (`sample_data.csv`). Save this cell as `data_acquisition_master.py`.

In [9]:
'''
# %% [code] "data_acquisition_master.py"
import feedparser  # Library to parse RSS feeds
import pandas as pd  # For creating and managing DataFrames

# Function to fetch news from Yahoo Finance
def fetch_yahoo_finance():
    # RSS feed URL for Yahoo Finance (example: AAPL headlines)
    rss_url = "https://feeds.finance.yahoo.com/rss/2.0/headline?s=AAPL&region=US&lang=en-US"
    feed = feedparser.parse(rss_url)  # Parse the RSS feed
    news_items = []
    # Loop over each entry in the feed
    for entry in feed.entries:
        title = entry.title  # Get the article title
        # Get the summary if it exists; otherwise, use an empty string
        summary = entry.summary if hasattr(entry, 'summary') else ""
        # Combine title and summary to form full text
        text = title + ". " + summary
        # Append the news article as a dictionary with a default sentiment label "neutral"
        news_items.append({"text": text, "label": "neutral", "source": "Yahoo Finance"})
    return news_items

# Function to fetch news from Reuters Business News RSS feed
def fetch_reuters_finance():
    rss_url = "http://feeds.reuters.com/reuters/businessNews"
    feed = feedparser.parse(rss_url)
    news_items = []
    for entry in feed.entries:
        title = entry.title
        summary = entry.summary if hasattr(entry, 'summary') else ""
        text = title + ". " + summary
        # Specify source as Reuters
        news_items.append({"text": text, "label": "neutral", "source": "Reuters"})
    return news_items

# Function to fetch news from CNBC RSS feed
def fetch_cnbc_finance():
    rss_url = "https://www.cnbc.com/id/100003114/device/rss/rss.html"
    feed = feedparser.parse(rss_url)
    news_items = []
    for entry in feed.entries:
        title = entry.title
        summary = entry.summary if hasattr(entry, 'summary') else ""
        text = title + ". " + summary
        news_items.append({"text": text, "label": "neutral", "source": "CNBC"})
    return news_items

# Function to fetch news from MSN Finance RSS feed
def fetch_msn_finance():
    # Example URL for MSN Money RSS feed (adjust URL if needed)
    rss_url = "https://www.msn.com/en-us/money/rss"
    feed = feedparser.parse(rss_url)
    news_items = []
    for entry in feed.entries:
        title = entry.title
        summary = entry.summary if hasattr(entry, 'summary') else ""
        text = title + ". " + summary
        news_items.append({"text": text, "label": "neutral", "source": "MSN Finance"})
    return news_items

# Function to fetch finance-related news from Google News (proxy for Google Finance)
def fetch_google_finance():
    # Google News RSS feed URL for finance search query
    rss_url = "https://news.google.com/rss/search?q=finance&hl=en-US&gl=US&ceid=US:en"
    feed = feedparser.parse(rss_url)
    news_items = []
    for entry in feed.entries:
        title = entry.title
        summary = entry.summary if hasattr(entry, 'summary') else ""
        text = title + ". " + summary
        news_items.append({"text": text, "label": "neutral", "source": "Google News"})
    return news_items

# Function to merge all articles from the different sources
def merge_all_articles():
    # Call each individual fetch function
    yahoo_articles = fetch_yahoo_finance()
    reuters_articles = fetch_reuters_finance()
    cnbc_articles = fetch_cnbc_finance()
    msn_articles = fetch_msn_finance()
    google_articles = fetch_google_finance()
    
    # Combine all article lists into one
    all_articles = yahoo_articles + reuters_articles + cnbc_articles + msn_articles + google_articles
    return all_articles

# Main execution block: merge articles and save as CSV
if __name__ == "__main__":
    articles = merge_all_articles()  # Get merged list of articles
    # Convert the list of dictionaries into a pandas DataFrame
    df = pd.DataFrame(articles)
    # Save DataFrame to CSV file named 'sample_data.csv'
    df.to_csv("sample_data.csv", index=False)
    print("Merged sample_data.csv created with", len(df), "entries")
'''

'\n# %% [code] "data_acquisition_master.py"\nimport feedparser  # Library to parse RSS feeds\nimport pandas as pd  # For creating and managing DataFrames\n\n# Function to fetch news from Yahoo Finance\ndef fetch_yahoo_finance():\n    # RSS feed URL for Yahoo Finance (example: AAPL headlines)\n    rss_url = "https://feeds.finance.yahoo.com/rss/2.0/headline?s=AAPL&region=US&lang=en-US"\n    feed = feedparser.parse(rss_url)  # Parse the RSS feed\n    news_items = []\n    # Loop over each entry in the feed\n    for entry in feed.entries:\n        title = entry.title  # Get the article title\n        # Get the summary if it exists; otherwise, use an empty string\n        summary = entry.summary if hasattr(entry, \'summary\') else ""\n        # Combine title and summary to form full text\n        text = title + ". " + summary\n        # Append the news article as a dictionary with a default sentiment label "neutral"\n        news_items.append({"text": text, "label": "neutral", "source": "Yah