# 2024 Assignment for Text Mining Course at GSOM

*Requirements*:
- The assignment must be completed in **groups of 3 or 4**.
- The output of the assignment is a **Python notebook**.

*Task*:
- Financial news provides information about the financial status of companies and thus could potentially help to explain (after-the-fact) the movement of stock prices on the stock market.
- Some [Algorithmic Trading](https://en.wikipedia.org/wiki/Algorithmic_trading) and [High-frequency Trading](https://en.wikipedia.org/wiki/High-frequency_trading#News-based_trading) platforms make us of news feeds when deciding to buy and sell stock, which means that news feeds may even be informative of future (intra-day) changes in stock prices.
- The aim of this assignment is to investigate **whether through training a text classifier on financial news** headlines, **it is possible to explain** (after-the-fact) the **price movements** on the stock market on that day.
- More information on the sources of information that can be used for the project -- specifically news feed information and stock price information -- is provided in the notebook below.

*Hand-in*:
- The **assignment is due on Sunday the 2nd of June (by midnight)**.  
- Each group should hand-in their assignment by **sending a single email** to Mark Carman (mark.carman@polimi.it) with their saved notebook attached.

*Presentations*:
- On **Monday the 3rd of June**, we will hold a presentation session.
- Each group will have **10 minutes to present their notebook**.
- No slides are needed, just make sure that the notebook is clear and self-explanatory, with headings, and explanations of the analysis and graphs generated.



## 1. Gathering news headlines:

There are various sources of news headlines available online.
- Simplest might be to gather them from Google News' RSS feed.
- The code below downloads an RSS feed on news about the company Tesla with ticker symbol 'TSLA' and parses it with the Beautiful Soup library:

In [1]:
import urllib.request
import bs4 as bs
import time

ticker = 'TSLA'
url = 'https://news.google.com/rss/search?hl=en-US&q='+ticker+'&gl=US&ceid=US:en'

time.sleep(15) ## wait 15 seconds between each request. This is SUPER IMPORTANT otherwise your IP-address will be banned for sending too frequent requests.

doc = urllib.request.urlopen(url).read()
parsed_doc = bs.BeautifulSoup(doc,'lxml')
print(parsed_doc.prettify())

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<html>
 <body>
  <rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
   <channel>
    <generator>
     NFE/5.0
    </generator>
    <title>
     "TSLA" - Google News
    </title>
    <link/>
    https://news.google.com/search?hl=en-US&amp;q=TSLA&amp;gl=US&amp;ceid=US:en
    <language>
     en-US
    </language>
    <webmaster>
     news-webmaster@google.com
    </webmaster>
    <copyright>
     2024 Google Inc.
    </copyright>
    <lastbuilddate>
     Wed, 01 May 2024 09:09:26 GMT
    </lastbuilddate>
    <description>
     Google News
    </description>
    <item>
     <title>
      Tesla jumps 15% after passing key hurdle to roll out advanced driver-assistance tech in China - CNBC
     </title>
     <link/>
     https://news.google.com/rss/articles/CBMiaGh0dHBzOi8vd3d3LmNuYmMuY29tLzIwMjQvMDQvMjkvdGVzbGEtdHNsYS1zdG9jay11cC1hZnRlci1wYXNzaW5nLWh1cmRsZS10by1jaGluYS1mdWxsLXNlbGYtZHJpdmluZy5odG1s0gFsaHR0cHM6Ly93d3cuY25iYy





Having downloaded the document and parsed it, we could extract the headlines
- Note that you could also extract the links and try to download the content of the articles. (But make sure to use a timeout if you do!)





In [2]:
titles = parsed_doc.find_all('title')
for title in titles:
  print(title.text)

"TSLA" - Google News
Tesla jumps 15% after passing key hurdle to roll out advanced driver-assistance tech in China - CNBC
TSLA vs. F: Which Automaker Stock Is the Better Buy? - TipRanks.com - TipRanks
Former Exec Drew Baglino Just Dumped $181 Million of Tesla (TSLA) Stock - InvestorPlace
Tesla short sellers lose more than $5 billion in post-earnings rally - Yahoo Finance
Tesla Set to Boost Revenue and Margins with China Breakthrough, Analyst Predicts - Markets Insider
Tesla (TSLA) Sees California Registrations Fall for Second Straight Quarter - Bloomberg
Stocks making the biggest moves midday: TSLA, ROKU, AAPL, DPZ - CNBC
TSLA Stock: Tesla Stock Quotes, Company News And Chart Analysis | Stock News & Stock Market Analysis — IBD - Investor's Business Daily
TSLA Stock Alert: Another Round of Tesla Layoffs Slashes Supercharger Team - InvestorPlace
Tesla Stock Forecast: TSLA accelerates 15% on Monday after FSD approval - FXStreet
Tesla had a miserable quarter. Why is TSLA stock rising? - Fa

We can package the above code in a function to allow for it to be easily called for differen ticker symbols:

In [3]:
import urllib.request
import bs4 as bs

def get_titles(ticker):
  url = 'https://news.google.com/rss/search?hl=en-US&q='+ticker+'&gl=US&ceid=US:en'
  time.sleep(15) ## wait 15 seconds between each request. This is SUPER IMPORTANT!
  doc = urllib.request.urlopen(url).read()
  parsed_doc = bs.BeautifulSoup(doc,'lxml')
  titles = parsed_doc.find_all('title')[1:]
  return [title.text for title in titles]

- Let's call the code for a few different ticker symbols below:

In [4]:
tickers = ['aapl','msft','amzn','goog','tsla','nvda','pypl','nflx','csco','avgo','orcl','qcom']
for ticker in tickers:
  print('ticker: ', ticker)
  print(get_titles(ticker))

ticker:  aapl
['Apple stock rises on upgrade from Bernstein - Yahoo Finance', "Bernstein says 'be like Buffett,' buy Apple shares while cheap - CNBC", 'Is Apple Stock Going to $195? 1 Wall Street Analyst Thinks So. - The Motley Fool', 'Play Apple Stock Like Warren Buffett, Advises Bernstein - TipRanks.com - TipRanks', 'After being neutral on Apple’s stock since 2018, one analyst changes his tune - MarketWatch', 'Wall Street Braces For Brutal Apple Earnings, But Top Analyst Gives 6 Reasons To Stay Bullish On iPhone Maker - TradingView', "Here's Our Plan for Apple's Stock - TheStreet", 'Apple (NASDAQ:AAPL) Boosts AI Capabilities Ahead of Q2 Earnings - TipRanks.com - TipRanks', 'Weekly Preview: Earnings to Watch This Week 4-28-24 (AAPL, AMZN, SQ) - Nasdaq', 'Stocks making the biggest moves midday: TSLA, ROKU, AAPL, DPZ - CNBC', 'Apple (AAPL) Earnings: Will Investors Look Past China Woes? - tastylive ', 'Apple Earnings Preview: Is AAPL Stock a Buy Ahead of the May 2 Report? - InvestorPlace

## 2. Gathering a list of Ticker symbols

You will need to source a list of Ticker symbols from somewhere.
- There are many lists online of companies with high market capitalisations, e.g.: http://www.iweblists.com/us/commerce/MarketCapitalization.html
- Note: you don't need to gether this list programmatically (via an API), you can just copy it into python.

## 3. Sourcing stock price information

You could source information on share price movement from many sources. of the stock, you could use the yfinace library.
- NOTE: This service is scraping text from the yahoo finance website. So be CAREFUL to not to request too frequently the page (wait at least 15 seconds between requests) otherwise you will likely be banned by their web servers.
- Here is a recent blog post showing  https://towardsdatascience.com/how-to-get-stock-data-using-python-c0de1df17e75

In [None]:
#!pip3 install yfinance

Having installed the library, you can load it:

In [None]:
import yfinance as yf
import time

ticker = 'MSFT'
tickerData = yf.Ticker(ticker)

Once we have the infomration on tht ticker symbol, we can print out information about its stock price over the last few days:

In [None]:
time.sleep(15) ## wait 15 seconds between each request. This is SUPER IMPORTANT otherwise your IP-address will be banned for sending too frequent requests.
df = tickerData.history()
df

We could use this data to find how the stock price has changed recently. For example, we could find the difference in the closing prices between consecutive days:

In [None]:
df['Change'] = df['Close'].diff()
df

To make use of this stock price change as a label for building a text classifier, we might simply check whether the change in the closing price on that day was positive or negative.
-- i.e. we use of the news data to predict whether the change in the stock price was positive or negative.

In [None]:
df['Positive Change'] = (df['Change'] > 0)
df

Better might be to check first **whether there was a substantial change** in the price, e.g. where the change was greater than some threshold -- say 1%:

In [None]:
threshold = 0.01
df['Substantial Change'] = (abs(df['Change'])/df['Close'] > threshold)
df

And then train a classifier to predict the change only if it is substantial -- i.e. to predict **three classes (*positive, no_change, negative*)**.

In [None]:
import numpy as np

threshold = 0.01
df['label'] = np.select(
    [
        (abs(df['Change'])/df['Close'] > threshold) & (df['Change'] > 0),
        (abs(df['Change'])/df['Close'] > threshold) & (df['Change'] < 0)
    ],
    [
        'positive',
        'negative'
    ],
    default='no_change'
)
df

Or alternatively, one could just train a binary classifier using only examples when there was substantial change in the stock price, i.e. remove from the training set the days when there was unsubstantial change.

More advanced things to investigate:
- You could have a look at the recommendations to see if they can be predicted somehow.

In [None]:
time.sleep(15) ## wait 15 seconds between each request. This is SUPER IMPORTANT!
tickerData.recommendations

Have fun with the assignment!!
