# 2024 Assignment for Text Mining Course at GSOM

*Requirements*:
- The assignment must be completed in **groups of 3 or 4**.
- The output of the assignment is a **Python notebook** (Use graphs to be more clear).

*Task*:
- Financial news provides information about the financial status of companies and thus could potentially help to explain (after-the-fact) the movement of stock prices on the stock market.
- Some [Algorithmic Trading](https://en.wikipedia.org/wiki/Algorithmic_trading) and [High-frequency Trading](https://en.wikipedia.org/wiki/High-frequency_trading#News-based_trading) platforms make us of news feeds when deciding to buy and sell stock, which means that news feeds may even be informative of future (intra-day) changes in stock prices.
- The aim of this assignment is to investigate **whether through training a text classifier on financial news** headlines, **it is possible to explain** (after-the-fact) the **price movements** on the stock market on that day.
- More information on the sources of information that can be used for the project -- specifically news feed information and stock price information -- is provided in the notebook below.

*Hand-in*:
- The **assignment is due on Sunday the 2nd of June (by midnight)**.  
- Each group should hand-in their assignment by **sending a single email** to Mark Carman (mark.carman@polimi.it) with their saved notebook attached.

*Presentations*:
- On **Monday the 3rd of June**, we will hold a presentation session.
- Each group will have **10 minutes to present their notebook**.
- No slides are needed, just make sure that the notebook is clear and self-explanatory, with headings, and explanations of the analysis and graphs generated.

*Valuation*:
- How far have you gone in data mining
- How clear is the presentation
- How clear the code is

## 1. Gathering news headlines:

There are various sources of news headlines available online.
- Simplest might be to gather them from Google News' RSS feed.
- The code below downloads an RSS feed on news about the company Tesla with ticker symbol 'TSLA' and parses it with the Beautiful Soup library:

In [3]:
import urllib.request
import bs4 as bs
import time

ticker = 'TSLA'
url = 'https://news.google.com/rss/search?hl=en-US&q='+ticker+'&gl=US&ceid=US:en'

time.sleep(15) ## wait 15 seconds between each request. This is SUPER IMPORTANT otherwise your IP-address will be banned for sending too frequent requests.

doc = urllib.request.urlopen(url).read()
parsed_doc = bs.BeautifulSoup(doc,'lxml')
print(parsed_doc.prettify())

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<html>
 <body>
  <rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
   <channel>
    <generator>
     NFE/5.0
    </generator>
    <title>
     "TSLA" - Google News
    </title>
    <link/>
    https://news.google.com/search?hl=en-US&amp;q=TSLA&amp;gl=US&amp;ceid=US:en
    <language>
     en-US
    </language>
    <webmaster>
     news-webmaster@google.com
    </webmaster>
    <copyright>
     2024 Google Inc.
    </copyright>
    <lastbuilddate>
     Fri, 19 Apr 2024 09:52:19 GMT
    </lastbuilddate>
    <description>
     Google News
    </description>
    <item>
     <title>
      Tesla (TSLA) Sees a More Significant Dip Than Broader Market: Some Facts to Know - Yahoo Finance
     </title>
     <link/>
     https://news.google.com/rss/articles/CBMiTmh0dHBzOi8vZmluYW5jZS55YWhvby5jb20vbmV3cy90ZXNsYS10c2xhLXNlZXMtbW9yZS1zaWduaWZpY2FudC0yMTQ1MjEzNDMuaHRtbNIBAA?oc=5
     <guid ispermalink="false">
      CBMiTmh0dHBzOi8v





Having downloaded the document and parsed it, we could extract the headlines
- Note that you could also extract the links and try to download the content of the articles. (But make sure to use a timeout if you do!)





In [4]:
titles = parsed_doc.find_all('title')
for title in titles:
  print(title.text)

"TSLA" - Google News
Tesla (TSLA) Sees a More Significant Dip Than Broader Market: Some Facts to Know - Yahoo Finance
Tesla (TSLA) gets “Buy” rating and $298 price target from RBC Capital - TESLARATI
Tesla (TSLA) Can't Catch a Break - Nasdaq
Stocks making the biggest moves midday: DUOL, JBLU, TSLA, DHI - CNBC
Tesla (TSLA) Asks Shareholders to Re-Ratify Elon Musk’s $56 Billion Payout - Bloomberg
Tesla Bull Says 'Don't Know…What's Going To Happen To TSLA Stock Next Week' Ahead Of Earnings Call: 'Much ... - Markets Insider
Tesla Stock [TSLA] Is Basically Worth $0 If True Full Self Driving Isn’t Achieved - CleanTechnica
Dear TSLA Stock Fans, Mark Your Calendars for April 23 - InvestorPlace
Tesla's (NASDAQ:TSLA) Job Cuts Spark Investor Concerns: What's Behind the Decline? - TipRanks.com - TipRanks
Tesla (TSLA) to Cut Jobs by 10% to Manage Costs Amid Low EV Sales - Yahoo Finance
TSLA Stock Quote Price and Forecast - CNN
Tesla (TSLA) Model 2 Is Crucial to Investment Thesis, David Baron Says -

We can package the above code in a function to allow for it to be easily called for differen ticker symbols:

In [5]:
import urllib.request
import bs4 as bs

def get_titles(ticker):
  url = 'https://news.google.com/rss/search?hl=en-US&q='+ticker+'&gl=US&ceid=US:en'
  time.sleep(15) ## wait 15 seconds between each request. This is SUPER IMPORTANT!
  doc = urllib.request.urlopen(url).read()
  parsed_doc = bs.BeautifulSoup(doc,'lxml')
  titles = parsed_doc.find_all('title')[1:]
  return [title.text for title in titles]

- Let's call the code for a few different ticker symbols below:

In [6]:
tickers = ['aapl','msft','amzn','goog','tsla','nvda','pypl','nflx','csco','avgo','orcl','qcom']
for ticker in tickers:
  print('ticker: ', ticker)
  print(get_titles(ticker))

ticker:  aapl


['Apple (NASDAQ:AAPL) Dips as Analyst Cuts Estimates - TipRanks.com - TipRanks', '1 Analyst Thinks Apple Stock Will Slide to $162. Is It a Sell? - The Motley Fool', "Apple (AAPL) Stock Sinks As Market Gains: Here's Why - Yahoo Finance", 'Is Apple About To Disappoint Investors With Q2 Print? Analyst Flags 2 Factors Weakening iPhone Maker - Ap - Benzinga', 'Apple eyes Asia expansions after losing China market share - Yahoo Finance', 'Tim Cook offloads nearly 200,000 shares of AAPL stock worth $32 million - 9to5Mac', 'Apple (AAPL) Readies M4 Chip Mac Line, Including New MacBook Air and Mac Pro - Bloomberg', 'DOJ absurdly compares AAPL share buybacks with R&D spend - 9to5Mac', 'Apple (AAPL) iPhone Shipments Fall 10% as Android Smartphones Rise - Bloomberg', "Apple Inc (AAPL) DCF Valuation: Is The Stock Undervalued? - The Acquirer's Multiple", 'Apple Stock Is "Dead Money," According to 1 Wall Street Analyst - The Motley Fool', 'Apple Inc. (NASDAQ:AAPL) Holdings Raised by Bellecapital Intern

## 2. Gathering a list of Ticker symbols

You will need to source a list of Ticker symbols from somewhere.
- There are many lists online of companies with high market capitalisations, e.g.: http://www.iweblists.com/us/commerce/MarketCapitalization.html
- Note: you don't need to gether this list programmatically (via an API), you can just copy it into python.

## 3. Sourcing stock price information

You could source information on share price movement from many sources. of the stock, you could use the yfinace library.
- NOTE: This service is scraping text from the yahoo finance website. So be CAREFUL to not to request too frequently the page (wait at least 15 seconds between requests) otherwise you will likely be banned by their web servers.
- Here is a recent blog post showing  https://towardsdatascience.com/how-to-get-stock-data-using-python-c0de1df17e75

In [7]:
#!pip3 install yfinance



Having installed the library, you can load it:

In [8]:
import yfinance as yf
import time

ticker = 'MSFT'
tickerData = yf.Ticker(ticker)

Once we have the infomration on that ticker symbol, we can print out information about its stock price over the last few days:

In [9]:
time.sleep(15) ## wait 15 seconds between each request. This is SUPER IMPORTANT otherwise your IP-address will be banned for sending too frequent requests.
df = tickerData.history()
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-03-19 00:00:00-04:00,417.829987,421.670013,415.549988,421.410004,19837900,0.0,0.0
2024-03-20 00:00:00-04:00,422.0,425.959991,420.660004,425.230011,17860100,0.0,0.0
2024-03-21 00:00:00-04:00,429.829987,430.820007,427.160004,429.369995,21296200,0.0,0.0
2024-03-22 00:00:00-04:00,429.700012,429.859985,426.070007,428.73999,17636500,0.0,0.0
2024-03-25 00:00:00-04:00,425.23999,427.410004,421.609985,422.859985,18060500,0.0,0.0
2024-03-26 00:00:00-04:00,425.609985,425.98999,421.350006,421.649994,16725600,0.0,0.0
2024-03-27 00:00:00-04:00,424.440002,424.450012,419.01001,421.429993,16705000,0.0,0.0
2024-03-28 00:00:00-04:00,420.959991,421.869995,419.119995,420.720001,21871200,0.0,0.0
2024-04-01 00:00:00-04:00,423.950012,427.890015,422.220001,424.570007,16316000,0.0,0.0
2024-04-02 00:00:00-04:00,420.109985,422.380005,417.839996,421.440002,17912000,0.0,0.0


We could use this data to find how the stock price has changed recently. For example, we could find the difference in the closing prices between consecutive days:

In [10]:
df['Change'] = df['Close'].diff()
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,Change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2024-03-19 00:00:00-04:00,417.829987,421.670013,415.549988,421.410004,19837900,0.0,0.0,
2024-03-20 00:00:00-04:00,422.0,425.959991,420.660004,425.230011,17860100,0.0,0.0,3.820007
2024-03-21 00:00:00-04:00,429.829987,430.820007,427.160004,429.369995,21296200,0.0,0.0,4.139984
2024-03-22 00:00:00-04:00,429.700012,429.859985,426.070007,428.73999,17636500,0.0,0.0,-0.630005
2024-03-25 00:00:00-04:00,425.23999,427.410004,421.609985,422.859985,18060500,0.0,0.0,-5.880005
2024-03-26 00:00:00-04:00,425.609985,425.98999,421.350006,421.649994,16725600,0.0,0.0,-1.209991
2024-03-27 00:00:00-04:00,424.440002,424.450012,419.01001,421.429993,16705000,0.0,0.0,-0.220001
2024-03-28 00:00:00-04:00,420.959991,421.869995,419.119995,420.720001,21871200,0.0,0.0,-0.709991
2024-04-01 00:00:00-04:00,423.950012,427.890015,422.220001,424.570007,16316000,0.0,0.0,3.850006
2024-04-02 00:00:00-04:00,420.109985,422.380005,417.839996,421.440002,17912000,0.0,0.0,-3.130005


To make use of this stock price change as a label for building a text classifier, we might simply check whether the change in the closing price on that day was positive or negative.
-- i.e. we use of the news data to predict whether the change in the stock price was positive or negative.

In [11]:
df['Positive Change'] = (df['Change'] > 0)
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,Change,Positive Change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2024-03-19 00:00:00-04:00,417.829987,421.670013,415.549988,421.410004,19837900,0.0,0.0,,False
2024-03-20 00:00:00-04:00,422.0,425.959991,420.660004,425.230011,17860100,0.0,0.0,3.820007,True
2024-03-21 00:00:00-04:00,429.829987,430.820007,427.160004,429.369995,21296200,0.0,0.0,4.139984,True
2024-03-22 00:00:00-04:00,429.700012,429.859985,426.070007,428.73999,17636500,0.0,0.0,-0.630005,False
2024-03-25 00:00:00-04:00,425.23999,427.410004,421.609985,422.859985,18060500,0.0,0.0,-5.880005,False
2024-03-26 00:00:00-04:00,425.609985,425.98999,421.350006,421.649994,16725600,0.0,0.0,-1.209991,False
2024-03-27 00:00:00-04:00,424.440002,424.450012,419.01001,421.429993,16705000,0.0,0.0,-0.220001,False
2024-03-28 00:00:00-04:00,420.959991,421.869995,419.119995,420.720001,21871200,0.0,0.0,-0.709991,False
2024-04-01 00:00:00-04:00,423.950012,427.890015,422.220001,424.570007,16316000,0.0,0.0,3.850006,True
2024-04-02 00:00:00-04:00,420.109985,422.380005,417.839996,421.440002,17912000,0.0,0.0,-3.130005,False


Better might be to check first **whether there was a substantial change** in the price, e.g. where the change was greater than some threshold -- say 1%:

In [12]:
threshold = 0.01
df['Substantial Change'] = (abs(df['Change'])/df['Close'] > threshold)
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,Change,Positive Change,Substantial Change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2024-03-19 00:00:00-04:00,417.829987,421.670013,415.549988,421.410004,19837900,0.0,0.0,,False,False
2024-03-20 00:00:00-04:00,422.0,425.959991,420.660004,425.230011,17860100,0.0,0.0,3.820007,True,False
2024-03-21 00:00:00-04:00,429.829987,430.820007,427.160004,429.369995,21296200,0.0,0.0,4.139984,True,False
2024-03-22 00:00:00-04:00,429.700012,429.859985,426.070007,428.73999,17636500,0.0,0.0,-0.630005,False,False
2024-03-25 00:00:00-04:00,425.23999,427.410004,421.609985,422.859985,18060500,0.0,0.0,-5.880005,False,True
2024-03-26 00:00:00-04:00,425.609985,425.98999,421.350006,421.649994,16725600,0.0,0.0,-1.209991,False,False
2024-03-27 00:00:00-04:00,424.440002,424.450012,419.01001,421.429993,16705000,0.0,0.0,-0.220001,False,False
2024-03-28 00:00:00-04:00,420.959991,421.869995,419.119995,420.720001,21871200,0.0,0.0,-0.709991,False,False
2024-04-01 00:00:00-04:00,423.950012,427.890015,422.220001,424.570007,16316000,0.0,0.0,3.850006,True,False
2024-04-02 00:00:00-04:00,420.109985,422.380005,417.839996,421.440002,17912000,0.0,0.0,-3.130005,False,False


And then train a classifier to predict the change only if it is substantial -- i.e. to predict **three classes (*positive, no_change, negative*)**.

In [13]:
import numpy as np

threshold = 0.01
df['label'] = np.select(
    [
        (abs(df['Change'])/df['Close'] > threshold) & (df['Change'] > 0),
        (abs(df['Change'])/df['Close'] > threshold) & (df['Change'] < 0)
    ],
    [
        'positive',
        'negative'
    ],
    default='no_change'
)
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,Change,Positive Change,Substantial Change,label
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2024-03-19 00:00:00-04:00,417.829987,421.670013,415.549988,421.410004,19837900,0.0,0.0,,False,False,no_change
2024-03-20 00:00:00-04:00,422.0,425.959991,420.660004,425.230011,17860100,0.0,0.0,3.820007,True,False,no_change
2024-03-21 00:00:00-04:00,429.829987,430.820007,427.160004,429.369995,21296200,0.0,0.0,4.139984,True,False,no_change
2024-03-22 00:00:00-04:00,429.700012,429.859985,426.070007,428.73999,17636500,0.0,0.0,-0.630005,False,False,no_change
2024-03-25 00:00:00-04:00,425.23999,427.410004,421.609985,422.859985,18060500,0.0,0.0,-5.880005,False,True,negative
2024-03-26 00:00:00-04:00,425.609985,425.98999,421.350006,421.649994,16725600,0.0,0.0,-1.209991,False,False,no_change
2024-03-27 00:00:00-04:00,424.440002,424.450012,419.01001,421.429993,16705000,0.0,0.0,-0.220001,False,False,no_change
2024-03-28 00:00:00-04:00,420.959991,421.869995,419.119995,420.720001,21871200,0.0,0.0,-0.709991,False,False,no_change
2024-04-01 00:00:00-04:00,423.950012,427.890015,422.220001,424.570007,16316000,0.0,0.0,3.850006,True,False,no_change
2024-04-02 00:00:00-04:00,420.109985,422.380005,417.839996,421.440002,17912000,0.0,0.0,-3.130005,False,False,no_change


Or alternatively, one could just train a binary classifier using only examples when there was substantial change in the stock price, i.e. remove from the training set the days when there was unsubstantial change.

More advanced things to investigate:
- Look at weekly changes in the price, rather than daily (tizio indiano con occhiali oro dice che secondo lui non ha molto senso perchè le informazioni influenzano un paio d'ore)
- Look for other sources of text/news information (or the articles themselves)
- You could have a look at the recommendations to see if they can be predicted somehow.

In [14]:
time.sleep(15) ## wait 15 seconds between each request. This is SUPER IMPORTANT!
tickerData.recommendations

YFNotImplementedError: Have not implemented fetching 'recommendations' from Yahoo API

Have fun with the assignment!!
