# Webscrapping investing.com

#### By Troy Mazerolle

The following notebook outlines how to scrap articles from the main news page of investing.com.  Following the link https://www.investing.com/news/stock-market-news/X, where X is the page number, we can extract a list of links from each page and then extract the text from each article that the link contains.

For more information about the functions used from ScrappingFunctions, the GitHub link to the page is [here](https://github.com/TJMazerolle/InvestingComWebscrapping/blob/main/ScrappingFunctions.py).

Note that the results seen below are from running the script as of December 21, 2023.

In [1]:
from ScrappingFunctions import *
import numpy as np
import pandas as pd

## Webscrapping Stock News Articles

Note that since the following chunk uses Seleneum, to run the chunk below you will need geckodriver.exe in the same folder that wherever you run the script is in.  Also, the script only has the option for Firefox as the browser it uses, so Firefox must be installed.

First, we use get_training_links(n) to get the list of training links from the first n pages of https://www.investing.com/news/stock-market-news/X, where X represents all pages between 1 and n.  In the case below, we set n = 2 to get the links from the first two pages, and output the first five links in the list.

In [2]:
training_links = get_training_links(2)
print("Number of Articles:", len(training_links))
display(training_links[:5])

Number of Articles: 54


['https://www.investing.com/news/stock-market-news/australia-stocks-lower-at-close-of-trade-spasx-200-down-003-3262077',
 'https://www.investing.com/news/stock-market-news/tencent-netease-plummet-as-china-flags-more-online-gaming-curbs-3262075',
 'https://www.investing.com/news/stock-market-news/tesla-launches-shanghai-megapack-battery-project--chinese-state-media-3262074',
 'https://www.investing.com/jp.php?v2=MXFjPWYxYTo-aD0yMmM0NDdlZDkyNDQ1MyQ0ZjsxN35lIz43YDhiJGVtPSNkODRuPk0_YDc_NSNnMTdlNXRmJTF2Yz1mN2E7Pm09NTJ3NHU3a2Q9MjA0IDNyNDo=',
 'https://www.investing.com/news/stock-market-news/boeings-first-dreamliner-delivery-to-china-since-2019-to-land-friday-3262073']

Next, we use get_article_texts(training_links, tracker=False) to take the list of links and extract the article text and the corresponding dates of each article.  We will also output the first five article texts to showcase the results.

In [3]:
training_text = get_article_texts(training_links, tracker=False)
display(training_text[:5])

[['Investing.com – Australia stocks were lower after the close on Friday, as losses in the Consumer Staples, Industrials and Utilities sectors led shares lower.\nAt the close in Sydney, the S&P/ASX 200 declined 0.03%.\nThe best performers of the session on the S&P/ASX 200 were  Austal Ltd  (ASX:ASB), which rose 6.92% or 0.12 points to trade at 1.85 at the close. Meanwhile,  Pointsbet Holdings Ltd  (ASX:PBH) added 5.62% or 0.05 points to end at 0.94 and Lynas Rare Earths Ltd (ASX:LYC) was up 3.53% or 0.24 points to 7.04 in late trade.\nThe worst performers of the session were Omni Bridgeway Ltd (ASX:OBL), which fell 5.13% or 0.07 points to trade at 1.29 at the close. EML Payments Ltd (ASX:EML) declined 3.66% or 0.03 points to end at 0.79 and Abacus Property Group (ASX:ABG) was down 2.90% or 0.04 points to 1.17.\nRising stocks outnumbered declining ones on the Sydney Stock Exchange by 593 to 558 and 367 ended unchanged.\nShares in Omni Bridgeway Ltd (ASX:OBL) fell to 5-year lows; falling

Lastly, the data can be formatted and stored in a dataframe.

In [4]:
training_text = np.array(training_text)
text_data_dict = {
    "Link": training_links,
    "Date": training_text[:,1],
    "Text": training_text[:,0]
}
text_data = pd.DataFrame(text_data_dict)
display(text_data.head())

Unnamed: 0,Link,Date,Text
0,https://www.investing.com/news/stock-market-ne...,"Dec 22, 2023",Investing.com – Australia stocks were lower af...
1,https://www.investing.com/news/stock-market-ne...,"Dec 21, 2023",Investing.com-- Shares of Chinese gaming giant...
2,https://www.investing.com/news/stock-market-ne...,"Dec 21, 2023",BEIJING/SHANGHAI (Reuters) -Tesla on Friday la...
3,https://www.investing.com/jp.php?v2=MXFjPWYxYT...,,
4,https://www.investing.com/news/stock-market-ne...,"Dec 21, 2023",(Reuters) - Boeing (NYSE:BA)'s first direct de...


Once we have the dataframe, we can save the data and use it for our own analysis, whatever that may be.

## Webscrapping News Articles of Specific Stock

To get news articles that pertain to a specific stock, we can use get_stock_links(search_term, limit = 100) to bring us to the page https://www.investing.com/search/?q={search_term}&tab=news, which displays news articles pertaining to the specific search term.  Ideally, search_term will be a stock symbol, such as MSFT in the example below.  Since the links are dynamically loaded into the webpage, we use the links = 100 argument to keep scrolling until there are at least 100 (or whatever we set limit to) links to extract from the page.  Once we have the links, we can go through the same process above to extract the text from each link and store all the data in a dataframe.

In [5]:
# Extract Links
msft_links = get_stock_links("msft", limit = 100)
print("Number of Articles:", len(msft_links))
display(msft_links[:5])

# Extract Text
msft_text = get_article_texts(msft_links, tracker=False)
display(msft_text[:5])

# Store Data in Dataframe
msft_text = np.array(msft_text)
msft_data_dict = {
    "Link": msft_links,
    "Date": msft_text[:,1],
    "Text": msft_text[:,0]
}
msft_data = pd.DataFrame(msft_data_dict)
display(msft_data.head())

Number of Articles: 106


['https://www.investing.com/news/stock-market-news/pulitzerwinning-authors-join-openai-microsoft-copyright-lawsuit-3261058',
 'https://www.investing.com/news/stock-market-news/microsoft-shelves-windows-mixed-reality-feature-3261890',
 'https://www.investing.com/news/stock-market-news/microsoft-ending-support-for-windows-10-could-send-240-million-pcs-to-landfills--report-3261916',
 'https://www.investing.com/analysis/msft-breaking-resistance-137246',
 'https://www.investing.com/analysis/msft-continues-buyback-frenzy-200466332']

[['By Blake Brittain (Reuters) - A group of 11 nonfiction authors have joined a lawsuit in Manhattan federal court that accuses OpenAI and Microsoft (NASDAQ:MSFT) of misusing books the authors have written to train the models behind OpenAI\'s popular chatbot ChatGPT and other artificial-intelligence based software. The writers, including Pulitzer Prize winners Taylor Branch, Stacy Schiff and Kai Bird - who co-wrote the J. Robert Oppenheimer biography "American Prometheus" that was adapted into the hit film "Oppenheimer" this year - told the court on Tuesday that the companies infringed their copyrights by using their work to train OpenAI\'s GPT large language models. Representatives for OpenAI and Microsoft did not immediately respond to requests for comment on Wednesday. "The defendants are raking in billions from their unauthorized use of nonfiction books, and the authors of these books deserve fair compensation and treatment for it," the writers\' attorney Rohit Nath said on Wednesd

Unnamed: 0,Link,Date,Text
0,https://www.investing.com/news/stock-market-ne...,"Dec 20, 2023",By Blake Brittain (Reuters) - A group of 11 no...
1,https://www.investing.com/news/stock-market-ne...,"Dec 21, 2023",(Reuters) - Microsoft (NASDAQ:MSFT) on Thursda...
2,https://www.investing.com/news/stock-market-ne...,"Dec 21, 2023",By Akash Sriram (Reuters) - Microsoft Corp (...
3,https://www.investing.com/analysis/msft-breaki...,,"MSFTIn a previous piece, I posted an analysis ..."
4,https://www.investing.com/analysis/msft-contin...,,Microsoft (NASDAQ:MSFT) is up almost 2% today ...
