# Webscrapping investing.com

#### By Troy Mazerolle

The following notebook outlines how to scrap articles from the main news page of investing.com.  Following the link https://www.investing.com/news/stock-market-news/X, where X is the page number, we can extract a list of links from each page and then extract the text from each article that the link contains.

For more information about the functions used from ScrappingFunctions, the GitHub link to the page is [here](https://github.com/TJMazerolle/InvestingComWebscrapping/blob/main/ScrappingFunctions.py).

Note that the results seen below are from running the script as of December 19, 2023.

In [5]:
from ScrappingFunctions import *
import numpy as np
import pandas as pd

Note that since the following chunk uses Seleneum, to run the chunk below you will need geckodriver.exe in the same folder that wherever you run the script is in.  Also, the script only has the option for Firefox as the browser it uses, so Firefox must be installed.

First, we use get_training_links(n) to get the list of training links from the first n pages of https://www.investing.com/news/stock-market-news/X, where X represents all pages between 1 and n.  In the case below, we set n = 2 to get the links from the first two pages, and output the first five links in the list.

In [6]:
training_links = get_training_links(2)
print("Number of Articles:", len(training_links))
display(training_links[:5])

Number of Articles: 54


['https://www.investing.com/news/stock-market-news/in-buying-toshiba-a-littleknown-fund-takes-on-japan-incs-toughest-job-3259504',
 'https://www.investing.com/news/stock-market-news/factboxhow-are-the-red-sea-attacks-impacting-shipping-in-the-suez-canal-3259239',
 'https://www.investing.com/news/stock-market-news/not-just-for-christmas-britains-ms-targets-more-regular-food-shoppers-3259483',
 'https://www.investing.com/jp.php?v2=MHA-YGA3MGtmMDo0bjRjZj9vYTI-PDc8MyRlN2Zsbic0cj43YzsxdzA4aXc0aGU_MkFjPGNrZXM9azNhNXQwczB3PmBgMTBqZjk6Pm4rYyI_Y2E4Pjw3IzNyZWs=',
 'https://www.investing.com/news/stock-market-news/activist-investor-cevian-takes-13-stake-in-ubs-3259476']

Next, we use get_article_texts(training_links, tracker=False) to take the list of links and extract the article text and the corresponding dates of each article.  We will also output the first five article texts to showcase the results.

In [7]:
training_text = get_article_texts(training_links, tracker=False)
display(training_text[:5])

[['By Anton Bridge and Makiko Yamazaki TOKYO (Reuters) - A little-known private equity firm is set to take on the toughest job in corporate Japan: turning around Toshiba (OTC:TOSYY). Japan Industrial Partners (JIP) is spearheading a $14 billion takeover that will see the troubled conglomerate delisted on Wednesday after 74 years on the Tokyo exchange. While not a global player, JIP has quietly built up a track record by carving out businesses from big manufacturers, such as  Sony  (NYSE:SONY)\'s laptop arm and Olympus\' camera unit. Led by a former banker with a Wharton MBA, it has a reputation for being hands-on with its acquisitions, and for thrift - its executives fly economy. In Toshiba, JIP takes on a sprawling company far bigger and more complex than any it acquired before. The stakes are also higher: Toshiba employs some 106,000 people in businesses including batteries, chips, nuclear power and defence, making it critical to national security. Whether JIP can pull off a turnarou

Lastly, the data can be formatted and stored in a dataframe.

In [8]:
training_text = np.array(training_text)
text_data_dict = {
    "Link": training_links,
    "Date": training_text[:,1],
    "Text": training_text[:,0]
}
text_data = pd.DataFrame(text_data_dict)
display(text_data.head())

Unnamed: 0,Link,Date,Text
0,https://www.investing.com/news/stock-market-ne...,"Dec 19, 2023",By Anton Bridge and Makiko Yamazaki TOKYO (Reu...
1,https://www.investing.com/news/stock-market-ne...,"Dec 18, 2023",By Ahmad Ghaddar LONDON (Reuters) -Attacks lau...
2,https://www.investing.com/news/stock-market-ne...,"Dec 19, 2023","By James Davey LONDON (Reuters) - In Britain, ..."
3,https://www.investing.com/jp.php?v2=MHA-YGA3MG...,,
4,https://www.investing.com/news/stock-market-ne...,"Dec 19, 2023",ZURICH (Reuters) -Cevian Capital has taken a 1...


Once we have the dataframe, we can save the data and use it for our own analysis, whatever that may be.