# Yahoo Finance Scrapping

## Libraries

In [1]:
import pandas as pd
import requests
import random
from bs4 import BeautifulSoup
import re

## Scrapper Function

With this function we simply lookup for url for a given ticker and use some regular expressions to collect the financial data. This functions works with a list of tickers and output the scrapped data in a dictionary (of dictionaries). 

In [2]:
def scrapper(ticker_list):

    """
    This function takes a list of tickers (such as ['AAPL', 'META']) as input and scraps the related data from yahoo finance. 
    More specifically, the data come from the "financials" tab of yahoo finance and therefore the scrapper gathers some fundamentals data about those companies.
    The resulting data are stored in a dictionary of dictionaries and can then be easily put into a pandas dataframe (or something else)
    """

    all_results = dict()

    for ticker in ticker_list:

        # define some user agents to be used to avoid detection
        user_agents = [ 
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36', 
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36', 
        'Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148', 
        'Mozilla/5.0 (Linux; Android 11; SM-G960U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Mobile Safari/537.36'
        ]
        
        # randomly choose an agent
        user_agent = random.choice(user_agents)
        headers = {'User-Agent': user_agent}

        # construct the URL using the ticker passed as argument to the function
        URL = f"https://finance.yahoo.com/quote/{ticker}/financials/"

        page = requests.get(URL, headers=headers)
        soup = BeautifulSoup(page.content, "html.parser")
        soup_str = str(soup)

        # get the time data
        soup_str = str(soup)
        start = soup_str.find('Breakdown')
        end = soup_str.find('Total Revenue')
        substr = soup_str[start:end]
        year = re.findall('\\d+/\\d+/\\d+', substr)
        year.insert(0, 'TTM')

        # Isolate the table from the rest of the content
        soup_str = str(soup)
        start = soup_str.find('Breakdown')
        end = soup_str.find('Related Tickers')
        substr = soup_str[start:end]

        # each company's data are stored in a dictionary
        results = dict()
        # here we use regular expressions (regex) to target the specific bits we are interested in
        for i in re.findall("column sticky(.*?)<div class=\"row lv-0", substr):
            title = re.findall('title=\"(.*?)\"', i)[0]
            data = re.findall('<div(.*?)</div>', i)[1:]
            data_ = [re.findall('>(.*) ', d)[0] for d in data if re.findall('>(.*) ', d)[0] not in title]
            results[title] = data_
        results['year'] = year

        # finaly we add the data scrapped for the current ticker into the another dictionary.
        all_results[ticker] = results
        
    
    return all_results

## Scrapping the data for the Magnificent 7

We simply run the scrapping functions for some tickers.

In [3]:
ticker_list = ['AAPL', 'NVDA', 'MSFT', 'AMZN', 'TSLA', 'GOOG', 'META']
res = scrapper(ticker_list)
print(res)

{'AAPL': {'Total Revenue': ['395,760,000', '391,035,000', '383,285,000', '394,328,000', '365,817,000'], 'Cost of Revenue': ['211,657,000', '210,352,000', '214,137,000', '223,546,000', '212,981,000'], 'Gross Profit': ['184,103,000', '180,683,000', '169,148,000', '170,782,000', '152,836,000'], 'Operating Expense': ['58,428,000', '57,467,000', '54,847,000', '51,345,000', '43,887,000'], 'Operating Income': ['125,675,000', '123,216,000', '114,301,000', '119,437,000', '108,949,000'], 'Net Non Operating Interest Income Expense': ['--', '--', '-183,000', '-106,000', '198,000'], 'Other Income Expense': ['71,000', '269,000', '-565,000', '-334,000', '60,000'], 'Pretax Income': ['125,746,000', '123,485,000', '113,736,000', '119,103,000', '109,207,000'], 'Tax Provision': ['29,596,000', '29,749,000', '16,741,000', '19,300,000', '14,527,000'], 'Net Income Common Stockholders': ['96,150,000', '93,736,000', '96,995,000', '99,803,000', '94,680,000'], 'Diluted NI Available to Com Stockholders': ['96,150,

## Storing the data in a Pandas dataframe

Here we simply take the scrapped data and put them into a dataframe for further analysis.

In [4]:
all_df = []
for k in res.keys():
    data = pd.DataFrame(res[k])
    data['ticker'] = k
    all_df.append(data)
financials = pd.concat(all_df, axis=0)
financials.head(10)

Unnamed: 0,Total Revenue,Cost of Revenue,Gross Profit,Operating Expense,Operating Income,Net Non Operating Interest Income Expense,Other Income Expense,Pretax Income,Tax Provision,Net Income Common Stockholders,...,Net Income from Continuing Operation Net Minority Interest,Normalized EBITDA,year,ticker,Total Unusual Items Excluding Goodwill,Total Unusual Items,Tax Rate for Calcs,Earnings from Equity Interest Net of Tax,Average Dilution Earnings,Rent Expense Supplemental
0,395760000,211657000,184103000,58428000,125675000,--,71000,125746000,29596000,96150000,...,96150000,137352000,TTM,AAPL,,,,,,
1,391035000,210352000,180683000,57467000,123216000,--,269000,123485000,29749000,93736000,...,93736000,134661000,9/30/2024,AAPL,,,,,,
2,383285000,214137000,169148000,54847000,114301000,-183000,-565000,113736000,16741000,96995000,...,96995000,125820000,9/30/2023,AAPL,,,,,,
3,394328000,223546000,170782000,51345000,119437000,-106000,-334000,119103000,19300000,99803000,...,99803000,130541000,9/30/2022,AAPL,,,,,,
4,365817000,212981000,152836000,43887000,108949000,198000,60000,109207000,14527000,94680000,...,94680000,123136000,9/30/2021,AAPL,,,,,,
0,130497000,32639000,97858000,16405000,81453000,1539000,1034000,84026000,11146000,72880000,...,72880000,86137000,TTM,NVDA,--,--,0.0,,,
1,130497000,32639000,97858000,16405000,81453000,1539000,1034000,84026000,11146000,72880000,...,72880000,86137000,1/31/2025,NVDA,--,--,0.0,,,
2,60922000,16621000,44301000,11329000,32972000,609000,237000,33818000,4058000,29760000,...,29760000,35583000,1/31/2024,NVDA,--,--,0.0,,,
3,26974000,11618000,15356000,9779000,5577000,5000,-1401000,4181000,-187000,4368000,...,4368000,7340000,1/31/2023,NVDA,-1353000,-1353000,0.0,,,
4,26914000,9439000,17475000,7434000,10041000,-207000,107000,9941000,189000,9752000,...,9752000,11351000,1/31/2022,NVDA,--,--,0.0,,,
