# [Market Watch Webscraping: Securing Missing Financial Statement Data](#section-title)

- Per the EDA in Section 3, there are 2,226 data entries missing the value for Cost of Revenue.  Before webscraping can begin, I need to determine which tickers in the merged df (SEC, GICS, and Market Value) are missing the value ```is_cost_of_revenue_value```.
- It is not appropriate to impute these values, as the nulls represent 49.9% of the data (2226/ 4463). Furthermore, Cost of Revenue varies widely by company composition, sector, and growth phase.
- In order to use these data points, the Cost of Revenue was gathered from another souce, __[MarketWatch](https://www.marketwatch.com/investing/stock/aapl/financials/income/quarter
)__. MarketWatch is a website providing financial information, business news, analysis, and stock market data. I parsed through this html to reference the Cost of GoodsSold (COGS) including Depreciation & Amortization, for all companies that previously had null from the Polygon.io call. All figures are taken from the Q2 2023 quarterly income statements, as is consistent with the rest of the data set.
- Below, a list of tickets is created to represent all companies within my dataframe in which the Cost of Revenue line item is missing. I developed a webscraper function to iterate through this list, add the values to a dictionary, and export to a dataframe. This dataframe will later be merged with the information already collected.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
 
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Importing the concatened data frame for SEC, GICS, and market value info
df_sgmf = pd.read_csv("../data/cleaned_csvs_interim_steps/sgmf_cleaned.csv")

In [3]:
df_sgmf.columns

Index(['ticker', 'cik', 'symbol', 'description', 'gics_sector',
       'equity_securities', 'cap_size', 'sector_revenue_total_(trillions)',
       'total_cap_(trillions)', 'company_name', 'fiscal_period', 'fiscal_year',
       'filing_date', 'is_basic_earnings_per_share_order',
       'is_basic_earnings_per_share_unit', 'is_basic_earnings_per_share_value',
       'is_cost_of_revenue_order', 'is_cost_of_revenue_unit',
       'is_cost_of_revenue_value', 'is_gross_profit_order',
       'is_gross_profit_unit', 'is_gross_profit_value',
       'is_operating_expenses_order', 'is_operating_expenses_unit',
       'is_operating_expenses_value', 'is_revenues_order', 'is_revenues_unit',
       'is_revenues_value', 'ci_comprehensive_income_loss_order',
       'ci_comprehensive_income_loss_unit',
       'ci_comprehensive_income_loss_value',
       'ci_comprehensive_income_loss_attributable_to_parent_order',
       'ci_comprehensive_income_loss_attributable_to_parent_unit',
       'ci_comprehensive_i

In [4]:
# Using a list comprehension to append all values with nulls under "is_cost_of_revenue_value" by ticker

ticker_no_cogs = df_sgmf[df_sgmf["is_cost_of_revenue_value"].isnull()]["ticker"].tolist()
len(ticker_no_cogs)

2231

---
## Webscraping the MarketWatch url for COGS values

In [5]:
url_marketwatch_aapl = "https://www.marketwatch.com/investing/stock/aapl/financials/income/quarter"
response = requests.get(url_marketwatch_aapl)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "lxml")
    # Rest of your code using the 'soup' object
else:
    print("Failed to fetch the URL")

# Help from Emily and Chat GPT

In [6]:
soup.prettify() # Taken from Akul.me Cheatsheat
soup.extract

<bound method PageElement.extract of <!DOCTYPE html>
<html class="" lang="en">
<head>
<title>AAPL | Apple Inc. Quarterly Income Statement | MarketWatch</title>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="user-scalable=no, initial-scale=1.0, maximum-scale=1.0, width=device-width" name="viewport"/>
<meta content="noarchive, noodp" name="robots"/>
<link href="//sts.wsj.net" rel="dns-prefetch"/>
<link href="//s.marketwatch.com" rel="dns-prefetch"/>
<link href="//video-api.wsj.com" rel="dns-prefetch"/>
<link href="//fonts.wsj.net" rel="dns-prefetch"/>
<link href="//m.wsj.net" rel="dns-prefetch"/>
<link href="//mwstream.wsj.net" rel="dns-prefetch"/>
<link href="//tags.tiqcdn.com" rel="dns-prefetch"/>
<link href="//s.ntv.io" rel="dns-prefetch"/>
<link href="//cdn.cxense.com" rel="dns-prefetch"/>
<link href="//a248.e.akamai.net" rel="dns-prefetch"/>
<link href="//om.dowjoneson.com" rel="dns-prefetch"/>
<link href="//bam.nr-data.net" rel="dns-pr

In [7]:
soup.html.title
soup.html.body

<body class="page--quote symbol--stock tab--financials page--Financials" role="document">
<section aria-label="site" class="container container--masthead Expanded masthead--expanded" role="banner">
<nav aria-label="site" class="region region--full fixed" role="navigation">
<nav-hat></nav-hat>
<a class="skip-link screen-reader-text btn btn--primary" href="#maincontent">Skip to main content</a>
<header class="column column--full masthead j-masthead full-width">
<input class="hidden toggle--menu j-toggle" id="main-menu" type="checkbox"/>
<label class="btn btn--menu j-toggle-label" for="main-menu" tabindex="0">
<i class="icon"></i>
<span class="screen-reader-text">Main Menu</span>
</label>
<div aria-label="dropdown navigation" class="nav j-main-menu" role="navigation">
<div class="nav__content">
<div class="element element--ad is-loading">
<div class="j-ad lazyload" data-expand="200" id="ad-navigation" is="mw-ad">
<script>
                        !function() {
                            w

In [8]:
div_results = soup.body.find_all('div', {'class':'cell__content'})
div_results

[<div class="cell__content fixed--cell">Item</div>,
 <div class="cell__content">Item</div>,
 <div class="cell__content">30-Jun-2022</div>,
 <div class="cell__content">30-Sep-2022</div>,
 <div class="cell__content">31-Dec-2022</div>,
 <div class="cell__content">31-Mar-2023</div>,
 <div class="cell__content">30-Jun-2023</div>,
 <div class="cell__content">5- qtr trend</div>,
 <div class="cell__content fixed--cell">Sales/Revenue</div>,
 <div class="cell__content">Sales/Revenue</div>,
 <div class="cell__content"><span class="">82.96B</span></div>,
 <div class="cell__content"><span class="">90.15B</span></div>,
 <div class="cell__content"><span class="">117.15B</span></div>,
 <div class="cell__content"><span class="">94.84B</span></div>,
 <div class="cell__content"><span class="">81.8B</span></div>,
 <div class="cell__content"> <div class="chart--financials js-financial-chart" data-chart-data="82959000000.0,90146000000.0,117154000000.0,94836000000.0,81797000000.0"><div></div></div></div>,
 <

In [9]:
soup.body.find_all("div", {"class": "cell__content"})[24].text

'Cost of Goods Sold (COGS) incl. D&A'

In [10]:
soup.body.find_all("div", {"class": "cell__content"})[30].text

'45.38B'

---
## Does a different Market Watch ticker also have COGS as the 30th element?


In [11]:
url_marketwatch_tsla = "https://www.marketwatch.com/investing/stock/tsla/financials/income/quarter"
response = requests.get(url_marketwatch_tsla)

if response.status_code == 200:
    soup_tsla = BeautifulSoup(response.content, "lxml")
    # Rest of your code using the 'soup' object
else:
    print("Failed to fetch the URL")

# Help from Emily and Chat GPT

In [12]:
soup_tsla.extract

<bound method PageElement.extract of <!DOCTYPE html>
<html class="" lang="en">
<head>
<title>TSLA | Tesla Inc. Quarterly Income Statement | MarketWatch</title>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="user-scalable=no, initial-scale=1.0, maximum-scale=1.0, width=device-width" name="viewport"/>
<meta content="noarchive, noodp" name="robots"/>
<link href="//sts.wsj.net" rel="dns-prefetch"/>
<link href="//s.marketwatch.com" rel="dns-prefetch"/>
<link href="//video-api.wsj.com" rel="dns-prefetch"/>
<link href="//fonts.wsj.net" rel="dns-prefetch"/>
<link href="//m.wsj.net" rel="dns-prefetch"/>
<link href="//mwstream.wsj.net" rel="dns-prefetch"/>
<link href="//tags.tiqcdn.com" rel="dns-prefetch"/>
<link href="//s.ntv.io" rel="dns-prefetch"/>
<link href="//cdn.cxense.com" rel="dns-prefetch"/>
<link href="//a248.e.akamai.net" rel="dns-prefetch"/>
<link href="//om.dowjoneson.com" rel="dns-prefetch"/>
<link href="//bam.nr-data.net" rel="dns-pr

In [13]:
soup_tsla.body.find_all("div", {"class": "cell__content"})[30].text
# Confirmed that this index references the correct line item & quarter.

'20.39B'

In [14]:
def scrape_get_df(ticker:str):

    # Create url based on ticker
    url_beginning = "https://www.marketwatch.com/investing/stock/"
    url_middle = ticker
    url_end = "/financials/income/quarter"
   
    results = []
    try:
        res = requests.get(url_beginning + ticker + url_end)
        soup = BeautifulSoup(res.text, 'html.parser')
            
        new_submission = {}
        new_submission["ticker"] = ticker
        
        # Extract the element and access its text content
        div_elements = soup.body.find_all('div', {'class': 'cell__content'})
        new_submission["is_cost_of_revenue_value"] = div_elements[30].text
        results.append(new_submission)   
    
    except Exception as e:
        print(f"Error processing {ticker}: {e}")

    # Creating the data frame
    df_cogs_indiv = pd.DataFrame(results)
    
    return df_cogs_indiv

# Got help from Chat GPT for the div_elements variable only.

In [15]:
# Practicing to call the function on one equity only.
df_cogs_indiv = scrape_get_df("AAPL")
df_cogs_indiv

Unnamed: 0,ticker,is_cost_of_revenue_value
0,AAPL,45.38B


In [17]:
# Iterate through tickers and call the function
dfs = []
for ticker in ticker_no_cogs:
    df_cogs_indiv = scrape_get_df(ticker)
    dfs.append(df_cogs_indiv)

# Concatenate all dataframes into a single dataframe
df_cogs = pd.concat(dfs, ignore_index=True)
df_cogs

Error processing LIN: list index out of range
Error processing RPRX: list index out of range
Error processing LSXMA: list index out of range
Error processing LSXMB: list index out of range
Error processing LSXMK: list index out of range
Error processing BATRA: list index out of range
Error processing BATRK: list index out of range
Error processing RBA: list index out of range
Error processing OSH: list index out of range
Error processing PDCE: list index out of range
Error processing EMBK: list index out of range
Error processing HR: list index out of range
Error processing BNL: list index out of range
Error processing ROIV: list index out of range
Error processing RXDX: list index out of range
Error processing ISEE: list index out of range
Error processing MAXR: list index out of range
Error processing BGCP: list index out of range
Error processing NG: list index out of range
Error processing HURN: list index out of range
Error processing ACVA: list index out of range
Error processing

In [22]:
df_cogs

In [None]:
# Convert dataframe to csv for record and for use in future notebooks:
df_cogs.to_csv("../data/cleaned_csvs_interim_steps/cogs_cleaned.csv", index=False)