# Text Analysis

# Web Scraping
- Let us go to Yahoo Finance 
- And try to get some data for AAPL

- the landing page URL is 
https://finance.yahoo.com/quote/AAPL?p=AAPL&.tsrc=fin-srch
- The base url here is 
    https://finance.yahoo.com/quote/AAPL
- The query is 
      - ?p=AAPL&.tsrc=fin-srch
      - p and .tsrc are query keys
      - AAPL and fin-srch are corresponding values
      - & is the query separator
    
- try look at statistics. The statistics URL is 
https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL

- We will use 
    - Requests to fetch the html page
    - Beautiful Soup to parse the html
    - pandas read_html to get the tables


In [17]:
# Import the necessary packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml
from lxml import html

In [18]:
# Extract Data of AAPL from Yahoo
# Use requests to get the page
page=requests.get("https://finance.yahoo.com/quote/AAPL?p=AAPL&.tsrc=fin-srch")
page

<Response [200]>

- 200 is the status code and it means the page was successfuly obtained
- Let us use BeautifulSoup to parse the data

In [19]:
soup=BeautifulSoup(page.content,'lxml')

In [20]:
# You can search by tags by using find_all function in Beautiful Soup
title=soup.find_all('title')
title

[<title>Apple Inc. (AAPL) Stock Price, News, Quote &amp; History - Yahoo Finance</title>]

In [21]:
type(title)

bs4.element.ResultSet

In [22]:
len(title)

1

In [23]:
title[0].text

'Apple Inc. (AAPL) Stock Price, News, Quote & History - Yahoo Finance'

In [24]:
# We need tables from the soup.
# Use find_all function that looks at table tags
tables=soup.find_all('table')
tables

[<table class="W(100%)" data-reactid="92"><tbody data-reactid="93"><tr class="Bxz(bb) Bdbw(1px) Bdbs(s) Bdc($seperatorColor) H(36px)" data-reactid="94"><td class="C($primaryColor) W(51%)" data-reactid="95"><span data-reactid="96">Previous Close</span></td><td class="Ta(end) Fw(600) Lh(14px)" data-reactid="97" data-test="PREV_CLOSE-value"><span class="Trsdu(0.3s)" data-reactid="98">119.21</span></td></tr><tr class="Bxz(bb) Bdbw(1px) Bdbs(s) Bdc($seperatorColor) H(36px)" data-reactid="99"><td class="C($primaryColor) W(51%)" data-reactid="100"><span data-reactid="101">Open</span></td><td class="Ta(end) Fw(600) Lh(14px)" data-reactid="102" data-test="OPEN-value"><span class="Trsdu(0.3s)" data-reactid="103">119.44</span></td></tr><tr class="Bxz(bb) Bdbw(1px) Bdbs(s) Bdc($seperatorColor) H(36px)" data-reactid="104"><td class="C($primaryColor) W(51%)" data-reactid="105"><span data-reactid="106">Bid</span></td><td class="Ta(end) Fw(600) Lh(14px)" data-reactid="107" data-test="BID-value"><span 

In [25]:
# There are two tables
len(tables)

2

In [26]:
# First table
tables[0]

<table class="W(100%)" data-reactid="92"><tbody data-reactid="93"><tr class="Bxz(bb) Bdbw(1px) Bdbs(s) Bdc($seperatorColor) H(36px)" data-reactid="94"><td class="C($primaryColor) W(51%)" data-reactid="95"><span data-reactid="96">Previous Close</span></td><td class="Ta(end) Fw(600) Lh(14px)" data-reactid="97" data-test="PREV_CLOSE-value"><span class="Trsdu(0.3s)" data-reactid="98">119.21</span></td></tr><tr class="Bxz(bb) Bdbw(1px) Bdbs(s) Bdc($seperatorColor) H(36px)" data-reactid="99"><td class="C($primaryColor) W(51%)" data-reactid="100"><span data-reactid="101">Open</span></td><td class="Ta(end) Fw(600) Lh(14px)" data-reactid="102" data-test="OPEN-value"><span class="Trsdu(0.3s)" data-reactid="103">119.44</span></td></tr><tr class="Bxz(bb) Bdbw(1px) Bdbs(s) Bdc($seperatorColor) H(36px)" data-reactid="104"><td class="C($primaryColor) W(51%)" data-reactid="105"><span data-reactid="106">Bid</span></td><td class="Ta(end) Fw(600) Lh(14px)" data-reactid="107" data-test="BID-value"><span c

In [27]:
table1=soup.find('table',{"data-reactid":"92"})
trows=table1.find_all('tr')
for row in trows:
    print(row.get_text())

Previous Close119.21
Open119.44
Bid119.12 x 1800
Ask119.15 x 900
Day's Range117.87 - 119.67
52 Week Range53.15 - 137.98
Volume78,857,203
Avg. Volume158,008,226


In [28]:
td=table1.find_all('td')
for d in td:
    print(d.get_text())

Previous Close
119.21
Open
119.44
Bid
119.12 x 1800
Ask
119.15 x 900
Day's Range
117.87 - 119.67
52 Week Range
53.15 - 137.98
Volume
78,857,203
Avg. Volume
158,008,226


In [29]:
# How to read the data
# Use read_html function in pandas
# read_html accepts string 
# So convert the tables to string and use it
t=pd.read_html(str(tables[0]))
t

[                0                1
 0  Previous Close           119.21
 1            Open           119.44
 2             Bid    119.12 x 1800
 3             Ask     119.15 x 900
 4     Day's Range  117.87 - 119.67
 5   52 Week Range   53.15 - 137.98
 6          Volume         78857203
 7     Avg. Volume        158008226]

In [30]:
# The ouput is a list
type(t)

list

In [31]:
# The first element is a dataframe
type(t[0])

pandas.core.frame.DataFrame

In [32]:
# Copy into a dataframe
t1=t[0].copy()

In [33]:
# First row
t1.loc[0]

0    Previous Close
1            119.21
Name: 0, dtype: object

In [34]:
# First column
t1[0]

0    Previous Close
1              Open
2               Bid
3               Ask
4       Day's Range
5     52 Week Range
6            Volume
7       Avg. Volume
Name: 0, dtype: object

In [38]:
t1.index

RangeIndex(start=0, stop=8, step=1)

In [39]:
t1.index=t1[0]
t1

Unnamed: 0_level_0,0,1
0,Unnamed: 1_level_1,Unnamed: 2_level_1
Previous Close,Previous Close,119.21
Open,Open,119.44
Bid,Bid,119.12 x 1800
Ask,Ask,119.15 x 900
Day's Range,Day's Range,117.87 - 119.67
52 Week Range,52 Week Range,53.15 - 137.98
Volume,Volume,78857203
Avg. Volume,Avg. Volume,158008226


In [19]:
# Change the index to first columns
t1.index=t1[0]
t2=t1[1]
t2

0
Previous Close             115.97
Open                       117.19
Bid                 119.31 x 1100
Ask                  119.35 x 800
Day's Range       116.44 - 119.63
52 Week Range      53.15 - 137.98
Volume                  112294954
Avg. Volume             162108573
Name: 1, dtype: object

In [20]:
# I want Apple's previous close 
float(t2.loc["Previous Close"])

115.97

- Now there were two tables in the html
- Let us extract both


In [13]:
# write a for loop
df_tables=pd.DataFrame()
for table in tables:
# The first element of of each table object is a dataframe
    df_tables=df_tables.append(pd.read_html(str(table))[0])
df_tables_1=df_tables.copy()
# Use the first column as index
df_tables_1.index=df_tables_1[0]
# Extract the second column as it containst the values
df_tables_1[1]

0
Previous Close                                   119.21
Open                                             119.44
Bid                                       119.12 x 1800
Ask                                        119.15 x 900
Day's Range                             117.87 - 119.67
52 Week Range                            53.15 - 137.98
Volume                                         78857203
Avg. Volume                                   158008226
Market Cap                                       2.028T
Beta (5Y Monthly)                                  1.35
PE Ratio (TTM)                                    36.36
EPS (TTM)                                          3.28
Earnings Date               Jan 26, 2021 - Feb 01, 2021
Forward Dividend & Yield                   0.82 (0.69%)
Ex-Dividend Date                           Nov 06, 2020
1y Target Est                                    123.11
Name: 1, dtype: object

# Task
- Extract all the tables and values from statistics page for AAPL
- The output must be a dataframe 
    - The index of the dataframe must be the variable name
    - The only column must contain the latest values 

In [None]:
https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL

In [40]:
page1=requests.get("https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL")
page1

<Response [200]>

In [66]:
soup=BeautifulSoup(page1.content,'lxml')
tables1[0]

<table class="W(100%) Bdcl(c) M(0) Whs(n) BdEnd Bdc($seperatorColor) D(itb)" data-reactid="52"><thead data-reactid="53"><tr class="Bdtw(0px) C($primaryColor)" data-reactid="54"><th class="Fw(400) Pend(10px) Pos(st) Start(0) Pend(10px) Bgc($lv2BgColor) Z(1)" data-reactid="55"><!-- react-text: 56 --> <!-- /react-text --><div class="W(3px) Pos(a) Start(100%) T(0) H(100%) Bg($pfColumnFakeShadowGradient) Pe(n) Pend(5px)" data-reactid="57"></div></th><th class="Fw(b) Ta(c) Pstart(6px) Pend(4px) Py(6px) Miw(fc) Miw(fc)--pnclg Bgc($lv1BgColor) Pend(0)" data-reactid="58"><span class="Pos(r) smplTblTooltip" data-reactid="59"><span class="Pos(a) Z(3) Bgc($lv3BgColor) Bd($featurePromoBorder) Bxsh($boxAreaShadow) smplTblTooltip:h_V(v) V(h) W(150px) P(10px) D(ib) Fz(12px) C($tertiaryColor) Fw(500) Mt(25px)" data-reactid="60"><!-- react-text: 61 -->As of Date: 11/14/2020<!-- /react-text --><div class="Pos(a) H(0) W(0) Bdbc($seperatorColor) End(100%) Bds(s) Bdw(10px) Bdstartc(t) Bdendc(t) Bdtc(t)" dat

In [65]:
# We need tables from the soup.
# Use find_all function that looks at table tags
tables1=soup.find_all('table')
tables1
len(tables1)
t2=pd.read_html(str(tables1[0]))[0]
type(t2)
t2.columns

Index(['Unnamed: 0', 'As of Date: 11/14/2020Current', '9/30/2020', '6/30/2020',
       '3/31/2020', '12/31/2019', '9/30/2019'],
      dtype='object')

In [70]:
page=requests.get("https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL")
soup=BeautifulSoup(page.content,'lxml')
tables=soup.find_all('table')
df_tables=pd.DataFrame()
for table in tables:
    df1=pd.read_html(str(table))[0]
    df1.columns=range(0,len(df1.columns))
    df_tables=df_tables.append(df1)
df_tables_1=df_tables.copy()
df_tables_1.index=df_tables_1[0]
df_tables_1[1] 

0
Market Cap (intraday) 5                                  2.03T
Enterprise Value 3                                       2.05T
Trailing P/E                                             36.36
Forward P/E 1                                            30.03
PEG Ratio (5 yr expected) 1                               2.99
Price/Sales (ttm)                                         7.61
Price/Book (mrq)                                         31.03
Enterprise Value/Revenue 3                                7.46
Enterprise Value/EBITDA 6                                25.29
Beta (5Y Monthly)                                         1.35
52-Week Change 3                                        78.60%
S&P500 52-Week Change 3                                 14.83%
52 Week High 3                                          137.98
52 Week Low 3                                            53.15
50-Day Moving Average 3                                 116.12
200-Day Moving Average 3                             

In [58]:
t=[]
for table in tables1:
    t.append(pd.read_html(str(table))[0])

t

[                    Unnamed: 0 As of Date: 11/14/2020Current 9/30/2020  \
 0      Market Cap (intraday) 5                         2.03T     1.97T   
 1           Enterprise Value 3                         2.05T     1.99T   
 2                 Trailing P/E                         36.36     35.12   
 3                Forward P/E 1                         30.03     30.12   
 4  PEG Ratio (5 yr expected) 1                          2.99      2.86   
 5            Price/Sales (ttm)                          7.61      7.50   
 6             Price/Book (mrq)                         31.03     27.20   
 7   Enterprise Value/Revenue 3                          7.46     30.69   
 8    Enterprise Value/EBITDA 6                         25.29    108.89   
 
   6/30/2020 3/31/2020 12/31/2019 9/30/2019  
 0     1.56T     1.10T      1.29T   995.15B  
 1     1.58T     1.10T      1.30T     1.01T  
 2     28.52     20.02      24.70     19.01  
 3     24.33     19.65      22.17     17.27  
 4      2.02      

In [15]:
def get_data_from_yahoo(ticker):
    page=requests.get("https://finance.yahoo.com/quote/{0}?p={0}&.tsrc=fin-srch".format(ticker))
    soup=BeautifulSoup(page.content,'lxml')
    tables=soup.find_all('table')
    df_tables=pd.DataFrame()
    for table in tables:
        df1=pd.read_html(str(table))[0]
        df1.columns=range(0,len(df1.columns))
        df_tables=df_tables.append(df1)
    df_tables_1=df_tables.copy()
    df_tables_1.index=df_tables_1[0]
    return df_tables_1[1]    

In [None]:
# write a for loop
df_tables=pd.DataFrame()
for table in tables1:
# The first element of of each table object is a dataframe
    df_tables=df_tables.append(pd.read_html(str(table))[0])
df_tables_1=df_tables.copy()
# Use the first column as index
df_tables_1.index=df_tables_1[0]
# Extract the second column as it containst the values
df_tables_1[1]

In [45]:
# There are two tables
len(tables1)

10

In [47]:
t1=pd.read_html(str(tables[0]))
t1

[                0                1
 0  Previous Close           119.21
 1            Open           119.44
 2             Bid    119.12 x 1800
 3             Ask     119.15 x 900
 4     Day's Range  117.87 - 119.67
 5   52 Week Range   53.15 - 137.98
 6          Volume         78857203
 7     Avg. Volume        158008226]

In [14]:
# Use format function
# We can send a string 
# The string will be placed in the place of {}
"https://finance.yahoo.com/quote/{0}?p={0}&.tsrc=fin-srch".format("AAPL")

'https://finance.yahoo.com/quote/AAPL?p=AAPL&.tsrc=fin-srch'

In [24]:
get_data_from_yahoo('GOOG')

0
Previous Close                          1740.39
Open                                    1750.00
Bid                              1,740.01 x 900
Ask                              1,754.54 x 800
Day's Range                 1,747.36 - 1,764.05
52 Week Range               1,013.54 - 1,818.06
Volume                                  1263966
Avg. Volume                             1891973
Market Cap                               1.185T
Beta (5Y Monthly)                          1.02
PE Ratio (TTM)                            33.87
EPS (TTM)                                 51.75
Earnings Date                               NaN
Forward Dividend & Yield              N/A (N/A)
Ex-Dividend Date                            NaN
1y Target Est                           1839.17
Name: 1, dtype: object

In [25]:
# Get news report
page=requests.get("https://finance.yahoo.com/news/nikola-beats-third-quarter-forecast-212439967.html")
soup=BeautifulSoup(page.content,'lxml')
title=soup.find_all('title')

# Task 2
- Extract the news article content alone from this webpage