## Web scraping 

The workflow of web scraping not only includes getting data online but also includes the process of turning the data into something readable and usable since in most cases, the data scraped are unstructered.  
The steps of web scraping are:  
- Locate the URL for which you want to scrape data from;
- Inspect the webpage to identify the "Tags"/"Path"/"Selector"/"Attributes" of the content you want to scrape;
- Write the code and make sure your code works;
- Generalize your codes to scrape more webpages, but make sure to not let yourself be blocked. _there are tips in later sections_

## Yahoo Finance News

This tutorial will follow the above steps to extract news articles and related information from Yahoo Finance 

### 1. Understand HTTP Requests and Responses
its useful to understand how information is transmitted between servers and clients before we start web scraping. Hyper Text Transfer Protocol (HTTP) is a messaging protocol that connects a client with the server.  
The communication occurs in a request-response cycle between a client and the server replies with a **_response_**.  
The most common requests a client can send is the `GET` request and the `POST` request.  
GET request is a client asks the server to send information back, while POST request is asking a client to send information to the server, thus it is more widely used in web applications.   
For web scraping, we usually send a GET request and wait for the server to respond.   
                        

### 2. Scrape information from one webpage
How do we scape information from a webpage? We need to get a GET request to the server and extract the information we want from the response.  
Specifically here are the steps:
- Send the request 
- Parse the HTML response
- Locate and extract the information we need 

#### 2.1 Send the request 
I need to get 2 pieces of information to scrape a website. One is the URL of the webpage, another one is the page source.  
For chrome you can inspect the page either by looking at the developer's tool or by right clicking the mouse and choosing "inspect"

<img src = 'page_source.JPG' width=800 length=800/>

The code on the top right panel are the codes for us to explore and locate the information we care about.  
It is the respose that the server sends back from a get request:

In [2]:
import requests

url = 'https://news.yahoo.com/dems-head-toward-house-control-160107166.html'

response = requests.get(url)
response.status_code,response.reason

(200, 'OK')

There are five categories of status code showing different situations during this step:  

<img src = 'categories.png' width=600 height=400 />

#### 2.2 Parse the response 

After getting the response from the server, we need to parse the raw HTML and make it readable to python, so that we can extract valuable information from it.  
The main parsers are `BeautifulSoup` and `lxml`. `lxml` is much faster but is terrible at dealing with multiformed HTML, thus we usually use `BeautifulSoup` to parse:

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text)

The soup variable will carry all the information from the webpage, thus it is long and tedious.  
The next step is to extract the data we need from soup. To do that we need to go back to the webpage and examine it.

#### 2.3 Locate and extract the data we need 
First we know we want tp extract the news article and the timestamp for when the news was released.  
It is important because we need to match the label variable, depending on what you're looking for, using the timestamp. Lets go back to the webpage and find them.  

The trick I always use is that I first locate what is the information I want to extract from the webpage, in this case I want to get the news body.  
I then select parts of the news and click 'inspect'. This way instead of showing the beginning of the HTML code, it is highlighting what I have selected on the right top panel. 

<img src = 'paragraph.JPG' width = 800 height = 800 />

Taking a closer look:  

<img src = 'closer.png' width = 600 height = 400/>

Its showing that the news contents are stored under under the tags 'p'. What we need to do next is to grab these pieces of information from soup. There are several ways to extract the information after knowing where they are.  
One way is to use the tag name and its attribution, another way is to use the CSS select.  
For this case, I will use the tag name with the functions **`find`** from BeautifulSoup since it is more straightforward. 

<img src = 'find.png' width = 800 height = 800>

Now our goal is to extract all the paragraphs from the news, not just one. Looking at the HTML, we can see that all the paragraphs are in each tag names 'p', and under a division, 'div'.  
The most convenient was of doing this is to find the division and extract all the paragraph at once: 

In [4]:
soup.find('div', attrs={'class': 'caas-body'})

<div class="caas-body"><p>WASHINGTON (AP) — Disappointed Democrats headed Wednesday toward renewing their control of the House for two more years but with a potentially shrunken majority as they lost at least seven incumbents without ousting a single Republican lawmaker.</p><p>By Wednesday afternoon, Democrats' only gains were two North Carolina seats vacated by GOP incumbents after a court-ordered remapping made the districts more Democratic. Although their majority seemed secure, the results were an unexpected jolt for a party that had envisioned gains of perhaps 15 seats. They were a morale booster for Republicans, who going into Election Day were mostly bracing for losses.</p><p>“They were all wrong," House Minority Leader Kevin McCarthy, R-Calif., told reporters about Democrats' assumptions of adding to their House numbers. Repeating a campaign theme Republicans used repeatedly against Democrats, he said, “The rejection that we saw last night from the Democrats, was that America d

The find function will find the first tag that satisfies the condition specified in the function. We use the class attribution and specify its name to be the condition.  
We can see that soup has found us the right information, but we need to remove the HTML format, just keep the news article as a string:

In [5]:
news = soup.find('div', attrs={'class': 'caas-body'}).text

We can see that the "news" variable has captured all the news body, with some extra symbols like '\'. However, it doesn't matter since we usually delete symbols when pre-processing text data.  

There is another function from BeautifulSoup called `find_all`. instead of returning the first tag, **find_all** will return all tags that satisfy the condition as a list. This can be helpful when extracting all information from a common tag, like all links from a webpage.  
For the news paragraphs, if we use **find_all**, we need to add an extra step of joining all list elements together into one string, which is not as convenient as using **find**.  

We can also get the headlines of the news. Following the same procedure, we find the headline is here:

<img src = 'headline.png'>



the headline is in a unique tag _**h1**_ in the beginning. It is very convenient to extract with the find function:


In [6]:
soup.find('h1')

<h1 data-test-locator="headline">Dems head toward House control, but GOP picks off seats</h1>

In [7]:
# there's extra information in there mostly specifying the style, we can extract the text only:
headline = soup.find('h1').text

we also want to know what date this news was published, following the same process:  
<img src = "date.png">

In [8]:
date = soup.find('time').text
date

'November 4, 2020, 6:01 PM'

Unlike the news articles, we do not want the date to be a string variable. Thus, we need to convert the string into a timestamp as a date. The function we can use is `strptime` from `datetime`.  

In [9]:
# Instead of using the code above, we can add strptime here:
from datetime import datetime 

date = datetime.strptime(soup.find('time').text, "%B %d, %Y, %H:%M %p")
date

datetime.datetime(2020, 11, 4, 6, 1)

the function `strptime()` takes in two inputs, the string and the format. The string is the date we scraped and based on how the string looks like, we need to define a pattern for **strptime()** to recognize it.  
According to the reference %B is the full word for the month, %Y is the complete digits for the year, %H:%M is the time in hours and minutes and %p is AM/PM.  

In [10]:
def get_info():
    '''
    Now that we have all the information we need, we can write a function to format the code
    '''
    #send request 
    response = requests.get(url)
    #parse
    soup = BeautifulSoup(response.text)
    #get information you need
    news = soup.find('div', attrs={'class': 'caas-body'}).text
    headline = soup.find('h1').text
    date = datetime.strptime(soup.find('time')).text, "%B %d, %Y %H:%M %p"
    
    return news, headline, date

### 3. Fetching subsequent pages
so far we have scraped one news article, we then need to scrape other news articles. As long as the websites are in a similar format, and we know their URLs we can use the same function we defined in the previous section to extract the news body, headline, and date one by one.  
The only information we are missing here is the URLs of a list of news articles. We can get the list of URLs by web scraping a webpage with a list of news articles:
[https://finance.yahoo.com/topic/stock-market-news]

#### 3.1 Get the links
This website has a lot of news articles listed, we need to locate the external link to each news articles and store them in a list. Links are usually located in a tag named by **'a'** with the attribute name **'href'**:

<img src = 'links.png' width = 800 height = 800>

for this website, it provides a lot of list tags **'li'** to store each news article's information. The links are in the *'a'* tag, called *'href'*. We can use the **find_all** function to extract all the links inside the list tags:

In [11]:
url = "https://finance.yahoo.com/topic/stock-market-news"

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

link = [l.find('a')['href'] for l in soup.find_all('li') if l.find('a')]

However, using this code will make the parser find all the links inside the ‘li’ tags, including links we do not desire, like links to advertisements. In cases like this, when the patterns are nested deep inside a common tag, there are two special convenience languages for such traversals: `CSS selectors` and `XPath`. They are both ways to identify the specific locations for the information we desire. I will use CSS selectors here.  

We want to only get links from _‘li’_s for news articles, each _‘li’_ for a news article should have a unique CSS selector that leads us there. On the webpage HTML panel, the easiest way to get the specific CSS selector is to select any li tag, right click, and choose ‘copy selector’. For the first news article, its CSS selector is _“#Fin-Stream > ul > li:nth-child(1)”_. We want to go directly to the ‘a’ tag of this ‘li’ tag, so I write _“#Fin-Stream > ul > li:nth-child(1) a”_ instead. The select function will return a list, to extract the attribute ‘href’ from the ‘a’ tag, we need to first slice the list and specify the attribute name:

In [12]:
soup.select('#Fin-Stream > ul > li:nth-child(1) a')[0]['href']

'/m/013f3a26-06d7-3bb6-8811-ddbc901f9eec/these-are-the-best-robinhood.html'

We have successfully extracted the link for the first news article, but compare to the actual link of the news article,

“https://finance.yahoo.com/news/election-2020-stock-market-news-updates-november-5-2020-232052919.html”,  

we are missing “https://finance.yahoo.com”. We can add this part by extending the string

In [13]:
'https://finance.yahoo.com' + soup.select('#Fin-Stream > ul > li:nth-child(1) a')[0]['href']

'https://finance.yahoo.com/m/013f3a26-06d7-3bb6-8811-ddbc901f9eec/these-are-the-best-robinhood.html'

We can also notice that the CSS selector has ‘1’ inside, indicating the first article. We can manipulate the CSS selector to extract more links. Using list comprehension:

In [14]:
links = ['https://finance.yahoo.com' + soup.select('#Fin-Stream > ul > li:nth-child({}) a'.format(i))[0]['href'] for i in range(1,4)]

the variable 'links' contains the first 3 news articles on this webpage:

In [15]:
links

['https://finance.yahoo.com/m/013f3a26-06d7-3bb6-8811-ddbc901f9eec/these-are-the-best-robinhood.html',
 'https://finance.yahoo.com/m/0aaba121-6543-313e-bb26-dc7e76f84a36/one-800flowers-com-stock.html',
 'https://finance.yahoo.com/m/a49abab5-9171-3baa-8141-e76126c6aaa3/penn-national-gaming-stock.html']

#### 3.2 Extract information for each link
After storing the links into a list we can then scrape news information for multiple news:

In [16]:
def get_info(url):
    #grab the information
    response = requests.get(url)
    soup = BeautifulSoup(response.text)
    news = soup.find('div', attrs = {'class': 'caas-body'}).text
    headline = soup.find('h1').text
    date = datetime.strptime(soup.find('time').text, "%B %d, %Y, %I:%M %p")
    
    #combine all info into a list of columns 
    columns = [news, headline, date]
    
    #give columns names 
    column_names = ['News', 'Headline', 'Date']
    
    return dict(zip(column_names, columns))

Note that the news published date for these websites also includes the hour, minute, am/pm, so we need to change the fitting format in *strptime*. We also need to think about how to store the information for a list of news. You can store them as a list of tuples, a dictionary, or a DataFrame, depending on the next step. The code I provided above will store the data for each news as a dictionary. By list comprehension, we can store a list of news as a list of dictionaries:

In [17]:
info = [get_info(url) for url in links]

In [18]:
# which we can turn into a dataframe 
import pandas as pd
pd.DataFrame(info)

Unnamed: 0,News,Headline,Date
0,"Buying a stock is easy, but purchasing the rig...",These Are The Best Robinhood Stocks To Buy Or ...,2021-06-09 00:05:00
1,When considering what names to put on your wat...,"One-800Flowers.com Stock Clears Key Benchmark,...",2021-06-08 23:46:00
2,A Relative Strength Rating upgrade for Penn Na...,Penn National Gaming Stock Flashes Improved Re...,2021-06-08 23:26:00


#### 3.3 More links
Normally, this method would work to extract as many links as this webpage has. However, one challenge I am facing here is that this website that stores all the links for the news is a **dynamic website**. This means that the server will only send the top three news articles back to the request, so we cannot extract over three links using the method above. For some websites, they only display limited information per page. We can then manipulate the page parameter in the URL to scrape all URLs for a bunch of pages. For example, Yelp only displays ten restaurants for each page, we can scrape all links for the ten restaurants in page 1, then scrape page 2, and so on.  

One solution for this is to use Selenium:

In [19]:
from selenium import webdriver

url = 'https://finance.yahoo.com/topic/stock-market-news'

links = []

for i in range(1, 10):
    
    #define webdriver
    driver = webdriver.Chrome('C:\chromedriver') #use your own path
    driver.implicitly_wait(5)
    
    #get request
    driver.get(url)
    
    #keep clicking each news article and grab their url 
    elem = driver.find_element_by_css_selector("#Fin-Stream > ul > li:nth-child({})".format(i))
    elem.click()
    
    #store in a list
    links.append(driver.current_url)

KeyboardInterrupt: 

The solution I used here should not be the best one to deal with this issue since it takes a very long time, and it clicks the webpage all the time.  

### 4. How to reduce the chances of getting blocked
Before web scraping, it is essential to check whether the website allows being scraped. You can search the web scraping rule for the website, or check related information on the website.  
You are always facing a chance of being blocked by the server during web scraping even though the website is web scraping friendly. Most of the time, it is because you are sending the request too freaquently and too many times in a short period of time. There are several methods you can adopt to reduce the changes of being blocked.  

#### 4.1 Specify a header
When sending the request, you can specify headers in the request function by using the website's `Request Headers`. You can find the request headers in the same panel that contains HTML information, but in the **Network** tab:

<img src = 'request_header.png' width = 800 height = 800>

Choose any item in the **Name** column, and the **Request Headers** will show on the right. We can star the headers information into a dictionary named by **"headers"**, and call **"headers = headers"** in the `request.get()` function. 

In [None]:
url = "https://finance.yahoo.com/topic/stock-market-news"

In [None]:
headers = {'accept-ranges': 'bytes', 
           'access-control-allow-origin': '*', 
           "content-encoding": "gzip", 
          "cache-control": "public, max-age=86400", 
          "control-type": "text/javascript"}

In [None]:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")

#### 4.2 Use time to sleep to stop requesting too freaquently
When web scraping the links from a website, I have used the list comprehension:

In [None]:
info = [get_info(url) for url in links]

The list comprehension, even though saves some space for coding, will send three consecutive requests in a very short time. It won't be a big problem if there are only 3 requests but this becomes very problematic if the number increases significantly to like three hundred instead. Here, we need to adjust this step and insert a time sleep function:

In [None]:
info = []
for url in links:
    #in case some page wouldn't work 
    try:
        scraped = get_info(url)
    except:
        print('blocked')
    ## adding time sleep will decrease the chance of being blocked, but increase operation time
    time.sleep(1)
    info.append(scraped)

Time.sleep will make your computer stop for a certain period before working on the next step. You can specify the waiting time to balance between avoiding be blocked and increasing operation time. Not that I also use try except function here, in case there are any problems during scraping.

#### 4.3 Do not repeat the request for different information
In the previous example, I wanted to extract the news body, date and headline from one website. Instead of requesting the website 3 times, I request it once, store the results and parse the response in soup, and extract three different peices of information from soup. Thus, we only request one time, instead of 3 times 