
# Web Scraping with Beautiful Soup

![](https://sixfeetup.com/blog/an-introduction-to-beautifulsoup/@@images/1cc2fc44-5ac8-4378-bef2-95048d5bc5ad.png)

Frequently we don't have access to any/enough data to perform accurate analysis, this is a common issue to a new/nich project. In those cases, we might need to find a way to collect data on our own.

__A Web Scraper__ is a program that extract data from a website. __BeautifulSoup__ is a Python library that provides many functions to pull data out of HTML and XML files.

For more information, please read the documentation of Beautiful Soup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### Problem Statement
- Build a Web Scraper to collect data about articles on [https://vnexpress.net/](https://vnexpress.net/).
- Required information:
  - Title
  - Description
  - Link to the Article
  - Link to Thumbnail Image


## Send GET Request to the Website

The first step is to request a HTML text file of vnexpress. We can archieve that by sending a [GET request](https://www.w3schools.com/tags/ref_httpmethods.asp) to [https://vnexpress.net/]( https://vnexpress.net/) using the Python library `requests`.

In [52]:
import requests
base_url ='https://tiki.vn'
r = requests.get(base_url + '/nha-sach-tiki/c8322?src=c.8322.hamburger_menu_fly_out_banner')

print(r.text)

<!doctype html>
<html class="no-js" lang="">
<head>
    
    <style>html { background: #f4f4f4; } .async-hide body { opacity: 0 !important} </style>
    <script>(function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;
    h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};
    (a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;
    })(window,document.documentElement,'async-hide','dataLayer',1500,
    {'GTM-53B3KKW':true});</script>
<script>
!function(){if('PerformanceLongTaskTiming' in window){var g=window.__tti={e:[]};
g.o=new PerformanceObserver(function(l){g.e=g.e.concat(l.getEntries())});
g.o.observe({entryTypes:['longtask']})}}();
</script>

  
    <script>
        (function() {    
            function getCookie(name) {
              var value = "; " + document.cookie;
              var parts = value.split("; " + name + "=");
              if (parts.length == 2) return parts.pop().split(";").shift();
            }
          

## Parse the Raw Text with BeautifulSoup
In order to extract the information from the HTML file, we need to put it through a parser. 

A parser builds data structrure from input data (usually text), allowing us to easily find and extract the components of the data. __BeautifulSoup__ is one of the most popular HTML parser for Python.

In [31]:
from bs4 import BeautifulSoup

# r.text is a HTML file so we will use html.parser
soup = BeautifulSoup(r.text, 'html.parser')

# Make the soup object look nicer
print(soup.prettify()[:500])

<!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <style>
   html { background: #f4f4f4; } .async-hide body { opacity: 0 !important}
  </style>
  <script>
   (function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;
    h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};
    (a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;
    })(window,document.documentElement,'async-hide','dataLayer',1500,
    {'GTM-53B3KKW':true});
  </script>
  <


## Extract Information
A __soup__ object has many method to find and extract information.

- `<soup>.find(<tag>, {<attribute1>:<value1>, <attribute2>:<value2>})`: Return the FIRST occurence of `<tag>` with `<attributes>` equal to `<values>` in `<soup>` object. Output: Tag Object.  

- `<soup>.find_all(<tag>, {<attribute1>:<value1>, <attribute2>:<value2>})`: Return ALL occurences of `<tag>` with `<attributes>` equal to `<values>` in `<soup>` object. Output: ResultSet (List) containing one or many Tag Objects.

- `<soup>.<tag>`: Return FIRST occurence of `<tag>` in `<soup>` object. Output: Tag Object.

In [42]:
# First occurence of the tag w/o attribute
first_p = soup.p
print("\nFirst occurence of the tag w/o attribute:")
# print("Type:",type(first_meta))
print(first_p)

# First occurence of the tag 
first_product = soup.find('div', {'class':'product-item'})
print("\nFirst occurence of the tag:")
# print("Type:",type(first_product))
print(first_product)


# All occurences of the tag
print("\nAll occurences of the tag:")
products = soup.find_all('div', {'class':'product-item'})
# print("Type:", type(articles))
# print(products)


First occurence of the tag w/o attribute:
<p><a href="https://chrome.google.com/webstore/detail/tiki-assistant/ncpaceoemnbcjffjpjcgnbaklmkhdmak?hl=en-US&amp;gl=VN&amp;authuser=1" target="_blank">Tiki Assistant</a> là tiện ích chạy trực tiếp trên trình duyệt Chrome -  giúp gợi ý &amp; tìm kiếm nhanh các sản phẩm tốt nhất trên Tiki, phù hợp với nhu cầu tìm kiếm sản phẩm của bạn.</p>

First occurence of the tag:
<div class="product-item" data-brand="" data-category="Nhà Sách Tiki/English Books/Fiction - Literature/Romance" data-id="2048897" data-price="184800" data-score="" data-seller-product-id="2051203" data-title="Call Me By Your Name" product-sku="3381544740759" rel="">
<a class="" data-id="2048897" href="/call-me-by-your-name-p2048897.html?spid=2051203&amp;src=category-page-8322&amp;2hi=0" title="Call Me By Your Name">
<div class="content">
<span class="image">
<img alt="" class="product-image img-responsive" src="https://salt.tikicdn.com/cache/280x280/ts/product/ff/26/a2/fdf754ec5

In [96]:
print(first_product.prettify())
print("title", first_product.find('p', {'class': 'title'}).text.strip())
print('img_url', first_product.img['src'])
print("url", first_product.find("a")['href'])
print("author", first_product.find("p", {'class': 'author'}).text)
print("price_sale", first_product.find("span", {'class': 'final-price'}).text.strip().split()[0][:-1].strip())
print("sale", first_product.find("span", {'class': 'sale-tag'}).text)
print("price_regular", first_product.find("span", {'class': 'price-regular'}).text[:-1])
print("rating", first_product.find("span", {'class': 'rating-content'}).span['style'].split(':')[-1][:-1])
print("reviewer", first_product.find("p", {'class': 'review'}).text.strip('(').split()[0])


<div class="product-item" data-brand="" data-category="Nhà Sách Tiki/English Books/Fiction - Literature/Romance" data-id="2048897" data-price="184800" data-score="" data-seller-product-id="2051203" data-title="Call Me By Your Name" product-sku="3381544740759" rel="">
 <a class="" data-id="2048897" href="/call-me-by-your-name-p2048897.html?spid=2051203&amp;src=category-page-8322&amp;2hi=0" title="Call Me By Your Name">
  <div class="content">
   <span class="image">
    <img alt="" class="product-image img-responsive" src="https://salt.tikicdn.com/cache/280x280/ts/product/ff/26/a2/fdf754ec5975dd1738775416e26feceb.jpg"/>
    <span class="product-right-icon">
    </span>
   </span>
   <p class="icons">
   </p>
   <p class="title">
    Call Me By Your Name
   </p>
   <p class="author">
    André Aciman
   </p>
   <p class="price-sale">
    <span class="final-price">
     184.800đ
     <span class="sale-tag sale-tag-square">
      -30%
     </span>
    </span>
    <span class="price-regular

With the Tag Object, we can also apply `.find()` and `.find_all()` method to search for its children tags.

In [5]:
# Find the p tag for article description
# <p class='description'> is a children tag of each article
description_tag = first_article.find('p', {'class':'description'})

print(description_tag.prettify())

<p class="description">
 <a data-medium="Item-1" data-thumb="1" href="https://vnexpress.net/hon-21-000-nguoi-ha-noi-da-xet-nghiem-nhanh-ncov-4139226.html" title="Hơn 21.000 người Hà Nội đã xét nghiệm nhanh nCoV">
  Trong 21.000 mẫu xét nghiệm nhanh, có hai trường hợp dương tính, tuy nhiên khi làm xét nghiệm khẳng định bằng phương pháp PCR đều có kết quả âm tính.
 </a>
</p>



Our data is contained inside the tags. Depending on which section of the tag contains the data, we will use a different method. 

1. __Data is the CONTENT of the tag__

_E.g. The text that can be seen on the website_

The content of the tag is the text between the starting tag and closing tag. To get this section, we use the attribute `.text`

In [None]:
# Extract the content of the tag
# The result is a string
description = description_tag.text
print(repr(description))
print(description)
print(type(description))

# The result is a string
# So we can apply string methods to it if it got any strange character
# description = description.replace('\xa0',' ')
# print(repr(description))

2. __Data is the VALUE of an ATTRIBUTE of the tag.__

_E.g. URLs, id, hidden information,..._

Sometimes the information we need is the value of an attribute located inside the starting tag. To get this section, we index that attribute with square brackets `[ ]`, similar to a dictionary.

In [6]:
# Select the <a> tag inside description_tag
a_tag = description_tag.a 
print('Content:', a_tag)

# Extract the value of an attribute
title = a_tag['title']
print('Title:', repr(title))

# The result is also a string
print('Type:', type(title))

Content: <a data-medium="Item-1" data-thumb="1" href="https://vnexpress.net/hon-21-000-nguoi-ha-noi-da-xet-nghiem-nhanh-ncov-4139226.html" title="Hơn 21.000 người Hà Nội đã xét nghiệm nhanh nCoV">Trong 21.000 mẫu xét nghiệm nhanh, có hai trường hợp dương tính, tuy nhiên khi làm xét nghiệm khẳng định bằng phương pháp PCR đều có kết quả âm tính.</a>
Title: 'Hơn 21.000 người Hà Nội đã xét nghiệm nhanh nCoV'
Type: <class 'str'>


## Putting it all together!

Now that we learned how to find and extract information with BeautifulSoup. Let's write a program to solve the requirements!

### Main Component of the Scraper

In [114]:
# We want to save information about all products in a list
data = []

# Find all products
products = soup.find_all('div', {'class':'product-item'})
# print(len(products))
# Extract information of each product
for product in products:
    
    # Each product is dictionary containing the required information
    d = {'title':'', 'author':'', 'price_sale': 0, 'sale': 0, 'price_regular': 0, 'rating': 0, 'reviewers': 0, 'img_url':'', 'url':''}

    # We use the try-except blocks to handle errors
    try:
        # use strip for all text to remove outer blank space
        
        #get title and strip the space outside of it
        try:
            d['title'] = product.find('p', {'class': 'title'}).text.strip()
        except:
            print('Wrong with title')
            print(product)
            
        try:
            d['img_url'] = product.img['src'].strip()
        except:
             print('Wrong with img_url')
             print(product)
            
        try:
            d['url'] = product.find("a")['href'].strip()
        except:
            print('Wrong with url')
            print(product)
            
        try:
            author = product.find('p', {'class': 'author'})
            if author:
                d['author'] = product.find('p', {'class': 'author'}).text.strip()
        except:
            print('Wrong with author')
            print(product)
            
        #get price sale, strip outside space, split to get first item and remove the 'd' at the end
        #remove '.' in price sale to convert to int or it will make casting error
        try:
            d['price_sale'] = int(product.find('span', {'class': 'final-price'}).text.strip().split()[0][:-1].strip().replace('.',''))
        except:
            print('Wrong with price_sale')
            print(product)
            
        try:
            
            sale = product.find('span', {'class': 'sale-tag'})
            if sale:
                d['sale'] = int(sale.text.strip(' -%'))
        except:
            print('Wrong with sale')
            #print(product.find('span', {'class': 'sale-tag'}).text.strip())
            print(product)
            
        try:
            price_regular = product.find('span', {'class': 'price-regular'})
            if price_regular:
                d['price_regular'] = int(price_regular.text.strip()[:-1].replace('.',''))
        except:
            print('Wrong with price_regular')
            print(product)
            
        # get the rating, split by ':' and get the last, remove '%' 
        try:
            
            rating = int(product.find('span', {'class': 'rating-content'}).span['style'].split(':')[-1][:-1].strip())
            # rating has 100% for 5 stars, 1 star is 20% so divide the rating for 20 
            d['rating'] = round((rating) / 20, 1)
        except:
            print('Wrong with rating')
            print(product)
            
        try:
            d['reviewers'] = int(product.find('p', {'class': 'review'}).text.strip('(').split()[0].strip())
        except:
            print('Wrong with reviewers')
            print(product)
        # Append the dictionary to data list
        data.append(d)
        
    except:
        # Skip if error and print error message
        print(product)
#         print("We got one product error!")

In [115]:
# here is the array of dictionary article objects we want
data

[{'title': 'Call Me By Your Name',
  'author': 'André Aciman',
  'price_sale': 184800,
  'sale': 30,
  'price_regular': 264000,
  'rating': 4.2,
  'reviewers': 39,
  'img_url': 'https://salt.tikicdn.com/cache/280x280/ts/product/ff/26/a2/fdf754ec5975dd1738775416e26feceb.jpg',
  'url': '/call-me-by-your-name-p2048897.html?spid=2051203&src=category-page-8322&2hi=0'},
 {'title': "Oxford Advanced Learner's Dictionary 8th...",
  'author': 'Oxford University Press',
  'price_sale': 516700,
  'sale': 13,
  'price_regular': 595000,
  'rating': 4.4,
  'reviewers': 84,
  'img_url': 'https://salt.tikicdn.com/cache/280x280/ts/product/55/26/7c/3ad6cf393f130bda73bd9fa0bc3ce5a9.jpg',
  'url': '/oxford-advanced-learner-s-dictionary-8th-edition-with-vietnamese-translation-and-cd-rom-paperback-p11119019.html?spid=11119020&src=category-page-8322&2hi=0'},
 {'title': 'Vui Vẻ Không Quạu Nha - Tản Văn',
  'author': 'Ở Đây Zui Nè',
  'price_sale': 44850,
  'sale': 35,
  'price_regular': 69000,
  'rating': 4.6,

### Package into Functions

As a good developer, we always need to 

In [None]:
# Import Library
import requests
from bs4 import BeautifulSoup

def get_url(url):
    """Get parsed HTML from url
      Input: url to the webpage
      Output: Parsed HTML text of the webpage
    """
    # Send GET request
    r = requests.get(url)

    # Parse HTML text
    soup = BeautifulSoup(r.text, 'html.parser')

    return soup

def scrape_vnexpress(url="https://vnexpress.net/"):
    """Scrape the home page of vnexpress
      Input: url to the webpage. Default: https://vnexpress.net/
      Output: A list containing scraped data of all articles
    """

    # Get parsed HTML
    soup = get_url(url)


    # Find all article tags
    articles = soup.find_all('article', {'class':'item-news'})

    # List containing data of all articles
    data = []

    # Extract information of each article
    for article in articles:

        d = {'title':'', 'url':'', 'image_url':'', 'description':''}
        
        try:
          d['title'] = article.a['title']
          d['url'] = article.a['href']
          # It is fine to clean the data inside the scraper
          # But it is more recommended to do it afterwards
          d['description'] = article.p.text.replace('\xa0','').strip('\n')
          if article.img:
            d['image_url'] = article.img['data-src']  # or should we use ['src'] ?

          # Append the dictionary to data list
          data.append(d)
        except:
          # Skip if error and print error message
          print("We got one article error!")

    return data

### Test the Scraper

In [None]:
# Test the scraper
data = scrape_vnexpress()
data

###Convert your result to pandas dataframe

In [None]:
# Save data to a DataFrame
import pandas as pd

articles = pd.DataFrame(data = data, columns = data[0].keys())

In [None]:
type(articles)

##Store your result

###Option 1: Store your pandas dataframe to pickle file and load them again




In [None]:
articles.to_pickle("./result.pkl")

In [None]:
unpickled_result = pd.read_pickle("./result.pkl")

In [None]:
unpickled_result

###Option 2: Store your pandas dataframe to csv file 




In [None]:
# it should store a "result.csv" in your current folder, you can download it to store it
articles.to_csv("./result.csv", index=False)