# **Web scraping using beautifulsoup**

In this tutorial we will develop a web scraper to extract online review data from an online review site.

## Load libraries

For this scaper we will use beautifulsoup and requests python packages. Both the packages comes pre-installed with Google Colab, however, if you are using your own environment, you would need to install the both.

In [0]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Design the scraper

Specify the URL you need to scrape data from.
The following commands will fetch the web structure of the url and assign it to the object 'soup' which we can use to extract data.

In [0]:
url = 'https://www.productreview.com.au/listings/nab-national-australia-bank'

In [0]:
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")

## **Scraping data from html elements**

We need to extract the reviews from this page. 

First, find the corresponding html element which contains reviews. 
Visit the website, right click -> Inspect element -> identify the html element name which contains the review. (Use Google Chrome or Firefox)

The command **findAll**  will find all the elements with the given name. In this case, it will fetch all the div elements with the attribute **"itemprop": "review"** and create a list.
We assign this to a variable called **all_reviews**.  

How it looks on browser as follows:

![alt text](https://i.ibb.co/WD6W0xF/item-prop.png)

In [0]:
all_reviews = soup.findAll("div", {"itemprop": "review"})

In [5]:
print('Length of reviews: {}'.format(len(all_reviews)))

Length of reviews: 20


Observe the structure of one element.

In [6]:
print(all_reviews[5])

<div itemprop="review" itemscope="" itemtype="http://schema.org/Review"><div class="mb-3_X8n card_364 card-full_37I card-full-md_3Xd" id="review-e80ec924-5c23-455b-89da-55617100a9e9"><div class="px-4_1Cw pt-4_9Zz pb-0_pDB trim-y_LxH bg-transparent_126 rounded-top_1Rx no-border_2Lm card-header_2E7"><div class=" media_xhR"><a href="/consumer-profiles/230ea4b0-3858-4bfa-808a-5d04d91606ca"><div class=" cursor--pointer_3im relative_2IG d-inline-block_3nd"><svg class=" absolute-bottom-right_1bQ" height="20" viewbox="0 0 20 20" width="20"><use xlink:href="#badge-facebook"></use></svg></div></a><div class="ml-3_Jy- align-self-center_1t7 flex-basis-auto_2WS flex-column_1v6 d-flex_oSG overflow-x-hidden__yA media-body_15t"><h4 class="my-0_27D align-items-baseline_kxl flex-column_1v6 d-inline-f

Now we need to iterate over the **all_reviews** object and process each element.  
Identify the html elements corresponding to:

*   Title
*   Review text
*   Date
*   Star rating


**Note:** Certain values are placed as 'attributes' of a html component. In such scenarios, you have to access the value using a different syntax (Observe the codes for accessing date and review rating)

In [15]:
all_reviews[0].find('p', {"class": "mb-0_2CX"}).text

'Do not bank with the NAB.  They make promises they can’t keep and then they don’t care. And then they tell you that they can’t advise you when the matter will be resolved. Go with another bank with good customer service!'

In [0]:
reviews = []

for reviewBox in all_reviews:

    # get review title
    review_title = reviewBox.find('h3', {"class": "mb-2_3ol"}).text
    
    # get review
    review_text = reviewBox.find('p', {"class": "mb-0_2CX"}).text
    
    # get date
    # date is placed as an attribute of the html element. These are indicated by the word 'attr' in the html structure. You can access them as follows:
    review_date = reviewBox.find('meta', {'itemprop': 'datePublished'})['content']

    # get star rating
    review_stars = reviewBox.find('meta', {'itemprop': 'ratingValue'})['content']
    
    # append the process review and the title to the reviews list
    reviews.append([review_title, review_text, review_date, review_stars])

Explore the extracted reviews.

In [17]:
len(reviews)

20

In [0]:
reviews

Compose a dataframe using the review data.

In [0]:
output_column_names = ['title', 'review', 'date', 'stars']
data = pd.DataFrame(reviews, columns=output_column_names)

Write the dataframe to a csv. The dataframe will be saved in Files tab.

In [0]:
data.to_csv('nab_reviews.csv', index=None)