# WEB SCRAPING


---



The following code performs :

1. .Extraction of short reviews of a specified product from Amazon.
1. Extraction of star-rating of each review collected.
2. Stores the collected data in a json object.


**Importing the libraries**

In [0]:
from bs4 import BeautifulSoup as bsoup
import ssl
import json
import requests as rq
import re

##Preparing the list of web-pages to be scraped

The below code uses :

*   The requests library to get the web-page.
*   BeautifulSoup pakage to extract contents.
*   Regular Expression to identify the right pages/links.

In [0]:
#The extraction starts with the base url (The first page of customer reviews of any product)
# Product of choice : Vivo-V9Pro
base_url = 'https://www.amazon.in/Vivo-V9Pro-Black-Snapdragon-660AIE/product-reviews/B07HLNGL6R/ref=dpx_acr_txt?showViewpoints=1'

r = rq.get(base_url) # requesting the page
soup = bsoup(r.text) #initialising the BeautifulSoup Object for converting pages to text 

# Using Regular Expression to identify the page navigation buttons/links
page_links = soup.find_all("a",href=re.compile(r'.*/Vivo-V9Pro-Black-Snapdragon-660AIE/product-reviews/B07HLNGL6R.*pageNumber=\d'))

# Number of pages defaulted to one incase of only one page
try: 
    no = re.sub('[^0-9]','',page_links[-2].get_text())  #Gets the text/label from the second last navigation button 
    num_pages = int(no)
except IndexError:
    num_pages = 1

# List containing the wep-page url's to be scraped
url_list = ["{}&pageNumber={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]

#print(len(url_list)) #prints the number of pages collected

50


##Ignoring certificate errors

In [0]:
#ignoring SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

##Creating json objects

In [0]:
# The code uses JSON objects to store the Data extracted from the web pages.

product_json = {}
product_json['short-reviews'] = []
product_json['short-reviews-stars'] = [] 

##Extracting Short-reviews and Star-rating associated with each review and saving it to a json object

In [0]:
i=1  #To keep track of the Page Number
leng = 0 # Number of reviews collected, to identify skipped pages
skipped = [] #list to store the skipped page numbers

with open('Vivo_V9_reviews_.json', 'w+') as outfile:       # JSON file to save the data
    
    for url in url_list:                                   # Loop to iterate through all web-pages
        
        #print("\n++++++++++++Processing page ",i,"/",len(url_list))   #To print the status of each page
        i=i+1                                                          # To track the Page number
        
        page = rq.get(url)                 #request for pages
        html = page.text                   #converting each page into text
        soup = bsoup(html, 'html.parser')  #parsing each page to identify html tags
        html = soup.prettify('utf-8')      #setting the encoding 
    
        # block of code to extract the short-reviews of the product    
        
        for a_tags in soup.findAll('a',
                                   attrs={'class': 'a-size-base a-link-normal review-title a-color-base a-text-bold'}):
            short_review = a_tags.text.strip()
            product_json['short-reviews'].append(short_review)
        
        
        # block of code to extract the short-review's star-rating
        
        for i_tags in soup.findAll('i',
                                   attrs={'data-hook': 'review-star-rating'}):
            for spans in i_tags.findAll('span', attrs={'class': 'a-icon-alt'}):
                  short_review_stars = spans.text.strip()
                  product_json['short-reviews-stars'].append(short_review_stars)
                  break 
        
        # Identifies the skipped page and stores the page numbers in a list
        if(len(product_json['short-reviews']) == leng):
          #print("\n===============Page ",i-1," skipped--")
          skipped.append(i-1)
        leng = len(product_json['short-reviews'])
        #print("\nCollected--------Reviews/Rating = ",len(product_json['short-reviews']),"/",len(product_json['short-reviews-stars']))
    
    
    #writes the extracted reviews in to a file              
    json.dump(product_json, outfile, indent=4)

print ('\n\n----------Extraction completed with ', len(skipped),' pages Skipped..Check json file.----------')


++++++++++++Processing page  1 / 50

Collected--------Reviews/Rating =  10 / 10

++++++++++++Processing page  2 / 50

Collected--------Reviews/Rating =  20 / 20

++++++++++++Processing page  3 / 50

Collected--------Reviews/Rating =  30 / 30

++++++++++++Processing page  4 / 50

Collected--------Reviews/Rating =  40 / 40

++++++++++++Processing page  5 / 50

Collected--------Reviews/Rating =  50 / 50

++++++++++++Processing page  6 / 50

Collected--------Reviews/Rating =  60 / 60

++++++++++++Processing page  7 / 50

Collected--------Reviews/Rating =  70 / 70

++++++++++++Processing page  8 / 50

Collected--------Reviews/Rating =  80 / 80

++++++++++++Processing page  9 / 50

Collected--------Reviews/Rating =  90 / 90

++++++++++++Processing page  10 / 50

Collected--------Reviews/Rating =  100 / 100

++++++++++++Processing page  11 / 50

Collected--------Reviews/Rating =  110 / 110

++++++++++++Processing page  12 / 50

Collected--------Reviews/Rating =  120 / 120

++++++++++++Proces