## Text Classification
### Task 1 - Data Collection:

**Product Review (2016-2021)** <br/>

In this assignment I will scrape a collection of product reviews from a set of web pages, preprocess the data and evaluate the performance of different classifiers in the context of review sentiment and review helfulness.

This notebook covers **Task 1 - Data Collection**. The web page contains review across all years from 2016 to 2021 and these reviews are spread across various pages (30 reviews per page). This task requires parsing of all the webpages to collect review information from all the six years. <br><br> The information that needs to be extracted is:
- The star rating of the review.
- The title of the review.
- The main body of the review.
- Review helpfulness information.

In [1]:
import requests
from pathlib import Path
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import os

Default settings for the data collection.

In [2]:
# Personal Website Address
base_url = 'http://mlg.ucd.ie/modules/python/assign2/'

# Global list declarations
ref_url_list = []
titles = []
body_text = []
ratings = []
helpfulness = []

Create directory for raw data storage, if it does not already exist:

In [3]:
dir_raw = Path("raw")
dir_raw.mkdir(parents=True, exist_ok=True)

Defining a parseHTML function to parse HTML content for different URL requests.

In [4]:
def parseHTML(url):
    '''
    :parameter url: provide the url which requires content parsing
    '''
    response = requests.get(url)
    # validating the response code to allow for parsing
    if response.status_code == 200:
        content = response.content.decode('utf-8')
        parsed_content = BeautifulSoup(content, "html.parser")
        return parsed_content
    else:
        raise Exception("Web page not found")

Defining a parse_webpage_links function to parse base URL content to fetch embedded href links.

In [5]:
def parse_webpage_links(url):
    '''
    :parameter url: provide the url which requires content parsing
    '''
    # Calling parseHTML function to parse the webpage content
    parsed_content = parseHTML(url)
    
    # Finding all anchor tags within the parsed content to fetch href links
    ref_links = parsed_content.find_all('a')
    
    # Iterating through the anchor tags to fetch href links
    for link in ref_links:
        href = link.get('href')
        # passing on the base url
        if 'index' in href:
            pass
        else:
            ref_url_list.append(url + str(href))
    
    print("Successfully fetched all the reference links from the base url")
    #print(ref_url_list)
    return ref_url_list

Defining a parseReview function to parse reference URL content to fetch review contents.

In [6]:
def parseReview(url):
    '''
    :parameter url: provide the url which requires content parsing
    '''
    # Calling parseHTML function to parse the reference webpage content
    parsed_content = parseHTML(url)
    
    # Parsing review titles from the reference link and storing them in the global list
    rev_title = parsed_content.find_all('h5')
    for title in rev_title:
        titles.append(title.get_text(strip=True))
    
    # Parsing review body text from the reference link and storing it in the global list
    rev_body = parsed_content.find_all('p',{"class":"review-body"})
    for body in rev_body:
        body_text.append(body.get_text(strip=True))
    
    # Parsing review ratings from the reference link and storing them in the global list
    rev_rating = parsed_content.find_all('img',alt=True)
    for rating in rev_rating:
        ratings.append(int(rating.get('alt')[0]))

    # Parsing review helpfulness information from the reference link and storing them in the global list
    rev_help = parsed_content.find_all('p',{"class":"metadata"})
    for helpf in rev_help:
        if 'users found this review helpful' in helpf.text:
            helpfulness.append(helpf.get_text(strip=True))
    
    print("Successfully Parsed %s" % url)


Function to generate a csv dump for the parsed review data

In [7]:
def generate_csv():
    fname = "review_data.csv"
    out_path = dir_raw / fname
    print("Writing data to %s" % out_path)
    review_df = pd.DataFrame({'Title':titles,'Review':body_text,'Rating':ratings,'Helpfulness_Info':helpfulness})
    review_df.to_csv(out_path, index=False)

Fetching monthly webpage links for all the six years

In [8]:
ref_url_list = []
yearwise_links = parse_webpage_links(base_url)

Successfully fetched all the reference links from the base url


Collecting the review information using the parser

In [9]:
titles = []
body_text = []
ratings = []
helpfulness = []

# Iterating through each of the reference links
for yw_link in yearwise_links:
    # Iterating through all the pages
    link_parts = os.path.splitext(yw_link)
    for i in range(1,8):
        # Recreating the link for each page within the monthly reviews webpage
        reformed_link = link_parts[0][:-1] + str(i) + link_parts[1]
        try:
            parseReview(reformed_link)
        except:
            break

Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2016-jan-01.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2016-jan-02.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2016-jan-03.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2016-jan-04.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2016-jan-05.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2016-jan-06.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2016-feb-01.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2016-feb-02.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2016-feb-03.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2016-feb-04.html
Successfully Parsed http://mlg.ucd.ie/modules/pyth

Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2017-jul-03.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2017-jul-04.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2017-aug-01.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2017-aug-02.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2017-aug-03.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2017-aug-04.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2017-aug-05.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2017-aug-06.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2017-sep-01.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2017-sep-02.html
Successfully Parsed http://mlg.ucd.ie/modules/pyth

Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2019-jan-03.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2019-jan-04.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2019-jan-05.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2019-feb-01.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2019-feb-02.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2019-feb-03.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2019-feb-04.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2019-mar-01.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2019-mar-02.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2019-mar-03.html
Successfully Parsed http://mlg.ucd.ie/modules/pyth

Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2020-jul-02.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2020-jul-03.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2020-jul-04.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2020-aug-01.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2020-aug-02.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2020-aug-03.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2020-aug-04.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2020-aug-05.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2020-sep-01.html
Successfully Parsed http://mlg.ucd.ie/modules/python/assign2/21200390/reviews-2020-sep-02.html
Successfully Parsed http://mlg.ucd.ie/modules/pyth

Validating if all the reviews have been collected successfully

In [10]:
# Validating if the collected review information is equal to the total number of reviews (9244)
if len(titles) == len(body_text) == len(ratings) == len(helpfulness) == 9244:
    print("Review collection successful")
else:
    print("Mismatch in the collected numbers")

Review collection successful


Calling the generate csv function if review collection is successful

In [11]:
generate_csv()

Writing data to raw/review_data.csv


Note - In this task, I have collected the necessary review data by parsing through all the web pages and created a CSV dump - **review_data.csv** which has been saved in the "raw" directory. This data dump will be used in task 2 for further data classification and analysis.