# WebScraper

## Websites
### Notebook Check

Notebook check is a website orientated at providing performance benchmarks, comparisons, reviews, buyers guides and more
for a variety of tech related products. It specifies that all writers who provide articles for the site are independent
writers, indicating a decreased bias in the pieces written. This site was used as the prime data source because it has
available a very large range of laptop articles/reviews, which is the focus of this analysis, when compared to other
similar sites. Each review follows a similar layout making it simple to segment the different sections of the reviews.
When a couple of reviews were extracted and observed, the following structure was found:

•	Title

•	Introduction

•	Case

•	Connectivity

•	Input devices

•	Display

•	Performance

•	Emissions

•	Energy management

•	Verdict

Each of these can be viewed as key aspects that a consumer would be interested in when buying a laptop, making it easier
to perform the NLP tasks later. On these pages there is also a section which provides a brief overlay of the key
statistics; however, these will be gathered from the second website which provides a far more detailed specification
breakdown.

### Laptopmedia.com

Laptopmedia is a site that has a range reviews, specifications and analysis for a very large range of laptops. The
largest of these being the sheer number of specifications available for different devices. A reason this website was not
used for the main review gathering section is that it only has reviews available for a very limited number of the
laptops on the site while also making heavy use of javascript making it hard to navigate and extract the content on
certain pages. Each specification was gathered by going to the laptop series page which has a complete list of all the
sites laptops. This made it easy to go into the page for each laptop and extract the specification information.

The specification was presented in a table format with labels, making it easy to locate where the data was and create a
title for the entry. Hyperlinks where also provided for certain aspects which were also extracted for a more complete
data set.

## Copyright

As the information extracted from the website is being used for personal use there should be no issues regarding the
copyright. It is also not being extracted with the goal of achieving any form of financial gain in which case the
authors and site owner themselves would need to be contacted, and some sort of written permission given to use the
contents within. This would also need to be performed if this program were to be made publicly available, or the
contents of this analysis be used in a published academic paper.

## Workflow

To extract the webpage html for each of the pages of the website a combination of both selenium and beautiful soup was
used. Selenium makes use of a webdriver that operates the page the same way in which a user would, it opens the webpage
specified natively in the specified browser [1]. The browser selected was chrome, given it has the highest degree of
familiarity (This required a separate installation and path specification). This allows for information to be extracted
from sites where the contents are only populated when the page is opened (Javascript based sites). Selenium was only
utilised to get a complete page html where the information can be extracted from using beautiful soup. Beautiful soup is
a python library for extracting information from HTML and XML files [2].

The webcrawler works by loading the main search page url into the webdriver which extracts the page html. Beautiful soup
takes in this html and the a tag information is extracted for a specific class, being determined by manually looking
through the structure of the page and where each element is contained. A list is created of all the laptop links doing
this which is then looped through the webdriver to obtain the actual reviews themselves.

For each of the laptop links a similar process was followed where the page html was extracted and then analysed based of
the tags and class identification. In this case the information was contained within div tags. The title was easy to
extract as it was in an isolated div tag with a specific class along with the intro/overview of the review. Another
search was performed which extracted all the div tags that had a specific class which was the same for all text
paragraphs in the review. This information was segmented into the different review areas based on the h2 tag contents
(Which is the heading style used for the section titles). Some string formatting was performed and then this was
appended to the laptops list.

Once all the information for all the laptops had been extracted, it was then written to a csv using the base csv writer
in python. For this whole process the tqdm package was used which allows for the progress of a loop to be viewed and it
also gives an elapsed time and estimated time of completion. The same process was followed for laptop media with the
only difference being the class used and the depth gone in the html.

## Data Extraction

When the HTML was analysed which contained the text it always followed the same structure where all the text of interest
was contained within a <div> with the “content” id. Each segment of text was separated into a different div with a
class of “ttcl_number csc-default”, the number in that class representing what type of content is in that segment. 1
denoted the review/article main caption while 0 represented the content of interest. So to get the content of importance
a find_all() method was used on the soup for div tags that had the class “ttcl_0 csc-default”. This resulted in a list
of all the div tags and the content which could be further organised and analysed to extract the text and organise
them. For each of these tags a sub soup was created which simple find() methods could be run on to extract the text
contents and then append these to a list based on if they had a heading or not (If a h2 tag existed).

The contents of these heading tags were compared to a range of words that where created based on initial site html
inspection. An index was assigned based on this heading which specified what location in the laptop list it would get
appended to. This laptop list was organised based on the section headings to make it easier to determine what data was
where and making the final saving of the data to the csv easier. If no h2 tag was found in the div, the contained
text would be a continuation of the current section and the index would not be changed.

The raw data was able to be extracted from a div tag within with the class “csv-textpic-text” setting text=True in the
find() method. Some initial formatting had to be performed on the text blocks such as the removal of any new line
characters and making sure no double spaces where present. Also, if no text content was found for a certain heading
(That’s if only an image is located under the heading, but the heading still exists), the content would just be set to
the current heading.

## Final Data

Once this whole process was done the contents of the laptop array was written to a csv file. The final data set was
comprised of 750 entries with 11 columns. One of the columns labelled “Specifications” would be substituted with data
from the second crawler which gathered only specification information. All fields contained textual data organised by
the sections in the reviews in which they came from such as; “Display”, “Performance”, “Case” etc.

![Figure 1 - Final dataset](DataTableFigure.png)

Not as large of a data set was able to be created as was desired because of the use of only one website, however on a
finalised project this scraper would be run on multiple review sites.

## Running

In order to run this webcrawler the following packages are required

•	bs4

•	selenium

•	tqdm (To View the progress)

It also requires for the ChromeDriver to be downloaded and added to the working directory for the user’s current
version of chrome.

![Figure 2 - Webscraper working](WebScraperWorking.PNG)

The scraper opens the physical web page for each of the laptops and then scrapes the data. As can be seen in the
progress bar it has scraped 3 out of 786 laptops and has an estimated time of 1 hour and 31 minutes remaining.

## Code

In [None]:
import csv
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import tqdm

# Chromedriver version 90 required to be installed

def main():
    # Get all the link for each laptop
    website_url = 'https://www.notebookcheck.net/Reviews.55.0.html?&items_per_page=500&hide_youtube=1' \
                  '&ns_show_num_normal=1&hide_external_reviews=1&introa_search_title=laptop%20review&tagArray[' \
                  ']=16&typeArray[]=1 '

    driver = webdriver.Chrome()

    driver.get(website_url)

    page_html = driver.page_source

    page_soup = soup(page_html, 'html.parser')

    links = []
    for a in page_soup.findAll("a", {"class": "introa_large introa_review"}, href=True):
        links.append(a['href'])
    for a in page_soup.findAll("a", {"class": "introa_small introa_review"}, href=True):
        links.append(a['href'])

    website_url = "https://www.notebookcheck.net/Reviews.55.0.html?&items_per_page=500&hide_youtube=1" \
                  "&ns_show_num_normal=1&hide_external_reviews=1&page=1&introa_search_title=laptop%20review&tagArray[" \
                  "]=16&typeArray[]=1 "

    driver.get(website_url)

    page_html = driver.page_source

    page_soup = soup(page_html, 'html.parser')

    for a in page_soup.findAll("a", {"class": "introa_large introa_review"}, href=True):
        links.append(a['href'])
    for a in page_soup.findAll("a", {"class": "introa_small introa_review"}, href=True):
        links.append(a['href'])

    labels = ["Title", "Intro", "Specifications", "Case", "Connectivity", "Input devices", "Display", "Performance",
              "Emissions", "Energy management", "Verdict"]
    laptops = [[], [], [], [], [], [], [], [], [], [], []]
    # Make into for loop
    for i in tqdm.tqdm(range(len(links))):
        # for i in tqdm.tqdm(range(10)):
        try:
            driver.get(links[i])
            page_html = driver.page_source
            page_soup = soup(page_html, 'html.parser')
            laptop = [[], [], [], [], [], [], [], [], [], [], []]
            laptop[0].append(page_soup.find("h1", text=True).text)
            laptop[1].append(str(page_soup.find("div", {"class": "intro-text"}).contents[1]).strip(" "))
            temp = page_soup.find('div', {"class": "csc-textpic-text"}).findAll(text=True)

            # Retrieve Intro information
            for item in temp:
                str(item).strip(" ").replace('\n', '')
            temp = " ".join(temp).replace("  ", " ")
            laptop[1][0] = laptop[1][0] + " " + temp

            divs = page_soup.find_all("div", {"class": "ttcl_0 csc-default"})
            index = 0
            heading = ''
            for div in divs[1:]:
                sub_soup = soup(str(div), 'html.parser')
                if len(sub_soup.find_all("h2")) > 0:
                    heading = str(sub_soup.find("h2", text=True).text).capitalize()
                    if ("Case" in heading) or ("Chassis" in heading):
                        index = 3
                    elif ("Connectivity" in heading) or ("Equipment" in heading):
                        index = 4
                    elif "Input" in heading:
                        index = 5
                    elif "Display" in heading:
                        index = 6
                    elif "Performance" in heading:
                        index = 7
                    elif "Emissions" in heading:
                        index = 8
                    elif "Energy" in heading:
                        index = 9
                    elif "Verdict" in heading:
                        index = 10

                try:
                    sub_temp = sub_soup.find('div', {"class": "csc-textpic-text"}).findAll(text=True)
                    for item in sub_temp:
                        str(item).strip(" ").replace('\n', '')

                    laptop[index].append(str(" ".join(sub_temp).replace("  ", " ")))
                    laptop[index][0].strip('\n').replace("  ", " ")

                    if laptop[index][0] == '':
                        laptop[index][0] = heading

                except:
                    continue

            for i in range(len(laptop)):
                if len(laptop[i]) == 0:
                    laptop[i].append("none")
                laptops[i].append(laptop[i])

        except:
            continue

    # Saving to a csv file
    with open('output.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(labels)
        for i in tqdm.tqdm(range(len(laptops[0]))):
            try:
                laptop = []
                for j in range(len(laptops)):
                    laptop.append(laptops[j][i][0])
                writer.writerow(laptop)
            except:
                continue


main()