# The Webscraper

## Websites to be consumed and rationale for extracting web content

For the NLP prototype, a large text corpus (around 8000 elements) is needed that is relevant to the automotive industry. “Eric the Car Guy” is an automotive website containing many forums relating to automotive content. The “Service and Repair Questions Answered Here” (https://www.ericthecarguy.com/forums/forum/stay-dirty-lounge/service-and-repair-questions-answered-here/) forum was used for generating a text corpus via a web scraper. This forum was chosen as there is enough posts to create a large enough corpus, but also the natural language used will align to the NLP project. After a brief analysis, users were often mentioning the make/model of the problematic car, and what the problem was (specific part fault), which is the exact information needed for this NLP prototype. 

There are many other automotive forums within this website, such as news, sales, or video discussions. Even though parts and car models may be mentioned, the context of the natural language will not be suitable for this NLP prototype.

## Website/data copyright considerations 

Eric the Car Guy contains a Copyright notice for the use of the information within the website that can be found here: http://new.ericthecarguy.com/terms-of-service/ under the first paragraph in the “Personal Use” section. The scraping of this website is within accordance of the copyright notice as this information is being scraped for learning purposes is only intended for personal use and will not be reproduced in a commercial format. If this prototype is to be given to a development team or used in a commercial environment, then written consent from Eric the Car Guy will be needed. 

## Complexity of content layout/Content extractor to export the important aspects of the data and/or metadata

A screenshot of the main page of the forum and a post can be seen below:
<img src= "Report images/Figure1.png">
Figure 1  - Main page of the “Service and Repair Questions Answered Here” forum

A – Number of topics (posts) per page

B – Title of post – This is a link to take the end user to that specific post (Fig. 2)

C – Page number(s)

D – Author of the post

<img src= "Report images/Figure2.png">
Figure 2 – Forum post in the “Service and Repair Questions Answered Here” forum

E – Post title

F – Date and time posted

G – Description of the post

H – Author of the post

The above-mentioned labels were all used to extract data from the selected data source. Two python packages were used for this, “urllib” and “beautifulsoup”. Urllib is used for opening the desired page, and beautifulsoup is used for parsing the html construction of the page and extracting/saving the desired data form the webpage. The general URL for the forum (https://www.ericthecarguy.com/forums/forum/stay-dirty-lounge/service-and-repair-questions-answered-here/) changes depending on which page is being accessed. The general URL directs a user to page 1, but if any other page is being accessed a “/page/n/” is added to the URL (https://www.ericthecarguy.com/forums/forum/stay-dirty-lounge/service-and-repair-questions-answered-here/page/2/ for page 2). The urllib.request.open function is used accordingly to open the correct URL for scraping the page.

<img src= "Report images/Figure3.png">

Figure 3 – URL changing when a different page is accessed

Once the page has been opened, the page is read by the urllib.request.read and this is saved to a variable, and the page is closed. Beautifulsoup’s page_html function is then used to parse this page into its html construction. From this page, the author of the post is extracted and saved. Author names are kept in a “span” tag with class of “bbp-topic-srtated-by”, these are found with beautifulsoup’s findAll function. Within this span tag, an “a” tag is embedded, which has an “img” tag embedded within that. Within the “img” tag, an alt value has the author’s name. Fig. 4 shows this construction. 


In [4]:
"""Example here"""
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

general_url = "https://www.ericthecarguy.com/forums/forum/stay-dirty-lounge/service-and-repair-questions-answered-here/"
# opening connection, grabbing the page
uClient = uReq(general_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")
# find all span tags with a class of bbp-topic-started-by 
person_posted = page_soup.findAll("span", {"class": "bbp-topic-started-by"}) # this returns 15 results
print(len(person_posted))
# extract author name from the alt value in the img tag
author_name = person_posted[0].a.img["alt"]  # indexed depending on the article, 0 for 1st article, 14 for the 15th article
print(author_name)

15
Kevin Fontenot


<img src= "Report images/Figure4.png">

Figure 4 – The Author name embedded in the html structure

The url to take a user from the main forum page to a specific post is embedded in each post title. This url is needed to extract and save the other desired text items to the corpus. This url to a specific post is embedded in an li tag with a class of “bbp-topic-title” and is found with the findAll function. Within this, an a tag is embedded which has a href attribute. This href attribute contains the url to that specific post. This url is extracted using beautifulsoup, and the new page is opened with the urllib.request.open function, as before.

<img src= "Report images/Figure5.png">

Figure 5 – The link to the post embedded in the html structure

In [5]:
"""Example here, uses same page as previous block of code"""

# grabs each topic
# find all where class is bhp-topic-title
containers = page_soup.findAll("li", {"class": "bbp-topic-title"})
# inspect each container
print(containers[1])
# inspect the a tag
container = containers[1]  # this is indexed depending on the article
new_url = container.a['href']
# get the url that references what the user has said
print("The new url is {}".format(new_url))

<li class="bbp-topic-title">
<a class="bbp-topic-permalink" href="https://www.ericthecarguy.com/forums/topic/2006-honda-accord-j30a5-timing/">2006 honda accord J30a5 timing.</a>
<p class="bbp-topic-meta">
<span class="bbp-topic-started-by">Started by: <a class="bbp-author-avatar" href="https://www.ericthecarguy.com/forums/users/kevinakashorty/" rel="nofollow" title="View Kevin Fontenot's profile"><noscript><img alt="Kevin Fontenot" class="gravatar avatar avatar-14 um-avatar um-avatar-default" data-default="https://www.ericthecarguy.com/wp-content/plugins/ultimate-member/assets/img/default_avatar.jpg" height="14" onerror="if ( ! this.getAttribute('data-load-error') ){ this.setAttribute('data-load-error', '1');this.setAttribute('src', this.getAttribute('data-default'));}" src="https://www.ericthecarguy.com/wp-content/plugins/ultimate-member/assets/img/default_avatar.jpg" width="14"/></noscript><img alt="Kevin Fontenot" class="lazyload gravatar avatar avatar-14 um-avatar um-avatar-default

The post date, desciption and author are extracted from each post page and added to the corpus. The post title is contained within the first h1 tag present on the page. Within this tag, a span tag has a text attribute which is the post title, which is extracted and stored (Shown in Fig. 6). The post description is contined in a div tag with class “bbp-topic-content” and is found with the findAll function. Within this div tag, a list of p tags have a text attribute which is the description of the post (Shown in Fig. 7). These p tags are extracted, concatenated together and saved. The date of the post is contained as a text attribute within a span tag with a class of “bbp-topic-post-date” and is found with the findAll function. The format of the extracted date is “Month Day, Year at 00:00 pm/am” (Shown in Fig. 8). 

<img src= "Report images/Figure6.png">

Figure 6 – HTML location of the post title

<img src= "Report images/Figure7.png">

Figure 7 – HTML location of the post description

<img src= "Report images/Figure8.png">

Figure 8 – HTML location of the date of the post

In [6]:
"""Example here, using the same post extracted from the previous block of code"""

# opening connection to the specific post, grabbing the page
uClient = uReq(new_url)
new_page_html = uClient.read()
uClient.close()

new_page_soup = soup(new_page_html, "html.parser")

# extract the title
forum_title = new_page_soup.h1.span.text
print("Title is: {}".format(forum_title))

# extract the article description
new_containers = new_page_soup.findAll("div", {"class": "bbp-topic-content"})

# text is contained in p tags. Concatenate these together
text = new_containers[1].find_all('p')
description = ""

for i in range(len(text)):
    # extracts the raw text from the tag
    current_decription = text[i].text
    description = description + current_decription
print("Description is: {}".format(description))

date_container = new_page_soup.findAll("span", {"class": "bbp-topic-post-date"})
date_posted = date_container[0].text
print("Date of post is: {}".format(date_posted))

Title is: 2006 honda accord J30a5 timing.
Description is: I had the engine out doing the timing belt and everything.  I was putting the belt on and needed to turn the rear (left side) cam pulley a hair.
But it kicked over alot and wouldn’t go anymore, turned it back and just before its perfect its kicks that way a good bit again and just keeps doing it. So I was able to get it as close as possible,  maybe a 1/4″ off. Is that ok or how do I get it perfect?  
Date of post is: May 28, 2021 at 9:46 pm


As there are 15 posts per page, whenever the findAll function is used on the main page, 15 results are returned. A for loop is used to iterate over the returned results and extract the desired information from each post accordingly. From each post, the four variables that were saved are: Date Posted, Author, Title and Description. These 4 variables were written to a csv file named “scraped_info.csv” to be used later (Shown in Fig. 9). 

The above web scraper was used on roughly 500 pages of the selected forum, and roughly 7500 posts were scraped to generate a corpus for the NLP prototype. The following table shows the data types of the scraped variables:

<img src= "Report images/Table1.png">

## Content Coverage of extracted data

When analysing the above table, the scraped data does not contain any subtypes within it, all variables are string variables, which is acceptable for the proposed NLP prototype. The data can be left as strings and no vectorisation is needed as no machine learning is needed for the prototype. Approximately 34,000 posts have been made to this forum, therefore roughly 22% of the available content was extracted which is sound data coverage for the NLP prototype. Some posts caused the scraper to crash due to a URL being posted or a form of text hat was not accepted, these posts were passed over.  

<img src= "Report images/Figure9.png">

Figure 9 – Overview of the scraped data

## Methodology of processing, cleaning, and storing harvested data for NLP tasking

For storing the data, minor cleaning took place. As the data is saved in a csv format, all scraped variables have their commas removed to not corrupt the data saving process. Any double spaces were reduced to a single space, and any “\r”  or “\n” characters were removed as this also corrupted the data saving process. Each post was written to the desired file one post at a time, therefore each attribute that needed to be saved was separated by a comma to save the file as a csv file. 


In [8]:
date_posted = date_posted.replace(", ", " ")

forum_title = forum_title.replace(",", " ")
forum_title = forum_title.replace("  ", " ")

description = description.replace("\r\n", " ")
description = description.replace(",", " ")
description = description.replace("  ", " ")
description = description.replace("\n", "")

written_variables = date_posted.replace(",", " ") + "," + author_name + "," + forum_title.replace(",", " ") + "," + description + "\n"
print("Variable written to csv file: {}".format(written_variables))

Variable written to csv file: May 28 2021 at 9:46 pm,Kevin Fontenot,2006 honda accord J30a5 timing.,I had the engine out doing the timing belt and everything. I was putting the belt on and needed to turn the rear (left side) cam pulley a hair.But it kicked over alot and wouldn’t go anymore turned it back and just before its perfect its kicks that way a good bit again and just keeps doing it. So I was able to get it as close as possible maybe a 1/4″ off. Is that ok or how do I get it perfect? 



## Metadata supplementation

For this NLP prototype, conventional supplementation of metadata from another source is not needed. The data that was harvested from the selected source is more than sufficient for this task. Supplementing metadata such as more information on car parts/makes of cars can be a step taken for implementing this project with a development team. 

## Summary and visualisation of the harvested data.

A preliminary exploratory data analysis occurred with the following graphs. Fig. 10 and Fig. 11 are word clouds of the scraped data from the titles of posts and descriptions of posts, respectively.

<img src= "Report images/Figure10.png">

Figure 10 – Word cloud of the words found in the titles of posts

<img src= "Report images/Figure11.png">

Figure 11 – Word cloud of the words found in the descriptions of posts

The above word clouds were generated after stop word removal to show more meaningful data. It is shown that makes and models of cars are more frequent in titles of posts, rather than in descriptions. Also, car parts are present in both titles and descriptions of posts, but words describing the problem are more frequent in descriptions of posts. These word clouds emphasise that the scraped data is relevant to the overall project, as the words “problem” and “issue” are quite frequent.

Bar graphs of the twenty most frequent words in titles and descriptions can be seen below in Fig. 12 and Fig. 13 respectively, for better understanding of the dataset.  

<img src= "Report images/Figure12.png">

Figure 12 – Most common words in titles of posts
[('Honda', 810), ('Accord', 384), ('start', 359), ('Civic', 275), ('engine', 247), ('2000', 223), ('Ford', 222), ('noise', 216), ('issue', 203), ('2002', 200), ('2003', 191), ('I', 186), ('Engine', 182), ('help', 182), ('Acura', 174), ('oil', 171), ('No', 169), ('Chevy', 167), ('2007', 164), ('2001', 164)]

<img src= "Report images/Figure13.png">

Figure 13 – Most common words in descriptions of posts

[('car', 7486), ('engine', 4902), ('im', 3682), ('would', 3652), ('new', 3336), ('like', 3124), ('back', 2923), ('replaced', 2907), ('one', 2900), ('get', 2822), ('start', 2799), ('fuel', 2714), ('problem', 2633), ('oil', 2419), ('also', 2359), ('could', 2347), ('time', 2196), ('help', 2179), ('sensor', 2160), ('issue', 2102)]

