![image.png](attachment:image.png)

# Web Scraping

[RedFlagDeals](https://forums.redflagdeals.com/hot-deals-f9/) is a forum where users can post sales or deals that they have come accross. The first part of this project is focused on scraping relevant information from the "All Hot Deals" section, which includes all product categories. In the second and third part I will clean and visualize the data to extract and summarize useful information.  

|Column name|Description|
|---|---|
|'title'| Title of post|
|'votes'| Sum of up-, and down-votes|
|source'| Name of retailer offering the sale|
|'creation_date'| Date of initial post|
|'last_reply'| Date of most recent reply|
|'author'| User name of post author|
|'replies'| Number of replies|
|'views'| Number of views|
|'price'| Price of product on sale|
|'saving'| Associated saving|
|'expiry'| Expiry date of sale|
|'url'| Link to deal|

In [639]:
# Packages
import requests # Scraping
from bs4 import BeautifulSoup # HTML parsing
import pandas as pd
import numpy as np
import datetime

## I. Retrieving data from the "All Hot Deals" sub-forum

URL format for different pages: `root-url/page#/`

In [647]:
# URL base, current page, and total number of pages. Used to iterate over page URLs.
current_page = "" # page number used to format base URL
total_pages = 1 # total number for endpoint of iteration
root_url = "https://forums.redflagdeals.com/hot-deals-f9/" # base url for "Hot Deals"

# URL base to generate links to specific posts
base_url = "https://forums.redflagdeals.com"

# Dataframe to store scraped data
# gloale keyword allows modification in function
table = pd.DataFrame(columns=
    ['title',
    'votes',
    'source',
    'creation_date',
    'last_reply',
    'author',
    'replies',
    'views',
    'price',
    'saving',
    'expiry',
    'url'])

In [641]:
def get_posts(page: str):
    """Returns list of all HTML post elements found on page and sets total_pages variable"""
    
    # Initalize list of posts on page class="row topic"
    posts = []
    
    # Get entire page content
    response = requests.get(page)
    content = response.content

    # URL parser
    parser = BeautifulSoup(content, 'html.parser')
    
    # Find total number of pages
    # Format of text: " {current page #} of {total page #} "
    # Need to strip white space and extract total page #
    pages = parser.select(".pagination_menu_trigger")[0].text.strip().split("of ")[1]
    global total_pages
    total_pages = int(pages) # Set global variable 
    
    # Find and return list of topics
    topics = parser.find_all("li", class_="row topic")
    return topics

In [642]:
def additional_info(post: str) -> dict:
    """Extracts and returns additional information from a specific post:
    url-link to the deal, the price, the discount percentage, the expiry date and 
    the parent/thread categories of the product. Returns NaN for objects that are not found."""
    
    # Additional information about deal
    add = {}
    
    # Get content of post
    response = requests.get(post)
    content = response.content
    
    # Parse URL
    parser = BeautifulSoup(content, 'html.parser')
    
    # Thread-header with information on parent and thread category
    try: # parent category
        parent_category = parser.select(".thread_parent_category")[0].text
        add['Parent:'] = parent_category
    except: add['Parent:'] = np.nan # NaN if category not found
    try: # thread category
        thread_category = parser.select(".thread_category")[0].text
        add['Thread:'] = thread_category
    except: add['Thread:'] = np.nan # NaN if category not found
    
    
    # Offer-summary field: may contain deal link, price, saving, and retailer
    summary = parser.select(".post_offer_fields") # format example: "Price:\n$200\nSaving:\n70%"
    try:
        summary_list = summary[0].text.split("\n") 
    except: summary_list = []
        
    # Go through summary elements and save relevant information
    for i in range(1, (len(summary_list) -1), 2): # index 0 is empty string
        current_element = summary_list[i] # content of current list element
        next_element = summary_list[i+1] # next list element
        
        # Price, saving, and expiry date information contained in the next list element will be saved
        if current_element.startswith("Price") or current_element.startswith("Saving") or current_element.startswith("Expiry"):
            add[current_element]  = next_element # next elements corrsponds to content
            
    # URL to link. Full link not available through .text
    try: 
        url = str(summary[0]).split('href="')[1].split('"')[0] # select link between href=" and "
        add['Link:'] = url
    except: add['Link:'] = np.nan
        
    
    # If any of the elements is not found in the summary-field add None value to dictionary 
    if "Price:" not in add:
        add['Price:'] = np.nan
        
    if "Savings:" not in add:
        add['Savings:'] = np.nan
        
    if "Expiry:" not in add:
        add['Expiry:'] = np.nan
    
    return add # Return dictionary containing with information on price, saving and expiry  

In [643]:
def fill_table(posts: list) -> None:
    '''Fills table with data from elements of the post objects'''
    
    # For appending data 
    tmp_table = pd.DataFrame() # temporary DataFrame that holds all column objects. Will be appended to the global `table`. 
    
    # Initializing columns for tmp_table
    title_col = pd.Series()
    source_col = pd.Series()
    url_col = pd.Series()
    votes_col = pd.Series()
    replies_col = pd.Series()
    views_col = pd.Series()
    creation_date_col = pd.Series()
    last_reply_col = pd.Series()
    author_col = pd.Series()
    price_col = pd.Series()
    saving_col = pd.Series()
    expiry_col = pd.Series()
    parent_col = pd.Series()
    thread_col = pd.Series()
    

    # Iterate through post elements and extract data for table
    for post in posts:
        
        # Retailer corresponding to deal
        try: 
            source = post.select(".topictitle_retailer")[0].text.split("\n")[0] # split and remove line-break characters
            source_series = pd.Series(source) # transform into Series object
        except: source_series = pd.Series(np.nan)
        source_col = source_col.append(source_series, ignore_index=True) # append to column and ignore index to avoid complications when merging with DataFrame object

        # Number of votes
        try: 
            votes = post.select(".post_voting")[0].text.split("\n")[1] # split and remove line-break characters
            votes_series = pd.Series(votes) # transform into Series object
        except: votes_series = pd.Series(0)
        votes_col = votes_col.append(votes_series, ignore_index=True) # append to column
            
        # Title 
        try:
            topic = post.select(".topic_title_link") # contains title and sub-url to post
            title = topic[0].text.split('\n')[1] # extract text and remove line-break characters
            title_series = pd.Series(title)
        except: title_series = pd.Series(np.nan)
        title_col = title_col.append(title_series, ignore_index=True)

        # Date of initial posting
        try: 
            creation = post.select(".first-post-time")[0].text.split("\n")[0] # remove line-breaks
            creation_series = pd.Series(creation)
        except: creation_series = pd.Series(np.nan)
        creation_date_col = creation_date_col.append(creation_series, ignore_index=True) # append to column
        
        # Date of most recent replie
        try: 
            last_replie = post.select(".last-post-time")[0].text.split("\n")[0] # remove line-breaks
            last_replie_series = pd.Series(last_replie)
        except: last_replie_series = pd.Series(np.nan)
        last_reply_col = last_reply_col.append(last_replie_series, ignore_index=True) # append to column
        
        # Author user-name
        try:
            author = post.select(".thread_meta_author")[0].text.split("\n")[0]
            author_series = pd.Series(author)
        except: author_series = pd.Series(np.nan)
        author_col = author_col.append(author_series, ignore_index=True)
        
        
        # Number of replies
        try:
            replies = post.select(".posts")[0].text.split("\n")[0]
            replies = replies.replace(",","") # replace any commas to prepare for data type switch to integer
            replies_series = pd.Series(replies)
        except: replies_series = pd.Series(np.nan)
        replies_col = replies_col.append(replies_series, ignore_index=True)
        
        # Number of views
        try:
            views = post.select(".views")[0].text.split("\n")[0]
            views = views.replace(",","") # replace any commas to prepare for data type switch to integer
            views_series = pd.Series(views)
        except: replies_series = pd.Series(np.nan)
        views_col = views_col.append(views_series, ignore_index=True)
        
        # Link to current post
        try:
            link = str(topic).split('href="')[1] # split at href to extract link
            link_clean = link.split('">')[0] # remove superfluous characters
        except: 
            link_clean = np.nan
        
        # Additional information post
        if link_clean != None: # retrieve information from post, if it exists
            post_url = (base_url + "{}").format(link_clean) # merge base-, and sub-url to generate the complete post-link
            add_info = additional_info(post_url) # get additonal information on price, saving, and expiry-date
            
            # Fill columns with additional information from add_info dictionary
            price_col = price_col.append(pd.Series(add_info['Price:']), ignore_index=True)
            saving_col = saving_col.append(pd.Series(add_info['Savings:']), ignore_index=True)
            expiry_col = expiry_col.append(pd.Series(add_info['Expiry:']), ignore_index=True)
            url_col = url_col.append(pd.Series(add_info['Link:']), ignore_index=True)
            parent_col = parent_col.append(pd.Series(add_info['Parent:']), ignore_index=True)
            thread_col = thread_col.append(pd.Series(add_info['Thread:']), ignore_index=True)
        else:
            price_col = price_col.append(np.nan)
            saving_col = saving_col.append(np.nan)
            expiry_col = expiry_col.append(np.nan)
            url_col = url_col.append(np.nan)
        
            
    # Fill temporary table
    tmp_table['title'] = title_col
    tmp_table['votes'] = votes_col.astype(int)
    tmp_table['source'] = source_col
    tmp_table['creation_date'] = creation_date_col
    tmp_table['last_reply'] = last_reply_col
    tmp_table['author'] = author_col
    tmp_table['replies'] = replies_col.astype(int)
    tmp_table['views'] = views_col.astype(int)
    tmp_table['price'] = price_col
    tmp_table['saving'] = saving_col
    tmp_table['expiry'] = expiry_col
    tmp_table['url'] = url_col
    tmp_table['parent_category'] = parent_col
    tmp_table['thread_category'] = thread_col
        
    # Print result
    global table # gloabal keyword allows modification inside function
    table = table.append(tmp_table)
    print("Current table length: ", table.shape[0])

In [648]:
# First page information, and set total_pages through get_posts()
# Generate list of posts on first page
posts = get_posts(url)  
# Fill table from information on first page and corresponding posts
fill_table(posts)

#Loop through pages and fill table
for page in range(2, total_pages):
    next_url = root_url + str(page) + "/" # URL of next page: base-url + number + "/"
    print(next_url)
    # Generate list of posts on current page
    posts = get_posts(next_url)

    # Fill table from information on current page and posts
    fill_table(posts)

# Print first 10 rows
table.head(10)

Current table length:  30
https://forums.redflagdeals.com/hot-deals-f9/2/
Current table length:  60
https://forums.redflagdeals.com/hot-deals-f9/3/
Current table length:  90
https://forums.redflagdeals.com/hot-deals-f9/4/
Current table length:  120
https://forums.redflagdeals.com/hot-deals-f9/5/
Current table length:  150
https://forums.redflagdeals.com/hot-deals-f9/6/
Current table length:  179
https://forums.redflagdeals.com/hot-deals-f9/7/
Current table length:  209
https://forums.redflagdeals.com/hot-deals-f9/8/
Current table length:  239
https://forums.redflagdeals.com/hot-deals-f9/9/
Current table length:  269
https://forums.redflagdeals.com/hot-deals-f9/10/
Current table length:  299
https://forums.redflagdeals.com/hot-deals-f9/11/
Current table length:  329
https://forums.redflagdeals.com/hot-deals-f9/12/
Current table length:  359
https://forums.redflagdeals.com/hot-deals-f9/13/
Current table length:  389
https://forums.redflagdeals.com/hot-deals-f9/14/
Current table length:  

Unnamed: 0,author,creation_date,expiry,last_reply,parent_category,price,replies,saving,source,thread_category,title,url,views,votes
0,pux420,"Jun 29th, 2020 12:30 pm","July 12, 2020","Jul 12th, 2020 11:37 am",Home & Garden,214.99,121,20%,Costco,Home Improvement & Tools,Garage Epoxy $214.99,https://www.costco.ca/rokrez-epoxy-floor-coati...,35154,22
1,Indubitably,"Jul 12th, 2020 9:06 am",,"Jul 12th, 2020 11:35 am",Computers & Electronics,19.99,20,$23 off,Amazon.ca,Home Theatre & Audio,Dudios True Wireless Earbuds - $19.99 with Prime,http://www.amazon.ca/gp/redirect.html?ie=UTF8&...,2470,19
2,maverikbc,"Jul 8th, 2020 9:31 pm",,"Jul 12th, 2020 11:35 am",Financial Services,,67,,,Credit Cards,"TD, BMO, and CIBC cash back 10% cards welcome ...",,15080,19
3,phoreoneone,"Jun 30th, 2020 10:48 am",,"Jul 12th, 2020 11:32 am",,,5,,Starbucks,Groceries,Aeroplan with TD and Starbucks Offers YMMV CHE...,,1440,0
4,AramisAtos,"Jun 18th, 2020 3:20 pm",,"Jul 12th, 2020 11:32 am",,,96,70%,,Apparel,Levi's Warehouse Sale Event 70% off,http://levis-canada.sjv.io/c/341376/486184/846...,34354,59
5,Mr2828,"Jul 9th, 2020 5:42 pm",,"Jul 12th, 2020 11:31 am",,,171,,Foot Locker,Apparel,Ultraboost/NMD/Infinity React 50% + extra 15/2...,,48508,112
6,enzo85,"Jul 12th, 2020 11:31 am",,"Jul 12th, 2020 11:31 am",Home & Garden,,0,,,Appliances,$499 - Miele Classic C1 Cat and Dog Canister V...,https://www.canadiantire.ca/en/pdp/miele-class...,66,0
7,SAN66,"Jul 11th, 2020 4:09 pm",,"Jul 12th, 2020 11:28 am",Computers & Electronics,$129.97,22,30%,Staples,Peripherals & Accessories,Google Home Wifi Router (1st gen) $129.97 in s...,https://staplescanada.4u8mqw.net/c/341376/7554...,4900,4
8,mangoman,"Jul 10th, 2020 4:40 pm","July 12, 2020","Jul 12th, 2020 11:27 am",Computers & Electronics,$499 + 20x,199,,Shoppers Drug Mart,Computers & Tablets/eReaders,$499.99 Acer Aspire 3 Ryzen 3 3200U / 8GB RAM ...,,22448,21
9,Blackdove77,"Jul 6th, 2020 12:49 pm","July 12, 2020","Jul 12th, 2020 11:27 am",Computers & Electronics,,111,100%,,Video Games,Watchdogs 2 PC version free on July 12th,https://news.ubisoft.com/en-us/article/41nS5f7...,25379,105


In [649]:
# Write data to csv file
table.to_csv('C:/Users/User/Documents/GitHub/Data-Science/rfd_scrape.csv')

In [650]:
df = pd.read_csv('rfd_scrape.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1409 entries, 0 to 1408
Data columns (total 15 columns):
Unnamed: 0         1409 non-null int64
author             1409 non-null object
creation_date      1409 non-null object
expiry             419 non-null object
last_reply         1409 non-null object
parent_category    869 non-null object
price              960 non-null object
replies            1409 non-null int64
saving             561 non-null object
source             1058 non-null object
thread_category    1409 non-null object
title              1409 non-null object
url                1103 non-null object
views              1409 non-null int64
votes              1409 non-null int64
dtypes: int64(4), object(11)
memory usage: 165.2+ KB


## II. Data wrangling