![image](rfd_logo.png)

[RedFlagDeals](https://forums.redflagdeals.com/hot-deals-f9/) is a forum where users can post sales or deals that they have come across. The first part of this project is focused on scraping relevant information from the "All Hot Deals" section that includes all product categories. In the second and third part I will clean and visualize the data to extract and summarize useful information. 

Two tables will be scraped. The main table with a description of all columns is shown below. The second table will store comments that were made on each post. To insure, that comments can be linked back to the original post in the main table, corresponding post titles are stored in a second column. 

Data will first be cleaned by converting columns to the most appropriate data type, removing unwanted characters from strings and by dealing with missing values. Some of the nan records can be substituted through information found in the title or url. 

Main title data will be aggregated and visualized to reveal outstanding deal postings and interesting patterns. Extracted comments can be used for natural language processing and analyzing sentiment more robustly. A possible approach would be to perform PCA to Taking only upvotes and the number of replies into consideration may not be enough to accurately reflect the value of a deal.

|Column name|Description|
|---|---|
|'title'| Title of post|
|'votes'| Sum of up-, and down-votes|
|'source'| Name of retailer offering the sale|
|'creation_date'| Date of initial post|
|'last_reply'| Date of most recent reply|
|'author'| User name of post author|
|'replies'| Number of replies|
|'views'| Number of views|
|'price'| Price of product on sale|
|'saving'| Associated saving|
|'expiry'| Expiry date of sale|
|'url'| Link to deal|


In [27]:
# Packages
import requests # Scraping
from bs4 import BeautifulSoup # HTML parsing
import pandas as pd
import numpy as np
import datetime
import re

## I. Retrieving data from the "All Hot Deals" sub-forum

The "All Hot Deals" section is an overview page of recent posts made on the forum. Information about posts such as title and number of upvotes are summarized and listed. The summaries in the "All Hot Deals" section as well as the comments on individual posts are organized into several pages.

To scrap as much information as possible, we need to iterate through the different pages on "All Hot Deals". Further, we need to access each individual post to obtain additional information not found in the summary. For each post, we will also scrape the comments made by users. For this will need to iterate through the available pages on each post.


URL format for different pages: `root-url/page#/`

In [28]:
# Initialize global variables used to iterate over web-pages.
current_page = "" # page number; used to format root URL
total_pages = 1 # endpoint for iteration; set through get_posts()
root_url = "https://forums.redflagdeals.com/hot-deals-f9/" # base url for "All Hot Deals" sub-forum

# URL base to generate links to specific posts
base_url = "https://forums.redflagdeals.com"

# Dataframe to store scraped data
main_table = pd.DataFrame(columns=
    ['title',
    'votes',
    'source',
    'creation_date',
    'last_reply',
    'author',
    'replies',
    'views',
    'price',
    'saving',
    'expiry',
    'url'])

# Dataframe to to store post comments
comment_table = pd.DataFrame(columns=
                       ['title',
                       'comments'])

In [29]:
def get_posts(page: str) -> list:
    """
    Returns list of parsed object containing all post elements from
    the current 'page' and sets gloabl variable 'total_pages'
    
    Args:
    page - url string of current page
    
    Returns:
    topics - all parsed elements of class 'row topic'
    total_pages - sets global variable
    """
    
    # Initalize list of posts on page class="row topic"
    posts = []
    
    # Get entire page content
    response = requests.get(page)
    content = response.content
    
    # Find total number of pages and set global variable accordingly
    # Format of text: " {current page #} of {total page #} "
    # Need to strip white space and extract total page #
    parser = BeautifulSoup(content, 'html.parser')
    pages = parser.select(".pagination_menu_trigger")[0].text.strip().split("of ")[1]
    global total_pages
    total_pages = int(pages)
    
    # Find and return topics
    topics = parser.find_all("li", class_="row topic")
    return topics

In [60]:
def get_additional_info(post: str) -> dict:
    """
    Extracts and returns additional information from a RedFlagDeals post:
    url-link to the deal, the price of the product, the discount saving, 
    the expiry date, the parent/thread categories of the product and a list of
    comments found on all pages. 
    
    Args:
    post - url string linking to a specific post
    
    Returns:
    additional_info - dictionary containing additional information about the post.
    Stores NaN values in "additional_info" for objects that are not found.
    """
    
    # Additional information found in post
    additional_info = {}
    # Reviews found on all pages of post
    
    # Get content of post
    response = requests.get(post)
    content = response.content
    parser = BeautifulSoup(content, 'html.parser')
    
    # Thread-header with information on parent and thread category
    try: # parent category
        parent_category = parser.select(".thread_parent_category")[0].text
        additional_info['Parent:'] = parent_category
    except: additional_info['Parent:'] = np.nan # NaN if category not found
    try: # thread category
        thread_category = parser.select(".thread_category")[0].text
        additional_info['Thread:'] = thread_category
    except: additional_info['Thread:'] = np.nan # NaN if category not found
    
    
    # Offer-summary field: may contain deal link, price, saving, and retailer
    summary = parser.select(".post_offer_fields") # format example: "Price:\n$200\nSaving:\n70%"
    try:
        summary_list = summary[0].text.split("\n") 
    except: summary_list = []
        
    # Example format of summary_list: ["", "Price:", "200$", "Saving:", 50%, "Expiry:", "July 23, 2020"]
    for i in range(1, (len(summary_list) -1), 2): # index 0 is empty string
        current_element = summary_list[i] # content of current list element
        next_element = summary_list[i+1] # next list element containing the information
        
        # Price, saving, and expiry date information contained in the next list element will be saved
        if (current_element.startswith("Price") 
            or current_element.startswith("Saving") 
            or current_element.startswith("Expiry")):
            
            additional_info[current_element]  = next_element # next elements corrsponds to content
            
    # URL to link. Full link not available through .text
    try: 
        url = str(summary[0]).split('href="')[1].split('"')[0] # select link between href=" and "
        additional_info['Link:'] = url
    except: additional_info['Link:'] = np.nan
        
    
    # If any of the elements is not found in the summary-field add None value to dictionary 
    if "Price:" not in additional_info:
        additional_info['Price:'] = np.nan
        
    if "Savings:" not in additional_info:
        additional_info['Savings:'] = np.nan
        
    if "Expiry:" not in additional_info:
        additional_info['Expiry:'] = np.nan
    
    return additional_info # Return dictionary containing with information on price, saving and expiry  

In [None]:
def get_post_comments(post:str, title:str) -> pd.DataFrame:
    """
    Retrieves comments from all post pages.
    
    Arg:
    post: url to the first page of a post
    title: the title of the post
    
    Returns:
    df_comments: DataFrame object. Each row corresponds to an indevidual comment.
    A second column indicates the title of original post on which the comment was made
    """
    
    # Data storage
    df_comments = pd.DataFrame()
    comment_list = []
    
    # subsequent pages can be retireved through: root_url + "page#/"
    root_url = post
    
    # Get and parse content from first post page
    response = requests.get(post)
    content = response.content
    parser = BeautifulSoup(content, 'html.parser')
    
    # Retrive total number of pages for iteration
    # There is only one page, if the "pagination_menu_trigger" class doesn't exist -> except
    try:
        pages = parser.select(".pagination_menu_trigger")[0].text
        last_page = int(re.findall("\d+", pages)[1]) # first element is current page, second is last page
    except:
        last_page = 1
    
    # Iterate through pages and retrieve comments
    for page in range(1,(last_page+1)):
        if page == 1:
            # print(title, "\nPage 1 is being scrapped for comments...")
            
            #Get comments from first page
            comments = parser.select(".post_content .content")
            comment_list.extend(comment.text for comment in comments)
            
            # print("Comments found on page 1:", len(comments))
        else:
            # print("Page {} is being scrapped for comments...".format(page))
            
            # Parse next page
            next_page = root_url + str(page) # next page
            response = requests.get(next_page)
            content = response.content
            parser = BeautifulSoup(content, 'html.parser')
            
            # Get comments
            comments = parser.select(".post_content .content")
            comment_list.extend(comment.text for comment in comments)
            
            # print("Comments found on page {}:".format(page), len(comments))
    
    # Fill dataframe to return
    title_col = pd.Series(title for i in range(len(comment_list)))
    df_comments['title'] = title_col
    df_comments['comments'] = pd.Series(comment_list)
    return df_comments

In [None]:
def fill_table(posts: list) -> None:
    '''
    Extracts parsed data from current page and appends to the global table variable.
    
    Args:
    posts - list of parsed post elements obtained through get_posts()
    '''
    
    # Temporary DataFrame object that will be appended to the global 'table' variable
    tmp_table = pd.DataFrame()
    tmp_comments = pd.DataFrame()
    
    # Initializing columns for tmp_table
    title_col = pd.Series()
    source_col = pd.Series()
    url_col = pd.Series()
    votes_col = pd.Series()
    replies_col = pd.Series()
    views_col = pd.Series()
    creation_date_col = pd.Series()
    last_reply_col = pd.Series()
    author_col = pd.Series()
    price_col = pd.Series()
    saving_col = pd.Series()
    expiry_col = pd.Series()
    parent_col = pd.Series()
    thread_col = pd.Series()
    
    #Initializing columns for tmp_comments
    comment_title = pd.Series()
    comment_order = pd.Series()
    comment_comment = pd.Series()
    

    # Iterate through post elements on current page and extract data for table
    for post in posts:
        
        # Retailer corresponding to deal
        try: 
            source = post.select(".topictitle_retailer")[0].text.split("\n")[0] # split and remove line-break characters
            source_series = pd.Series(source) # transforming into Series object allows use of .append method
        except: source_series = pd.Series(np.nan)
        source_col = source_col.append(source_series, ignore_index=True)

        # Number of votes
        try: 
            votes = post.select(".post_voting")[0].text.split("\n")[1] 
            votes_series = pd.Series(votes) 
        except: votes_series = pd.Series(0)
        votes_col = votes_col.append(votes_series, ignore_index=True)
            
        # Title 
        try:
            topic = post.select(".topic_title_link") 
            title = topic[0].text.split('\n')[1] 
            title_series = pd.Series(title)
        except: title_series = pd.Series(np.nan)
        title_col = title_col.append(title_series, ignore_index=True)

        # Date of initial posting
        try: 
            creation = post.select(".first-post-time")[0].text.split("\n")[0]
            creation_series = pd.Series(creation)
        except: creation_series = pd.Series(np.nan)
        creation_date_col = creation_date_col.append(creation_series, ignore_index=True) 
        
        # Date of most recent replie
        try: 
            last_replie = post.select(".last-post-time")[0].text.split("\n")[0]
            last_replie_series = pd.Series(last_replie)
        except: last_replie_series = pd.Series(np.nan)
        last_reply_col = last_reply_col.append(last_replie_series, ignore_index=True) 
        
        # Author user-name
        try:
            author = post.select(".thread_meta_author")[0].text.split("\n")[0]
            author_series = pd.Series(author)
        except: author_series = pd.Series(np.nan)
        author_col = author_col.append(author_series, ignore_index=True)
        
        
        # Number of replies
        try:
            replies = post.select(".posts")[0].text.split("\n")[0]
            replies = replies.replace(",","") # remove commas to facilitate data type conversion to integer
            replies_series = pd.Series(replies)
        except: replies_series = pd.Series(np.nan)
        replies_col = replies_col.append(replies_series, ignore_index=True)
        
        # Number of views
        try:
            views = post.select(".views")[0].text.split("\n")[0]
            views = views.replace(",","") # remove commas to facilitate integer conversion
            views_series = pd.Series(views)
        except: replies_series = pd.Series(np.nan)
        views_col = views_col.append(views_series, ignore_index=True)
        
        # Link to current post. Used to extract additional information
        try:
            link = str(topic).split('href="')[1] # split at href to extract link
            link_clean = link.split('">')[0] # remove superfluous characters
        except: 
            link_clean = np.nan
        
        # Additional information post
        if link_clean != None: # retrieve information from post, if url exists
            post_url = (base_url + "{}").format(link_clean) # merge base-, and sub-url to generate the complete post-link
            additional_info = get_additional_info(post_url) # get dictionary of additonal information on price, saving, etc.
            
            # Fill columns with additional information from additional_info dictionary
            price_col = price_col.append(pd.Series(additional_info['Price:']), ignore_index=True)
            saving_col = saving_col.append(pd.Series(additional_info['Savings:']), ignore_index=True)
            expiry_col = expiry_col.append(pd.Series(additional_info['Expiry:']), ignore_index=True)
            url_col = url_col.append(pd.Series(additional_info['Link:']), ignore_index=True)
            parent_col = parent_col.append(pd.Series(additional_info['Parent:']), ignore_index=True)
            thread_col = thread_col.append(pd.Series(additional_info['Thread:']), ignore_index=True)
            
            # get comments from post
            comments_tmp = get_post_comments(post_url, title)
        else:
            price_col = price_col.append(np.nan)
            saving_col = saving_col.append(np.nan)
            expiry_col = expiry_col.append(np.nan)
            url_col = url_col.append(np.nan)
        
            
    # Fill temporary table
    tmp_table['title'] = title_col
    tmp_table['votes'] = votes_col.astype(int)
    tmp_table['source'] = source_col
    tmp_table['creation_date'] = creation_date_col
    tmp_table['last_reply'] = last_reply_col
    tmp_table['author'] = author_col
    tmp_table['replies'] = replies_col.astype(int)
    tmp_table['views'] = views_col.astype(int)
    tmp_table['price'] = price_col
    tmp_table['saving'] = saving_col
    tmp_table['expiry'] = expiry_col
    tmp_table['url'] = url_col
    tmp_table['parent_category'] = parent_col
    tmp_table['thread_category'] = thread_col
           
    # Append temporary objects to global variables 
    global main_table # gloabal keyword allows modification inside function
    main_table = main_table.append(tmp_table)
    global comment_table
    comment_table = comment_table.append(comments_tmp)
    
    
    # Logging progress
    print("Current main_table length:", main_table.shape[0])
    print("Current comment_table lenth:", comment_table.shape[0])
    print("="*50)

In [31]:
import time
start_time = time.time()
# Get first page information, and set total_pages through get_posts()
print('Extracting information from page: 1')
print("-"*50)
posts = get_posts(root_url)  
# Extract infomation from first page to fill table with data
fill_table(posts)

#Loop through pages and add data to table
for page in range(2, (total_pages + 1)):
    next_url = root_url + str(page) + "/" # URL of next page: base-url + number + "/"
    print('\nExtracting information from page: ', page, " of ", total_pages)
    print("-"*50)
    # Generate list of posts on current page
    posts = get_posts(next_url)

    # Fill table from information on current page and posts
    fill_table(posts)
    
print('Total time for scraping:', (time.time()-start_time)/60, "min.")

Extracting information from page: 1
--------------------------------------------------
Current main_table length: 29
Current comment_table lenth: 49

Extracting information from page:  2  of  45
--------------------------------------------------
Current main_table length: 59
Current comment_table lenth: 55

Extracting information from page:  3  of  45
--------------------------------------------------
Current main_table length: 89
Current comment_table lenth: 210

Extracting information from page:  4  of  45
--------------------------------------------------
Current main_table length: 119
Current comment_table lenth: 302

Extracting information from page:  5  of  45
--------------------------------------------------
Current main_table length: 149
Current comment_table lenth: 423

Extracting information from page:  6  of  45
--------------------------------------------------
Current main_table length: 179
Current comment_table lenth: 491

Extracting information from page:  7  of  45
---

Current main_table length: 1199
Current comment_table lenth: 1370

Extracting information from page:  41  of  45
--------------------------------------------------
Current main_table length: 1229
Current comment_table lenth: 1378

Extracting information from page:  42  of  45
--------------------------------------------------
Current main_table length: 1259
Current comment_table lenth: 1379

Extracting information from page:  43  of  45
--------------------------------------------------
Current main_table length: 1289
Current comment_table lenth: 1384

Extracting information from page:  44  of  45
--------------------------------------------------
Current main_table length: 1319
Current comment_table lenth: 1385

Extracting information from page:  45  of  45
--------------------------------------------------
Current main_table length: 1326
Current comment_table lenth: 1389
Total time for scraping: 89.95090063810349 min.


In [32]:
# Write data to csv file
main_table.to_csv('rfd_main.csv')
comment_table.to_csv('rfd_comments.csv')

## Load data and explore

### Main table

In [61]:
df_raw = pd.read_csv('rfd_main.csv').iloc[:,1:]
df_raw.head()

Unnamed: 0,author,creation_date,expiry,last_reply,parent_category,price,replies,saving,source,thread_category,title,url,views,votes
0,flora0222,"Jul 16th, 2020 8:29 am",,"Jul 17th, 2020 9:20 am",,6.99,89,,Staples,Home & Garden,"One Step Hand Sanitizer, Fragrance-Free, 473mL...",https://staplescanada.4u8mqw.net/c/341376/7554...,15445,132
1,yellowmp5,"Jul 13th, 2020 1:29 pm",,"Jul 17th, 2020 9:18 am",,,441,,Home Depot,Home & Garden,RYOBI 20% coupon barcode,,59219,159
2,riseagainstthemachine,"Jul 2nd, 2020 10:03 am",,"Jul 17th, 2020 9:17 am",,free,92,,,Apparel,Kits.com Free Pair Prescription Glasses,https://www.kits.com/freeglasses.html,25242,54
3,Googliya,"Jul 17th, 2020 7:23 am",,"Jul 17th, 2020 9:15 am",,469.99,13,,Costco,Sports & Fitness,"Northrock xc00, fat Tire bike, $469.99",https://www.costco.ca/northrock-xc00-fat-tire-...,1769,2
4,Presents,"Jul 16th, 2020 1:36 pm","July 29, 2020","Jul 17th, 2020 9:15 am",,,24,,Canadian Tire,Automotive,60x total CT Money when you pay with your Tria...,,2981,9


In [34]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1326 entries, 0 to 1325
Data columns (total 14 columns):
author             1326 non-null object
creation_date      1326 non-null object
expiry             370 non-null object
last_reply         1326 non-null object
parent_category    825 non-null object
price              898 non-null object
replies            1326 non-null int64
saving             524 non-null object
source             987 non-null object
thread_category    1325 non-null object
title              1326 non-null object
url                1044 non-null object
views              1326 non-null int64
votes              1326 non-null int64
dtypes: int64(3), object(11)
memory usage: 145.2+ KB


In [35]:
df_raw.describe(include='all')

Unnamed: 0,author,creation_date,expiry,last_reply,parent_category,price,replies,saving,source,thread_category,title,url,views,votes
count,1326,1326,370,1326,825,898,1326.0,524,987,1325,1326,1044,1326.0,1326.0
unique,946,1283,73,1253,12,610,,286,148,55,1309,1019,,
top,immad01,"Jul 13th, 2020 12:24 pm","July 2, 2020","Jul 17th, 2020 8:31 am",Computers & Electronics,Free,,50%,Amazon.ca,Computers & Electronics,Book Outlet - Many Low Prices on Books (Free S...,https://staplescanada.4u8mqw.net/c/341376/7554...,,
freq,34,3,30,4,339,13,,31,213,176,2,2,,
mean,,,,,,,52.892911,,,,,,13734.675716,12.815988
std,,,,,,,144.362648,,,,,,33812.256561,28.023994
min,,,,,,,-1.0,,,,,,127.0,-73.0
25%,,,,,,,5.0,,,,,,2191.75,1.0
50%,,,,,,,15.0,,,,,,4208.0,5.0
75%,,,,,,,45.0,,,,,,11146.25,14.0


### Comments table

In [36]:
df_comments = pd.read_csv("rfd_comments.csv").loc[:,"title":]
df_comments.head()

Unnamed: 0,title,comments
0,Rheem 36kW Electric Tankless Water Heater Reg....,Purchased in store by a friend for 289+tax. We...
1,Rheem 36kW Electric Tankless Water Heater Reg....,Which store location? Do you have a picture of...
2,Rheem 36kW Electric Tankless Water Heater Reg....,OOS in BC
3,Rheem 36kW Electric Tankless Water Heater Reg....,"Saint John, NB. Receipt posted!"
4,Rheem 36kW Electric Tankless Water Heater Reg....,Only seems to be in stock (or is it) in Scarbo...


In [37]:
df_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1389 entries, 0 to 1388
Data columns (total 2 columns):
title       1389 non-null object
comments    1387 non-null object
dtypes: object(2)
memory usage: 21.8+ KB


## II. Data wrangling

### Comment table

From the the short exploration above we can see that we need to remove one row with missing values for the comments column. These probably correspond to comments without text. Further, comment strings will need to be cleaned.

In [38]:
# Delete rows with empty comments
df_comments.dropna(axis=0, inplace=True)
df_comments.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1387 entries, 0 to 1388
Data columns (total 2 columns):
title       1387 non-null object
comments    1387 non-null object
dtypes: object(2)
memory usage: 32.5+ KB


In [40]:
# Print first 100 comments
print([x for x in df_comments['comments'][0:100]])

["Purchased in store by a friend for 289+tax. Went back and got another for me for the same price. Not sure if it's a price error or a real sale.\n", 'Which store location? Do you have a picture of the advertised sticker price or receipt to price match? Hard to get this hot deal without much to go on', 'OOS in BC', 'Saint John, NB. Receipt posted!', 'Only seems to be in stock (or is it) in Scarborough in GTA.', 'You might have to upgrade your electrical service to use this. 36 kw at 240 volts works out to 150 amps. Lots of houses only have 100 amp services. \nBetter to have a gas tankless water heater if you have gas service.', "Kanus wrote: ↑\nYou might have to upgrade your electrical service to use this. 36 kw at 240 volts works out to 150 amps. Lots of houses only have 100 amp services. \nBetter to have a gas tankless water heater if you have gas service.\n\nEven with 200A... Lets say furnace is ON, this is 66A at my house. We are at 200 - 66 - 150 and we are already in the negative

The only way to clean these strings up will be through regular expressions. 

The following should be removed:  
* ↑ symbols
* \n new line characters
* urls

In [41]:
# ↑ symbols
arrow_removed = [re.sub("↑+","", str(string)) for string in df_comments['comments']]
# \n characters
newline_removed = [re.sub("\\n+"," ",string) for string in arrow_removed]
# urls
urls_removed = [re.sub(r"\bhttp.+"," ",string) for string in newline_removed]
# Assign cleaned comments back
df_comments['comments'] = pd.Series(urls_removed)

# first 100 comments of cleaned table
print([x for x in df_comments['comments'][0:100]])

["Purchased in store by a friend for 289+tax. Went back and got another for me for the same price. Not sure if it's a price error or a real sale. ", 'Which store location? Do you have a picture of the advertised sticker price or receipt to price match? Hard to get this hot deal without much to go on', 'OOS in BC', 'Saint John, NB. Receipt posted!', 'Only seems to be in stock (or is it) in Scarborough in GTA.', 'You might have to upgrade your electrical service to use this. 36 kw at 240 volts works out to 150 amps. Lots of houses only have 100 amp services.  Better to have a gas tankless water heater if you have gas service.', "Kanus wrote:  You might have to upgrade your electrical service to use this. 36 kw at 240 volts works out to 150 amps. Lots of houses only have 100 amp services.  Better to have a gas tankless water heater if you have gas service. Even with 200A... Lets say furnace is ON, this is 66A at my house. We are at 200 - 66 - 150 and we are already in the negative. Doesn'

In [62]:
# Save cleaned comment table as file
df_comments.to_csv('rfd_comments.csv')

df_comments.head()

Unnamed: 0,title,comments
0,Rheem 36kW Electric Tankless Water Heater Reg....,Purchased in store by a friend for 289+tax. We...
1,Rheem 36kW Electric Tankless Water Heater Reg....,Which store location? Do you have a picture of...
2,Rheem 36kW Electric Tankless Water Heater Reg....,OOS in BC
3,Rheem 36kW Electric Tankless Water Heater Reg....,"Saint John, NB. Receipt posted!"
4,Rheem 36kW Electric Tankless Water Heater Reg....,Only seems to be in stock (or is it) in Scarbo...


### Main table

In [63]:
# Copy of raw data set
df = df_raw.copy()

# List of tuples: (column name, column dtype)
col_dtypes = [(col, type(x)) for x,col in zip(df.iloc[0], df.columns)]

# Print tuple for columns containing dates
for col in col_dtypes:
    if col[0] in ['creation_date', 'last_reply', 'expiry']:
        print(col[0], ': ', col[1])

creation_date :  <class 'str'>
expiry :  <class 'float'>
last_reply :  <class 'str'>


None of the columns are formatted as datetime. To facilitate working with the dates, we will convert them to datetime. 

### Convert date columns to datetime dtype

In [64]:
def to_datetime(column_name: str) -> pd.Series:
    """
    Converts a column of either format "%b %d, %Y %I:%M %p"
    or format "%B %d, %Y" from string to date-time
    
    Args:
    date_column - name of column with dates encoded as strings
    
    Returns:
    Column elements converted to datetime in a pandas.Series object
    """    
    # Superfluous characters removed
    column_clean = df[column_name].str.replace("st","").str.replace("nd","")\
                        .str.replace("rd","").str.replace("th","").str.strip()
    
    # Check for correct length of cleaned column
    column_len = len(column_clean)
    print("Cleaned and original column are of equal lenght: ", 
          column_len == len(df[column_name]), "\n")
    
    # Convert each entry from format "%b %d, %Y %I:%M %p" to datetime
    date_column = []
    try:
        date_column = column_clean.apply(lambda x :\
                        datetime.datetime.strptime(str(x), "%b %d, %Y %I:%M %p"))
    except: 
        print("\"%b %d, %Y %I:%M %p\" is incorrect format")
        pass
    
    # Convert from format "%B %d, %Y" to datetime
    for date in df[column_name]:
        if date is not np.nan:
            try:
                date_column.append(datetime.datetime.strptime(date, "%B %d, %Y"))
            except: 
                print("\"%B %d, %Y\" is incorrect format for", date)
                break
        else: 
            date_column.append(None)
    
    if len(date_column) != column_len:
        print("\n", "Incorrect column length!\n")
    else:
        print("\n", "Column has expected length!\n")
    
    return pd.Series(date_column)

In [65]:
# creation_date column converted to datetime
creation_date = to_datetime('creation_date')

# Compare random slice of original and converted column
print(creation_date.iloc[99:105], "\n")
print(df.loc[99:104, 'creation_date'])

Cleaned and original column are of equal lenght:  True 

"%B %d, %Y" is incorrect format for Jul 16th, 2020 8:29 am

 Column has expected length!

99    2020-07-16 20:38:00
100   2020-07-13 13:49:00
101   2020-07-15 21:33:00
102   2020-06-13 16:26:00
103   2020-07-09 17:42:00
104   2020-07-06 03:46:00
Name: creation_date, dtype: datetime64[ns] 

99     Jul 16th, 2020 8:38 pm
100    Jul 13th, 2020 1:49 pm
101    Jul 15th, 2020 9:33 pm
102    Jun 13th, 2020 4:26 pm
103     Jul 9th, 2020 5:42 pm
104     Jul 6th, 2020 3:46 am
Name: creation_date, dtype: object


In [66]:
# last_reply column converted to datetime
last_reply = to_datetime('last_reply')

# Print original and new column for comparison
print(last_reply.iloc[208:215], "\n")
print(df.loc[208:214, 'last_reply'])

Cleaned and original column are of equal lenght:  True 

"%B %d, %Y" is incorrect format for Jul 17th, 2020 9:20 am

 Column has expected length!

208   2020-07-16 13:56:00
209   2020-07-16 13:56:00
210   2020-07-16 13:54:00
211   2020-07-16 13:39:00
212   2020-07-16 13:36:00
213   2020-07-16 13:29:00
214   2020-07-16 13:26:00
Name: last_reply, dtype: datetime64[ns] 

208    Jul 16th, 2020 1:56 pm
209    Jul 16th, 2020 1:56 pm
210    Jul 16th, 2020 1:54 pm
211    Jul 16th, 2020 1:39 pm
212    Jul 16th, 2020 1:36 pm
213    Jul 16th, 2020 1:29 pm
214    Jul 16th, 2020 1:26 pm
Name: last_reply, dtype: object


In [67]:
expiry = to_datetime('expiry')
print(expiry.iloc[50:57], "\n")
print(df.loc[50:56, 'expiry'])

Cleaned and original column are of equal lenght:  True 

"%b %d, %Y %I:%M %p" is incorrect format

 Column has expected length!

50   NaT
51   NaT
52   NaT
53   NaT
54   NaT
55   NaT
56   NaT
dtype: datetime64[ns] 

50    NaN
51    NaN
52    NaN
53    NaN
54    NaN
55    NaN
56    NaN
Name: expiry, dtype: object


The to_datetime() function appears to correctly convert each of the columns. The results can now be used in the DataFrame.

In [68]:
# Assign datetime columns to DataFrame
df.expiry = expiry
df.last_reply = last_reply
df.creation_date = creation_date

# Verify dates
df.head()

Unnamed: 0,author,creation_date,expiry,last_reply,parent_category,price,replies,saving,source,thread_category,title,url,views,votes
0,flora0222,2020-07-16 08:29:00,NaT,2020-07-17 09:20:00,,6.99,89,,Staples,Home & Garden,"One Step Hand Sanitizer, Fragrance-Free, 473mL...",https://staplescanada.4u8mqw.net/c/341376/7554...,15445,132
1,yellowmp5,2020-07-13 13:29:00,NaT,2020-07-17 09:18:00,,,441,,Home Depot,Home & Garden,RYOBI 20% coupon barcode,,59219,159
2,riseagainstthemachine,2020-07-02 10:03:00,NaT,2020-07-17 09:17:00,,free,92,,,Apparel,Kits.com Free Pair Prescription Glasses,https://www.kits.com/freeglasses.html,25242,54
3,Googliya,2020-07-17 07:23:00,NaT,2020-07-17 09:15:00,,469.99,13,,Costco,Sports & Fitness,"Northrock xc00, fat Tire bike, $469.99",https://www.costco.ca/northrock-xc00-fat-tire-...,1769,2
4,Presents,2020-07-16 13:36:00,2020-07-29,2020-07-17 09:15:00,,,24,,Canadian Tire,Automotive,60x total CT Money when you pay with your Tria...,,2981,9


### Dealing with missing data: `source`

In [69]:
df.loc[:, ['source', 'title']].head()

Unnamed: 0,source,title
0,Staples,"One Step Hand Sanitizer, Fragrance-Free, 473mL..."
1,Home Depot,RYOBI 20% coupon barcode
2,,Kits.com Free Pair Prescription Glasses
3,Costco,"Northrock xc00, fat Tire bike, $469.99"
4,Canadian Tire,60x total CT Money when you pay with your Tria...


It is possible, that users simply forgot to include the source of the deal. We will check if missing sources are mentioned in the corresponding title.

In [70]:
# Set of entries in 'source' column
retailer_set = set(df['source'].dropna())
print("Number of unique sources: ", len(retailer_set))
print(df.source.isnull().sum(), "missing values in source column")

Number of unique sources:  148
339 missing values in source column


The large number of unique sources is promising! 

Next we will use the set previously created to itterate through the titles an check if any of the unique source names are present. If a source name from the set is found in `title` and no value is found in the corresponding `source` column, then the index as well as the source name are saved in the `replace` dictinoary.

In [71]:
replace_dict = {} # key: index; value: retailer name to replace missing source value at index

# Iterate through set of unique values from source source column
for retailer in retailer_set:
    """Fill replace dictioray with indecies and source names. Entries are made
    when a source name is found in the title column while the corresponding source entry
    is empty."""
    
    # Iterate through 'source' and 'title' columns row-by-row
    # Generate boolean array: True if unique source name (retailer) found in "title" and "source" is np.nan
    source_missing_and_in_title = np.array([retailer in title 
                                     if source is np.nan else False
                                     for title,source in zip(df.title, df.source)])
    
    # Indecies for which source_missing_and_in_title is True
    replacement_indicies = np.where(source_missing_and_in_title == True)[0]
    # Fill "replace" dictionary
    for index in replacement_indicies:
        if index not in replace_dict.keys():
            replace_dict[index] = retailer

print("Replacements found in 'title':", len(replace_dict.values()))

Replacements found in 'title': 37


Some missing sources can be replaced by information found in the title. We will use the indecies and values stored in `replace_dict` to replace the appropriate values.

In [72]:
source_list = list(df.source) # copy of source column 
missing_start = sum([x is np.nan for x in source_list]) # missing values before cleaning
print("Missing source values before replacement:", missing_start)

for replace in replace_dict.items():
    index = replace[0]
    source_replacement = replace[1]
    source_list[index] = source_replacement

missing_end = sum([x is np.nan for x in source_list]) # missing values after cleaning
print("Missing source values after replacement:", missing_end)
replaced_count = missing_start-missing_end # number of replaced values
print(replaced_count, "missing source records have been replaced!")

Missing source values before replacement: 339
Missing source values after replacement: 302
37 missing source records have been replaced!


All identified `source` records have been replaced with apporiate names of retailers found in the corrseponding `title` column. The new `source` column can now replace the old one.  

In [73]:
df.source = source_list
print("Number of missing values as expected:", (df.source.isnull().sum() == missing_end))

Number of missing values as expected: True


Further substitutions for missing `source` values may be found in the `url` column. The objective is to extract company names from the urls and use them to further replace missing values.

In [74]:
# 'url' entries of rows with missing source values
url_replacement = df[df.source.isnull()].url
print(url_replacement.notnull().sum(), "missing source values have corresponding urls")
url_replacement.head()

245 missing source values have corresponding urls


2                 https://www.kits.com/freeglasses.html
6     http://go.redirectingat.com?id=2927x594702&amp...
10                 https://www.ivctel.com/Internet/Plan
25    https://www.awin1.com/awclick.php?gid=340116&a...
27    http://www.amazon.ca/gp/redirect.html?ie=UTF8&...
Name: url, dtype: object

The urls need to be split and cleaned to extract the name of the organisation. The final cleaned values and their corresponding indicies in the DataFrame will be stored in the `clean_urls` disctionary.

In [75]:
clean_urls = {} # key: index in df, value: cleaned url
indicies = url_replacement.index

for url in zip(indicies, url_replacement):
    index = url[0]
    replacement_url = url[1]
    
    # Clean if url value not missing
    if replacement_url is not np.nan:
        url_root = replacement_url.split("//")[1].split("/")[0].split("?")[0].replace("www.", "")
        removed_domain = url_root.split(".")
        clean_urls[index] = removed_domain
    else:
        clean_urls[index] = np.nan
        
print(clean_urls)

{2: ['kits', 'com'], 6: ['go', 'redirectingat', 'com'], 10: ['ivctel', 'com'], 25: ['awin1', 'com'], 27: ['amazon', 'ca'], 31: ['awin1', 'com'], 34: ['oxio', 'ca'], 44: ['twitter', 'com'], 50: ['outilspierreberger', 'com'], 53: ['play', 'google', 'com'], 58: ['outilspierreberger', 'com'], 59: ['outilspierreberger', 'com'], 66: ['slickdeals', 'net'], 69: ['amazon', 'ca'], 81: ['buyapi', 'ca'], 86: ['go', 'redirectingat', 'com'], 97: nan, 98: nan, 101: nan, 109: nan, 112: nan, 113: nan, 115: ['schweser', 'com'], 117: ['tkqlhce', 'com'], 119: ['tkqlhce', 'com'], 134: ['performancehealth', 'ca'], 135: ['poppacorn', 'ca'], 144: ['amazon', 'ca'], 145: ['anrdoezrs', 'net'], 160: nan, 161: ['rover', 'ebay', 'com'], 162: ['ebox', 'ca'], 163: ['altimatel', 'com'], 165: ['detourcoffee', 'com'], 166: ['awin1', 'com'], 169: nan, 173: ['amazon', 'ca'], 177: ['kits', 'com'], 178: ['ninjakitchen', 'com'], 179: ['ninjakitchen', 'com'], 185: ['register', 'ubisoft', 'com'], 186: ['go', 'redirectingat', '

We want to identify the company names from the url splits observed in the print above.
The patterns shown in the table will facilitate this process. This is an oversimplification and will lead to some false extractions but the number of errors should be minimal.

|Condition| Pattern|
|---|---|
|Lists length 2| company name is at index 0|
|Lists length 3 and domain com, ca, or net| name is at index 1|
|List length 3 and domain io| name is at index 0| 
|List length 4| no identifiable name|

In [76]:
clean_url_final = clean_urls.copy()

for item in clean_url_final.items():
    index = item[0]
    url_split = item[1]
    try:
        if len(url_split) == 2:
             # name at index 0
            clean_url_final[index] = url_split[0].title()
        
        elif ((len(url_split) == 3) 
                        and ((url_split[-1] == "com") 
                                 or (url_split[-1] == "ca") 
                                 or (url_split[-1] == "ca"))):
            # name at index 1
            clean_url_final[index] = url_split[1].title()
        
        elif ((len(url_split) == 3) 
                        and (url_split[-1] == "io")):
             # name at index 0
            clean_url_final[index] = url_split[0].title()
        else: 
              clean_url_final[index] = np.nan
    except: value = np.nan

In [77]:
# Add url-derived company names to DataFrame
df.loc[list(clean_url_final.keys()),'source'] = list(clean_url_final.values())
print("Missing source values remaining: ", df.source.isnull().sum())

Missing source values remaining:  63


### Dealing with missing data: `price`

Users may have forgoten to tag prices associated with the deals they posted. We will verify if their are any `$` signs in the title for those rows that have missing price values.

In [78]:
missing_prices_df = df[df.price.isnull()]
price_in_title = ["$" in title for title in missing_prices_df.title]
print(df.price.isnull().sum(), "missing values in 'price' column")
print(sum(price_in_title), "missing prices have '$' signs in the title") 

428 missing values in 'price' column
217 missing prices have '$' signs in the title


In [79]:
# Display first 10 title to evaluate if the missing price could be substituted
replacement_titles = missing_prices_df[price_in_title].title
[title for title in replacement_titles][0:10]

['Old Navy: $15 for 5 Reusable Face Masks (Triple-Layer Cotton) Adults & Children',
 'Book Outlet - Many Low Prices on Books (Free Shipping Over $45 & 16% Off)',
 'Amex Offers Spend $10, get $5 - Up to 10 times (starts June 24th)',
 'Olympic 2" weight plates - 2.5lbs ($4.05), 5lbs ($7.96), 10lbs ($13.08), 25lbs ($35.97), 35lbs ($39.97), 45lbs ($66.97)',
 'Book Outlet - Many Low Prices on Books (Free Shipping Over $45 & 16% Off)',
 '[Lightning Deal] Coppertone Sport Continuous Sunscreen Spray Spf 30 Duo Pack 222mL $10.56',
 'Lenovo Duet Chromebook 2-in-1: $399 - preorder',
 'CIBC Earn $300 and 12 Month Fee Rebate with a CIBC Smart Account',
 'Bit Defender 1 Year Licenses - $14.99 + $19.99',
 'Harman Kardon Astra Bluetooth Speaker - Black - Open Box $49.99 (Brand New $429.99)']

In [80]:
regex = "[$]+[.,]*\d+[.,]*\d+"\
        "|[.,]*\d+[.,]*\d+[$]+"\
        "|[a-zA-Z]+[$]+[.,]*\d+[.,]*\d+"
price_replacements = replacement_titles.str.findall(regex)
print("Number of possible replacements:", len(price_replacements))
price_replacements

Number of possible replacements: 217


13                                               [$15]
25                                               [$45]
26                                               [$10]
27      [$4.05, $7.96, $13.08, $35.97, $39.97, $66.97]
31                                               [$45]
                             ...                      
1312                                    [$4.99, $3.99]
1315                                          [$39.99]
1316                                        [$60, $99]
1317                                            [$187]
1321                                            [$112]
Name: title, Length: 217, dtype: object

We will assume the first element in each list is most relevant and use it to replace missing price values. Some inaccuracies are likely to occure but the estimates should be reasonable for the most part. Most often, user seem to not include pricers in the summary if the product is available at different price categories. Picking one the prices is better than having no information at all.

In [81]:
replacement_dict = {} # key: index; value: price to replace missing value at index

# Iterate through price lists found in price_replacements and corresonding indecies in DataFrame
for replacement in zip(price_replacements, list(price_replacements.index)):
    price_list = replacement[0]
    index = replacement[1]
    if price_list != []:
        price = price_list[0]
        price_clean = (re.search(r"\d+[.,]*\d+", price)).group().replace(",","")
        replacement_dict[index] = price_clean
        
print(len(replacement_dict), "replacements found.")

207 replacements found.


In [82]:
# Replace missing values
df.loc[list(replacement_dict.keys()), 'price'] = list(replacement_dict.values())
print("Remaining missing values:", df.price.isnull().sum())

Remaining missing values: 221


In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1326 entries, 0 to 1325
Data columns (total 14 columns):
author             1326 non-null object
creation_date      1326 non-null datetime64[ns]
expiry             370 non-null datetime64[ns]
last_reply         1326 non-null datetime64[ns]
parent_category    825 non-null object
price              1105 non-null object
replies            1326 non-null int64
saving             524 non-null object
source             1263 non-null object
thread_category    1325 non-null object
title              1326 non-null object
url                1044 non-null object
views              1326 non-null int64
votes              1326 non-null int64
dtypes: datetime64[ns](3), int64(3), object(8)
memory usage: 145.2+ KB


### Dealing with missing data: `saving`

In [98]:
# Titles for which the saving entry is missing
missing_savings_df = df[df.saving.isnull()]
print([title for title in missing_savings_df.title.head(20)])

['One Step Hand Sanitizer, Fragrance-Free, 473mL, $6.99', 'RYOBI 20% coupon barcode', 'Kits.com Free Pair Prescription Glasses', 'Northrock xc00, fat Tire bike, $469.99', '60x total CT Money when you pay with your Triangle WE MC - All Tire & Wheel purchases (YMMV)', 'Wahl deluxe haircutting kit $39.99 YMMV', 'Fluval Flex 9 gallon aquarium- $99.99', 'LU28R550UQNXZA Samsung 28” 4K IPS Monitor $327.99', '100mb cable internet 39$', 'Costco.ca Geometric dome climbing structure $199', 'Old Navy: $15 for 5 Reusable Face Masks (Triple-Layer Cotton) Adults & Children', 'Dell Business 15% Off Email Signup Coupon, *Stacking* w/ 10%OFFMONITOR, +6% Rakuten CB (HOT)', 'Kettle Chips New York Cheddar Chips, 220 Gram for $2.49 (add-on item)', 'Kasa Smart Outdoor Plug with 2 Outlets, Individual Control, IP64 Waterproof', 'HSBC Fixed rates: 2.04% 2-year, 1.99% 5-yr Insured & 2.29% 5-Yr (with cashback!)', 'Hilroy 1-Subject Notebook, 10-1/2" x 8", Assorted, 80 Pages $0.10, 24pk markers $2.49', '[CLEARANCE]

Lastly we will search "%" symbols in rows for which the "saving" column entry is empty, similar to what we have done for prices.

In [192]:
# Titles containing the % symbol may contain information on savings
# "saving_in_title" indicates the indicies for which there is no data
# in the "saving" column and a "%" is found in the title.
saving_in_title = ["%" in title for title in missing_savings_df.title]
print(df.saving.isnull().sum(), "missing values in 'saving' column")
print(sum(saving_in_title), "rows with missing 'saving' data have a '%' symbol in their title") 

802 missing values in 'saving' column
63 rows with missing 'saving' data have a '%' symbol in their title


In [224]:
# Titles containing the % symbol in rows with missing 'saving' entries  
replacement_titles = missing_savings_df[saving_in_title].title

# Extract savings data
regex = "[.,]*\d+[.,]*\d+[%]+"
saving_replacements = replacement_titles.str.findall(regex)
print("Number of possible replacements:", len(replacement_titles))
saving_replacements

Number of possible replacements: 63


1                       [20%]
16                 [15%, 10%]
20      [2.04%, 1.99%, 2.29%]
72                      [20%]
73                      [10%]
                ...          
1273               [10%, 15%]
1276                    [25%]
1307               [25%, 10%]
1319                    [60%]
1320                    [60%]
Name: title, Length: 63, dtype: object

Again, we will assume the first occurance to be most relevant. 

In [221]:
replacements = {}
index_saving_tuples = zip(saving_replacements.index, saving_replacements)
for index, saving in index_saving_tuples:
    try:
        replacements[index] = saving[0]
    except:
        print("Empty list found in 'saving_replacements'")
replacements

Empty list found in 'saving_replacements'
Empty list found in 'saving_replacements'
Empty list found in 'saving_replacements'
Empty list found in 'saving_replacements'
Empty list found in 'saving_replacements'
Empty list found in 'saving_replacements'
Empty list found in 'saving_replacements'
Empty list found in 'saving_replacements'


{1: '20%',
 16: '15%',
 20: '2.04%',
 72: '20%',
 73: '10%',
 103: '50%',
 107: '30%',
 116: '10%',
 170: '50%',
 213: '70%',
 217: '2.5%',
 223: '12%',
 246: '2.70%',
 259: '87%',
 301: '25%',
 304: '50%',
 322: '25%',
 369: '91%',
 432: '90%',
 435: '30%',
 439: '37%',
 449: '75%',
 473: '20%',
 533: '40%',
 597: '40%',
 614: '70%',
 638: '60%',
 646: '60%',
 659: '2.5%',
 661: '10%',
 662: '80%',
 735: '40%',
 756: '75%',
 791: '50%',
 819: '99.9%',
 831: '30%',
 849: '2.8%',
 895: '50%',
 934: '2.14%',
 939: '50%',
 949: '30%',
 960: '25%',
 1070: '20%',
 1096: '50%',
 1141: '25%',
 1160: '50%',
 1181: '57%',
 1221: '25%',
 1235: '2.54%',
 1248: '15%',
 1273: '10%',
 1276: '25%',
 1307: '25%',
 1319: '60%',
 1320: '60%'}

We will expand our search for percent symbols to the comments to see if we can increase the number of replacements.

In [130]:
# df slice with missing "saving" data and no "%" symbol in "titel"
no_title_replacement = missing_savings_df[[(not replaceable) for replaceable in saving_in_title]]

# titles will be used as ids for corresponding comments
comment_ids = set(no_title_replacement.title)

In [169]:
# Convert ids into indecies for comments_df
comment_indecies = [(x in comment_ids) for x in df_comments.title]


The DataFrame includes comments derived from initial posts as well the corresponding responses. To find information on savings, we are mainly interested in the initial posts. This is because we expect the authors to limmit their comments to pertinant information on the sales deal. Percent symbols may be present in the responses but are less likely to correspond to savings associated with the product. 

Each post will have a row with its  title for every response that was made. To retrieve only the initial posts of the authors, we will group by title and then select the `first_valid_index()` from each group object. 

In [190]:
# Indecies for which titles appear for the first time
index_initial_posts = [x for x in df_comments[comment_indecies]\
                       .groupby('title').apply(pd.DataFrame.first_valid_index)]

replacement_comments = df_comments.iloc[index_initial_posts]

In [225]:
# Search for % symbol in comments
saving_found = ["%" in str(comment) for comment in replacement_comments.comments]
print(sum(saving_found), "row(s) with missing 'saving' data have a '%' symbol in their title") 

1 row(s) with missing 'saving' data have a '%' symbol in their title


In [228]:
replacement_comments[saving_found].comments.iloc[0]

'Thanks OP! I ordered 1 from buy buy baby. $251.99 all in with 20% off coupon'

Unfortunately only one replacement was found. Another option that could be explored in the future would be to search the titles for other indicators. For example strings of the format "50$ less" could be explored.

### Homogenizing the "saving" column

In [229]:
df.saving.value_counts()

50%           31
50% off       27
100%          12
40%           12
40% off       10
              ..
81%            1
81% off        1
20% to 30%     1
$120           1
$75 off        1
Name: saving, Length: 286, dtype: int64