# The Smart Local Data Collection


## Description of Dataset
- **url**: url of article
- **timedelta**: number of days between publication date and web scraping date (2 November 2022)
- **title**: title of article
- **category**: main category of article
- **subcategory1**: first subcategory of article
- **subcategory2**: second subcategory of article
- **subcategory3**: third subcategory of article
- **preview**: preview content of article (before clicking into article)
- **content**: full content of article (includes image credits)
- **n_tokens_title**: number of words in title (alphanumerical and including ampersands)
- **title_polarity**: polarity of title, values in the range of [-1, 1], -1: negative, 1: positive
- **title_subjectivity**: subjectivity of title, values in the range of [0, 1], 0: objective, 1: subjective
- **n_tokens_preview**: number of words in preview (alphanumerical and including ampersands)
- **preview_polarity**: polarity of preview, values in the range of [-1, 1], -1: negative, 1: positive
- **preview_subjectivity**: subjectivity of preview, values in the range of [0, 1], 0: objective, 1: subjective
- **n_tokens_content**: number of words in content of article
- **prop_non_stop**: proportion of stop words in content of article
- **prop_unique_non_stop**: proportion of unique stop words in content of article
- **content_polarity**: polarity of content, values in the range of [-1, 1], -1: negative, 1: positive
- **content_subjectivity**: subjectivity of content, values in the range of [0, 1], 0: objective, 1: subjective
- **reading_duration**: number of minutes to read entire article, around 200 words per minute
- **author**: author of article
- **publish_date**: publication date of article
- **day_of_week**: day of publication of article, 0: Monday, 6: Sunday
- **month**: month of publication of article
- **year**: year of publication of article
- **num_imgs**: number of images in article
- **img_links**: list of urls of images
- **num_hrefs**: number of hyperlinks in article
- **num_self_hrefs**: number of hyperlinks in article linked to thesmartlocal.com
- **num_tags**: number of tags at the end of the article
- **num_shares**: number of shares of article

## Import Libraries

In [1]:
import datetime
import re
import requests
import string
import time

import contractions
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns

from bs4 import BeautifulSoup
from functools import reduce
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from textblob import TextBlob
from tqdm import tqdm
from typing import List, Tuple

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pngse\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Utility Functions

In [2]:
def get_non_stop_words_rate(text: str, stopwords: List[str]) -> Tuple[float, float]:
    """
    Returns a tuple of proportion of non stopwords in text and
    proportion of unique non stopwords in text.
    """
    tokens = text.split()
    new_text = text
    for sw in stopwords:
        new_text = re.sub(pattern=f'\s*{sw}\s*', repl=' ', string=new_text)
    non_stop_words_tokens = new_text.split()
    
    # Proportion of non stopwords in text
    rate = len(non_stop_words_tokens)/len(tokens)
    
    # Proportion of unique non stopwords in text
    unique_rate = len(set(non_stop_words_tokens))/len(set(tokens))
    
    return rate, unique_rate

def get_sentiment(text:str) -> Tuple[float, float]:
    """
    Returns a tuple of polarity and sentiment of text, 
    Polarity values are in the range of [-1, 1], 
    -1: negative, 1: positive.
    Sentiment values in the range of [0, 1],
    0: objective, 1: subjective.
    """
    blob = TextBlob(text)
    sentiment = blob.sentiment
    
    return (sentiment.polarity, sentiment.subjectivity)

def lemmatize(text: str) -> str:
    """Converts a text into its lemmatized form."""
    wnl = WordNetLemmatizer()
    return ' '.join([wnl.lemmatize(word) for word in text.split()])

def remove_punctuations(text: str) -> str:
    """Removes punctuations and keeps all alphanumerical text."""
    return re.sub(r'[^\w\s]', '', text)

def remove_tags(text: str) -> str:
    """Removes social media tags."""
    return re.sub('@\w+', '', text.lower())

def remove_stopwords(text: str, stopwords: List[str]) -> str:
    """Removes stopwords from text."""
    new_text = text
    for s in stopwords:
        pattern = ' ' + s + ' '
        new_text = re.sub(pattern, ' ', new_text)
    
    return new_text

# Web Scraping

Data of articles from 24 December 2018 to 2 November 2022 are scraped from [The Smart Local's website](https://thesmartlocal.com/) to create the underlying dataset. This block of code retrieves the url, title, category and preview text of 4080 articles. The robots.txt file was checked before scraping to avoid any violations and scraping was done overnight to reduce disruptions to the company.

In [3]:
overall_list = []

# Information of 10 articles' previews are retrieved per loop
for i in tqdm(range(1, 409)):
    url = 'https://thesmartlocal.com/page/{}/'.format(i)
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, 'lxml')
    
    for preview in soup.select('.col-lg-6+ .col-lg-6'):
        title = preview.h2.a.string
        category = preview.li.get_text(strip=True)
        article_url = preview.h2.a.get('href')
        article_summary = preview.div.get_text(' ', strip=True)

        temp_list = [article_url, title, category, article_summary]
        overall_list.append(temp_list)
    
    # Pause requests for half a second to avoid spamming the website
    # with requests
    time.sleep(0.5)

article_preview = pd.DataFrame(overall_list,
                               columns=['url', 'title', 'subcategory',
                                        'preview'])

article_preview

100%|████████████████████████████████████████████████████████████████████████████████| 408/408 [08:37<00:00,  1.27s/it]


Unnamed: 0,url,title,subcategory,preview
0,https://thesmartlocal.com/read/staytion-marsil...,Staytion Marsiling: Coworking Space In The Nor...,Career,"Hooray for being able to sleep in, plus the ti..."
1,https://thesmartlocal.com/read/things-to-do-es...,"Esplanade Is Having Free Shows, A Theatre BTS ...",Things To Do In Singapore,Do not miss the free entertainment here.
2,https://thesmartlocal.com/read/things-to-do-no...,17 New Things To Do In November 2022 – Bishan ...,Activities,"In the blink of an eye, we're approaching 2023..."
3,https://thesmartlocal.com/read/paypal-welcome-...,You Can Redeem Vouchers For Brands Like foodpa...,Businesses,Vouchers can also be used on Zalora and Agoda.
4,https://thesmartlocal.com/read/things-to-do-ju...,9 Best Things To Do In Jurong For Westies To S...,Things To Do In Singapore,"Hot take: west side, best side."
...,...,...,...,...
4075,https://thesmartlocal.com/read/deals-end-2018/,9 Money-Saving Hacks That Will Expire On 31 De...,Things To Do In Singapore,Deals and hacks before 2019 You know the en...
4076,https://thesmartlocal.com/read/inspiration-sto...,Inspiration Store At Orchard Xchange Teaches Y...,Events,JR East Inspiration Store Here’s a thought –...
4077,https://thesmartlocal.com/read/nascans-sg/,NASCANS Has Ex-MOE Teachers And Coaches Maths ...,Businesses,NASCANS Student Care Centre If you’ve racked...
4078,https://thesmartlocal.com/read/family-spots-no...,6 Hidden Family Spots In The North To Get Your...,Things To Do In Singapore,Family places and activities in Singapore’s No...


The following code uses the urls scraped earlier to further scrape more information of each article, such as reading duration, author, publish date, content, number of images, number of links, number of self-directed links to The Smart Local, number of tags and number of shares.

In [6]:
article_info = []

for url in tqdm(sl.url):
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, "lxml")
    
    after_title = soup.select('.after-title')[0]
    reading_duration = (after_title.span.string if len(after_title) > 1
                        else after_title.string)

    author = soup.select('#meta-author a')[0].string

    # Date format yyyy-mm-dd
    publish_date = soup.select('#meta-date time')[0].get('datetime')

    # Article body
    article_body = soup.select('#wtr-content,.post-content')[0]
    content = article_body.get_text(' ', strip=True)

    # Number of images in article
    img_selector = 'p img, .size-full, .alignnone'
    num_imgs = len(soup.select(img_selector))
    
    # List of links to images in each article
    img_urls = [img.get('src') for img in soup.select(img_selector)]

    # List of links in articles
    href_list = article_body.find_all('a', class_='', href=True)

    # Total number of links in article
    num_hrefs = len(href_list)

    # Number of links self-directed to smartlocal
    num_self_hrefs = len(['smartlocal.com' in l.get('href') for l in href_list])

    # Number of article shares
    num_shares = soup.select('.mashsbcount')[0].string

    # Number of tags at the end of the article
    num_tags = len(soup.select('.post-tags')[0].find_all('a'))

    temp_list = [content, reading_duration, author, publish_date, num_imgs,
                 img_urls, num_hrefs, num_self_hrefs, num_tags, num_shares]

    article_info.append(temp_list)
    
    # Pause requests for half a second to avoid spamming the website
    # with requests
    time.sleep(0.5)

cols = ['content', 'reading_duration', 'author', 'publish_date', 'num_imgs',
        'img_links', 'num_hrefs', 'num_self_hrefs', 'num_tags', 'num_shares']

article_info = pd.DataFrame(article_info, columns=cols)

article_info

100%|████████████████████████████████████████████████████████████████████████████| 4080/4080 [1:50:19<00:00,  1.62s/it]


Unnamed: 0,content,reading_duration,author,publish_date,num_imgs,img_links,num_hrefs,num_self_hrefs,num_tags,num_shares
0,Staytion Marsiling – Coworking space in the No...,4,Renae Cheng,2022-11-02,7,[https://thesmartlocal.com/wp-content/uploads/...,17,17,1,27
1,Things to do at Esplanade So you’ve been to Es...,3,Samantha Nguyen,2022-11-01,5,[https://thesmartlocal.com/wp-content/uploads/...,7,7,2,73
2,Things to do in November 2022 Halloween may be...,13,Kezia Tan,2022-11-01,33,[https://thesmartlocal.com/wp-content/uploads/...,63,63,6,244
3,PayPal’s Welcome Pack promotion With Black Fri...,3,Aditi Kashyap,2022-11-01,5,[https://thesmartlocal.com/wp-content/uploads/...,4,4,2,25
4,"Things to do in Jurong For too long, residents...",10,Raewyn Koh,2022-11-01,24,[https://thesmartlocal.com/wp-content/uploads/...,18,18,3,31
...,...,...,...,...,...,...,...,...,...,...
4075,Deals and hacks before 2019 You know the end-o...,4,Sammi Kor,2018-12-26,14,[https://thesmartlocal.com/wp-content/uploads/...,17,17,0,0
4076,JR East Inspiration Store Here’s a thought – m...,3,Sammi Kor,2018-12-24,10,[https://thesmartlocal.com/wp-content/uploads/...,3,3,0,0
4077,NASCANS Student Care Centre If you’ve racked y...,7,Sammi Kor,2018-12-24,12,[https://thesmartlocal.com/wp-content/uploads/...,6,6,0,0
4078,Family places and activities in Singapore’s No...,5,Renae Cheng,2018-12-24,24,[https://thesmartlocal.com/wp-content/uploads/...,7,7,0,119


The two scraped dataframes are then concatenated into one single dataframe.

In [6]:
smartlocal = pd.concat(objs=[article_preview, article_info], axis=1)

# Remove share count at the end of content body
remove_share = lambda x: re.sub(pattern='\s*\d+ SHARES Share Tweet',
                                repl='',
                                string=x)

smartlocal.content = (smartlocal.content.apply(remove_share))

# Remove commas in num_shares and convert to int type
convert_int = lambda x: int(re.sub(pattern=',', repl='', string=x))
smartlocal.num_shares = smartlocal.num_shares.apply(convert_int)

smartlocal

Unnamed: 0,url,title,subcategory,preview,content,reading_duration,author,publish_date,num_imgs,img_links,num_hrefs,num_self_hrefs,num_tags,num_shares
0,https://thesmartlocal.com/read/staytion-marsil...,Staytion Marsiling: Coworking Space In The Nor...,Career,"Hooray for being able to sleep in, plus the ti...",Staytion Marsiling – Coworking space in the No...,4,Renae Cheng,2022-11-02,7,[https://thesmartlocal.com/wp-content/uploads/...,17,17,1,0
1,https://thesmartlocal.com/read/things-to-do-es...,"Esplanade Is Having Free Shows, A Theatre BTS ...",Things To Do In Singapore,Do not miss the free entertainment here.,Things to do at Esplanade So you’ve been to Es...,3,Samantha Nguyen,2022-11-01,5,[https://thesmartlocal.com/wp-content/uploads/...,7,7,2,20
2,https://thesmartlocal.com/read/things-to-do-no...,17 New Things To Do In November 2022 – Bishan ...,Activities,"In the blink of an eye, we're approaching 2023...",Things to do in November 2022 Halloween may be...,13,Kezia Tan,2022-11-01,33,[https://thesmartlocal.com/wp-content/uploads/...,63,63,5,144
3,https://thesmartlocal.com/read/paypal-welcome-...,You Can Redeem Vouchers For Brands Like foodpa...,Businesses,Vouchers can also be used on Zalora and Agoda.,PayPal’s Welcome Pack promotion With Black Fri...,3,Aditi Kashyap,2022-11-01,5,[https://thesmartlocal.com/wp-content/uploads/...,4,4,2,13
4,https://thesmartlocal.com/read/things-to-do-ju...,9 Best Things To Do In Jurong For Westies To S...,Things To Do In Singapore,"Hot take: west side, best side.","Things to do in Jurong For too long, residents...",10,Raewyn Koh,2022-11-01,24,[https://thesmartlocal.com/wp-content/uploads/...,18,18,3,52
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4075,https://thesmartlocal.com/read/deals-end-2018/,9 Money-Saving Hacks That Will Expire On 31 De...,Things To Do In Singapore,Deals and hacks before 2019 You know the en...,Deals and hacks before 2019 You know the end-o...,4,Sammi Kor,2018-12-26,14,[https://thesmartlocal.com/wp-content/uploads/...,17,17,0,0
4076,https://thesmartlocal.com/read/inspiration-sto...,Inspiration Store At Orchard Xchange Teaches Y...,Events,JR East Inspiration Store Here’s a thought –...,JR East Inspiration Store Here’s a thought – m...,3,Sammi Kor,2018-12-24,10,[https://thesmartlocal.com/wp-content/uploads/...,3,3,0,0
4077,https://thesmartlocal.com/read/nascans-sg/,NASCANS Has Ex-MOE Teachers And Coaches Maths ...,Businesses,NASCANS Student Care Centre If you’ve racked...,NASCANS Student Care Centre If you’ve racked y...,7,Sammi Kor,2018-12-24,12,[https://thesmartlocal.com/wp-content/uploads/...,6,6,0,0
4078,https://thesmartlocal.com/read/family-spots-no...,6 Hidden Family Spots In The North To Get Your...,Things To Do In Singapore,Family places and activities in Singapore’s No...,Family places and activities in Singapore’s No...,5,Renae Cheng,2018-12-24,24,[https://thesmartlocal.com/wp-content/uploads/...,7,7,0,119


# Export DataFrame

In [42]:
# Export dataframe to xlsx file
smartlocal.to_excel('./dataset/SmartLocal/smartlocal_raw.xlsx', 
                    index=False, 
                    encoding="utf-16")

# Export dataframe to parquet file
smartlocal.to_parquet('./dataset/SmartLocal/smartlocal_raw.parquet')