<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP (Part 1)

## Problem Statement:

Singapore has recently lifted all its Covid-19 pre-departure testing in Feb 2023 as part of the big move to lift its remaining border measures at the end of the pandemic. As a Data Analyst hired by a local travel agency, we are tasked to research on some of the latest popular travel destinations in order to assist the marketing and operations team in their campaign to promote tour package services in these countries for the upcoming 2023 Travel Fair. In order to ensure a successful campaign, we would need to gather essential reviews on the latest hit attractions that are currently trending among Singaporeans. Our preliminary research studies have shown that Japan and Thailand have emerged as one of the top few holiday destinations for Singaporeans. In this project, we will be attempting to build a binary NLP classifer model that can correctly categorise these reviews into both countries.

As a novel apporach, text information from subreddits r/JapanTravel and r/ThailandTourism will be scraped using Reddit's Pushshift API to collect the necesary data and train the model. Further analysis on the text data, as well as evaluation on the model's ability to successfully classify the corresponding texts would be elaborated throughout the notebook

Part 1 consists of Web scraping the text information, compiling and saving the raw dataset to be used in Part 2

### Sources:
1. https://www.straitstimes.com/life/travel/bangkok-tokyo-and-bali-are-top-year-end-destinations-among-singapore-travellers
2. https://www.travelandleisureasia.com/sg/news/year-end-travel-destinations-among-singapore-travellers/
3. https://www.straitstimes.com/singapore/singapore-will-lift-remaining-covid-19-border-restrictions-from-feb-13

## Import Libraries

In [2]:
import requests
import pandas as pd
import time
import datetime

**Scraped on Wednesday, 13 March 2023 15:25:14**

In [18]:
#function to scrape both subreddits using Pushsift API
def scrape(cat1, num, cat2):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/46.0.2490.80'}
    #Pushshift API URL
    url1 = f'https://api.pushshift.io/reddit/search/submission?subreddit={cat1}'
    #Initiate empty timestamp
    before_timestamp = None
    #Pushshift API params for 1st subreddit forum, defined to scrape 150 post per scrape
    params1 = {'subreddit': cat1, 'size': 150, before_timestamp: before_timestamp}
    #create empty dataframe
    df1 = pd.DataFrame([],columns=['subreddit', 'created_utc', 'title', 'selftext'])
    #while loop to request information from API and scrape the posts
    while len(df1.index) < num:
        res1 = requests.get(url1, params1, headers=headers)
        posts1 = res1.json()["data"]
        posts1_df = pd.DataFrame(posts1, columns=['subreddit', 'created_utc', 'title', 'selftext'])
        df1 = pd.concat([df1, posts1_df])
        before_timestamp = posts1[-1]['created_utc']
        params1['before'] = before_timestamp
    #Pushshift API params for 2nd subreddit forum, defined to scrape 150 post per scrape
    url2 = f'https://api.pushshift.io/reddit/search/submission?subreddit={cat2}'
    params2 = {'subreddit': cat2, 'size': 150, 'before': before_timestamp}
    df2 = pd.DataFrame([],columns=['subreddit', 'created_utc', 'title', 'selftext'])
    while len(df2.index) < num:
        res2 = requests.get(url2, params2, headers=headers)
        posts2 = res2.json()["data"]
        posts2_df = pd.DataFrame(posts2, columns=['subreddit', 'created_utc', 'title', 'selftext'])
        df2 = pd.concat([df2, posts2_df])
        before_timestamp = posts2[-1]['created_utc']
        params2['before'] = before_timestamp
    
    final_df = pd.concat([df1, df2])
    final_df.drop(columns=['created_utc'], inplace=True)
    return final_df

In [19]:
web_df = scrape('japantravel', 1000, 'ThailandTourism')

In [20]:
web_df.to_csv('../data/japan_vs_thailand.csv', index=False)