## Contents
* [1. Scrape KrisFlyer & PPS Club Data](#1.Scrape-KrisFlyer-&-PPS-Club-Data)
* [2. Imports](#2.-Imports)
* [3. Extracting URLs](#3.-Extracting-URLs)
* [4. Extracting Comments](#4.-Extracting-Comments)

---
## 1. Scrape KrisFlyer & PPS Club Data
---
Objective: to scrape the 'KrisFLyer & PPS Club' topic's comments from SQTalk

---
## 2. Imports
---

In [7]:
import numpy as np
import pandas as pd
import requests, time, random, os
from tqdm import tqdm
from bs4 import BeautifulSoup

---
## 3. Extracting URLs
---
- threads and comments share a hierarchical relationship: a thread is the parent of comments
- since every thread has a unique URL (not organised by a recognisable pattern, e.g. running numbers), there is a need to scrape every threads' URL to access their comments later

- write custom headers to avoid appearing bot-like

In [8]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9,it;q=0.8,es;q=0.7',
    'Accept-Encoding': 'gzip, deflate, br',
    'Referer': 'http://www.google.com',
    "Host": "httpbin.org"
}

user_agent_list = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]

- extract every thread's URL
- data extracted on 17 Jan 2023

In [9]:
url = 'http://www.sqtalk.com/forum/forum/singapore-airlines/krisflyer-pps-club'

results = []
i = 1

try:
    while True:
        user_agent = random.choice(user_agent_list)  # Pick a random user agent
        headers = {'User-Agent': user_agent}
        time.sleep(random.uniform(1,2))  # randomly delay requests to website to appear less bot-like
        
        response = requests.get(url + f'/page{i}', headers = headers, timeout =10 ,verify=False)  # to scan through each page in each thread
        soup = BeautifulSoup(response.content,'lxml')
        page_total = int(soup.find_all('span', {'class':"pagetotal"})[1].text)  # there are multiple 'pagetotal' classes; the 2nd one is the one we need

        for result in tqdm(soup.find_all('a', {'class':"topic-title js-topic-title"})):  # extract comments
            thread = {}
            thread['title'] = result.text
            thread['url'] = result['href']
            results.append(thread)

        i+=1  # to 'flip' pages

        if i > page_total:  # if i > total pages, break the loop
                break

except requests.exceptions.Timeout:
    print("Timeout occurred")  # to print if timeout occured

pd.set_option('display.max_rows', None)  # to see all rows
print(len(results))
pd.DataFrame(results)

100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<00:00, 9044.32it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<?, ?it/s]
100%|██████████| 12/12 [00:00<

1560





Unnamed: 0,title,url
0,Qualifying as EG for the first time,http://www.sqtalk.com/forum/forum/singapore-ai...
1,Which FFP for me? Master Discussion,http://www.sqtalk.com/forum/forum/singapore-ai...
2,#SQMelbourneTram,http://www.sqtalk.com/forum/forum/singapore-ai...
3,Advice sought - Changing redemption bookings,http://www.sqtalk.com/forum/forum/singapore-ai...
4,First Savers SYD-SIN,http://www.sqtalk.com/forum/forum/singapore-ai...
5,Waitlist on Redemption booking,http://www.sqtalk.com/forum/forum/singapore-ai...
6,request to cancel miles (FQTV and FQTS),http://www.sqtalk.com/forum/forum/singapore-ai...
7,Redemption requirements up 10%,http://www.sqtalk.com/forum/forum/singapore-ai...
8,Missing Krisflyer Miles for PPS and PPS Solitaire,http://www.sqtalk.com/forum/forum/singapore-ai...
9,Mixed SQ/ Star Alliance Award,http://www.sqtalk.com/forum/forum/singapore-ai...


- Remove repeated sticky threads
- since sticky threads appear in every page, they are scrapped repeatedly at every page

In [10]:
results_clean = pd.DataFrame(results).drop_duplicates(subset=['title'])
print(results_clean)

                                                  title  \
0                   Qualifying as EG for the first time   
1                   Which FFP for me? Master Discussion   
2                                      #SQMelbourneTram   
3          Advice sought - Changing redemption bookings   
4                                  First Savers SYD-SIN   
5                        Waitlist on Redemption booking   
6               request to cancel miles (FQTV and FQTS)   
7                        Redemption requirements up 10%   
8     Missing Krisflyer Miles for PPS and PPS Solitaire   
9                         Mixed SQ/ Star Alliance Award   
10    Contact Centre Unable to See Seats That Are Av...   
11                   F Advantage LHR to SYD return help   
14             No saver fares outside a few months time   
15     Downgrading from accidental Solitaire PPS status   
16                    First Class Saver from Australia?   
17          Solitaire Benefits Are Coming Back - Slowly 

- export URLs

In [11]:
newpath = 'output'
if not os.path.exists(newpath):
    os.makedirs(newpath)

results_clean.to_csv('output/KrisFlyer_URLs.csv', index=False)

---
## 4. Extracting Comments
---

- accessing comments in every thread, page by page

In [12]:
df_url = pd.read_csv('output/KrisFlyer_URLs.csv')
len(df_url)

1298

In [13]:
results_thread = []

for url in tqdm(df_url['url']):
    try:
        i = 1
        
        while True:
            user_agent = random.choice(user_agent_list)  # Pick a random user agent
            headers = {'User-Agent': user_agent}
            time.sleep(random.uniform(1, 2))  # randomly delay requests to website to appear less bot-like
            
            response_thread = requests.get(url + f'-/page{i}', headers = headers, timeout =10 ,verify=False)  # to scan through each page in each thread
            soup_thread = BeautifulSoup(response_thread.content,'lxml')
            page_total = int(soup_thread.find('span', {'class':"pagetotal"}).text)

            for result in tqdm(soup_thread.find_all('div', {'class':"js-post__content-text restore h-wordwrap"})):  # extract comment
                if soup_thread.find_all('div', {'class':"js-post__content-text restore h-wordwrap"}) == None:  # to avoid error if there's no comments
                    continue
                results_thread.append(result.text.replace('\r','').replace('\n','').replace('\t',''))
            
            i+=1  # to 'flip' pages
            
            if i > page_total:  # if i > total pages, break the loop
                    break
    
    except requests.exceptions.Timeout:
        print("Timeout occurred")  # to print if timeout occured

print(len(results_thread))
pd.DataFrame(results_thread)

  0%|          | 0/1298 [00:00<?, ?it/s]
  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 91.74it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 90.23it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 89.75it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 93.56it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 82.80it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 83.43it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 83.23it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 89.04it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 98.20it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 92.50it/s][A

  0%|          | 0/15 [

Timeout occurred



100%|██████████| 5/5 [00:00<00:00, 233.74it/s]
 53%|█████▎    | 683/1298 [1:05:44<1:00:40,  5.92s/it]
100%|██████████| 8/8 [00:00<00:00, 142.87it/s]
 53%|█████▎    | 684/1298 [1:05:48<54:23,  5.31s/it]  
100%|██████████| 5/5 [00:00<00:00, 203.28it/s]
 53%|█████▎    | 685/1298 [1:05:51<48:06,  4.71s/it]
100%|██████████| 8/8 [00:00<00:00, 167.09it/s]
 53%|█████▎    | 686/1298 [1:05:55<46:41,  4.58s/it]
100%|██████████| 3/3 [00:00<00:00, 304.34it/s]
 53%|█████▎    | 687/1298 [1:05:59<43:00,  4.22s/it]
100%|██████████| 2/2 [00:00<00:00, 166.90it/s]
 53%|█████▎    | 688/1298 [1:06:02<38:55,  3.83s/it]
100%|██████████| 4/4 [00:00<00:00, 139.30it/s]
 53%|█████▎    | 689/1298 [1:06:05<37:32,  3.70s/it]
100%|██████████| 7/7 [00:00<00:00, 144.26it/s]
 53%|█████▎    | 690/1298 [1:06:09<37:11,  3.67s/it]
  0%|          | 0/14 [00:00<?, ?it/s][A
100%|██████████| 14/14 [00:00<00:00, 91.58it/s][A
 53%|█████▎    | 691/1298 [1:06:14<42:57,  4.25s/it]
100%|██████████| 4/4 [00:00<00:00, 232.01it/s]
 5

Timeout occurred



100%|██████████| 8/8 [00:00<00:00, 136.51it/s]
 84%|████████▎ | 1087/1298 [1:49:37<25:02,  7.12s/it]
  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 88.92it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 78.42it/s][A
 84%|████████▍ | 1088/1298 [1:49:48<28:21,  8.10s/it]
100%|██████████| 6/6 [00:00<00:00, 188.39it/s]
 84%|████████▍ | 1089/1298 [1:49:52<23:51,  6.85s/it]
100%|██████████| 4/4 [00:00<00:00, 231.57it/s]
 84%|████████▍ | 1090/1298 [1:49:55<20:06,  5.80s/it]
100%|██████████| 3/3 [00:00<00:00, 152.86it/s]
 84%|████████▍ | 1091/1298 [1:49:58<17:26,  5.06s/it]
100%|██████████| 4/4 [00:00<00:00, 164.06it/s]
 84%|████████▍ | 1092/1298 [1:50:02<15:43,  4.58s/it]
100%|██████████| 4/4 [00:00<00:00, 214.04it/s]
 84%|████████▍ | 1093/1298 [1:50:05<14:31,  4.25s/it]
100%|██████████| 11/11 [00:00<00:00, 129.89it/s]
 84%|████████▍ | 1094/1298 [1:50:10<14:37,  4.30s/it]
  0%|          | 0/15 [00:00<?, ?it/s][A
100%|████

16541


Unnamed: 0,0
0,"Hi Guys,I'm still DAMN confused after reading ..."
1,ah relax. you can be the perfect test case the...
2,There's no need to panic. Been in a similar si...
3,Originally posted by Nick CView PostThere's no...
4,"Firstly, Welcome to SQTalk, Vtac82!Krisflyer m..."
5,Originally posted by SuperJonJonView PostFirst...
6,Originally posted by SuperJonJonView PostFirst...
7,"BTW, are there no KF G members here who can sh..."
8,"Here's a snippet from Singapore Air Website""Be..."
9,There are plenty of KF G members here. but you...


- update header of column to 'comments'

In [14]:
results_thread_temp = pd.DataFrame(results_thread)
results_thread_temp.columns = ['comments']
results_thread_temp.head()

Unnamed: 0,comments
0,"Hi Guys,I'm still DAMN confused after reading ..."
1,ah relax. you can be the perfect test case the...
2,There's no need to panic. Been in a similar si...
3,Originally posted by Nick CView PostThere's no...
4,"Firstly, Welcome to SQTalk, Vtac82!Krisflyer m..."


- rid duplicate comments

In [15]:
print(len(results_thread_temp['comments']))
results_thread_clean = pd.DataFrame(results_thread_temp).drop_duplicates(subset=['comments'])
print(len(results_thread_clean['comments']))

16541
16476


- export comments

In [17]:
newpath = 'output'
if not os.path.exists(newpath):
    os.makedirs(newpath)

results_thread_clean.to_csv('output/KrisFlyer_comments.csv', index=False)