## Contents
* [1. Scrape Other Data](#1.Scrape-Other-Data)
* [2. Imports](#2.-Imports)
* [3. Extracting URLs](#3.-Extracting-URLs)
* [4. Extracting Comments](#4.-Extracting-Comments)

---
## 1. Srape Other Data
---
Objective: to scrape the 'small-talk' topic's comments from SQTalk

---
## 2. Imports
---

In [1]:
import numpy as np
import pandas as pd
import requests, time, random, os
from tqdm import tqdm
from bs4 import BeautifulSoup

---
## 3. Extracting URLs
---

- write custom headers to avoid appearing bot-like

In [3]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9,it;q=0.8,es;q=0.7',
    'Accept-Encoding': 'gzip, deflate, br',
    'Referer': 'http://www.google.com',
    "Host": "httpbin.org"
}

user_agent_list = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]

- extract every 'small-talk' thread's URL
- data extracted on 24 Jan 2023

In [4]:
url = 'http://www.sqtalk.com/forum/forum/general/small-talk'

results = []
i = 1

try:
    while True:
        user_agent = random.choice(user_agent_list)  # Pick a random user agent
        headers = {'User-Agent': user_agent}
        time.sleep(random.uniform(1,2))  # randomly delay requests to website to appear less bot-like
        
        response = requests.get(url + f'/page{i}', headers = headers, timeout =10 ,verify=False)  # to scan through each page in each thread
        soup = BeautifulSoup(response.content,'lxml')
        page_total = int(soup.find_all('span', {'class':"pagetotal"})[1].text)  # there are multiple 'pagetotal' classes; the 2nd one is the one we need

        for result in tqdm(soup.find_all('a', {'class':"topic-title js-topic-title"})):  # extract comments
            thread = {}
            thread['title'] = result.text
            thread['url'] = result['href']
            results.append(thread)

        i+=1  # to 'flip' pages

        if i > page_total:  # if i > total pages, break the loop
                break

except requests.exceptions.Timeout:
    print("Timeout occurred")  # to print if timeout occured

pd.set_option('display.max_rows', None)  # to see all rows
print(len(results))
pd.DataFrame(results)

100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<00:00, 9602.34it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<00:00, 1611.89it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/10 [00:00<?, ?it/s]
100%|██████████| 10/

1250





Unnamed: 0,title,url
0,Boeing 777 white noise,http://www.sqtalk.com/forum/forum/general/smal...
1,A talk about working for SQ over 30 year,http://www.sqtalk.com/forum/forum/general/smal...
2,Ultimate 747 Fan?,http://www.sqtalk.com/forum/forum/general/smal...
3,End of the Road for the 777-200LR,http://www.sqtalk.com/forum/forum/general/smal...
4,The Comedy Awards Master Thread,http://www.sqtalk.com/forum/forum/general/smal...
5,change my name,http://www.sqtalk.com/forum/forum/general/smal...
6,Airbus completes autonomous flight testing (Ai...,http://www.sqtalk.com/forum/forum/general/smal...
7,Gulfstream introduces the G700 as the new flag...,http://www.sqtalk.com/forum/forum/general/smal...
8,iOS vs. Android,http://www.sqtalk.com/forum/forum/general/smal...
9,Carry On Case Recommendations,http://www.sqtalk.com/forum/forum/general/smal...


- Remove repeated sticky threads
- since sticky threads appear in every page, they are scrapped repeatedly at every page

In [5]:
results_clean = pd.DataFrame(results).drop_duplicates(subset=['title'])
print(results_clean)

                                                  title  \
0                                Boeing 777 white noise   
1              A talk about working for SQ over 30 year   
2                                     Ultimate 747 Fan?   
3                     End of the Road for the 777-200LR   
4                       The Comedy Awards Master Thread   
5                                        change my name   
6     Airbus completes autonomous flight testing (Ai...   
7     Gulfstream introduces the G700 as the new flag...   
8                                       iOS vs. Android   
9                         Carry On Case Recommendations   
10                                        A350-1000 ULR   
11                     Ear/Headphones - What do you use   
12                    HSBC's Elite Travellers Programme   
13                             Airbus launches A321 XLR   
14               Storing my things while I go travlling   
15                     Android App For Boarding Passes? 

- export URLs

In [6]:
newpath = 'output'
if not os.path.exists(newpath):
    os.makedirs(newpath)

results_clean.to_csv('output/other_URLs.csv', index=False)

---
## 4. Extracting Comments
---

- accessing comments in every thread, page by page

In [7]:
df_url = pd.read_csv('output/other_URLs.csv')
len(df_url)

1246

In [8]:
results_thread = []

for url in tqdm(df_url['url']):
    try:
        i = 1
        
        while True:
            user_agent = random.choice(user_agent_list)  # Pick a random user agent
            headers = {'User-Agent': user_agent}
            time.sleep(random.uniform(1, 2))  # randomly delay requests to website to appear less bot-like
            
            response_thread = requests.get(url + f'-/page{i}', headers = headers, timeout =10 ,verify=False)  # to scan through each page in each thread
            soup_thread = BeautifulSoup(response_thread.content,'lxml')
            page_total = int(soup_thread.find('span', {'class':"pagetotal"}).text)

            for result in tqdm(soup_thread.find_all('div', {'class':"js-post__content-text restore h-wordwrap"})):  # extract comment
                if soup_thread.find_all('div', {'class':"js-post__content-text restore h-wordwrap"}) == None:  # to avoid error if there's no comments
                    continue
                results_thread.append(result.text.replace('\r','').replace('\n','').replace('\t',''))
            
            i+=1  # to 'flip' pages
            
            if i > page_total:  # if i > total pages, break the loop
                    break
    
    except requests.exceptions.Timeout:
        print("Timeout occurred")  # to print if timeout occured

print(len(results_thread))
pd.DataFrame(results_thread)

  0%|          | 0/1246 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 336.84it/s]
  0%|          | 1/1246 [00:04<1:37:19,  4.69s/it]
100%|██████████| 2/2 [00:00<00:00, 115.46it/s]
  0%|          | 2/1246 [00:08<1:25:06,  4.11s/it]
100%|██████████| 3/3 [00:00<00:00, 173.48it/s]
  0%|          | 3/1246 [00:11<1:20:11,  3.87s/it]
100%|██████████| 1/1 [00:00<?, ?it/s][A
  0%|          | 4/1246 [00:15<1:16:20,  3.69s/it]
  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 94.66it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 96.92it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 89.99it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 96.41it/s] [A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 94.66it/s][A

  0%|          | 0/15 [00:00<?, ?it/s][A
100%|██████████| 15/15 [00:00<00:00, 94.38it/s][A

  0%|   

14493


Unnamed: 0,0
0,Does anyone know what the white noise is calle...
1,SQ have been in the Nordic countries for many ...
2,Originally posted by HerrOView PostSQ have bee...
3,I know some of you here love the 747 but I'll ...
4,"""It is unclear whether his wife was informed o..."
5,"If he wasn't divorced then, he probably is now !"
6,"After 61 planes, this is the last 777-200LR"
7,"Ladies and Gentlemen,It has come to our attent..."
8,I'm known to have a very unique (aka. weird) s...
9,Originally posted by TanandikaView PostI don't...


- update header of column to 'comments'

In [9]:
results_thread_temp = pd.DataFrame(results_thread)
results_thread_temp.columns = ['comments']
results_thread_temp.head()

Unnamed: 0,comments
0,Does anyone know what the white noise is calle...
1,SQ have been in the Nordic countries for many ...
2,Originally posted by HerrOView PostSQ have bee...
3,I know some of you here love the 747 but I'll ...
4,"""It is unclear whether his wife was informed o..."


- rid duplicate comments

In [10]:
print(len(results_thread_temp['comments']))
results_thread_clean = pd.DataFrame(results_thread_temp).drop_duplicates(subset=['comments'])
print(len(results_thread_clean['comments']))

14493
14360


- exporting comments

In [11]:
newpath = 'output'
if not os.path.exists(newpath):
    os.makedirs(newpath)

results_thread_clean.to_csv('output/other_comments.csv', index=False)