## Scraping CNA Articles by Topic
There are 2 steps to scraping the CNA articles, in the first portion, we will be scraping article links only for each topic. After obtaining the links, we then pass through each link to sieve out the necessary information. 

In [81]:
# ! pip install beautifulsoup4
# ! pip install requests

import urllib.request,sys,time
import os
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd

import regex as re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
stop_words = set(stopwords.words('english'))


#### Reading csv file of a manually compiled list of links for different topics

In [27]:
cna_links = pd.read_csv('../data/Trending Topics/links.txt')
cna_links

Unnamed: 0,topic,cna_search_link,num_pages
0,health,https://www.channelnewsasia.com/topic/health?s...,52
1,fashion,https://www.channelnewsasia.com/topic/fashion?...,47
2,environment,https://www.channelnewsasia.com/topic/environm...,30
3,food,https://www.channelnewsasia.com/topic/food?sor...,29
4,education,https://www.channelnewsasia.com/topic/educatio...,37
5,technology,https://www.channelnewsasia.com/topic/technolo...,30
6,art,https://www.channelnewsasia.com/topic/art?sort...,8


### Part 1: Creating a function to scrape the URLs

The main url we are scraping from changes for each topic, where articles tagged as under the topic will appear under the url. We will interate through the different pages of the url by changing the ending page number.

The function loops through the pages using the pageNum parameter.

Please update the headers variable with your own user agent. This prevents us from running into error 403: Forbidden when we scrape the URLs. 


To do so, 
1. Press F12 to navigate to the Chrome developer console.
2. Type in <code>navigator.userAgent</code> in the console and execute it by hitting enter.
3. Copy over your user agent and replace the value with your own in the headers dictionary.

For more details, refer to https://stackoverflow.com/questions/38489386/python-requests-403-forbidden

In [34]:
def get_urls(topic, link, ending_page):
    
    #rename csv as name_urls.csv
    with open(f'../data/Trending Topics/cna_urls/{topic}_cna_urls.csv', 'w', newline='') as file: #create a csv to input scrapped urls
        writer = csv.writer(file)
        writer.writerow(["Page", "URL"]) #create first header row with the column names as "Page" to indicate page number and "URL"
    
        for i in range(0, ending_page+1):
            try:
                url = link + str(i)
                headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36"}
                
                result = requests.get(url, headers=headers)

                soup=BeautifulSoup(result.text,'html.parser')

                article_urls = soup.findAll('a', attrs={'class' : 'h6__link h6__link-- list-object__heading-link'}) 
                for url in article_urls:
                    #add row in csv with the page number and scrapped url
                    writer.writerow([i, 'https://www.channelnewsasia.com'+url['href']])                                                                                        #link['href'] only gives us the relative path, not the absolute path, so we need to add the missing domain
                                                                                    

            except Exception as e:
                print('exception')
                error_type, error_obj, error_info = sys.exc_info()      # get the exception information
                print ('ERROR FOR LINK:',url)                          #print the link that cause the problem
                print (error_type, 'Line:', error_info.tb_lineno)     #print error info and line that threw the exception
                continue                                              #ignore this page. Abandon this and go back.

            time.sleep(2) 

In [33]:
for topic, link, num_pages in zip(cna_links["topic"], cna_links["cna_search_link"], cna_links["num_pages"]):
    get_urls(topic, link, num_pages)
    print(f"{topic} completed")

health https://www.channelnewsasia.com/topic/health?sort_by=field_release_date_value&sort_order=DESC&type%5Barticle%5D=article&page= 52
https://www.channelnewsasia.com/topic/health?sort_by=field_release_date_value&sort_order=DESC&type%5Barticle%5D=article&page=
https://www.channelnewsasia.com/topic/health?sort_by=field_release_date_value&sort_order=DESC&type%5Barticle%5D=article&page=
https://www.channelnewsasia.com/topic/health?sort_by=field_release_date_value&sort_order=DESC&type%5Barticle%5D=article&page=
https://www.channelnewsasia.com/topic/health?sort_by=field_release_date_value&sort_order=DESC&type%5Barticle%5D=article&page=
https://www.channelnewsasia.com/topic/health?sort_by=field_release_date_value&sort_order=DESC&type%5Barticle%5D=article&page=
https://www.channelnewsasia.com/topic/health?sort_by=field_release_date_value&sort_order=DESC&type%5Barticle%5D=article&page=
https://www.channelnewsasia.com/topic/health?sort_by=field_release_date_value&sort_order=DESC&type%5Barticle

### Part 2: Getting information from each link

Different functions are written to obtain specific information, by accessing the HTML tag on the CNA webpage.

In [35]:
def get_title(soup):
    title = soup.find("h1", attrs={'h1 h1--page-title'})

    return title
        
def get_text(soup):
    text = ''
    article = soup.find_all("div", attrs={"class": "text-long"})

    for i in range(len(article)):
        each_class = article[i]
        articleParagraph = each_class.find_all("p")
        for i in range(len(articleParagraph)):
            text += articleParagraph[i].text + '\n'
            
    return text

def get_related_topics(soup):
    other_keywords = []
    
    related_topics = soup.find("section", attrs={"class": "block block-layout-builder block-field-blocknodearticlefield-topics clearfix block--related-topics"})
    
    try:
        tags = related_topics.find_all("a")

        for tag in tags:
            tag_text = tag.text
            clean_tag = tag_text.replace("\n ", "")
            other_keywords.append(clean_tag)

    except Exception as e: # when article does not have any related topic tags
        other_keywords.append("")

    return other_keywords

In [56]:
def get_url_data(url):
    url_data = {}

    headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36"}
    result = requests.get(url, headers=headers)
    
    if result.status_code != 404:
    
        soup=BeautifulSoup(result.text,'html.parser')

        url_data['title'] = get_title(soup)
        url_data['text'] = get_text(soup)
        url_data['related_topics'] = get_related_topics(soup)

    return url_data

In [57]:
PATH = "../data/Trending Topics/cna_urls"

# Obtaining article title, text and related topic keywords for each article 
for file in os.listdir(PATH):
    topic = file.split("_")[0] 
    url_df = pd.read_csv(os.path.join(PATH, file))
    
    url_df['data'] = url_df.apply(lambda x: get_url_data(x['URL']), axis=1)

    url_df.to_csv(f"../data/Trending Topics/cna_text/{topic}_text.csv")

## Creating dictionary for each topic
After collecting a repository of words from CNA articles as the source, we will select the top 1000 frequently used words for each topic as a dictionary of words for that topic. The dictionary will serve as a word bank for each topic to tag social media text to their topics.

#### Processing text to remove unnecessary HTML tags and stopwords

In [102]:
def processing(text):
    text = text.replace("{'title': <h1 class=\"h1 h1--page-title\">", " ").replace("\n", " ").replace("</h1>", " ").replace("\'text\':", " ").replace("\\xa0", " ").replace("\\n", " ").replace("'related_topics': ['     ", " ")

    lowercase_text = text.lower()
    punctuations_removed = re.sub('[^a-z0-9]', ' ', lowercase_text)
    tokens = word_tokenize(punctuations_removed)
    stopwords_removed = [token for token in tokens if token not in stop_words]

    return " ".join(stopwords_removed)

#### Creating the dictionary of words for each topic

In [149]:
for file in os.listdir("../data/Trending Topics/cna_text"): # read article texts for each topic
    topic = file.split("_")[0]
    print(f"Reading text from topic: {topic}")

    article_df = pd.read_csv(f"../data/Trending Topics/cna_text/{file}", index_col=0)
    article_df["processed"] = article_df["data"].apply(processing)
    
    with open(f"../data/Trending Topics/dictionary/{topic}.csv", 'w') as file: # create new csv under dictionary folder to input top 1000 topical words for each topic
        writer = csv.writer(file)
        text = ''
        for article_text, data in zip(article_df["processed"], article_df["data"]): # loop through all rows of article text
            text += article_text # add all words from the article to a variable text 

        tokens = word_tokenize(text) # tokenize text and returns list
        freq_dist = FreqDist(tokens).most_common(1000) # returns a list of tuples (word, freq)
        top_1000_words = [word_freq[0] for word_freq in freq_dist] # append top 1000 words in a list

        writer.writerow(top_1000_words) 

    print("All words added\n")

Reading text from topic: art
All words added

Reading text from topic: education
All words added

Reading text from topic: environment
All words added

Reading text from topic: fashion
All words added

Reading text from topic: food
All words added

Reading text from topic: health
All words added

Reading text from topic: politics
All words added

Reading text from topic: technology
All words added

