# COMP47670 Assignment 2: Multi-Class Text Classification
## Description
### Objective
The objective of this assignment is to
- Scrape a corpus of news stories from a set of [web pages](http://mlg.ucd.ie/modules/COMP41680/assignment2/index.html)
- Pre-process the data
- Evaluate the performance of both binary and multi-label text classification algorithms on the data.

### Task 1. Data Collection
1. Select three of the 9 news categories: [Business, Politics, US-News]
2. Retrieve details from the link, parse the HTML to extract the following information:
  - The title of the news story.
  - The short text snippet for the story which represents the start of the complete news article.
  - The category label assigned to the story.
3. Store the parsed data that you have collected in an appropriate format.

In [5]:
import requests
from bs4 import BeautifulSoup
import json
import os.path
import traceback
from termcolor import colored

BASE_URL = "http://mlg.ucd.ie/modules/COMP41680/assignment2/month-"
CATEGORIES = ['Business', 'Politics', 'US-News']
MONTHS = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
PAGE_PER_MONTH = 31

In [6]:
# Assume breakpoint resume is not needed and website doesn't apply any anti-scraping techniques


class NewsScraper:
    def __init__(self):
        self.urls = []
        self.pages_stats = {}

    def initializer(self):
        for month in MONTHS:
            self.urls.append(BASE_URL + month)

    def pages_scraper(self):
        for url_mon in self.urls:
            page = None
            for page_num in range(1, PAGE_PER_MONTH + 1):
                try:
                    page = self.page_scraper(url_mon, page_num)
                    with open("data/raw_data/{}_{}.json".format(url_mon[-3:], page_num), "w", encoding='utf-8') as f:
                        json.dump(
                            {"page": page},
                            f,
                            indent=4,
                            separators=(',', ': '),
                            sort_keys=False,
                            ensure_ascii=False
                        )
                except Exception as e:
                    print("-" * 30)
                    print(colored("AN ERROR OCCURRED: {}\n".format(e), 'red'))

                print(colored("Finished - Month: {} - page {} with {} news".format(url_mon[-3:],
                                                                            page_num, len(page)), 'green', attrs=['bold']))
                self.pages_stats["{}_{}".format(url_mon[-3:], page_num)] = {"scraped": len(page)}
                self.save_stats()



    def page_scraper(self, url_mon, page_num):
        page_news = []
        page_num = str(page_num).zfill(3)
        url = url_mon + "-" + page_num + ".html"
        try:
            try:
                source = requests.get(url, timeout=20)
            except Exception:
                    return

            soup = BeautifulSoup(source.content, features="html.parser")
            for article in soup.find_all("div", class_="article"):
                try:
                    title = article.find("a").text
                except:
                    title = None
                if title == "null":
                    title = None

                try:
                    category = article.find_all(class_ = "metadata")[1].text[10:]
                except:
                    category = None
                if category == "null":
                    category = None

                try:
                    snippet = article.find("p", class_="snippet").text
                except:
                    snippet = None
                if snippet == "null":
                    snippet =None

                news = {
                    "title": title,
                    "category": category,
                    "snippet": snippet
                    # "link": link
                }
                page_news.append(news)

            return page_news

        except Exception as e:
            print("-" * 30)
            print(colored("ERROR: {}\n{}URL: {}".format(e, traceback.format_exc(), url), 'red'))
            return

    def save_stats(self):
        with open("data/stats/pages_stats.json", "w", encoding='utf-8') as f:
            json.dump(
                self.pages_stats,
                f,
                indent=4,
                separators=(',', ': '),
                sort_keys=False,
                ensure_ascii=False
            )

    def pages_mergerer(self):
        pass

    def run(self):
        self.initializer()
        self.pages_scraper()
        self.pages_mergerer()



scraper = NewsScraper()
scraper.run()

[1m[32mFinished - Month: jan - page 1 with 50 news[0m
[1m[32mFinished - Month: jan - page 2 with 50 news[0m
[1m[32mFinished - Month: jan - page 3 with 50 news[0m
[1m[32mFinished - Month: jan - page 4 with 50 news[0m
[1m[32mFinished - Month: jan - page 5 with 50 news[0m
[1m[32mFinished - Month: jan - page 6 with 50 news[0m
[1m[32mFinished - Month: jan - page 7 with 50 news[0m
[1m[32mFinished - Month: jan - page 8 with 50 news[0m
[1m[32mFinished - Month: jan - page 9 with 50 news[0m
[1m[32mFinished - Month: jan - page 10 with 50 news[0m
[1m[32mFinished - Month: jan - page 11 with 50 news[0m
[1m[32mFinished - Month: jan - page 12 with 50 news[0m
[1m[32mFinished - Month: jan - page 13 with 50 news[0m
[1m[32mFinished - Month: jan - page 14 with 50 news[0m
[1m[32mFinished - Month: jan - page 15 with 50 news[0m
[1m[32mFinished - Month: jan - page 16 with 50 news[0m
[1m[32mFinished - Month: jan - page 17 with 50 news[0m
[1m[32mFinished - Mon

KeyboardInterrupt: 

In [3]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline