# Introduction

This notebook explains a pipeline for:     
* Taking a URL as input
* Extracting all URLs found on that page which match specified criteria
* Extracting the text on each of those sub-URLS
* Generating a data table of the url, the text and the first title located on that page
* Writing this data to a CSV, which is expected to then be used in a subsequent pipeline. 

This is the abstract task which this pipeline accomplishes. In this scenario, we will configure the pipeline for a specific use case, which is:    
* Input the URL for 'The Daily Mail Australia' news publication
* Extract every news article on the front page of the website
* Extract all text from each article, which will then be used in 'Part 3 NLP Analysis' notebook.

## Extending and Scaling this Prototype
This pipeline contains fucntions and a template for a pipeline which can be sued to accomplish this abstracted use case. This is intended as a prototype which could later be developed into:    
* An application for extarcting text from all, or a query-filteres subset of, news articles on a news website. This would require altering the current method which only evaluates URLs from the inputted webpage. This new implementation would extract every URL present on the domain which meets a specified criteria. 
* This example will collected around 100-200 articles and takes aproximatley 2.5 minutes to run. The majority of computation time is from extracting text from the larger HTML text blobs. To scale the application to many thousands or millions of artciles, could be accomplished by distributing out the processing for each sub-URL across parallel computers. 


## Describing the Web Content

a. Websites to be consumed 

b. A rationale for extracting the web content 

c. Content coverage of the data extracted 

d. Complexity of the content layout 

e. Website/data copyright considerations 

f. Metadata supplementation and rational for the supplementation 

g. Content extractor to export the important aspects of the data and/or metadata 

h. Relevant python coding 

i. Demonstration of the application of the WebCrawler (i.e. screen shots) 

j. Methodology of processing, cleaning, and storing harvested data for NLP 
tasking 

k. Summary and visualisation of the harvested data. Preliminary EDA is 
acceptable in this section as well. 

# Configuration

In the following section, we:      
* Import required modules
* Set the configratuin constants, inluding the target URLs. This pipeline should be capable of performing the same task for different use cases by changing only the configrations. No changes to the code are required. 
* Define thefunctons for use in the 'Execution' section.
* Connect this runtime to filestorage (Google Drive) to store the output file. 

## Modules

In [29]:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
from sklearn.pipeline import Pipeline
import pandas as pd
import re
import random
import os
import numpy as np
import time

## Constants

The constants are the runtime variables which can be set here in this one code chunk. Future versions of this application could set these variables from an external config file. By changing the constants here, this pipeline can accomplish its task on a different use case, without any need to change code. 
These constants are:     

* MAIN_URL - This is the inputted webpage from which we will extract the URLS found there. An example would be the home page of a news website. 
* DOMAIN - URLs on pages are often partial urls and we will append this domain to the start to create the full URL. 
* SCRAPE_OUTPUT_FILE - File location for the output file. 
* ARTICLE_TAG = - the html tag we can use to identify when an element contains the a URL we are interested in. For example 'a' tags contain articles, while other tags may contain URLS to ad sites. 
* URL_HTML_TAG - Th etag which indicates the URL, this is commonly 'href'
* URLS_START_WITH - Filters in only URLS that start with this substring. 
* URLS_NOT_END_WITH - FIlters out URLs that end with the substring. 
* TEXT_TAG - The HTML tag which indicates the body text we want to extract. 
* REMOVE_SUBSTRINGS List of characters and substrings which can be cleaned out of the extracted text. 
* SEED - The random seed to be set for reproduceability. 

The HTML tags and URL substrings to filter with are determined by examining the HTML for the target pages directly. This will be demonstrated in the 'Execution' section below. 

In [30]:
MAIN_URL = 'https://www.dailymail.co.uk/news/breaking_news/index.html'
DOMAIN = 'https://www.dailymail.co.uk'
SCRAPE_OUTPUT_FILE = '/content/drive/MyDrive/MA5851_A3/scrape_results.csv'
ARTICLE_TAG = 'a'
URL_HTML_TAG = 'href'
URLS_START_WITH = ['https://www.dailymail.co.uk']
URLS_NOT_END_WITH = ['#video']
TEXT_TAG = 'p'
REMOVE_SUBSTRINGS = ["]",'"',"'",".",",","[","/",">","<"]
SEED = 42

## File Storage

In [31]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Functions

Where the imported module's fiuunctionality isn't precisley suited for our use cases, it is appropriate to define some custome functions. These functions are re-useable for other use cases. 

### Utilities

In [32]:
def Set_All_Seeds(seed):
  """Aims to set all used random seeds in one place."""
  random.seed(seed)
  os.environ['PYTHONHASHSEED'] = str(seed)
  np.random.seed(seed)
  np.random.RandomState(seed)

### HTML Element Selection

In [33]:
def Get_Soup(url: str):
  """Input a url and return the BeautifulSoup instance (aka: a 'soup')."""
  data = requests.get(url)
  html = BeautifulSoup(data.text, 'html.parser')
  return html

In [34]:
def Get_Links(soup: BeautifulSoup, find_tag: str, get_tag: str):
  """Extracts every URL found in the 'get_tag' of each 'find_tag' in the soup.""" 
  results = []
  for link in soup.find_all(find_tag):
    results.append(link.get(get_tag))
  return results

In [35]:
def Get_Content(soup: BeautifulSoup, tag: str):
  """Input a soup, return a list of strings which are the 
  contents found between each tag"""
  results = []
  for p in soup.find_all(tag):
    results.append(p.contents)
  return results

### Text Processing

In [37]:
def Remove_Sub_Strings(string: str, remove: list):
  """User can input a list of substrings which will all be 
  removed from the string"""
  for r in remove:
    assert isinstance(string, str)
    string = string.replace(r,"")
  return string

def Clean_String(s: str):
  """Cleans regex characters from string"""
  s = re.compile(r'<[^>]+>')
  return re.sub('(^|\s+)FIRST($|\s+)', '', s)

def Get_Text_From_Page(url: str):
  """Input a url and returns only the text found on the page.
  Reuires runtime variable 'TEXT_TAG' and 'REMOVE_SUBSTRINGS' to be defined."""
  web_text = Get_Content(soup = Get_Soup(url), tag=TEXT_TAG)
  return Remove_Sub_Strings(str(web_text), remove = REMOVE_SUBSTRINGS)

### Link Selection

In [36]:
def Select_Links_Starts_With(links: list, stem: str):
  """Input a list of URLS, returns the URLs which start with the stem"""
  results = []
  for link in links:
    if not isinstance(link, str):
      continue
    if link.startswith(stem):
      results.append(link)
  return results

def Select_Links_Ends_With(links: list, stem: str):
  """Input a list of URLS, returns the URLs which end with the stem"""
  results = []
  for link in links:
    if not isinstance(link, str):
      continue
    if link.endswith(stem):
      results.append(link)
  return results

def Remove_Links(func, links: list, stems: list):
  """Allows users to filter out URLs with a list of stems."""
  for s in stems:
    delta = func(links = links, stem=s)
    links = list(set(links) - set(delta))
  return links

def Append_Links(func, links: list, stems: list):
  """Allows users to filter in URLs with a list of stems."""
  results = []
  for s in stems:
    delta = func(links = links, stem=s)
    results.append(delta)
  return results[0]

# Execute Pipeline

In [38]:
Set_All_Seeds(SEED)
start_time = time.time()

## Extract webpages



In [39]:
# Take the main URL and extract all desired URLs found on that webpage into a list. 
main_soup = Get_Soup(MAIN_URL)
URLs = Get_Links(soup = main_soup, find_tag = ARTICLE_TAG, get_tag = URL_HTML_TAG)
URLs = Append_Links(func = Select_Links_Starts_With,links = URLs, stems = URLS_START_WITH)
URLs = Remove_Links(func = Select_Links_Ends_With,links = URLs, stems = URLS_NOT_END_WITH)

We can examine a sample of the main page HTML code here. From exploring the full HTML document, we can determine the tags and url stems needed to extract the desired URLS and set those porperties in the configuration. 

In [40]:
str(main_soup)[0:300]

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "//www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html><head><script type="text/javascript">\ntry {\n  Object.defineProperty(window, \'adverts\', {configurable: false, value:{}});\n}\ncatch(error) {\n  console.error(error);\n}\n</script><lin'

Using the tags and stems we have inputted into the configuration, we can extract a list of URLs as follows. The number of URLs extracted will be detailed in the output section below. 

In [41]:
URLs[0:4]

['https://www.dailymail.co.uk/news/article-10217729/Fired-Baltimore-cop-female-officer-accomplice-carry-SECOND-kidnapping.html#comments',
 'https://www.dailymail.co.uk/news/article-10235429/Biden-ignores-questions-saying-families-rest-easy-shelves-full.html#comments',
 'https://www.dailymail.co.uk/news/article-10225803/Europe-descends-chaos-second-night-protests-continue-Austria-Holland-Denmark.html#comments',
 'https://www.dailymail.co.uk/sport/sportsnews/article-10125861/Ole-Gunnar-Solskjaer-SACKED-Manchester-United-brutal-loss-Watford.html#comments']

We can apply our custom function to extract all body text for each URL in the list. See in the below example, that some html code and unwanted sub strings are still in the text. These will be removed with stop words in the subsequent pipeline.' 

In [None]:
Get_Text_From_Page(URLs[1])

'By  a class=author href=homesearchhtml?s=&amp;authornamef=Tommy+Taylor rel=nofollowTommy Taylora  and  a class=author href=homesearchhtml?s=&amp;authornamef=Ronny+Reyes+For+DailymailCom rel=nofollowRonny Reyes For DailymailComa   span class=article-timestamp article-timestamp-published span class=article-timestamp-labelPublished:span time datetime=2021-11-18T19:35:17+0000 19:35 GMT 18 November 2021 time span  |  span class=article-timestamp article-timestamp-updated span class=article-timestamp-labelUpdated:span time datetime=2021-11-19T04:50:56+0000 04:50 GMT 19 November 2021 time span     89 View  br  comments  A terminated Maryland cop his suspended police accomplice and his two daughters who he kidnapped were all found dead inside a crashed vehicle in an apparent murder-suicide after a five-day manhunt on Thursday police said\\xa0\\xa0\\xa0 Robert Vicosa 42 had taken his daughters Aminah 6 and Giana 7 from their Windsor Pennsylvania home on Sunday He was accompanied by\\xa0Sgt Tin

## Extend the metadata with new features

From the list of URLs, we can develop more features and output a data table of:    
* The URL
* The title of the article
* 'bag of words' which is the text in the article. 

In [None]:
bag_of_words = []
for url in URLs:
  bag_of_words.append(Get_Text_From_Page(url))


In [None]:
titles = []
for url in URLs:
  titles.append(Get_Soup(url).find("title").contents[0])

In [None]:
output = pd.DataFrame({"URLS":URLs,"Bag Of Words":bag_of_words,"Title":titles}).drop_duplicates()
output.to_csv(SCRAPE_OUTPUT_FILE)
execution_time = time.time() - start_time

## Output Profiling

The data table has been written to the output file. We can close the pipeline by providing some descriptive information about the corpus. The following is a preview of the data's first three rows. 

In [None]:
output.head(3)

Exploratory data analysis will be conducted on the corpus in the subsequent pipeline which performs NLP analysis. This pipeline is only responsible for extracting the text from the web pages. However, to ensure quaity data is sent to the next pipeline, we can explore:     
* The number of articles extracted
* Execution time
* Check for rows missing text

The total number of articles extracted:


In [None]:
len(output)


The time taken to process this many articles (in seconds):     

In [None]:
execution_time 

156.2861065864563

Low word counts indicate a problem extracting text from HTML. We can check how many records have a small word count as follows. 

In [None]:
ind = output["bag_of_words"].str.len() < 100
len(output[ind])