> # Pre-Lab Instructions
> <img src="https://github.com/Minyall/sc207_290_public/blob/main/images/attention.webp?raw=true" height=200>

> For this lab you will need:
> - DATA: `farright_dataset.parquet` - Download from Moodle and upload to this Colab session.
> - IF YOU'RE *NOT* USING COLAB - You will need to install `spacy` and `beautifulsoup4` and the spacy model, use the cell below.

In [None]:
#*
# If you are NOT using Google Colab you'll need to uncomment the lines below and run this cell to install spacy and its model
# import sys
# ! pip install spacy beautifulsoup4
# !{sys.executable} -m spacy download en_core_web_sm

# Applying it to Guardian Data

`farright_dataset.parquet` is a dataset of articles from The Guardian API, retrieved and prepped using the processes we used in SC207.
- Retrieving from the API using the simple query of `"far-right"` with a limit of 1,500 articles, ordered newest first.
- Only 'articles' from the 'News' pillar were retained.
- Unpacking nested data into its own columns and setting the correct data types
- Removing articles that were outliers such as sponsored content


In [None]:
import pandas as pd
from bs4 import BeautifulSoup

articles = pd.read_parquet('farright_dataset.parquet')
articles.info()

In [None]:
#*
# We turn our pandas column of texts into a simpler list to make it compatible with BeautifulSoup and Spacy
texts = articles['body'].tolist()

In [None]:
#*
# For teaching purposes only - finds first article with an <aside> element in
idx = articles[articles['body'].str.contains('<aside')].first_valid_index()
test_text = texts[idx]


# Prints out the URL of the story so we can view it as it's meant to look and compare to the text we have.
print(articles.loc[idx,'webUrl'])
print('----')
print(test_text)


If a text contains more complex elements these will be wrapped in different tags that help lay it out on the website, change it's formatting etc. We simply want the text inside the most basic 'paragraph' `<p>` elements. There may even be `<p>` elements that do extra things. These will have an associated `class` which tells the website to format it differently.

Sometimes there will be other elements *inside* `p` elements, such as sidebar related stories. Generally these are wrapped in `span` or `aside` tags. We will manually `decompose` these from the text - i.e. cut them out, before then identifying all the `p` elements and getting their text.

Generally for text analysis we want the content text rather than headings, web addresses, embedded side content etc. Every website will differ in the best way to extract this material. Though there are general standards of tagging HTML elements it is usually necessary to customise what elements you decompose, what you keep and in what order to maximise the content you want to retain.

In [None]:
# We'll remove span and aside elements
soup = BeautifulSoup(test_text, 'html.parser')

remove_elements = ('span','aside')
[e.decompose() for e in soup.find_all() if e.name in remove_elements]

# and we'll then retain the text associated with any p element that has no associated class
paras = [p.text for p in soup.find_all('p', class_=None)]
cleaned_item ='\n'.join(paras)
print(cleaned_item)

We can do this for every article in our list. First we'll build a function to do the job of cleaning, then we'll apply it to every item in the list of texts.

In [None]:
def clean_guardian_text(text, remove_elements=('span','aside')):
    soup = BeautifulSoup(text, 'html.parser')
    [e.decompose() for e in soup.find_all() if e.name in remove_elements]
    paras = [p.text for p in soup.find_all('p', class_=None)]
    cleaned_item ='\n'.join(paras)
    cleaned_item.replace("â€™", "'") # replacing an annoying character used in the guardian
    return cleaned_item

cleaned_texts = [clean_guardian_text(t) for t in texts]

In [None]:
print(cleaned_texts[0])

In [None]:
articles['cleaned_text'] = cleaned_texts
articles.to_parquet('farright_dataset_cleaned.parquet')

# Tokenising

In [None]:
#*
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(cleaned_texts[0])

In [None]:
# Spacy can tell us how many 'tokens' are in the document - i.e. how many words (but also other things)
len(doc)

In [None]:
# How many sentences in the document?
len(list(doc.sents))

In [None]:
#*
# Tokens are units of text in natural language processing. Exactly how a text is 'tokenised' varies depending on the tool
# and many debates are had about the best way to do it.

# The goal is to render a text down into individual units of information that can be processed by different analysis techniques

# This is how spacy breaks up the document
[w.text for w in doc]

In [None]:

#*# Spacy uses the context of the surrounding words and grammar to work out if the word is a noun, verb, adjective etc.
# They call this the 'part-of-speech' or POS
[(w.text, w.pos_) for w in doc]

In [None]:
#*
# Spacy tokens have helpful attributes...
# Is it alphabetical (i.e not numerical or punctuation)
[(w.text, w.is_alpha) for w in doc]

In [None]:
#*
# Is it punctuation? 
[(w.text, w.is_punct) for w in doc]

In [None]:
#*
# # Is it a stop word? 
[(w.text, w.is_stop) for w in doc]

### Stop Words?
Stop words are typically defined as the most common words in a language. Often incredibly common words can make it harder to find patterns in text. For example the most common words in a piece of text might be 'the', 'a', 'and' etc. That doesn't tell us much about the text even though the result is correct.

In [None]:
#*
# These are the stop words for this model
print(nlp.Defaults.stop_words)


In [None]:
# We can use these token attributes to filter our text based on what type of token it is

# This ensures only alphabetical tokens that aren't stop words are retained.
[w.text for w in doc if w.is_alpha and not w.is_stop]

In [None]:
# This allows numbers as well, but filters out space symbols like \r and \n and punctuation

[w.text for w in doc if not w.is_space and not w.is_punct and not w.is_stop]

### Lemmatization

A word's lemma is the simpler 'root' word that best represents the word's meaning. It reduces the possible range of words whilst still ensuring the words left convey the appropriate meaning.

To make this clearer we can use some examples:

In [None]:
#*
# Here we have essentially the same sentences, just a variation in that one uses a contraction "don't" rather than "do not".
rabbit_1 = nlp("I don't like rabbits in space")
rabbit_2 = nlp("I do not like rabbits in space")
print( [token.lemma_ for token in rabbit_1])
print( [token.lemma_ for token in rabbit_2])


In [None]:
#*
# Even differing text can be brought at least closer in similarity using lemmas, reducing loving to love
rabbit_1 = nlp("I'm loving these rabbits")
rabbit_2 = nlp("I love this rabbit!")

print( [token.lemma_ for token in rabbit_1])
print( [token.lemma_ for token in rabbit_2])

If you are doing any text analysis that counts the frequency of words, relies on word similarity etc, it is usually a good idea to reduce the range of words being used so long as it can retain the same underlying semantic meaning.

In [None]:
filtered_tokens = [w.lemma_.lower() for w in doc if not w.is_stop and  w.is_alpha]
filtered_tokens

In [None]:
from collections import Counter
counts = Counter(filtered_tokens)
counts.most_common(10)

In [None]:
# If you want to convert your filtered tokens to text you simply join them together again


filtered_text = " ".join(filtered_tokens)
filtered_text

# Tokenising in bulk
Spacy does some pretty heavy lifting so we should tokenise once, and then save the result to avoid having to rerun thr process again. Spacy also has a method that speeds up tokenising on large numbers of documents. Now we're getting into analysis we're going to start encountering the actual nuts and bolts of using a computer because the size of our datasets and the complexity of what we're doing can put a real strain on the actual hardware used.

Depending on what kind of computer we have available we may have to tweak different settings to avoid analysis failing or hardware crashing. Often the things we have to balance are...
- How much information can the computer keep in its memory at one time (RAM) controlled by `batch_size=`
- How many workers can run at the same time (CPUS) controlled by `n_process=`
- How long are things going to take to finish (Your patience) controlled by `how_close_the_deadline_is=`<sup>*</sup>

Spacy's `.pipe` method can help us here. It can take a stack of texts and we can tell it how many workers to start running and how many texts each worker should handle at a time.

 Generally if you're using Google Colab it takes around 4 minutes to process 500 articles. To avoid the hardware being overloaded and failing to finish you should set the batch_size to be between 150 and 200 and leave it using just 1 worker. 
 
If you have a more powerful laptop with multiple cores you can increase the number of workers and if you have a lot of RAM you can increase the batch size.

<sub>* Unfortunately not a real argument</sub>

In [None]:
import pandas as pd
import spacy

articles = pd.read_parquet('farright_dataset_cleaned.parquet')
cleaned_texts = articles['cleaned_text'].tolist()
nlp = spacy.load('en_core_web_sm')

In [None]:
def tokenise_doc(doc):
    tokens = [w.lemma_.lower() for w in doc if not w.is_stop and w.is_alpha]
    return ' '.join(tokens)

BATCH_SIZE = 150
WORKERS = 1


tokens = []
for doc in nlp.pipe(cleaned_texts, batch_size=BATCH_SIZE, n_process=WORKERS):
    tokens.append(tokenise_doc(doc))

articles['tokens'] = tokens
articles.to_parquet('farright_dataset_cleaned.parquet')


In [None]:
for toks in tokens[:5]:
    print(Counter(toks.split()).most_common(10))