# Text analytics - Natural Language Processing

After reviewing this notebook, you will be able to perform basic NLP steps, using mainly the popular spaCy library. 

NLP steps covered in this tutorial include:
- text pre-processing
- tokenization & lemmatization 
- finding keywords & transforming text to ML-usable numeric vectors
- working with meaning through word embeddings (semantic vectors)
- word tagging for named entities - Named Entity Recognition (NER)

------

### NLP with spaCy
https://spacy.io/ (The version used in this tutorial is spaCy 3.4)

First, let's import necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import spacy #spacy is a popular NLP library
from spacy import displacy #displacy enables visualisation of selected spacy outputs

SpaCy works with pre-trained language models which you have to download locally. For this tutorial, we will use the medium-size English model. For a list of models available, see spaCy documentation: https://spacy.io/usage/models

While we're at it, let's get a couple of other packages that will be needed later on.

In [None]:
!python -m spacy download en_core_web_md
!pip install spacy_langdetect
!pip install wikipedia

Once you have the model downloaded, you have to load it in order to be able to use it. (While you download it once only, you then have to load the model for each session/script.) Note that there is a naming convention to call your loaded model "nlp".

In [None]:
nlp = spacy.load("en_core_web_md")

You can then use the loaded model to process text input. Note that there is a standard way to call your processed document 'doc'.

In [None]:
doc = nlp("John is learning about natural language processing. To learn this in Prague is fun for everyone who loves nature and languages.")
doc

On the first sight, the doc object looks pretty much the same as the text input, but in fact it is already processed and contains many useful linguistic attributes.

From spaCy documentation:

>When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically includes a tagger, a lemmatizer, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

The resulting Doc is split into tokens and has attributes coming from the processing steps (the pipeline can of course be customized to include more or less steps as necessary). You can access the tokens by common slicing methods, and attributes by their name/type (https://spacy.io/api/token). Note that punctuation symbols are also included as individual tokens.


Let's inspect the processed object in a bit more detail. We can have a look at each token, its lemma, its part of speech tag (POS), whether it is a stopword, whether it consists of alphabetic characters, and its named entity (NE) type - if any.

In [None]:
#Nicer format for the output
print("{:<15} {:<15} {:<10} {:<10} {:<10} {:<10}".format(
    'Token','Lemma','POS','stop','alpha','NE'))
# Iterating over the tokens and accessing some of their attributes.
# (Note: attribute .lemma returns a numerical index of each lemma, .lemma_ returns the actual value)
for i in doc:
    print("{:<15} {:<15} {:<10} {:<10} {:<10} {:<10}".format(
        i.text, i.lemma_, i.pos_, i.is_stop, i.is_alpha,i.ent_type_))

# accessing individual tokens & sentences

print("Last 5 tokens: ", doc[-5:])
print("First sentence: ", list(doc.sents)[0])
print("Number of tokens: ", len(doc))
#doc.sents is a generator, to work with the actual sentences, make it into a list 
print("Number of sentences: ", len(list(doc.sents))) 
print("Number of identified entities: ", len(doc.ents), '\n')

You can see that the tokens still keep their upper/lower case as well as the exact grammatical form in which they appear in the text, while lemmas have both normalized form and case. Punctuation is also represented as tokens and has its own lemma. For further working with the text, it is better to do some pre-processing so that we keep only the most relevant words in their normalized form.

Side note: spaCy pipelines are language dependent. In case you are dealing with documents in unknown or mixed languages (or aim to categorize documents based on their language), you might want to include language detection as a first step of your pipeline. Language detection in general works more reliably on longer examples.

In [None]:
# necessary imports & adding language detection to the pipeline
from spacy_langdetect import LanguageDetector
from spacy.language import Language
def create_lang_detector(nlp, name):
    return LanguageDetector()

Language.factory("language_detector", func=create_lang_detector)
nlp.add_pipe('language_detector')

In [None]:
texts = ["This is an English example.",
        "Zkusíme i jiné jazyky.",
        "Es funkzioniert ganz gut.",
        "What happens if we do mix jazyky dohromady?",
        "Málo slov",
        "je",
        "těžší určit"]

# Explore the detected languages.
for text in texts:
    doc_detect = nlp(text)
    print(text, doc_detect._.language)

### Text pre-processing

Typical pre-processing steps include: removing punctuation, lowercasing, removing stopwords (frequently occuring words that don't significantly contribute to unique meaning of a text/document), and normalization - either stemming or lemmatization. Resulting normalized text can be further used e.g. as input for ML classification.

Let's have a look at how a text document differs at different stages of pre-processing.

In [None]:
# list of all tokens in a text
tokens = [token.text for token in doc]

# list of all tokens without punctuation and other characters.
# Note that here we simply go for tokens that contain alphabetic characters.
# You could also be more generic and specify which exact character types you would like to avoid:
# tokens_no_punct = [token for token in doc if not token.is_punct | token.is_space]
tokens_no_punct = [token for token in doc if token.is_alpha]

# list of all tokens without punctuation, all lowercase
tokens_no_punct_lower =  [token.text.lower() for token in doc if token.is_alpha]

# list of all lowercased tokens without punctuation + removing stopwords
tokens_no_punct_lower_no_stopwords = [token.text.lower() for token in doc if token.is_alpha & (token.is_stop == False)]

# list of all lemmas without punctuation and stopwords (lemmas have unified case automatically)
lemmas = [token.lemma_ for token in doc if token.is_alpha & (token.is_stop == False)]

# Let's look at some statistics about are sample text.
print("Number of tokens: ", len(tokens))
print("Number of tokens without punctuation: ",len(tokens_no_punct))
print("Number of tokens without punctuation and stopwords: ", len(tokens_no_punct_lower_no_stopwords))
print("Number of lemmas without punctuation and stopwords", len(lemmas))
print("Number of unique tokens without punctuation and stopwords: ", len(set(tokens_no_punct_lower_no_stopwords)))
print("Number of unique lemmas without punctuation and stopwords", len(set(lemmas)))

You can see that removing stopwords and punctuation significantly shortens the number of items even in such a short text. For processing of longer texts, this can lead to large saving in computing time and resources. Also note that there is a smaller number of unique lemmas than unique tokens, this is because different forms of the same word get united under the same lemma (see the two lists below for a comparison). Also while we did lowercase all the tokens, case is unified automatically for lemmas, lowercasing generic words, but keeping upper case for proper names.

In [None]:
print(sorted(set(tokens_no_punct_lower_no_stopwords)))
print(sorted(set(lemmas)))

##### Hands-on #1
*Let's try all this out on a Wikipedia article. In this exercise we will demonstrate that using pre-processing and filtering out of stopwords and punctuation is crucial for getting meaningful insights from a text.*

*First we get the content of a selected Wikipedia page (simply run the cell bellow).*

In [None]:
import wikipedia
turing = wikipedia.page("Turings test").content
#show first 1000 characters of the text
turing[:1000]

*Your task is to process the text using spaCy's default pipeline. Then create two separate lists:*

    - tokens_turing, containing simply all the tokens from the text without any preprocessing
    - lemmas_turing, containing all the lemmas from the text, cleaned from punctuation and stop words
    
*Additional questions:*

1. How many sentences are there in the text?
2. What is the last sentence?
3. What is the number of unique tokens and lemmas in the text?

*Hint: You can use set(x) to create a set of unique values from a list x*

    

In [None]:
# Enter your code here
doc_turing = 
tokens_turing = 
lemmas_turing =



 

*Finally, we can use your prepared lists to see what are the most frequent tokens vs lemmas of the text. Run the following cell to see that text pre-processing not only can save us computing time by reducing the number of analyzed terms, it also leads to more meaningful text summary.*

In [None]:
from collections import Counter
word_freq_tokens = Counter(tokens_turing)
word_freq_lemmas = Counter(lemmas_turing)
print("Most frequent tokens: ", word_freq_tokens.most_common(10), "\n")
print("Most frequent lemmas: ",word_freq_lemmas.most_common(10))

*Now that you pre-processed the text, you can actually use the lemmas to generate a quick summary of the text in form of a popular graphic - the wordcloud. Just run the cell bellow to see it.*

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

# Generate word cloud from our lemmas dataset.
# Note that hte wordlcoud library expects input as a string.
word_cloud = WordCloud(background_color = 'white').generate(" ".join(lemmas_turing))
# Display the wordcloud
plt.imshow(word_cloud)
plt.axis("off")
plt.show()

### Feature extraction from a text

Our generated list of lemmas is already just a few lines of code away from having a bag-of-words representation of the whole text. Let's first summarize the preprocessing steps in one function.

In [None]:
def preprocess_text(input_text):
    doc = nlp(input_text)
    # lemmatization of significant tokens of text
    lemmas_cleaned = [token.lemma_ for token in doc if 
                                 token.is_alpha & (token.is_stop == False)]
    # joining tokens into a string
    processed_text = ' '.join(lemmas_cleaned)
    return processed_text

# sample input
texts = ["We love apples and we love cherries.",
       "We love apples and we love bananas.",
        "We love apples and we love pears"]
# pre-process the input
preprocessed = [preprocess_text(t) for t in texts]

Now, with the pre-processed input, we can continue to creat the actual Bag-of-Words representation. It will collect all the words form all documents in our processed input and use them as vocabulary, and it will also count the occurences of each vocabulary item in each document. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# initialize the vectorizer
bow = CountVectorizer()
# transform words into a count vector
features_bow = bow.fit_transform(preprocessed)

# get the full vocabulary
vocab_bow = bow.get_feature_names_out()
# saving the word counts as an array (this can then be used as features for ML models)
features_array_bow = features_bow.toarray()
# inspect the vocab file
print(len(vocab_bow))
print(vocab_bow)
print(features_array_bow)

Let's have a look at the vocabulary sorted by occuerence count. 

In [None]:
features_cherries_bow = dict(zip(vocab_bow, features_array_bow[0]))
features_bananas_bow = dict(zip(vocab_bow, features_array_bow[1]))
features_pears_bow = dict(zip(vocab_bow, features_array_bow[2]))

print(dict(sorted(features_cherries_bow.items(), key=lambda x: x[1], reverse=True)[:10]))
print(dict(sorted(features_bananas_bow.items(), key=lambda x: x[1], reverse=True)[:10]))
print(dict(sorted(features_pears_bow.items(), key=lambda x: x[1], reverse=True)[:10]))

Bag of words is one way of vectorizing a text document. It is based solely on the frequency of occurences of the individual terms, which  might conceal some significant information. Therefore, tfidf (term frequency - inversed document frequency) is typically used for vectorization, as it adds information about which words are specific for a particular document. Let's demonstrate this on our sample sentences.

We can do the vectorization as we did with BoW, getting to the same vocabulary items. However, the assigned values for each vocabulary item are different now.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
features_tfidf = tfidf.fit_transform(preprocessed)

# get the full vocabulary
vocab_tfidf = tfidf.get_feature_names_out()
# saving the word values as an array (this can then be used as features for ML models)
features_array_tfidf = features_tfidf.toarray()
# inspect the vocab file
print(len(vocab_tfidf))
print(vocab_tfidf)
print(features_array_tfidf)

Let's inspect the sorted items again.

In [None]:
features_cherries_tfidf = dict(zip(vocab_tfidf, features_array_tfidf[0]))
features_bananas_tfidf = dict(zip(vocab_tfidf, features_array_tfidf[1]))
features_pears_tfidf = dict(zip(vocab_tfidf, features_array_tfidf[2]))

print(dict(sorted(features_cherries_tfidf.items(), key=lambda x: x[1], reverse=True)[:10]))
print(dict(sorted(features_bananas_tfidf.items(), key=lambda x: x[1], reverse=True)[:10]))
print(dict(sorted(features_pears_tfidf.items(), key=lambda x: x[1], reverse=True)[:10]))

You can see that while apples and the other types of fruit had the same value in BoW, in tfidf the other fruit types have higer value than apples. This is because apples are mentioned in all three examples, but the other fruit types are specific to each document, thus adding more information for distinguishing between the three.

##### Hands-on #2

*Now let's practice what we have just learned on another example. Load the following Wikipedia articles about Prague, Brno, Bratislava and Budapest and process them. Get both the bag-of-words and tfidf representation of these articles. Looking at the top 5 words for each city and representation types, which words get bigger importance in the tfidf representation?*

In [None]:
# get the Wikipedia articles
texts = [wikipedia.page("prague city czech").content, 
         wikipedia.page("brno city").content,
         wikipedia.page("bratislava city").content,
         wikipedia.page("budapest city").content
        ]

In [None]:
# Your code goes here






### Word embeddings

The spaCy library also allows you to convert words and entire documents into their vector representation based on their meaning. The library uses a embedding model, which was trained on an extensive corpus.

The created vector representations (embeddings) reflect the semantic relationships between words and can be used, for example, to compare words, sentences or whole documents with each other. This is done by calculating the cosine similarity of the compared vectors.

In [None]:
word1 = nlp("dog")
word2 = nlp("cat")
word3 = nlp("car")
print("word1 to word2: ", word1.similarity(word2))
print("word1 to word3: ", word1.similarity(word3))
print("word2 to word3: ", word2.similarity(word3))

In [None]:
sent1 = nlp("Kids are buying apples and bananas.")
sent2 = nlp("Children are shopping for fruit.")
sent3 = nlp("We read a book.")
print("sent1 to sent2: ", sent1.similarity(sent2))
print("sent1 to sent3: ", sent1.similarity(sent3))
print("sent2 to sent3: ", sent2.similarity(sent3))

This can be used also for categorization of documents (by calculating similarity between each document and a match phrase/topic), identifying (near) duplicates, or recommendation systems. Note that spaCy embeddings are based on individual words, so embeddings for longer texts will be less accurate, as they can easily get dilluted by included generic words. If you want to compare longer texts with embeddings, it is better to use sentence-encoder-based models (such as Universal Sentence Encoder).

##### Hands-on #3
*Process the following newspaper headlines with spaCy. Categorize them into articles about medicine/health and articles about sports. Play around with the terms you calculate the similarity on - you can start with simply "medicine" and "sports", but you can try to extend the terms for example to "medicine health hospital" or include individual sports examples, and then see how that influences the result. If using thresholds for setting a category, you can also experiment with different cut-off values.*


In [None]:
headlines = ["The new drug slows progression of cognitive decline.",
             "Cancer is the #2 cause of death in the US.",
             "The best cookie recipe.",
            "Basketball players apologize after their fight.",
            "Rising Covid-19 cases might be a warning sign.",
             "Player X scored an amazing goal.",
            "Tennis star reaches the finals.",
             "Democrats won the election.",
            "Healthy lifestyle and exercising can lower your risk of early death."]

In [None]:
# Your code goes here





*As a bonus, you can play around with the following cells and find out what the semantically closest words are to your input words or what happens if you do some basic arithmetic with words.Try to come up with a good triade of words for that!*

*We will demonstrate this using a pre-trained GenSim model for a change.*

In [None]:
#necessary imports
import gensim
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50")

*This is how you can get a list of words with most similar vectors*

In [None]:
model.most_similar("queen")

*And this is how you can calculate with words*

In [None]:
model.most_similar(positive=['king', 'woman'], negative=['man'])

In [None]:
# space for your own experiments






### Named entity recognition (NER)
NER is part of the default spaCy pipeline. It uses a pretrained model to identify entities, there are different models for different languages, they might differ in the number of entities they are able to identify and their success rate. For specific entities, other languages or higher success rate, you might need to look for other NER packages/models or train a model of your own.

Entities are all grouped under the .ents attribute of your doc, all you can access them directly from a token through .ent_type_ attribute. Each word in an entity also has a flag, whether it is the first word of an entity (B - beginning) or a subsequent word of an entity (I - in) - this can be checked by the .ent_iob_ attribute (non-entities have O - out - as their attribute)

In [None]:
# define text document
text = """Robin is watching a robin eating an apple on her Apple laptop. 
          I bought a second jersey in New Jersey yesterday for 20 dollars. 
          We are sitting at Deloitte office in Vinohrady at 10 am."""
doc = nlp(text)

# Check the recognised entities
print("Identified entities: ", [(ent.text, ent.label_) for ent in doc.ents])
# Highlight identified entities directly in text    
displacy.render(doc, style="ent", jupyter=True)

#Explain a tag
print("GPE: ", spacy.explain("GPE"))
print("ORG: ", spacy.explain("ORG"))
print("\n")

# Let's have a look at the tags accessed directly from the tokens
# last 4 tokens as an example
for t in doc[-4:]:
    print(t, t.ent_type_, t.ent_iob_)

##### Hands-on #4

*Take the input texts and remove any named people by the string "ANONYMIZED"*

*Hint: You can access the entity type of a token using the ent_type_ attribute*

In [None]:
text = """James Martin Brown was arrested for bribing Jane Newman in California.
        Katie had pneumonia in October. Jim, Marie and Bob Adams were accused of assaulting
        their cousin David Lim."""  

In [None]:
# Your code goes here



