<a href="https://colab.research.google.com/github/Santwijck/Jupyter-refactoring-beginner-jul2020/blob/master/Introduction_to_NLP_Pyladies_ellesvanderlaan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP

This notebook will introduce you to several traditional NLP techniques, starting with part-of-speech tagging, dependency parsing, and named entity recognition. We will briefly explore ways to translate text to digits so that text can be used in machine learning models, and will go over some supervised machine learning methods known to be effective for text data.  

First, import the necessary libraries:

In [None]:
import nltk
import pandas as pd
import re 
from sklearn.datasets import fetch_20newsgroups
import spacy
from tqdm import tqdm

nltk.download('punkt')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger')

# initialize pandas progress bar
tqdm.pandas()

## Analyze

### Split into sentences

In order to work with text, it is often necessary to split a text into its separate sentences, and each sentence into its separate words. I often use the `nltk` library (Natural Language Toolkit) to do that for me. Before showing how to use `nltk` to do that, let's first explore why it's easier to use `nltk` than to write your own functions.

***TASK | 5 min ***
Write some code that splits a text into a list of its sentences, and test it on the paragraphs below. If you're done early, try to improve your paragraph-splitter and test it on paragraph_3. 

***Hint*** 
All strings in Python have a `.split()` method that allows you to split a string into smaller strings based on a splitter or separator. Use this `.split()` to split up the following paragraph.*


In [None]:
paragraph_1 = ''' In the year of 1878 I took my degree of Doctor of Medicine of 
the University of London, and proceeded to Netley to go through the course 
prescribed for surgeons in the Army. Having completed my studies there, I was 
duly attached to the Fifth Northumberland Fusiliers as assistant surgeon. The 
regiment was stationed in India at the time, and before I could join it, the 
second Afghan war had broken out. On landing at Bombay, I learned that my corps 
had advanced through the passes, and was already deep in the enemy's country. 
I followed, however, with many other officers who were in the same situation as 
myself, and succeeded in reaching Candahar in safety, where I found my regiment, 
and at once entered upon my new duties. '''

In [None]:
sentences = paragraph_1.###your code here###
sentences

Now try the same code on the next paragraph. 

In [None]:
paragraph_2 = '''I had called upon my friend, Mr. Sherlock Holmes, one day in the 
autumn of last year and found him in deep conversation with a very stout, 
florid-faced, elderly gentleman with fiery red hair. With an apology for my 
intrusion, I was about to withdraw when Holmes pulled me abruptly into the room 
and closed the door behind me.'''

In [None]:
sentences = paragraph_2.###your code here###
sentences

In [None]:
paragraph_3 = '''Isa Whitney, the brother of the late Elias Whitney, D.D., 
Principal of the Theological College of St. George's, was much addicted to 
opium. The habit grew upon him, as I understand, from some foolish freak when 
he was at college; for having read De Quincy's descriptions of his dreams and 
sensations, he had drenched his tobacco with laudanum in any attempt to produce 
the same effects. He found, as so many more have done, that the practice is 
easier to attain than to get rid of, and for many years he continued to be a 
slave to the drug, an object of mingled horror and pity to his friends and 
relatives.  '''

In [None]:
### your code here ###

As you've seen, it is not so trivial to split a paragraph into sentences - at least some language-specific knowledge is needed to understand that the period in 'Mr. Sherlock Holmes' is not the end of a sentence, but part of the abbreviation of Mister. 

If the 5 minutes haven't passed yet, offer your help to your fellow Pyladies in the chat.

===========================================



`NLTK` is able to take these language-specific uses of periods etc. into account, and does a great job at splitting the above paragraphs into sentences:

In [None]:
nltk.sent_tokenize(paragraph_1)

In [None]:
nltk.sent_tokenize(paragraph_2)[0]

In [None]:
nltk.sent_tokenize(paragraph_3)[0]

### Split into words

As with sentences, it is also not so trivial to split a texts into its separate words by just using a space character as a splitter. Consider for example:

In [None]:
sentence = "It's great that you're joining this workshop!"
words = sentence.split(' ')
words

Again, `nltk` is of great help:

In [None]:
words = nltk.word_tokenize(sentence)
words

### Tag
Sometimes it is useful to only extract all nouns from a text, or all adjectives plus nouns, or all verbs that combine with the noun 'NLP'. In order to do so, we need to *tag* our text, and assign *part-of-speech (POS)* tags to every word. 

`nltk` has a build-in tagger that can do this for you. Don't forget that the tagger needs a list of words as input!


In [None]:
sentence = "It's great that you're joining this workshop!"
words = nltk.word_tokenize(sentence)
nltk.pos_tag(words)

In [None]:
# If you are interested in learning what all the abbreviations stand for: 
nltk.help.upenn_tagset()

Even though `nltk`'s taggers are working well, I often default to the `spaCy` library to do POS tagging for me. This is mostly because `spaCy` is really easy to use, and provides you with a whole range of different NLP functionalities.

In just a few lines of code, `spaCy` will tokenize and POS tag a text for you, and a whole bunch of other things, such as perform dependency parsing, lemmatize, and perform Named Entity Recognition - more on that below. The only steps you need to take are:
1. Read in a (statistical) language model that can process your text;
2. Use this model to create a `Doc` object;
3. Take out the information you need from the `Doc` object.

As noted on the spacy.io website:
Even though a `Doc` is processed – e.g. split into individual words and annotated – it still holds all information of the original text, like whitespace characters. You can always get the offset of a token into the original string, or reconstruct the original by joining the tokens and their trailing whitespace. This way, you’ll never lose any information when processing text with `spaCy`.

In [None]:
# load spacy model (usually called 'nlp')
nlp = spacy.load('en_core_web_sm')

# create a Doc object
doc = nlp(sentence)

# inspect contents of doc
[(token.text, token.pos_, token.lemma_, token.dep_) for token in doc]

For more information on what kinds of information you can access in a `Doc` object, check https://spacy.io/api/doc and https://spacy.io/api/token. 

### Parse

To further explore a text's grammatical structure, you might want to know which words are the subjects of a sentence, which adjectives belong to which nouns, etc. For this, you can use a dependency parser. 

In the code snippet above, we already saw some examples of this dependency parser: the `.dep_` attributes showed us which words are used as subjects and which words are used as (direct) objects in the example sentence. However, it would help to actually see how the words are related to each other. Luckily, there's `displaCy`  - a visualizer of syntactic dependencies. Try it out for yourself: 

In [None]:
# text = 'some text here'
# doc = nlp(text)
spacy.displacy.render(doc, style='dep', jupyter=True)

### Named Entities

Last but not least, this notebook will demonstrate how to use `spaCy` to extract named entities from your text, and render a beautiful visualization of them.

Named entities are real-world objects such as persons, locations, organizations, etc. They are often quite important in text mining applications that e.g. want to explore what persons or locations are mentioned in a text. Again, `spaCy` can help us detecting them with just a few simple lines of code. Check https://spacy.io/api/annotation if you want to know what all the abbreviations stand for.

In [None]:
# Remember paragraph 1?
print(paragraph_1)

In [None]:
doc = nlp(paragraph_1)
[(ent.text, ent.label_) for ent in doc.ents]

And again, `displaCy` allows you to visualize the different entities in a running text very easily:

In [None]:
spacy.displacy.render(doc, style='ent', jupyter=True)

You can nowadays also relatively easy use state-of-the-art transformer models, by using pipelines from the transformers library. The syntax is straighforward, and you can easily load a different tokenizer and language model. For more information on how to use transformers' pipelines, see https://huggingface.co/transformers/main_classes/pipelines.html. 

In [None]:
!pip install transformers

In [None]:
from transformers import pipeline
nlp = pipeline("ner")
sequence = paragraph_1

In [None]:
print(nlp(sequence))

Note how the text has been tokenized - what stands out?

## Prepare

The techniques shown above are all very useful when you want to understand the linguistic structure of your text better, and/or when you need to extract specific patterns or entities from a text. However, when you want to fit machine learning models on text, there many other things you can do with text which are collectively known as *text preprocessing*. In text preprocessing you explore, normalize, and vectorize text. Below, you will see what is meant by these three terms. 

### Load

We will work with the 20 newsgroup text dataset. It contains about 18,000 newsgroup posts on 20 different topics. This dataset is often used in NLP workshops, trainings, and tests - you may have encountered it before.



In [None]:
# Reading in the 20 newsgroups dataset as a pandas dataframe:
dataset = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))

In [None]:
# Check which categories are part of the data:
list(dataset.target_names)

In [None]:
# For now, we'll only select a subset of the labels. 
labels = ['comp.graphics', 'rec.autos', 'sci.space']

# read in a subset of the data:
dataset = fetch_20newsgroups(categories=labels, remove=('headers', 'footers', 'quotes'))

# change into a pandas dataframe
df = pd.DataFrame({
    'label':dataset.target,
    'text':dataset.data
})

### Preprocess

#### Explore

Start exploring the data:

In [None]:
df.head()

In [None]:
print(df.shape)
print(df['label'].value_counts())

***TASK | 5 min ***
Now start exploring the text data yourself. Some questions to consider when exploring text data:
1. What does a single item of text look like? Does it consist of multiple words, multiple sentences?
2. Are there any special characters in the text that we need to remove or replace? 
3. Do the texts contain 'general' language, or is there a lot of domain-specific jargon used?
4. Are all texts looking the same?
5. Are all texts in the same language?
... 

In [None]:
# Your own code here (5 min)

If you're done early, offer your help in the chat to your fellow Pyladies!

==================================

Another way to explore your text is by clustering it based on the distribution of the words in the different texts. *Topic modeling* is a technique that allows you to explore different clusters (or 'topics') that are present in your text.

In [None]:
!pip install pyLDAvis

In [None]:
import gensim
from gensim import corpora, models
import pyLDAvis
import pyLDAvis.gensim

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

tokenized_data = df['text'].apply(nltk.word_tokenize)
# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)
# Transform the collection of texts to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
# Build the LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=6, id2word=dictionary)
# Export interactive visuals as html
p = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(p)

After considering the above visualization, what stands out the most to you? 

Are these words the most useful to consider for a classification algorithm when learning to distinguish between the three categories we've selected? 

#### Normalize

The visualization above has brought to light an important issue: words that are used a lot are words that might not be the most important distinguishing feature when you want to learn a classification model. Also, there are a lot of irrelevant characters such as white spaces, punctuations, etc. that are likely also not relevant, but taking up a lot of 'space'. Therefore, it is common in traditional NLP modeling to take out punctuation, words such as 'the', 'a', or 'it' (commonly referred to as stop words), and to normalize text by taking the dictionary form of a word (lemmatizing).  

In [None]:
text = paragraph_1
print(text)
print('\n')
print('Text after removal of punctuation:')
print('\n')
doc = nlp(text)
tokens = [word.text for word in doc if not word.is_punct]
text = ' '.join(tokens)
print(text)

In [None]:
print(text)
print('\n')
print('Text after removal of stop words:')
print('\n')
doc = nlp(text)
tokens = [word.text for word in doc if not word.is_stop]
text = ' '.join(tokens)
print(text)

In [None]:
print(text)
print('\n')
print('Text after lemmatization:')
print('\n')
doc = nlp(text)
tokens = [word.lemma_ for word in doc]
text = ' '.join(tokens)
print(text)

In [None]:
# putting it all together in some neat functions:
def remove_whitespaces_and_newlines(text):
    text = re.sub('\n', ' ', text)
    text = re.sub(' +', ' ', text)
    return text


def remove_digits(text):
    text = re.sub('\d+', '', text)
    return text


def preprocess_text(text):
    text = str(text)
    text = remove_digits(text)
    text = remove_whitespaces_and_newlines(text)
    doc = nlp(text)
    tokens = [word.lemma_ for word in doc if not word.is_stop and not word.is_punct]
    text = ' '.join(tokens)
    return text

In [None]:
# apply preprocess_text to the 20 newsgroup dataset:
df['preprocessed_text'] = df['text'].progress_apply(preprocess_text)

To get an idea of what the functions above have cleaned from the text, consider one of the texts and its preprocessed variant:

In [None]:
# to make sure that the whole text is printed in the notebook, you can use:
pd.set_option('display.max_colwidth', None)

df.text.iloc[0]

In [None]:
df.preprocessed_text.iloc[0]

In [None]:
tokenized_data = df['preprocessed_text'].apply(nltk.word_tokenize)
# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)
# Transform the collection of texts to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
# Build the LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=6, id2word=dictionary)
# Export interactive visuals as html
p = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(p)

#### Vectorize

One last step to take before we can feed our data to a machine learning model - we need to make sure that the texts is transformed into digits, as algorithms are unable to compute anything on letters.

There are several techniques to transform your text to digits. Here, we will explore TF-IDF or Term Frequency - Inverse Document Frequency and word embeddings. 

*"TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general"* (https://en.wikipedia.org/wiki/Tf%E2%80%93idf). 

We don't need to write out the calculations ourself - `sklearn` makes it easy to transform your text into a tf-idf representation.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 

# create the vectorizer
vectorizer=TfidfVectorizer() 
# fit the tf-idf transformer on the preprocess text of your dataframe
X = vectorizer.fit_transform(df['preprocessed_text'])


Now let's inspect some transformed texts:



In [None]:
print(df.shape)
print(X.shape)

In [None]:
print(vectorizer.vocabulary_)

In [None]:
# create dictionary to find a tfidf word each word
word2tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

# print first 25 words and their score
for word, score in list(word2tfidf.items())[:25]:
    print(word, score)

In [None]:
# let's look up the texts where some of these words appear:
df[df.preprocessed_text.str.contains('aangeboden')]

For word embeddings, you can use the `gensim` library. `Gensim` contains pretrained word embeddings that you can use to vectorize your texts. Sind word embeddings are able to encode semantics into the vectorization, they often perform better in machine learning models than tf-idf. 

In [None]:
import gensim.downloader

# download a set of pretrained embeddings
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-50')

In [None]:
# see how powerful word embeddings are:
glove_vectors.most_similar('labrador')

In [None]:
# take the text from the first item
text = df['preprocessed_text'].iloc[0]

# loop over the text to collect word embeddings for all words
word_vectors = [glove_vectors[word] for word in nltk.word_tokenize(text) if word in glove_vectors.vocab]

# inspect results:
word_vectors

Since texts seldom contain the same number of words, you'll have to find a way to make sure that the vectorized texts all have the same number of features, i.e. the same number of words, before you can use them in any machine learning algorithm. There's several ways to do that:
- take the means of vectors (MoV);
- truncating and padding

Here, I'll show you how to take the mean of vectors:

In [None]:
np.mean(word_vectors, axis=0)

## Train & Predict

Now it's time to build a classifier that can predict which of the three classes is the best label for any unseen text. First, let's split the data into a train and test set:

In [None]:
from sklearn import metrics, model_selection, pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
df_train, df_test, y_train, y_test = model_selection.train_test_split(
    df, 
    df['label'], 
    test_size=0.2, 
    random_state=42
)

#### TF-IDF features

In [None]:
tf_idf = TfidfVectorizer()

# pipeline
pipe = pipeline.Pipeline([
                          ("vectorizer", tf_idf),
                          ("classifier", LogisticRegression())
                          ])
pipe.fit(
    df_train['preprocessed_text'],
    y_train
)

Now check the performance on the unseen test set:

In [None]:
X_test = df_test["preprocessed_text"].values
predicted = pipe.predict(X_test)
print(metrics.classification_report(y_test, predicted))

## Zero-shot classification with transformers

What to do if there's no labeled data? .... zero-shot classification
`transformers`

In [None]:
# might take a long time... 
classifier = pipeline("zero-shot-classification")

text = df.text.iloc[25]
print(text)
labels = ['science', 'cars', 'graphics']

classifier(text, labels)
