# Introduction to NLP with Python

A brief introduction to useful Python packages and functions for NLP and data manipulation, and some initial exploration of some example data.

Written by Sumithra Velupillai June 2019, updated January 2021

Some of the material from or inspired by: https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a


## Working with datasets and packages

Importing packages you need for a project is the first step.

pandas is a very useful package for working with datasets

In [None]:
import pandas as pd
import numpy as np

## Data visualization

There are many packages for visualisations in Python - matplotlib is one of the core packages, 

seaborn has some very nice functionalities

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

## NLP

Two main NLP packages are commonly used when working in the Python environment:

spaCy: https://spacy.io/

nltk: https://www.nltk.org/

We will load some of their functions and resources too for the English language

spaCy has a default language model for English that we will load into the variable 'nlp'

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

nltk has many functions for language processing tasks. Here we will use their stopwords, their definition of punctuation, and one of their word tokenizers and one of their lemmatizers, as well as data needed for these.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

## Other useful packages

We will also load some packages from one of the most commonly used machine learning libraries: 

scikit-learn: https://scikit-learn.org/stable/

And some other useful packages

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings 
warnings.filterwarnings('ignore')

%matplotlib inline
#matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)
from collections import Counter

## Data

We'll start by opening some data. This is data we will also use in the machine learning classification practical later this week. The data are medical transcriptions from mtsamples: https://www.mtsamples.com/

We have selected a subset of documents just to explore and try to get an idea of what is in them in terms of texts. The data has been saved in a spreadsheet that we'll open from a URL.

In [None]:
xlds = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/classification/classification_trainingdata.xlsx?raw=true'
df = pd.read_excel(xlds)

How many documents do we have?

In [None]:
len(df)

What does this dataframe look like? 

In [None]:
df.head()

There are a lot of newline characters that we can remove and replace with just one. This makes the data more readable.

In [None]:
df['txt'] = df['txt'].str.replace('_x000D_\n','\n')
df['txt'] = df['txt'].str.replace('\r\n+','\n')
df['txt'] = df['txt'].str.replace('\n +','\n')
df['txt'] = df['txt'].str.replace('\n+','\n')

So there are two main columns: 'label' and 'txt'. Each row in the dataframe is thus a text with a label. What does the label indicate?

Let's look at one of the documents. 

In [None]:
df['txt'][0]

Do you get a sense for what these texts might look like? Try looking at some other examples.

One way of exploring the data and trying to understand more about this corpus is to tokenize the corpus into its individual words, and looking at the most frequent words. Let's see how we can use spaCy and nltk for this.

With spaCy, we can apply their default English language model on all texts in one go. Remember that we loaded their language model in the variable 'nlp' above. With pandas, we can apply this on all rows in the dataframe, and save in a new column 'spacynlp':

In [None]:
df['spacynlp'] = df['txt'].apply(nlp)

Let's look at the first rows in the dataframe again.

In [None]:
df.head()

In the 'spacynlp' column, we now have a spaCy representation of each text. From this representation, we can retrieve the tokens, lemmas, part-of-speech (POS) tags, and other units that the default model gives us. Information about the different lignuistic features the spaCy model currently handles can be found here: https://spacy.io/usage/linguistic-features

For this exercise, we will look at lemma and POS frequencies. We'll write some functions. We'll start by tokenizing the data into sentences and save as a column, then we'll do the same for lemmas and save in another column.

The first function takes a row in a dataframe and returns a list of sentences from the SpaCy model.

In [None]:
def get_spacy_sentences(row):
  return [sentence for sentence in row.doc.sents]

We'll not apply this function and save in a new column: 'spacysentence'

In [None]:
df['spacysentence'] = df['spacynlp'].apply(get_spacy_sentences)

Let's define a similar function but this time we'll extract lemmas from the SpaCy model.

In [None]:
def get_spacy_lemmas(row):
  return [token.lemma_ for token in row.doc]
  

Let's apply this function on each row in the dataframe and save in a new column: 'spacylemma'. 

In [None]:
df['spacylemma'] = df['spacynlp'].apply(get_spacy_lemmas)

Let's do something similar to get the POS tags:

In [None]:
def get_spacy_pos(row):
  return [token.pos_ for token in row.doc]

In [None]:
df['spacypos'] = df['spacynlp'].apply(get_spacy_pos)

What does the dataframe look like now?

In [None]:
df.head()

In [None]:
len(df['spacylemma'][20])

We can look at an example column with the new list of lemmas for a random document:

In [None]:
df['spacylemma'][10]

Let's do this also with nltk, so that we can compare

With nltk, there are many options for different NLP tasks, including sentence tokenization, word tokenization, POS tagging, etc. The range of functions and options can be found on their webpage: https://www.nltk.org/

Let's use one of their sentence and word tokenizers, and lemmatizers to get lemmas. We'll start with sentences.

In [None]:
def get_nltk_sentences(row):
    sentences = [sent_tokenize(row)]
    return sentences

Let's apply this on the datafame and save in a new column 'nltksentence' 

In [None]:
df['nltksentence'] = df['txt'].apply(get_nltk_sentences)

Now let's write a function for lemmas. 

In [None]:

def get_nltk_lemmas(row):
    tk = WordPunctTokenizer() 
    lemmatizer = WordNetLemmatizer()
    tokens = tk.tokenize(row)
    lemmas = [lemmatizer.lemmatize(l) for l in tokens]
    return lemmas

Let's apply this on the dataframe and save in a new column 'nltklemma'

In [None]:
df['nltklemma'] = df['txt'].apply(get_nltk_lemmas)

Let's do the same to get the nltk POS tags and save in a new column 'nltkpos'. We will pass the list of lemmas to the nltk function for POS tagging.

In [None]:
def get_nltk_pos(row):
    tags = nltk.pos_tag(row)
    return [t[1] for t in tags]

In [None]:
df['nltkpos'] = df['nltklemma'].apply(get_nltk_pos)

In [None]:
df.head()

You can now take a look at some examples.

In [None]:
df['nltkpos'][0]

We can also combine these into tuples, to see what POS tag spaCy and nltk have assigned to each lemma in their models. Have a look at some examples from both models.

In [None]:
## remember how many documents there are in the dataset? You can choose any id within this range, starting with 0.
example_id = 43

## the models have been saved with the prefix 'nltk' or 'spacy' - try looking at some examples from both
model = 'spacy'


lemma_example = df[model+'lemma'][example_id]
pos_example = df[model+'pos'][example_id]
list(zip(lemma_example, pos_example))

Now you have some basic preprocessed text data from two off-the-shelf NLP packages!

Let's look at some basic descriptives - the most common lemmas for each model.

We'll create a new representation just to count all the lemmas, and save in two new variables: 'counts_nltk' and 'counts_spacy'.

Each row in this new representation is simply each lemma in each text.

In [None]:
counts_nltk = pd.Series([item for sublist in df.nltklemma for item in sublist])
counts_spacy = pd.Series([item for sublist in df.spacylemma for item in sublist])

How many lemmas in total are there for each of these models?

In [None]:
len(counts_nltk)

In [None]:
len(counts_spacy)

Why might these numbers differ?

Let's create a frequency list for each of the models to count the unique lemmas (types)

In [None]:
counts_nltk = counts_nltk.groupby(counts_nltk).size().rename_axis('nltk_lemma').reset_index(name='count')
counts_spacy = counts_spacy.groupby(counts_spacy).size().rename_axis('spacy_lemma').reset_index(name='count')

Take a look at the new frequency lists.

In [None]:
counts_spacy.head()

In [None]:
counts_nltk.head()

Let's sort these by frequency and look at the similarities and differences between the models.

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
pd_df_nltk = counts_nltk.sort_values(['count'], ascending=False)
pd_df_spacy = counts_spacy.sort_values(['count'], ascending=False)
fig, ax =plt.subplots(1,2)
sns.barplot(x='nltk_lemma', y='count', data=pd_df_nltk[:10], ax=ax[0], palette='colorblind')
sns.barplot(x='spacy_lemma', y='count', data=pd_df_spacy[:10], ax=ax[1], palette='dark')
fig.autofmt_xdate()
fig.show()

What observations do you have? 

We might want to remove punctuation and very common words, so that we can better understand which the most frequent content words are instead.

There are many ways to filter text collections from non-content bearing words. The most common baseline approach is to use a predefined list of stopwords and to only keep alphabetic characters.

!NOTE! This step is probably worth spending some time on in your own use-cases!

We'll use a baseline approach here using the nltk stopword list, keeping only words without numericals, and lowercasing all words.

In [None]:
counts_nltk = pd.Series([item.lower() for sublist in df.nltklemma for item in sublist if item.lower() not in stopwords.words('english') and item.isalpha()])
counts_spacy = pd.Series([item for sublist in df.spacylemma for item in sublist if item not in stopwords.words('english') and item.isalpha()])

In [None]:
counts_nltk = counts_nltk.groupby(counts_nltk).size().rename_axis('nltk_lemma').reset_index(name='count')
counts_spacy = counts_spacy.groupby(counts_spacy).size().rename_axis('spacy_lemma').reset_index(name='count')

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
pd_df_nltk = counts_nltk.sort_values(['count'], ascending=False)
pd_df_spacy = counts_spacy.sort_values(['count'], ascending=False)
fig, ax =plt.subplots(1,2)
sns.barplot(x='nltk_lemma', y='count', data=pd_df_nltk[:20], ax=ax[0], palette='colorblind')
sns.barplot(x='spacy_lemma', y='count', data=pd_df_spacy[:20], ax=ax[1], palette='dark')
fig.autofmt_xdate()
fig.show()

What observations do you now make on this? Do you notice differences between the two NLP packages? Which one do you think seems to work best? Why or why not are there differences?

Let's also look at POS tag differences.

In [None]:
counts_nltk = pd.Series([item for sublist in df.nltkpos for item in sublist])
counts_spacy = pd.Series([item for sublist in df.spacypos for item in sublist])

In [None]:
counts_nltk = counts_nltk.groupby(counts_nltk).size().rename_axis('nltk_pos').reset_index(name='count')
counts_spacy = counts_spacy.groupby(counts_spacy).size().rename_axis('spacy_pos').reset_index(name='count')

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
pd_df_nltk = counts_nltk.sort_values(['count'], ascending=False)
pd_df_spacy = counts_spacy.sort_values(['count'], ascending=False)
fig, ax =plt.subplots(1,2)
sns.barplot(x='nltk_pos', y='count', data=pd_df_nltk[:10], ax=ax[0], palette='colorblind')
sns.barplot(x='spacy_pos', y='count', data=pd_df_spacy[:10], ax=ax[1], palette='dark')
fig.autofmt_xdate()
fig.show()

What observations do you make? How can you use this information to inform your future NLP development for new use-cases?

## Ngrams

Another common representation in different NLP tasks is to look at 'ngrams'. These are n consecutive words in documents, which can be very useful to use for capturing common phrases and expressions.

scikit-learn has a function to generate ngrams that we can use to look at some different frequency distributions. We'll create a function to get common ngrams.

In [None]:
def get_top_n_ngram(corpus, n=None, ngram_range=1, sw=None):
    vec = CountVectorizer(ngram_range=(ngram_range, ngram_range), stop_words=sw).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]


Let's use this function to get the most common uni-, bi-, and trigrams in our corpus, and look at these.

Note that you can pass any ngram range you want to this function.

Also: note that you can choose to include or exclude stopwords for generating these ngrams.

In [None]:
stopwords=None
#stopwords='english'

common_words_unigram = get_top_n_ngram(df['txt'], 20, ngram_range=1, sw=stopwords)
common_words_bigram = get_top_n_ngram(df['txt'], 20, ngram_range=2, sw=stopwords)
common_words_trigram = get_top_n_ngram(df['txt'], 20, ngram_range=3, sw=stopwords)

df1 = pd.DataFrame(common_words_unigram, columns = ['unigram' , 'count'])
df2 = pd.DataFrame(common_words_bigram, columns = ['bigram' , 'count'])
df3 = pd.DataFrame(common_words_trigram, columns = ['trigram' , 'count'])

In [None]:
df1.groupby('unigram').sum()['count'].sort_values(ascending=False)
df2.groupby('bigram').sum()['count'].sort_values(ascending=False)
df3.groupby('trigram').sum()['count'].sort_values(ascending=False)

In [None]:
fig, ax =plt.subplots(1,3)
sns.barplot(x='unigram', y='count', data=df1[:10], ax=ax[0], palette='colorblind')
sns.barplot(x='bigram', y='count', data=df2[:10], ax=ax[1], palette='colorblind')
sns.barplot(x='trigram', y='count', data=df3[:10], ax=ax[2], palette='colorblind')
fig.autofmt_xdate()
fig.show()

Try including stopwords and run again.