### Grading

The final grade for the NLP subject is the mean of 2 grades: the exam grade and the laboratory grade.

The maximum number of laboratory points is 10. Points are obtained in the following activities:
- exercises during class
- homework that consists of the exercises not finished during class

Exercises finished during class are worth double the points.

Until the end of the semester, each student will have to present their homework (the teacher will ask random questions from the submitted exercises).

### NLTK

NLTK (Natural Language Toolkit) is a specialized tool for natural language processing. It is comprised of modules with widely used NLP algorithms as well as tools for using data from many corpora.

A corpus is a collection of texts in electronic format, either with raw content (the original text, without adnotations) or with various adnotations (POS, sintactic, semantic).

In [None]:
#Install NLTK
!pip install nltk



In [None]:
#Import the nltk module
import nltk

In [4]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

### Text preprocessing

Tokenization is the process of breaking raw texts into sentences or words.

In [5]:
from nltk.tokenize import sent_tokenize, word_tokenize

print(sent_tokenize("I have two dogs and a cat. Do you have pets too? My cat likes to chase mice. My dogs like to chase my cat."))

print(word_tokenize("I have two dogs and a cat. Do you have pets too? My cat likes to chase mice. My dogs like to chase my cat."))

['I have two dogs and a cat.', 'Do you have pets too?', 'My cat likes to chase mice.', 'My dogs like to chase my cat.']
['I', 'have', 'two', 'dogs', 'and', 'a', 'cat', '.', 'Do', 'you', 'have', 'pets', 'too', '?', 'My', 'cat', 'likes', 'to', 'chase', 'mice', '.', 'My', 'dogs', 'like', 'to', 'chase', 'my', 'cat', '.']


We also might not want to have distinctions between words like "This" (that appears at the beginning of the phrase, therefore, it is capitalized) and "this". We can apply lower() on the whole text, but we might lose proper nouns this way.

### Removing stopwords

Stopwords are very common words that don't bring any information about the theme and meaning of the text (like pronouns, prepositions etc.)

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk.corpus import stopwords
print(len(stopwords.words('english')))
stopwords.words('english')[0:30]

198


['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't"]

### Stemming

We might want to find how many times we can find in a text the action of running. We might find the verb run in different forms, like: run, ran, running, runs etc. For this we can use the stemming process that results in the root of a word. There are multiple stemming algorithms. Three examples are:

In [None]:
# Porter stemmer
ps = nltk.PorterStemmer()
print(ps.stem("running"))
print(ps.stem("runs"))
print(ps.stem("ran"))
print(ps.stem("are"))
print(ps.stem("darling"))

run
run
ran
are
darl


In [None]:
# Lancaster stemmer - not recommended as it often results in overstemming
ls = nltk.LancasterStemmer()
print(ls.stem("running"))
print(ls.stem("runs"))
print(ls.stem("ran"))
print(ls.stem("are"))
print(ls.stem("darling"))

run
run
ran
ar
darl


In [None]:
# Snowball stemmer (also known as Porter2)
snb = nltk.SnowballStemmer("english")
print(snb.stem("running"))
print(snb.stem("runs"))
print(snb.stem("ran"))
print(snb.stem("are"))
print(snb.stem("darling"))

run
run
ran
are
darl


### Lemmatization

The process of lematization returns the dictionary form of a word (canonical form). We will use WordNetLemmatizer, which requires WordNet, a lexical database of semantic relations between words.

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.stem import WordNetLemmatizer
lem=WordNetLemmatizer()
print(lem.lemmatize("running"))
print(lem.lemmatize("runs"))
print(lem.lemmatize("ran"))
print(lem.lemmatize("are"))
print(lem.lemmatize("darling"))

running
run
ran
are
darling


Note that lemmatization is slower than stemming. This is because lemmatization performs morphological analysis and derives the meaning of words from a dictionary.

You can specify the part of speech for the given word and the results greatly improve:

In [None]:
print(lem.lemmatize("running", pos="v"))
print(lem.lemmatize("ran", pos="v"))
print(lem.lemmatize("are", pos="v"))

run
run
be


### Numeral conversion or removal

Sometimes we want to remove all numerals as they give no information about the category of the text, for example. Sometimes we need the numerals in order to programatically understand and save the information from the text in some form of knowledge representation

In [None]:
!pip install word2number
!pip install num2words

Collecting word2number
  Downloading word2number-1.1.zip (9.7 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: word2number
  Building wheel for word2number (setup.py) ... [?25l[?25hdone
  Created wheel for word2number: filename=word2number-1.1-py3-none-any.whl size=5568 sha256=b76f60d654d0f8427131e0739b76e4bd63fb601d05fb91384f6a61a3bb977ae8
  Stored in directory: /root/.cache/pip/wheels/cd/ef/ae/073b491b14d25e2efafcffca9e16b2ee6d114ec5c643ba4f06
Successfully built word2number
Installing collected packages: word2number
Successfully installed word2number-1.1
Collecting num2words
  Downloading num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
Collecting docopt>=0.6.2 (from num2words)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading num2words-0.5.14-py3-none-any.whl (163 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.5/163.5 kB[0m [31m5.1 MB/s[0m eta [36m0

In [None]:
from word2number import w2n
print(w2n.word_to_num("eleven"))
print(w2n.word_to_num("twenty three"))

from num2words import num2words
print(num2words(12))
print(num2words(101))
print(num2words(2025))

11
23
twelve
one hundred and one
two thousand and twenty-five


### Bigrams

In [1]:
import nltk

In [2]:
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

In [6]:
text = "I have two dogs and a cat. Do you have pets too? My cat likes to chase mice. My dogs like to chase my cat.".lower()
tokens =  word_tokenize(text)
bigrams_finder = BigramCollocationFinder.from_words(tokens)
bigram_measures = BigramAssocMeasures() # a collection of bigram association measures (i.e. scoring functions)

# score_ngrams returns a sequence of (ngram, score) pairs ordered from highest to lowest score based on the scoring function provided
bigram_scores = bigrams_finder.score_ngrams(bigram_measures.likelihood_ratio)
print(len(bigram_scores), "bigrams found")

# The likelihood ratio tells you whether two words appearing together is more likely due to a real connection between them
# rather than just happening by chance.
n_best = 10
print(f"first {n_best} bigrams:")
bigram_scores[0:10]

25 bigrams found
first 10 bigrams:


[(('to', 'chase'), 14.555378652741947),
 (('and', 'a'), 8.699705569404067),
 (('do', 'you'), 8.699705569404067),
 (('pets', 'too'), 8.699705569404067),
 (('too', '?'), 8.699705569404067),
 (('cat', '.'), 6.994150686613864),
 (('my', 'cat'), 6.994150686613864),
 (('chase', 'mice'), 5.927116847164285),
 (('dogs', 'and'), 5.927116847164285),
 (('dogs', 'like'), 5.927116847164285)]

In [7]:
bigram_scores

[(('to', 'chase'), 14.555378652741947),
 (('and', 'a'), 8.699705569404067),
 (('do', 'you'), 8.699705569404067),
 (('pets', 'too'), 8.699705569404067),
 (('too', '?'), 8.699705569404067),
 (('cat', '.'), 6.994150686613864),
 (('my', 'cat'), 6.994150686613864),
 (('chase', 'mice'), 5.927116847164285),
 (('dogs', 'and'), 5.927116847164285),
 (('dogs', 'like'), 5.927116847164285),
 (('have', 'pets'), 5.927116847164285),
 (('have', 'two'), 5.927116847164285),
 (('i', 'have'), 5.927116847164285),
 (('like', 'to'), 5.927116847164285),
 (('likes', 'to'), 5.927116847164285),
 (('two', 'dogs'), 5.927116847164285),
 (('you', 'have'), 5.927116847164285),
 (('.', 'do'), 4.880620559635201),
 (('?', 'my'), 4.880620559635201),
 (('a', 'cat'), 4.880620559635201),
 (('cat', 'likes'), 4.880620559635201),
 (('mice', '.'), 4.880620559635201),
 (('chase', 'my'), 2.2590649092660344),
 (('my', 'dogs'), 2.2590649092660344),
 (('.', 'my'), 1.3695320221449936)]

In [None]:
# to score bigrams by their frequency use raw_freq
bigrams_finder.nbest(bigram_measures.raw_freq, n_best)

[('cat', '.'),
 ('my', 'cat'),
 ('to', 'chase'),
 ('.', 'do'),
 ('.', 'my'),
 ('?', 'my'),
 ('a', 'cat'),
 ('and', 'a'),
 ('cat', 'likes'),
 ('chase', 'mice')]

In [None]:
# You can apply different filters. For example, ignore all bigrams which occur less than 2 times.
bigrams_finder.apply_freq_filter(2)
bigrams_finder.nbest(bigram_measures.raw_freq, n_best)

[('cat', '.'), ('my', 'cat'), ('to', 'chase')]

### Exercises

For these exercises you must use some raw text. Choose a (short) book from https://www.gutenberg.org. Each exercise is worth 0.1. The exercises finished during the laboratory are graded with 0.2.

1. Download it through python (inside the code, so you don't have to upload the file too when you send the solution for this exercise) with urlopen() from module urllib and read the entire text in one single string. If the download takes too much time at each running, download the file, but leave the former instructions in a comment (to show that you know how to access an online file)

2. Remove the header (keep only the text starting from the title).

3. Print the number of sentences in the text. Print the average length (number of words) of a sentence.

4. Find the collocations in the text (bigrams and trigrams) that appear at least 5 times. Print them only once, not each time they appear.

5. Create a list of all the words (in lower case) from the text, without the punctuation.

6. Print the first N most frequent words (alphanumeric strings) together with their number of appearances. You can use FreqDist from nltk.

7. Remove stopwords and assign the result to variable lws.

8. Apply stemming (Porter) on the list of words (lws). Print the first 200 words. Do you see any words that don't appear in the dictionary?

9. Print a table of three columns (of size N, where N is the maximum length for the words in the text). The columns will be separated with the character "|". The head of the table will be: "Porter    |Lancaster |Snowball". The table will contain only the words that give different stemming results for the three stemmers (for example, suppose that we have both "runs" and "being" inside the text. The word "runs" should not appear in the list, as all three results are "run"; however "being" should appear in the table). The stemming result for the word for each stemmer will appear in the table according to the head of the table. The table will contain the results for the first NW words from the text (the number of rows will obviously be less than NW, as not all words match the requirements). For example, NW=500. Try to print only distinct results inside the table (for example, if a word has two occurences inside the text, and matches the requirments for appearing in the table, it should have only one corresponding row).

10. Print a table of two columns, simillar to the one above, that will compare the results of stemming and lemmatization. The head of the table will contain the values: "Snowball" and "WordNetLemmatizer". The table must contain only words that give different results in the process of stemming and lemmatization (for example, the word "running"). The table will contain the results for the first NW words from the text (the number of rows will obviously be less than NW, as not all words match the requirements). For example, NW=500. Try to print only distinct results inside the table (for example, if a word has two occurnces inside the text, and matches the requirments for appearing in the table, it should have only one corresponding row).

11. Print the first N most frequent lemmas (after the removal of stopwords) together with their number of appearances.

12. Change all the numbers from lws into words. Print the number of changes, and also the portion of list that contains first N changes (for example N=10).

13. Create a function that receives an integer N and a word W as parameter (it can also receive the list of words from the text). We want to print the concordance data for that word. This means printing the window of text (words on consecutive positions) of length N, that has the givend word W in the middle. For example, for the text "I have two dogs and a cat. Do you have pets too? My cat likes to chase mice. My dogs like to chase my cat." and a window of length 3, the concordance data for the word "cat" would be ["dogs", "cat", "pets"] and ["pets","cat", "likes"] (we consider the text without stopwords and punctuation).

14. In the previous exercise, the window of text may contain words from different sentences. Create a second function that prints windows of texts that contain words only from the phrase containing word W. We want to print concordance data for all the inflexions of word W.

In [None]:
from urllib.request import urlopen
url = "https://www.gutenberg.org/cache/epub/75518/pg75518.txt"  
with urlopen(url) as response:
    text_content = response.read().decode("utf-8")  
print(text_content) 
import re
text_content = re.sub(r'\\r\\n', ' ', text_content)
text_content = re.sub(r'\\xe2\\x80\\x99', '\'', text_content) # \xe2\x80\x99 -> '
text_content = re.sub(r'\\xe2\\x80\\x9c|\\xe2\\x80\\x9d', '\"', text_content)
text_content = re.sub(r'\\xe2\\x80\\x94', ' — ', text_content)
text_content = re.sub(r'\\xc3\\xa7',  'c', text_content) # ç -> c
text_content = text_content.lower()

In [None]:
title_marker = "the seeking"
start_index = text_content.find(title_marker)
if start_index != -1:
    text_content = text_content[start_index:]  
print(text_content)

In [None]:
sentences = sent_tokenize(text_content)
tokens = word_tokenize(text_content)
print(len(sentences))
print(len(tokens) / len(sentences))


In [None]:
tokens = word_tokenize(text_content)

bigrams = nltk.bigrams(tokens)
trigrams = nltk.trigrams(tokens)

bgram_dist = nltk.FreqDist(bigrams)
trigram_dist = nltk.FreqDist(trigrams)
for k,v in bgram_dist.items():
    if(v == 5):
        print(k, v)

for l,m in  trigram_dist.items():
    if(m == 5):
        print(l, m)

In [None]:
words = re.findall(r"\b[a-zA-Z0-9]+\b", text_content.lower()) 
print(words[:10])

In [None]:
most_freq = nltk.FreqDist(words)
stop = 3
counter = 0
for i,j in most_freq.items():
  print(i,j)
  counter += 1
  if counter > stop:
    break

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

words = [word for word in words if word not in stop_words]
most_freq = nltk.FreqDist(words)
stop = 3
counter = 0
for i,j in most_freq.items():
  print(i,j)

In [None]:
snb = nltk.SnowballStemmer("english")
stemm = [snb.stem(word) for word in words]
print(stemm[:5])

In [None]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")

distinct_words = {}

for word in set(words): 
    porter_stem = porter.stem(word)
    lancaster_stem = lancaster.stem(word)
    snowball_stem = snowball.stem(word)
    
    if porter_stem != lancaster_stem or porter_stem != snowball_stem or lancaster_stem != snowball_stem:
        distinct_words[word] = (porter_stem, lancaster_stem, snowball_stem)

max_word_length = max(len(word) for word in distinct_words.keys())


print(f"{'Porter':<{max_word_length}} | {'Lancaster':<{max_word_length}} | {'Snowball':<{max_word_length}}")

for word, (porter_stem, lancaster_stem, snowball_stem) in distinct_words.items():
    print(f"{porter_stem:<{max_word_length}} | {lancaster_stem:<{max_word_length}} | {snowball_stem:<{max_word_length}}")

In [None]:
snowball_stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# Create sets to store words where stemming and lemmatization differ
stemmed_words = set()
lemmatized_words = set()
comparison_table = []

# Loop through the words and compare the results of stemming and lemmatization
for word in words:
    # Apply stemming
    stemmed = snowball_stemmer.stem(word)
    
    # Apply lemmatization
    lemmatized = lemmatizer.lemmatize(word)
    
    # Check if stemming and lemmatization give different results
    if stemmed != lemmatized:
        if stemmed not in stemmed_words and lemmatized not in lemmatized_words:
            comparison_table.append([stemmed, lemmatized])
            stemmed_words.add(stemmed)
            lemmatized_words.add(lemmatized)

if (max(len(word) for word in stemmed_words) > max(len(word) for word in lemmatized_words)):
    max_word_length = max(len(word) for word in stemmed_words)
else:
    max_word_length = max(len(word) for word in lemmatized_words)

print(f"{'Stemmed':<{max_word_length}} | {'Lemmatized':<{max_word_length}}")
for stem, lem in comparison_table:
    print(f"{stem:<{max_word_length}} | {lem:<{max_word_length}}")

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
N = 10
words = [word for word in words if word.lower() not in stop_words]
lemmas = [lemmatizer.lemmatize(word) for word in words]
freqWords = nltk.FreqDist(lemmas).most_common()
print(freqWords[:N])

In [None]:
from word2number import w2n
numbers = []
for word in words:
  try:
    numbers.append(w2n.word_to_num(word))
  except Exception:
    pass
print(numbers[:N])

Exercises 13 and 14

In [None]:
def function(N: int, W: str, text):
  '''
    Returns the window of words but doesnt take into consideration stop_words as it is not mentioned.
  '''
  
  if W not in words:
    print(f"'{W}' not found in the text")
    return
  stemmer = PorterStemmer()
  W = stemmer.stem(W.lower()) 
  lists = []
  indexes = [index for (index, word) in enumerate(text) if word == W]
  c = 0
  for target_index in indexes:
    start = max(target_index - N // 2, 0)
    finish = min(target_index + N // 2 + 1, len(words))
    lists.append(text[start:finish])
    print(f"Concordance {c + 1}: {" ".join(text[start:finish])}")
    c +=1 
  return lists

print(function(3, "word", "Trying to find word in words, but the word repeats".split()))