<a href="https://colab.research.google.com/github/Showcas/NLP/blob/main/01_3_NLTK_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with Deep Learning

In [20]:
# we use this for progress bars
from tqdm.auto import tqdm

In [21]:
# Use the module by importing
import nltk

# Also, we use now a different corpus
from nltk.corpus import brown

In [22]:
# downloading all resources
nltk.download('brown')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [23]:
# Which categories are there?
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']



We will train a classifier that will be able to distinguish 3 (pseudo-randomly) chosen classes:

`science_fiction`, `news`, and `religion`

In [24]:
LABELS = {'fiction', 'news', 'religion'}

# We will create a list of tuples: (<tokens>, <label>)

full_corpus = [
    (brown.words(fileids=[fileid]), label)
    for label in LABELS
    for fileid in brown.fileids(categories=[label])
]

In [25]:
# Count label distribution:

from collections import Counter

lbl_dist = Counter(label for _, label in full_corpus)

print(lbl_dist)

Counter({'news': 44, 'fiction': 29, 'religion': 17})


We see from the **unbalanced** dataset that using (normal) _accuracy_ is not possible and we could not trust the result.

$ N = 44 + 29 + 17 = 90 $

$ P(\text{news}) = \frac{44}{90} = 0.49 $

$ P(\text{fiction}) = \frac{29}{90} = 0.32 $

$ P(\text{religion}) = \frac{17}{90} = 0.19 $

So, by always choosing `news`, the accuracy will automatically be close to 50%.

## Preprocessing

We need a function that does the preprocessing for us. The text is already tokenized, so we do not need to this.

We will do the 2 steps:

1. Lowercase all tokens
2. Remove stopwords (and punctuation)

In [26]:
from string import punctuation
from nltk.corpus import stopwords


STOPWORDS = set(stopwords.words('english'))
STOPWORDS = STOPWORDS.union(set(punctuation))
STOPWORDS = STOPWORDS.union(set(["''", "--", "``"]))


def lowercase_and_filter(tokens):
    return [
        t
        for token in tokens
        if (t := token.lower()) not in STOPWORDS
    ]

In [27]:
# example
print("Unprocessed\n", full_corpus[0][0][:20])

print("Preprocessed\n", lowercase_and_filter(full_corpus[0][0][:20]))

Unprocessed
 ['Thirty-three', 'Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.', 'His', 'parents', 'talked', 'seriously', 'and', 'lengthily', 'to', 'their', 'own', 'doctor', 'and']
Preprocessed
 ['thirty-three', 'scotty', 'go', 'back', 'school', 'parents', 'talked', 'seriously', 'lengthily', 'doctor']


In [29]:
# We have not many words, so to reduce the vocabulary, we **stem** the tokens additionally.

from nltk.stem import SnowballStemmer

STEMMER = SnowballStemmer(language='english')

# example:

STEMMER.stem('investigation'), STEMMER.stem('election'), STEMMER.stem('produced')

('investig', 'elect', 'produc')

In [30]:
def stem(tokens):
    return [STEMMER.stem(token) for token in tokens]

In [31]:
# example
print("Unprocessed \n", full_corpus[0][0][:20], end="\n\n")

print("Preprocessed \n", lowercase_and_filter(full_corpus[0][0][:20]), end="\n\n")

print("Stemmed \n", stem(lowercase_and_filter(full_corpus[0][0][:20])), end="\n\n")

Unprocessed 
 ['Thirty-three', 'Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.', 'His', 'parents', 'talked', 'seriously', 'and', 'lengthily', 'to', 'their', 'own', 'doctor', 'and']

Preprocessed 
 ['thirty-three', 'scotty', 'go', 'back', 'school', 'parents', 'talked', 'seriously', 'lengthily', 'doctor']

Stemmed 
 ['thirty-thre', 'scotti', 'go', 'back', 'school', 'parent', 'talk', 'serious', 'lengthili', 'doctor']



Some words like `said` is not reduced, so we could also use **Lemmatization**.

But we know from the lecture, that Lemmatization needs _POS-Tags_ for good results, so we need to to pos-tagging first.

1. POS-Tagging
2. Lemmatization
3. Stopword Removal
4. Stemming (?)

In [41]:
# Example of POS-Tagging:
nltk.pos_tag(full_corpus[0][0][:20])

[('Thirty-three', 'JJ'),
 ('Scotty', 'NNP'),
 ('did', 'VBD'),
 ('not', 'RB'),
 ('go', 'VB'),
 ('back', 'RB'),
 ('to', 'TO'),
 ('school', 'NN'),
 ('.', '.'),
 ('His', 'PRP$'),
 ('parents', 'NNS'),
 ('talked', 'VBD'),
 ('seriously', 'RB'),
 ('and', 'CC'),
 ('lengthily', 'RB'),
 ('to', 'TO'),
 ('their', 'PRP$'),
 ('own', 'JJ'),
 ('doctor', 'NN'),
 ('and', 'CC')]

In [43]:
nltk.pos_tag(nltk.word_tokenize("Don't treat me badly"))

[('Do', 'VBP'), ("n't", 'RB'), ('treat', 'VB'), ('me', 'PRP'), ('badly', 'RB')]

In [49]:
from nltk import WordNetLemmatizer

LEMMA = WordNetLemmatizer()

# example:

# Without explizit POS tag (default is NOUN)
print('Incorrect:', LEMMA.lemmatize('said'))

# With correct POS tag:
print('Correct:', LEMMA.lemmatize('said', pos='v'))

Incorrect: said
Correct: say


Unfortunately, the `WordNetLemmatizer` needs a specific form for the pos tag, so we have to convert the tag to a compatible format.

#### TASK 1.5
Implement a function to convert the position tag.
1. Utilize wordnet.ADJ, wordnet.VERB and wordnet.ADV
2. Tags starting with J -> wordnet.ADJ
3. Tags starting with V -> wordnet.VERB
4. Tags starting with R -> wordnet.ADV
5. Tags starting with N -> wordnet.NOUN
6. Tags starting with S -> wordnet.ADJ_SAT
7. All other Tags should be defaulted to wordnet.NOUN.
8. Return the converted tag.

*Hint*: You can look up all the possible Tags using: `nltk.help.upenn_tagset()`

In [58]:
from nltk.corpus import wordnet

### IMPLEMENT YOUR SOLUTION HERE ###

def convert_pos_tag(tag):
  if tag.startswith("J"):
    return wordnet.ADJ
  elif tag.startswith("V"):
    return wordnet.VERB
  elif tag.startswith("R"):
    return wordnet.ADV
  elif tag.startswith("N"):
    return wordnet.NOUN
  elif tag.startswith("S"):
    return wordnet.ADJ_SAT
  else:
    return wordnet.NOUN

In [59]:
# example:
[
    LEMMA.lemmatize(
        token, pos=convert_pos_tag(tag)
    )
    for token, tag in nltk.pos_tag(full_corpus[0][0][:20])
]

['Thirty-three',
 'Scotty',
 'do',
 'not',
 'go',
 'back',
 'to',
 'school',
 '.',
 'His',
 'parent',
 'talk',
 'seriously',
 'and',
 'lengthily',
 'to',
 'their',
 'own',
 'doctor',
 'and']

## Combining

We need to combine now all of the methods to do the preprocessing in this order:

1. POS-Tagging
2. Lemmatization
3. Stopword Removal
4. Stemming (?)

In [60]:
def preprocess(tokens):

    # 1. POS-Tagging
    with_pos = nltk.pos_tag(tokens)

    # 2.1 Conversion of pos tags for lemmatizer
    with_converted_pos = [(token, convert_pos_tag(tag)) for token, tag in with_pos]

    # 2.2 Lemmatize
    lemmatized_tokens = [LEMMA.lemmatize(token, pos=tag) for token, tag in with_converted_pos]

    # 3.1 Lowercase everything
    lowercase_tokens = [token.lower() for token in lemmatized_tokens]

    # 3.2 Remove stopwords/unwanted punctuation
    filtered_tokens = [token for token in lowercase_tokens if token not in STOPWORDS]

    # 4. Stemming
    stemmed_tokens = [STEMMER.stem(token) for token in filtered_tokens]

    # Done.
    return stemmed_tokens

In [61]:
# Now, we do the heavy-lifting (most time will be spend in the Lemmatizer—it's slow.)
# (We will use tqdm to see the progress)

preprocessed = [
    (preprocess(tokens), label)
    for tokens, label in tqdm(full_corpus, total=len(full_corpus), desc='Preprocessing')
]

Preprocessing:   0%|          | 0/90 [00:00<?, ?it/s]

In [62]:
# example
preprocessed[0][0][:20]

['thirty-thre',
 'scotti',
 'go',
 'back',
 'school',
 'parent',
 'talk',
 'serious',
 'lengthili',
 'doctor',
 'specialist',
 'univers',
 'hospit',
 'mr.',
 'mckinley',
 'entitl',
 'discount',
 'member',
 'famili',
 'decid']

## Feature Extraction

For feature extraction, we use now the **count** of each of the vocabulary word. The vocabulary will be the 100 most common tokens (=lowercased stemmed lemmata without stopwords).



In [63]:
from collections import Counter


VOCABULARY = sorted(
    token for token, _ in Counter(token for tokens, _ in preprocessed for token in tokens).most_common(100)
)

print(len(VOCABULARY))

print(VOCABULARY)

100
['also', 'anoth', 'around', 'ask', 'back', 'becom', 'begin', 'big', 'call', 'church', 'citi', 'come', 'could', 'day', 'even', 'face', 'find', 'first', 'four', 'get', 'give', 'go', 'god', 'good', 'great', 'hand', 'head', 'hear', 'high', 'hold', 'home', 'hous', 'john', 'know', 'last', 'leav', 'life', 'like', 'littl', 'live', 'long', 'look', 'make', 'man', 'mani', 'may', 'mean', 'meet', 'member', 'men', 'mr.', 'mrs.', 'much', 'must', 'nation', 'need', 'never', 'new', 'night', 'old', 'one', 'open', 'peopl', 'person', 'place', 'plan', 'play', 'presid', 'right', 'room', 'run', 'say', 'school', 'see', 'seem', 'sinc', 'stand', 'state', 'still', 'take', 'tell', 'thing', 'think', 'three', 'time', 'turn', 'two', 'u', 'univers', 'use', 'want', 'way', 'week', 'well', 'white', 'without', 'work', 'world', 'would', 'year']


In [64]:
def feature_set(tokens):

    features = {}

    token_count = Counter(tokens)

    for vocab_token in VOCABULARY:
        features[f"amount({vocab_token})"] = token_count[vocab_token]

    # features['text_length'] = len(tokens)
    # features['average_token_length'] = sum(len(token) for token in tokens) / len(tokens)

    return features

In [65]:
training_data = [
    (feature_set(tokens), label) for tokens, label in preprocessed
]

## Classification

In [66]:
from nltk import NaiveBayesClassifier

nb = NaiveBayesClassifier.train(training_data)

## Evaluation

Now, we want to measure how well the classifier can distinguish the classes.


But we don't have a data set for this. We already used the full `training_data` data set for training. We can't _test_ or _evaluate_ the classifier.

What me **MUST** do then, **BEFOREHAND**, is _splitting_ the dataset into **two parts**: The _training_ set and _testing_ set.

The *test* set is emulated to be fully and totally **UNKNOWN** to the classifier, so we are not allowed to use the full vocabulary: Only the one from the train set.

1. **SPLIT** the data set
1. Define a preprocess function
1. Define a feature extraction function
1. Create Vocabulary from **TRAIN** set
1. Apply preprocessing/feature extraction for **TRAIN** set.
1. Train classifier
1. Apply preprocessing/feature extraction for **TEST** set.
1. Classify **TEST** set
1. Evaluate results!


(We already have the preprocess/feature extraction functions.)

In [67]:
# We use random 80% of the data for training
import random

In [68]:
split_index = int(len(full_corpus) * 0.8)

random.seed(20)

shuffled = random.sample(full_corpus, len(full_corpus))

train_set = shuffled[:split_index]
test_set = shuffled[split_index:]

print(f"The train set has {len(train_set)} items, the test set {len(test_set)}")

Counter(label for _, label in train_set), Counter(label for _, label in test_set)

The train set has 72 items, the test set 18


(Counter({'news': 35, 'fiction': 27, 'religion': 10}),
 Counter({'religion': 7, 'news': 9, 'fiction': 2}))

By randomly selecting sets there might be a problem with the label distribution. The random selection does not check if **ALL** labels are in the train and the test set.

We can ensure that by splitting 80% of each label population.

In [69]:
def train_test_split(l, amount=0.8):
    split_index = int(len(l) * amount)

    shuffled = random.sample(l, len(l))

    train_set = shuffled[:split_index]
    test_set = shuffled[split_index:]

    return train_set, test_set

train_set = []
test_set = []

for label in LABELS:
    train_ids, test_ids = train_test_split(brown.fileids(categories=[label]), amount=0.8)
    train_set.extend([
        (brown.words(fileids=[fileid]), label) for fileid in train_ids
    ])
    test_set.extend([
        (brown.words(fileids=[fileid]), label) for fileid in test_ids
    ])


print(f"The train set has {len(train_set)} items, the test set {len(test_set)}")

Counter(label for _, label in train_set), Counter(label for _, label in test_set)

The train set has 71 items, the test set 19


(Counter({'fiction': 23, 'religion': 13, 'news': 35}),
 Counter({'fiction': 6, 'religion': 4, 'news': 9}))

In [70]:
def preprocess(tokens):
    # 1. POS-Tagging
    with_pos = nltk.pos_tag(tokens)

    # 2.1 Conversion of pos tags for lemmatizer
    with_converted_pos = [(token, convert_pos_tag(tag)) for token, tag in with_pos]

    # 2.2 Lemmatize
    lemmatized_tokens = [LEMMA.lemmatize(token, pos=tag) for token, tag in with_converted_pos]

    # 3.1 Lowercase everything
    lowercase_tokens = [token.lower() for token in lemmatized_tokens]

    # 3.2 Remove stopwords/unwanted punctuation
    filtered_tokens = [token for token in lowercase_tokens if token not in STOPWORDS]

    # 4. Stemming
    stemmed_tokens = [STEMMER.stem(token) for token in filtered_tokens]

    # Done.
    return stemmed_tokens

In [71]:
# We can do the same things: preprocess, feature extraction, classification

train_set_preprocessed = [
    (preprocess(tokens), label) for tokens, label in tqdm(train_set, total=len(train_set), desc='Preprocessing')
]

Preprocessing:   0%|          | 0/71 [00:00<?, ?it/s]

In [72]:
TRAIN_VOCABULARY = sorted(
    token for token, _ in Counter(token for tokens, _ in train_set_preprocessed for token in tokens).most_common(100)
)

print(len(TRAIN_VOCABULARY))

print(TRAIN_VOCABULARY)

100
['also', 'anoth', 'ask', 'back', 'becom', 'begin', 'big', 'call', 'car', 'christian', 'church', 'citi', 'come', 'could', 'day', 'even', 'face', 'find', 'first', 'get', 'give', 'go', 'god', 'good', 'great', 'hand', 'head', 'hear', 'high', 'hold', 'home', 'hous', 'john', 'know', 'last', 'leav', 'life', 'like', 'littl', 'live', 'long', 'look', 'make', 'man', 'mani', 'may', 'mean', 'meet', 'member', 'men', 'mr.', 'mrs.', 'much', 'must', 'nation', 'need', 'never', 'new', 'night', 'number', 'old', 'one', 'open', 'peopl', 'person', 'place', 'plan', 'play', 'presid', 'report', 'room', 'say', 'school', 'see', 'seem', 'sinc', 'spirit', 'stand', 'state', 'still', 'take', 'tell', 'thing', 'think', 'three', 'time', 'turn', 'two', 'univers', 'use', 'want', 'way', 'week', 'well', 'white', 'without', 'work', 'world', 'would', 'year']


In [73]:
def feature_extraction(tokens):
    features = {}

    token_count = Counter(tokens)

    for vocab_token in VOCABULARY:
        features[f"amount({vocab_token})"] = token_count[vocab_token]

    return features

In [74]:
train_data = [
    (feature_extraction(tokens), label) for tokens, label in train_set_preprocessed
]

In [75]:
nb = NaiveBayesClassifier.train(train_data)

In [76]:
test_data = [
    (
        feature_extraction(
            preprocess(
                tokens
            )
        ),
        label
    ) for tokens, label in tqdm(test_set, desc='Preprocessing/Feature Extraction')
]

Preprocessing/Feature Extraction:   0%|          | 0/19 [00:00<?, ?it/s]

In [77]:
# We can get the accuracy directly with NLTK:

nltk.classify.accuracy(nb, test_data)

0.8947368421052632

In [78]:
# BUT we know, that this is skewed, so we need Precision/Recall/F1

In [79]:
predictions = nb.classify_many([features for features, _ in test_data])

In [80]:
gold = [label for _, label in test_data]

In [81]:
# Micro Average

tp, fp, fn = 0, 0, 0

for predicted, correct in zip(predictions, gold):
    for label in LABELS:
        if correct == label:
            if predicted == label:
                tp += 1
            else:
                fn += 1
        else:
            if predicted == label:
                fp += 1
            # We don't care about TN for precision/recall

micro_precision = tp / (tp + fp)
micro_recall = tp / (tp + fn)
micro_fscore = (2 * micro_precision * micro_recall) / (micro_precision + micro_recall)

print(f"""
Micro-Precision: {micro_precision:.4f}
Micro-Recall   : {micro_recall:.4f}
Micro-FScore   : {micro_fscore:.4f}
""")


Micro-Precision: 0.8947
Micro-Recall   : 0.8947
Micro-FScore   : 0.8947



In [82]:
# Macro Average

precisions, recalls, fscores = {}, {}, {} # as dictionary so, we store it by _label_

for label in LABELS:
    tp, fp, fn = 0, 0, 0
    for predicted, correct in zip(predictions, gold):
        if correct == label:
            if predicted == label:
                tp += 1
            else:
                fn += 1
        else:
            if predicted == label:
                fp += 1
    p = tp / (tp + fp)
    r = tp / (tp + fn)
    f = (2 * p * r) / (p + r)

    precisions[label] = p
    recalls[label] = r
    fscores[label] = f


print(f"Precision per Label:")
print('\n'.join(['\t' + f'{label:<10}: {value:.2f}' for label, value in precisions.items()]))
print()

print(f"Recall per Label:")
print('\n'.join(['\t' + f'{label:<10}: {value:.2f}' for label, value in recalls.items()]))
print()

print(f"F-Score per Label:")
print('\n'.join(['\t' + f'{label:<10}: {value:.2f}' for label, value in fscores.items()]))
print()

macro_precision = sum(precisions.values()) / len(precisions)
macro_recall = sum(recalls.values()) / len(recalls)
macro_fscore = sum(fscores.values()) / len(fscores)


print(f"""
Macro-Precision: {macro_precision:.4f}
Macro-Recall   : {macro_recall:.4f}
Macro-FScore   : {macro_fscore:.4f}
""")

Precision per Label:
	fiction   : 1.00
	religion  : 0.75
	news      : 0.89

Recall per Label:
	fiction   : 1.00
	religion  : 0.75
	news      : 0.89

F-Score per Label:
	fiction   : 1.00
	religion  : 0.75
	news      : 0.89


Macro-Precision: 0.8796
Macro-Recall   : 0.8796
Macro-FScore   : 0.8796

