<a href="https://colab.research.google.com/github/Showcas/NLP/blob/main/01_3_NLTK_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with Deep Learning

In [None]:
# we use this for progress bars
from tqdm.auto import tqdm

In [None]:
# Use the module by importing
import nltk

# Also, we use now a different corpus
from nltk.corpus import brown

In [None]:
# downloading all resources
nltk.download('brown')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')

In [None]:
# Which categories are there?
print(brown.categories())


We will train a classifier that will be able to distinguish 3 (pseudo-randomly) chosen classes:

`science_fiction`, `news`, and `religion`

In [None]:
LABELS = {'fiction', 'news', 'religion'}

# We will create a list of tuples: (<tokens>, <label>)

full_corpus = [
    (brown.words(fileids=[fileid]), label)
    for label in LABELS
    for fileid in brown.fileids(categories=[label])
]

In [None]:
# Count label distribution:

from collections import Counter

lbl_dist = Counter(label for _, label in full_corpus)

print(lbl_dist)

We see from the **unbalanced** dataset that using (normal) _accuracy_ is not possible and we could not trust the result.

$ N = 44 + 29 + 17 = 90 $

$ P(\text{news}) = \frac{44}{90} = 0.49 $

$ P(\text{fiction}) = \frac{29}{90} = 0.32 $

$ P(\text{religion}) = \frac{17}{90} = 0.19 $

So, by always choosing `news`, the accuracy will automatically be close to 50%.

## Preprocessing

We need a function that does the preprocessing for us. The text is already tokenized, so we do not need to this.

We will do the 2 steps:

1. Lowercase all tokens
2. Remove stopwords (and punctuation)

In [None]:
from string import punctuation
from nltk.corpus import stopwords


STOPWORDS = set(stopwords.words('english'))
STOPWORDS = STOPWORDS.union(set(punctuation))
STOPWORDS = STOPWORDS.union(set(["''", "--", "``"]))


def lowercase_and_filter(tokens):
    return [
        t
        for token in tokens
        if (t := token.lower()) not in STOPWORDS
    ]

In [None]:
# example
print("Unprocessed\n", full_corpus[0][0][:20])

print("Preprocessed\n", lowercase_and_filter(full_corpus[0][0][:20]))

In [None]:
# We have not many words, so to reduce the vocabulary, we **stem** the tokens additionally.

from nltk.stem import SnowballStemmer

STEMMER = SnowballStemmer(language='english')

# example:

STEMMER.stem('investigation'), STEMMER.stem('election'), STEMMER.stem('produced')

In [None]:
def stem(tokens):
    return [STEMMER.stem(token) for token in tokens]

In [None]:
# example
print("Unprocessed \n", full_corpus[0][0][:20], end="\n\n")

print("Preprocessed \n", lowercase_and_filter(full_corpus[0][0][:20]), end="\n\n")

print("Stemmed \n", stem(lowercase_and_filter(full_corpus[0][0][:20])), end="\n\n")

Some words like `said` is not reduced, so we could also use **Lemmatization**.

But we know from the lecture, that Lemmatization needs _POS-Tags_ for good results, so we need to to pos-tagging first.

1. POS-Tagging
2. Lemmatization
3. Stopword Removal
4. Stemming (?)

In [None]:
# Example of POS-Tagging:

nltk.pos_tag(full_corpus[0][0][:20])

In [None]:
nltk.pos_tag(nltk.word_tokenize("Don't treat me badly"))

In [None]:
from nltk import WordNetLemmatizer

LEMMA = WordNetLemmatizer()

# example:

# Without explizit POS tag (default is NOUN)
print('Incorrect:', LEMMA.lemmatize('said'))

# With correct POS tag:
print('Correct:', LEMMA.lemmatize('said', pos='v'))

Unfortunately, the `WordNetLemmatizer` needs a specific form for the pos tag, so we have to convert the tag to a compatible format.

#### TASK 1.5
Implement a function to convert the position tag.
1. Utilize wordnet.ADJ, wordnet.VERB and wordnet.ADV
2. Tags starting with J -> wordnet.ADJ
3. Tags starting with V -> wordnet.VERB
4. Tags starting with R -> wordnet.ADV
5. Tags starting with N -> wordnet.NOUN
6. Tags starting with S -> wordnet.ADJ_SAT
7. All other Tags should be defaulted to wordnet.NOUN.
8. Return the converted tag.

*Hint*: You can look up all the possible Tags using: `nltk.help.upenn_tagset()`

In [None]:
from nltk.corpus import wordnet

### IMPLEMENT YOUR SOLUTION HERE ###
# def convert_pos_tag(tag):

    # return converted

In [None]:
# example:
[
    LEMMA.lemmatize(
        token, pos=convert_pos_tag(tag)
    )
    for token, tag in nltk.pos_tag(full_corpus[0][0][:20])
]

## Combining

We need to combine now all of the methods to do the preprocessing in this order:

1. POS-Tagging
2. Lemmatization
3. Stopword Removal
4. Stemming (?)

In [None]:
def preprocess(tokens):

    # 1. POS-Tagging
    with_pos = nltk.pos_tag(tokens)

    # 2.1 Conversion of pos tags for lemmatizer
    with_converted_pos = [(token, convert_pos_tag(tag)) for token, tag in with_pos]

    # 2.2 Lemmatize
    lemmatized_tokens = [LEMMA.lemmatize(token, pos=tag) for token, tag in with_converted_pos]

    # 3.1 Lowercase everything
    lowercase_tokens = [token.lower() for token in lemmatized_tokens]

    # 3.2 Remove stopwords/unwanted punctuation
    filtered_tokens = [token for token in lowercase_tokens if token not in STOPWORDS]

    # 4. Stemming
    stemmed_tokens = [STEMMER.stem(token) for token in filtered_tokens]

    # Done.
    return stemmed_tokens

In [None]:
# Now, we do the heavy-lifting (most time will be spend in the Lemmatizer—it's slow.)
# (We will use tqdm to see the progress)

preprocessed = [
    (preprocess(tokens), label)
    for tokens, label in tqdm(full_corpus, total=len(full_corpus), desc='Preprocessing')
]

In [None]:
# example
preprocessed[0][0][:20]

## Feature Extraction

For feature extraction, we use now the **count** of each of the vocabulary word. The vocabulary will be the 100 most common tokens (=lowercased stemmed lemmata without stopwords).



In [None]:
from collections import Counter


VOCABULARY = sorted(
    token for token, _ in Counter(token for tokens, _ in preprocessed for token in tokens).most_common(100)
)

print(len(VOCABULARY))

print(VOCABULARY)

In [None]:
def feature_set(tokens):

    features = {}

    token_count = Counter(tokens)

    for vocab_token in VOCABULARY:
        features[f"amount({vocab_token})"] = token_count[vocab_token]

    # features['text_length'] = len(tokens)
    # features['average_token_length'] = sum(len(token) for token in tokens) / len(tokens)

    return features

In [None]:
training_data = [
    (feature_set(tokens), label) for tokens, label in preprocessed
]

## Classification

In [None]:
from nltk import NaiveBayesClassifier

nb = NaiveBayesClassifier.train(training_data)

## Evaluation

Now, we want to measure how well the classifier can distinguish the classes.


But we don't have a data set for this. We already used the full `training_data` data set for training. We can't _test_ or _evaluate_ the classifier.

What me **MUST** do then, **BEFOREHAND**, is _splitting_ the dataset into **two parts**: The _training_ set and _testing_ set.

The *test* set is emulated to be fully and totally **UNKNOWN** to the classifier, so we are not allowed to use the full vocabulary: Only the one from the train set.

1. **SPLIT** the data set
1. Define a preprocess function
1. Define a feature extraction function
1. Create Vocabulary from **TRAIN** set
1. Apply preprocessing/feature extraction for **TRAIN** set.
1. Train classifier
1. Apply preprocessing/feature extraction for **TEST** set.
1. Classify **TEST** set
1. Evaluate results!


(We already have the preprocess/feature extraction functions.)

In [None]:
# We use random 80% of the data for training
import random

In [None]:
split_index = int(len(full_corpus) * 0.8)

random.seed(20)

shuffled = random.sample(full_corpus, len(full_corpus))

train_set = shuffled[:split_index]
test_set = shuffled[split_index:]

print(f"The train set has {len(train_set)} items, the test set {len(test_set)}")

Counter(label for _, label in train_set), Counter(label for _, label in test_set)

By randomly selecting sets there might be a problem with the label distribution. The random selection does not check if **ALL** labels are in the train and the test set.

We can ensure that by splitting 80% of each label population.

In [None]:
def train_test_split(l, amount=0.8):
    split_index = int(len(l) * amount)

    shuffled = random.sample(l, len(l))

    train_set = shuffled[:split_index]
    test_set = shuffled[split_index:]

    return train_set, test_set

train_set = []
test_set = []

for label in LABELS:
    train_ids, test_ids = train_test_split(brown.fileids(categories=[label]), amount=0.8)
    train_set.extend([
        (brown.words(fileids=[fileid]), label) for fileid in train_ids
    ])
    test_set.extend([
        (brown.words(fileids=[fileid]), label) for fileid in test_ids
    ])


print(f"The train set has {len(train_set)} items, the test set {len(test_set)}")

Counter(label for _, label in train_set), Counter(label for _, label in test_set)

In [None]:
def preprocess(tokens):
    # 1. POS-Tagging
    with_pos = nltk.pos_tag(tokens)

    # 2.1 Conversion of pos tags for lemmatizer
    with_converted_pos = [(token, convert_pos_tag(tag)) for token, tag in with_pos]

    # 2.2 Lemmatize
    lemmatized_tokens = [LEMMA.lemmatize(token, pos=tag) for token, tag in with_converted_pos]

    # 3.1 Lowercase everything
    lowercase_tokens = [token.lower() for token in lemmatized_tokens]

    # 3.2 Remove stopwords/unwanted punctuation
    filtered_tokens = [token for token in lowercase_tokens if token not in STOPWORDS]

    # 4. Stemming
    stemmed_tokens = [STEMMER.stem(token) for token in filtered_tokens]

    # Done.
    return stemmed_tokens

In [None]:
# We can do the same things: preprocess, feature extraction, classification

train_set_preprocessed = [
    (preprocess(tokens), label) for tokens, label in tqdm(train_set, total=len(train_set), desc='Preprocessing')
]

In [None]:
TRAIN_VOCABULARY = sorted(
    token for token, _ in Counter(token for tokens, _ in train_set_preprocessed for token in tokens).most_common(100)
)

print(len(TRAIN_VOCABULARY))

print(TRAIN_VOCABULARY)

In [None]:
def feature_extraction(tokens):
    features = {}

    token_count = Counter(tokens)

    for vocab_token in VOCABULARY:
        features[f"amount({vocab_token})"] = token_count[vocab_token]

    return features

In [None]:
train_data = [
    (feature_extraction(tokens), label) for tokens, label in train_set_preprocessed
]

In [None]:
nb = NaiveBayesClassifier.train(train_data)

In [None]:
test_data = [
    (
        feature_extraction(
            preprocess(
                tokens
            )
        ),
        label
    ) for tokens, label in tqdm(test_set, desc='Preprocessing/Feature Extraction')
]

In [None]:
# We can get the accuracy directly with NLTK:

nltk.classify.accuracy(nb, test_data)

In [None]:
# BUT we know, that this is skewed, so we need Precision/Recall/F1

In [None]:
predictions = nb.classify_many([features for features, _ in test_data])

In [None]:
gold = [label for _, label in test_data]

In [None]:
# Micro Average

tp, fp, fn = 0, 0, 0

for predicted, correct in zip(predictions, gold):
    for label in LABELS:
        if correct == label:
            if predicted == label:
                tp += 1
            else:
                fn += 1
        else:
            if predicted == label:
                fp += 1
            # We don't care about TN for precision/recall

micro_precision = tp / (tp + fp)
micro_recall = tp / (tp + fn)
micro_fscore = (2 * micro_precision * micro_recall) / (micro_precision + micro_recall)

print(f"""
Micro-Precision: {micro_precision:.4f}
Micro-Recall   : {micro_recall:.4f}
Micro-FScore   : {micro_fscore:.4f}
""")

In [None]:
# Macro Average

precisions, recalls, fscores = {}, {}, {} # as dictionary so, we store it by _label_

for label in LABELS:
    tp, fp, fn = 0, 0, 0
    for predicted, correct in zip(predictions, gold):
        if correct == label:
            if predicted == label:
                tp += 1
            else:
                fn += 1
        else:
            if predicted == label:
                fp += 1
    p = tp / (tp + fp)
    r = tp / (tp + fn)
    f = (2 * p * r) / (p + r)

    precisions[label] = p
    recalls[label] = r
    fscores[label] = f


print(f"Precision per Label:")
print('\n'.join(['\t' + f'{label:<10}: {value:.2f}' for label, value in precisions.items()]))
print()

print(f"Recall per Label:")
print('\n'.join(['\t' + f'{label:<10}: {value:.2f}' for label, value in recalls.items()]))
print()

print(f"F-Score per Label:")
print('\n'.join(['\t' + f'{label:<10}: {value:.2f}' for label, value in fscores.items()]))
print()

macro_precision = sum(precisions.values()) / len(precisions)
macro_recall = sum(recalls.values()) / len(recalls)
macro_fscore = sum(fscores.values()) / len(fscores)


print(f"""
Macro-Precision: {macro_precision:.4f}
Macro-Recall   : {macro_recall:.4f}
Macro-FScore   : {macro_fscore:.4f}
""")