<a href="https://colab.research.google.com/github/Showcas/NLP/blob/main/01_2_NLTK_and_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with Deep Learning

After understanding the process:

**Preprocessing**:

- Tokenization
- Normalization
- Punctuation
- ...

And after **Feature Extraction**:

- Bag-of-Words
- ...


We will **TRAIN** a classifier with the _features_ **and** the _class_.

# NLTK

In [None]:
import nltk

## Dataset

But first, we need a dataset.

We already know `nltk`. There are a few corpora already included, so let's use them for this example.

In [None]:
# Import the movie_reviews submodule which provides Movie Reviews
from nltk.corpus import movie_reviews

In [None]:
# just like with the stopwords, we need to download this corpus
import nltk
nltk.download('movie_reviews')

In [None]:
# Two possible classes:
print(movie_reviews.categories())

In [None]:
# getting all fileids for a class
movie_reviews.fileids('neg')[:10]

In [None]:
# getting raw review text
movie_reviews.raw('neg/cv000_29416.txt')

In [None]:
# getting review text, already split for us
movie_reviews.words('neg/cv000_29416.txt')

In [None]:
# Read all 'neg' reviews
neg_files = movie_reviews.fileids(categories=["neg"])

neg_reviews = [movie_reviews.raw(fileids=fileid) for fileid in neg_files]

In [None]:
# Same for 'pos' reviews
pos_files = movie_reviews.fileids(categories=["pos"])

pos_reviews = [movie_reviews.raw(fileids=fileid) for fileid in pos_files]

In [None]:
# Check sizes:
print(f"#neg: {len(neg_reviews)}")
print(f"#pos: {len(pos_reviews)}")

In [None]:
# Neg Example:
print(f"Neg:\n {neg_reviews[42]}")

In [None]:
# Pos Example:
print(f"Pos:\n {pos_reviews[2]}")

## Tokenization

We see from the example, that everything is lowercase anyway, so we don't have to deal with this.

Also, interpunctuation is divided from its word: No normalization needed.
We can see this for example here:

> ( it's basically a complete re-shoot of the shop around the corner , only adding a few modern twists ) .

There are spaces before and after the brackets, the comma and the period at the end.




So, we do the following:

1. Split on whitespaces
2. Remove stopwords and interpunctuation

In [None]:
example = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit"

# List of all tokens:
print(example.split())

In [None]:
# for the tokenization, we will need to download the tokenizer first
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

In [None]:
example = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit"

# Now with nltk.word_tokenize:
print(nltk.word_tokenize(example))

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
# First, we define the constants
from string import punctuation
from nltk.corpus import stopwords

# Use stopwords from NLTK, create a set for faster comparison
STOPWORDS = set(stopwords.words("english"))

# Add punctuation to stopword set (.union() for UNION)
STOPWORDS = STOPWORDS.union(set(punctuation))

# Add custom stopwords that we know appear in the text
STOPWORDS.add('--')

In [None]:
# Define a function for easier access.
def tokenization(text):
  # Return a list of the "important" tokens.
  # This is a list comprehension.
  #   1. text.split() is called
  #   2. Similar to a for loop, the result from 1. is iterated
  #   3. The token is added to the final list if its not in the stopword set
  #   4. The list is returned
  return [token for token in nltk.word_tokenize(text) if token not in STOPWORDS]

In [None]:
# this function is equivalent to the one above
# it's easier to read due to not using a list comprehension
def tokenization_verbose(text):
  tokenized = nltk.word_tokenize(text)
  out = []
  for token in tokenized:
    if token not in STOPWORDS:
      out.append(token)
  return out

In [None]:
# Example:
tokenization(neg_reviews[0])[:10]

In [None]:
tokenization(neg_reviews[0]) == tokenization_verbose(neg_reviews[0])

## Feature Extraction

We use a simple Bag-of-Words approach here. So, first we need to create a full vocabulary, so that we can say that each word is a feature. We do that by reading the _full_ corpus (
    <strong style='color: #FF6666'>Don't do this later; this is just an example!</strong>
)

In [None]:
vocabulary = set(
    token for text in neg_reviews + pos_reviews for token in tokenization(text)
)

print(f"Size of Vocabulary: {len(vocabulary):_}")

# We need the vocabulary as a list
vocabulary = sorted(vocabulary)

In [None]:
vocabulary[12345]

Our final list of features **for each** text will be list of 46'289 items: It will be a list of `True` and `False` values, where `True` indicates that the word in the vocabulary list at this position is inside the text.

Example: `vocabulary[12345]` is the word `"donald"`.
If now a text contains this exact word `"donald"`, its own feature list will be `True` *at position* `12345`.

```python
neg_feature_list = [
    [
        vocab_item in tokenization(text)
        for vocab_item in vocabulary
    ]
    for text in neg_reviews
]
```

<strong style='color: #FF6666'>Abort! This takes massively too long and too much storage.</strong>

It has to store 50'000 boolean values for each of the 2'000 texts. This is too much.

Is there a strategy to reduce the number of features?

We could use just 100 words instead of _all of them_. But which ones?

- Random ones (does this make sense?)
- 100 most common words
- ???

Let's focus on **2**. But how do we count the number of appearances?

We can do this ourselves with a good-oldfashioned dictionary. But there is a module for that.

In [None]:
# Counter will count that what we put in:
from collections import Counter

print("Example:\n", Counter(["a", "b", "a", "c", "a", "b", "d"]))

A `Counter` is similar to a dictionary: `c['a']` will return the number of appearances of the string `'a'`. Other than a dictionary, it can return 0 if there is no appearance.

But wait, there's more:

`.most_common(num)`: Returns the top-`num` most occuring tokens as a list of tuples

#### TASK 1.3
Count all tokens in all of the texts.

In [None]:
### IMPLEMENT YOUR SOLUTION HERE ###
full_count = Counter(

)

In [None]:
# Test the solution and let's see the most common word/token:
print(full_count.most_common(10))

# The output should be:
# [("'s", 18128), ('``', 17625), ('film', 9443), ("n't", 6217), ('movie', 5671), ('one', 5582), ('like', 3547), ('even', 2556), ('good', 2316), ('time', 2282)]

In [None]:
# It's a tuple! [0] is the token itself, [1] is the number of appearances

# Now, we can create the vocabulary out of the 100 most occuring words:

vocabulary = sorted(token for token, _ in full_count.most_common(100))

print(vocabulary[:50])

The **Feature Extraction** is now finished. But the classifier needs a combination of **FEATURES** and the **CLASS**.

We call that the _training data_.

Usually a list of tuples: (features, class), (features, class), ...

## Classifier

Next, we need a classifier that can work with our data.

We will see `NLTK`'s version and later the one from `scikit-learn`. Unfortunately, it needs a specific input format.
The features per text must be a dictionary.

In [None]:
def extract_features(text_tokens):
    feature = {}
    for word in vocabulary:
        feature[f"contains({word})"] = word in text_tokens
    return feature

In [None]:
all_neg_features = [extract_features(tokenization(text)) for text in neg_reviews]

all_pos_features = [extract_features(tokenization(text)) for text in pos_reviews]

In [None]:
# Example:
all_neg_features[0]

In [None]:
# To train, we need to attach the LABEL to each feature set:

training_data = []  # final list to contain training data tuples

# First, we add the features for the 'neg' class:
training_data.extend([(feature, "neg") for feature in all_neg_features])

# Then, we add the features for the 'pos' class:
training_data.extend([(feature, "pos") for feature in all_pos_features])

In [None]:
training_data[0]

`NLTK`'s Naive Bayes classifier can now be trained with this information:

In [None]:
from nltk import NaiveBayesClassifier

In [None]:
# We train by calling the .train() method with the just created training data

nb = NaiveBayesClassifier.train(training_data)

### Testing

How can we test/run the classifier now?

For text, we need to do the same things as above:

1. Tokenization
2. Feature Extraction
3. Classification

In [None]:
?nb.classify

#### TASK 1.4
Write a function to test the classification.
1. Tokenize
2. Extract features
3. Classification with nb.classify()
4. Return Classification

In [None]:
### IMPLEMENT YOUR SOLUTION HERE ###
def test_classify(example):

    return prediction



In [None]:
example = "I really hate this movie. It is the worst movie that I have ever seen."

output = test_classify(example)

print(f"The classifier predicts: {output}")

In [None]:
example = "I liked the perfect character performance so much that I watched this great film almost a hundred times back to back and fell fully in love with the well-designed plot."

output = test_classify(example)

print(f"The classifier predicts: {output}")

In [None]:
nb.show_most_informative_features(10)

From this view, we can see which *features* contributed to which *class*.

Note, this is is purely (!) from the training data, there is nothing of world knowledge or semantics.

From the first line, we can see that from the training data, if the text _contains_ the word "bad", it is 2 times more likely to be a _negative_ class. The combination of these probabilities lead to the output.

The way this works is also by **negative** samples, for example, if the text **DOES NOT** contain the word "bad" (`contains(bad) = False`) it is additionally, 1.5 times more probable to belong to the _positive_ class.

## Future Work:

Instead of using a simple boolean indicator, we could also use the number of appearances.

We could introduce other features, maybe the length of the full text, the average word length, etc. And we could go through the vocabulary and remove more self-defined stopwords (e.g. "would" is not on the list). Also, it might make sense to not use the top-100 but the ones in the range between 101-200 or even further down.

We'll get to that later.