<a href="https://colab.research.google.com/github/Showcas/NLP/blob/main/01_2_NLTK_and_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with Deep Learning

After understanding the process:

**Preprocessing**:

- Tokenization
- Normalization
- Punctuation
- ...

And after **Feature Extraction**:

- Bag-of-Words
- ...


We will **TRAIN** a classifier with the _features_ **and** the _class_.

# NLTK

In [1]:
import nltk

## Dataset

But first, we need a dataset.

We already know `nltk`. There are a few corpora already included, so let's use them for this example.

In [2]:
# Import the movie_reviews submodule which provides Movie Reviews
from nltk.corpus import movie_reviews

In [3]:
# just like with the stopwords, we need to download this corpus
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [4]:
# Two possible classes:
print(movie_reviews.categories())

['neg', 'pos']


In [5]:
# getting all fileids for a class
movie_reviews.fileids('neg')[:10]

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt',
 'neg/cv005_29357.txt',
 'neg/cv006_17022.txt',
 'neg/cv007_4992.txt',
 'neg/cv008_29326.txt',
 'neg/cv009_29417.txt']

In [6]:
# getting raw review text
movie_reviews.raw('neg/cv000_29416.txt')

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience membe

In [7]:
# getting review text, already split for us
movie_reviews.words('neg/cv000_29416.txt')

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

In [32]:
# Read all 'neg' reviews
neg_files = movie_reviews.fileids(categories=["neg"])

neg_reviews = [movie_reviews.raw(fileids=fileid) for fileid in neg_files]

In [33]:
# Same for 'pos' reviews
pos_files = movie_reviews.fileids(categories=["pos"])

pos_reviews = [movie_reviews.raw(fileids=fileid) for fileid in pos_files]

In [34]:
# Check sizes:
print(f"#neg: {len(neg_reviews)}")
print(f"#pos: {len(pos_reviews)}")

#neg: 1000
#pos: 1000


In [35]:
# Neg Example:
print(f"Neg:\n {neg_reviews[42]}")

Neg:
 a pseudo-intellectual film about the pseudo-intellectual world of art magazines , high art is as wasted as its drug-addled protagonists . 
in the only notable part of the movie , ally sheedy and radha mitchell deliver nice performances in the two leading roles , not that lisa cholodenko's script or direction makes you care much about either character . 
living in a world of heroin induced highs , they float along until they fall in love with each other . 
this uninviting picture , full of pretentious minor characters , has a receptionist that reads dostoevski and a woman in the restroom line who is a certified genius , having recently been awarded a prestigious mcarthur grant . 
24-year-old syd ( radha mitchell ) , who has a rather bland , live-in boyfriend , was just promoted to assistant editor at the artistic photography magazine " frame . " 
although the receptionist is impressed , syd is mainly a gofer for her boss until she meets famous photographer lucy berliner ( ally she

In [14]:
# Pos Example:
print(f"Pos:\n {pos_reviews[2]}")

Pos:
 you've got mail works alot better than it deserves to . 
in order to make the film a success , all they had to do was cast two extremely popular and attractive stars , have them share the screen for about two hours and then collect the profits . 
no real acting was involved and there is not an original or inventive bone in it's body ( it's basically a complete re-shoot of the shop around the corner , only adding a few modern twists ) . 
essentially , it goes against and defies all concepts of good contemporary filmmaking . 
it's overly sentimental and at times terribly mushy , not to mention very manipulative . 
but oh , how enjoyable that manipulation is . 
but there must be something other than the casting and manipulation that makes the movie work as well as it does , because i absolutely hated the previous ryan/hanks teaming , sleepless in seattle . 
it couldn't have been the directing , because both films were helmed by the same woman . 
i haven't quite yet figured out what 

## Tokenization

We see from the example, that everything is lowercase anyway, so we don't have to deal with this.

Also, interpunctuation is divided from its word: No normalization needed.
We can see this for example here:

> ( it's basically a complete re-shoot of the shop around the corner , only adding a few modern twists ) .

There are spaces before and after the brackets, the comma and the period at the end.




So, we do the following:

1. Split on whitespaces
2. Remove stopwords and interpunctuation

In [48]:
example = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit"

# List of all tokens:
print(example.split())

['This', 'is', 'a', 'Demo', 'Text', 'for', 'NLP', 'using', 'NLTK.', 'Full', 'form', 'of', 'NLTK', 'is', 'Natural', 'Language', 'Toolkit']


In [49]:
# for the tokenization, we will need to download the tokenizer first
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [50]:
example = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit"

# Now with nltk.word_tokenize:
print(nltk.word_tokenize(example))

['This', 'is', 'a', 'Demo', 'Text', 'for', 'NLP', 'using', 'NLTK', '.', 'Full', 'form', 'of', 'NLTK', 'is', 'Natural', 'Language', 'Toolkit']


In [51]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [52]:
# First, we define the constants
from string import punctuation
from nltk.corpus import stopwords

# Use stopwords from NLTK, create a set for faster comparison
STOPWORDS = set(stopwords.words("english"))

# Add punctuation to stopword set (.union() for UNION)
STOPWORDS = STOPWORDS.union(set(punctuation))

# Add custom stopwords that we know appear in the text
STOPWORDS.add('--')

In [53]:
# Define a function for easier access.
def tokenization(text):
  # Return a list of the "important" tokens.
  # This is a list comprehension.
  #   1. text.split() is called
  #   2. Similar to a for loop, the result from 1. is iterated
  #   3. The token is added to the final list if its not in the stopword set
  #   4. The list is returned
  return [token for token in nltk.word_tokenize(text) if token not in STOPWORDS]

In [54]:
# this function is equivalent to the one above
# it's easier to read due to not using a list comprehension
def tokenization_verbose(text):
  tokenized = nltk.word_tokenize(text)
  out = []
  for token in tokenized:
    if token not in STOPWORDS:
      out.append(token)
  return out

In [55]:
# Example:
tokenization(neg_reviews[0])[:10]

['plot',
 'two',
 'teen',
 'couples',
 'go',
 'church',
 'party',
 'drink',
 'drive',
 'get']

In [56]:
tokenization(neg_reviews[0]) == tokenization_verbose(neg_reviews[0])

True

## Feature Extraction

We use a simple Bag-of-Words approach here. So, first we need to create a full vocabulary, so that we can say that each word is a feature. We do that by reading the _full_ corpus (
    <strong style='color: #FF6666'>Don't do this later; this is just an example!</strong>
)

In [27]:
vocabulary = set(
    token for text in neg_reviews + pos_reviews for token in tokenization(text)
)

print(f"Size of Vocabulary: {len(vocabulary):_}")

# We need the vocabulary as a list
vocabulary = sorted(vocabulary)

Size of Vocabulary: 46_289


In [26]:
vocabulary[12345]

'donald'

Our final list of features **for each** text will be list of 46'289 items: It will be a list of `True` and `False` values, where `True` indicates that the word in the vocabulary list at this position is inside the text.

Example: `vocabulary[12345]` is the word `"donald"`.
If now a text contains this exact word `"donald"`, its own feature list will be `True` *at position* `12345`.

```python
neg_feature_list = [
    [
        vocab_item in tokenization(text)
        for vocab_item in vocabulary
    ]
    for text in neg_reviews
]
```

<strong style='color: #FF6666'>Abort! This takes massively too long and too much storage.</strong>

It has to store 50'000 boolean values for each of the 2'000 texts. This is too much.

Is there a strategy to reduce the number of features?

We could use just 100 words instead of _all of them_. But which ones?

- Random ones (does this make sense?)
- 100 most common words
- ???

Let's focus on **2**. But how do we count the number of appearances?

We can do this ourselves with a good-oldfashioned dictionary. But there is a module for that.

In [41]:
# Counter will count that what we put in:
from collections import Counter

print("Example:\n", Counter(["a", "b", "a", "c", "a", "b", "d"]))

Example:
 Counter({'a': 3, 'b': 2, 'c': 1, 'd': 1})


A `Counter` is similar to a dictionary: `c['a']` will return the number of appearances of the string `'a'`. Other than a dictionary, it can return 0 if there is no appearance.

But wait, there's more:

`.most_common(num)`: Returns the top-`num` most occuring tokens as a list of tuples

#### TASK 1.3
Count all tokens in all of the texts.

In [89]:
### IMPLEMENT YOUR SOLUTION HERE ###

all_reviews = list(map(movie_reviews.raw, movie_reviews.fileids()))

full_count = Counter(
    word for review in all_reviews for word in tokenization(review)
)

In [90]:
# Test the solution and let's see the most common word/token:
print(full_count.most_common(10))

# The output should be:
# [("'s", 18128), ('``', 17625), ('film', 9443), ("n't", 6217), ('movie', 5671), ('one', 5582), ('like', 3547), ('even', 2556), ('good', 2316), ('time', 2282)]

[("'s", 18128), ('``', 17625), ('film', 9443), ("n't", 6217), ('movie', 5671), ('one', 5582), ('like', 3547), ('even', 2556), ('good', 2316), ('time', 2282)]


In [91]:
# It's a tuple! [0] is the token itself, [1] is the number of appearances

# Now, we can create the vocabulary out of the 100 most occuring words:

vocabulary = sorted(token for token, _ in full_count.most_common(100))

print(vocabulary[:50])

["'re", "'s", "'ve", '``', 'action', 'actually', 'almost', 'also', 'although', 'another', 'around', 'audience', 'back', 'bad', 'best', 'better', 'big', 'cast', 'character', 'characters', 'come', 'comedy', 'could', 'director', 'end', 'enough', 'even', 'ever', 'every', 'fact', 'film', 'films', 'find', 'first', 'funny', 'get', 'gets', 'go', 'going', 'good', 'great', 'however', 'john', 'know', 'last', 'life', 'like', 'little', 'long', 'look']


The **Feature Extraction** is now finished. But the classifier needs a combination of **FEATURES** and the **CLASS**.

We call that the _training data_.

Usually a list of tuples: (features, class), (features, class), ...

## Classifier

Next, we need a classifier that can work with our data.

We will see `NLTK`'s version and later the one from `scikit-learn`. Unfortunately, it needs a specific input format.
The features per text must be a dictionary.

In [92]:
def extract_features(text_tokens):
    feature = {}
    for word in vocabulary:
        feature[f"contains({word})"] = word in text_tokens
    return feature

In [93]:
all_neg_features = [extract_features(tokenization(text)) for text in neg_reviews]

all_pos_features = [extract_features(tokenization(text)) for text in pos_reviews]

In [94]:
# Example:
all_neg_features[0]

{"contains('re)": False,
 "contains('s)": True,
 "contains('ve)": True,
 'contains(``)': True,
 'contains(action)': False,
 'contains(actually)': True,
 'contains(almost)': False,
 'contains(also)': True,
 'contains(although)': True,
 'contains(another)': False,
 'contains(around)': False,
 'contains(audience)': True,
 'contains(back)': True,
 'contains(bad)': True,
 'contains(best)': False,
 'contains(better)': False,
 'contains(big)': True,
 'contains(cast)': False,
 'contains(character)': True,
 'contains(characters)': True,
 'contains(come)': False,
 'contains(comedy)': False,
 'contains(could)': False,
 'contains(director)': True,
 'contains(end)': False,
 'contains(enough)': False,
 'contains(even)': True,
 'contains(ever)': True,
 'contains(every)': True,
 'contains(fact)': False,
 'contains(film)': True,
 'contains(films)': True,
 'contains(find)': True,
 'contains(first)': False,
 'contains(funny)': False,
 'contains(get)': True,
 'contains(gets)': False,
 'contains(go)': True

In [95]:
# To train, we need to attach the LABEL to each feature set:

training_data = []  # final list to contain training data tuples

# First, we add the features for the 'neg' class:
training_data.extend([(feature, "neg") for feature in all_neg_features])

# Then, we add the features for the 'pos' class:
training_data.extend([(feature, "pos") for feature in all_pos_features])

In [96]:
training_data[0]

({"contains('re)": False,
  "contains('s)": True,
  "contains('ve)": True,
  'contains(``)': True,
  'contains(action)': False,
  'contains(actually)': True,
  'contains(almost)': False,
  'contains(also)': True,
  'contains(although)': True,
  'contains(another)': False,
  'contains(around)': False,
  'contains(audience)': True,
  'contains(back)': True,
  'contains(bad)': True,
  'contains(best)': False,
  'contains(better)': False,
  'contains(big)': True,
  'contains(cast)': False,
  'contains(character)': True,
  'contains(characters)': True,
  'contains(come)': False,
  'contains(comedy)': False,
  'contains(could)': False,
  'contains(director)': True,
  'contains(end)': False,
  'contains(enough)': False,
  'contains(even)': True,
  'contains(ever)': True,
  'contains(every)': True,
  'contains(fact)': False,
  'contains(film)': True,
  'contains(films)': True,
  'contains(find)': True,
  'contains(first)': False,
  'contains(funny)': False,
  'contains(get)': True,
  'contains

`NLTK`'s Naive Bayes classifier can now be trained with this information:

In [97]:
from nltk import NaiveBayesClassifier

In [98]:
# We train by calling the .train() method with the just created training data

nb = NaiveBayesClassifier.train(training_data)

### Testing

How can we test/run the classifier now?

For text, we need to do the same things as above:

1. Tokenization
2. Feature Extraction
3. Classification

In [100]:
?nb.classify

#### TASK 1.4
Write a function to test the classification.
1. Tokenize
2. Extract features
3. Classification with nb.classify()
4. Return Classification

In [108]:
### IMPLEMENT YOUR SOLUTION HERE ###
def test_classify(example):
    return nb.classify(extract_features(tokenization(example)))



In [109]:
example = "I really hate this movie. It is the worst movie that I have ever seen."

output = test_classify(example)

print(f"The classifier predicts: {output}")

The classifier predicts: neg


In [110]:
example = "I liked the perfect character performance so much that I watched this great film almost a hundred times back to back and fell fully in love with the well-designed plot."

output = test_classify(example)

print(f"The classifier predicts: {output}")

The classifier predicts: pos


In [111]:
nb.show_most_informative_features(10)

Most Informative Features
           contains(bad) = True              neg : pos    =      2.0 : 1.0
        contains(script) = True              neg : pos    =      1.6 : 1.0
          contains(life) = True              pos : neg    =      1.5 : 1.0
       contains(nothing) = True              neg : pos    =      1.5 : 1.0
           contains(bad) = False             pos : neg    =      1.5 : 1.0
         contains(world) = True              pos : neg    =      1.5 : 1.0
           contains(n't) = False             pos : neg    =      1.5 : 1.0
   contains(performance) = True              pos : neg    =      1.4 : 1.0
         contains(great) = True              pos : neg    =      1.4 : 1.0
      contains(although) = True              pos : neg    =      1.4 : 1.0


From this view, we can see which *features* contributed to which *class*.

Note, this is is purely (!) from the training data, there is nothing of world knowledge or semantics.

From the first line, we can see that from the training data, if the text _contains_ the word "bad", it is 2 times more likely to be a _negative_ class. The combination of these probabilities lead to the output.

The way this works is also by **negative** samples, for example, if the text **DOES NOT** contain the word "bad" (`contains(bad) = False`) it is additionally, 1.5 times more probable to belong to the _positive_ class.

## Future Work:

Instead of using a simple boolean indicator, we could also use the number of appearances.

We could introduce other features, maybe the length of the full text, the average word length, etc. And we could go through the vocabulary and remove more self-defined stopwords (e.g. "would" is not on the list). Also, it might make sense to not use the top-100 but the ones in the range between 101-200 or even further down.

We'll get to that later.