<a href="https://colab.research.google.com/github/Showcas/NLP/blob/main/01_1_First_Steps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with Deep Learning

In [2]:
!pip freeze
!pip install nltk

absl-py==1.4.0
accelerate==1.3.0
aiohappyeyeballs==2.4.6
aiohttp==3.11.12
aiosignal==1.3.2
alabaster==1.0.0
albucore==0.0.23
albumentations==2.0.4
ale-py==0.10.1
altair==5.5.0
annotated-types==0.7.0
anyio==3.7.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
array_record==0.6.0
arviz==0.20.0
astropy==7.0.1
astropy-iers-data==0.2025.2.10.0.33.26
astunparse==1.6.3
atpublic==4.1.0
attrs==25.1.0
audioread==3.0.1
autograd==1.7.0
babel==2.17.0
backcall==0.2.0
beautifulsoup4==4.13.3
betterproto==2.0.0b6
bigframes==1.36.0
bigquery-magics==0.5.0
bleach==6.2.0
blinker==1.9.0
blis==0.7.11
blosc2==3.1.0
bokeh==3.6.3
Bottleneck==1.4.2
bqplot==0.12.44
branca==0.8.1
CacheControl==0.14.2
cachetools==5.5.1
catalogue==2.0.10
certifi==2025.1.31
cffi==1.17.1
chardet==5.2.0
charset-normalizer==3.4.1
chex==0.1.88
clarabel==0.10.0
click==8.1.8
cloudpathlib==0.20.0
cloudpickle==3.1.1
cmake==3.31.4
cmdstanpy==1.2.5
colorcet==3.1.0
colorlover==0.3.0
colour==0.1.5
community==1.0.0b1
confection==0.1.5
cons==0.4

## Tokenization

`str` has the `.split()` function which **splits** the string among any _whitespace_ producing a list of **tokens**.

In [3]:
EXAMPLE = """
I love Jackson's movie! It's sweet, but with satirical humour. The dialogue is great and the adventure scenes are fun...
It manages to be whimsical and romantic while laughing at the conventions of the fairytale genre.
I would recommend it to just about anyone. I've seen in several times, and I'm always happy to see it again whenever
I have a friend who hasn't seen it yet.
"""

In [4]:
tokens = EXAMPLE.split()

print(f"Type: {type(tokens)}")
print(f"Length: {len(tokens)}")

Type: <class 'list'>
Length: 67


In [5]:
print(tokens[:5], tokens[-5:])

['I', 'love', "Jackson's", 'movie!', "It's"] ['who', "hasn't", 'seen', 'it', 'yet.']


### Remove punctuation
- The module `string` has a predefined set `punctuation` which includes all punctuation marks.
- The type `str` has the method `.replace()` which takes 2 arguments: the string to replace and the replacement

#### TASK 1.1
Use the following to implement a loop to remove punctuation from `EXAMPLE`.

In [6]:
from string import punctuation

print(f"Predefined Punctuation: {punctuation}")

no_punctuation = EXAMPLE

### IMPLEMENT YOUR SOLUTION HERE ####

# no_punctuation = no_punctuation.translate(str.maketrans("", "", punctuation))

for char in punctuation:
  no_punctuation = no_punctuation.replace(char, " ")


print(no_punctuation)


Predefined Punctuation: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

I love Jacksons movie Its sweet but with satirical humour The dialogue is great and the adventure scenes are fun
It manages to be whimsical and romantic while laughing at the conventions of the fairytale genre
I would recommend it to just about anyone Ive seen in several times and Im always happy to see it again whenever
I have a friend who hasnt seen it yet



In [13]:
print("No Punctuation:\n", no_punctuation)

tokens = no_punctuation.split()

print("Tokens:\n", tokens[:5], tokens[-5:])

No Punctuation:
 
I love Jacksons movie Its sweet but with satirical humour The dialogue is great and the adventure scenes are fun
It manages to be whimsical and romantic while laughing at the conventions of the fairytale genre
I would recommend it to just about anyone Ive seen in several times and Im always happy to see it again whenever
I have a friend who hasnt seen it yet

Tokens:
 ['I', 'love', 'Jacksons', 'movie', 'Its'] ['who', 'hasnt', 'seen', 'it', 'yet']


### Combine Negations / Normalize Apostrophe
We could hardcode rules which things to normalize: "it's" â†’ "it is"

In [8]:
normalize_rules = {"n't": " not", "'m": " am", "'ve": " have", "'s": " is"}

In [9]:
normalized = EXAMPLE

for old, new in normalize_rules.items():
    normalized = normalized.replace(old, new)

for p in punctuation:
    normalized = normalized.replace(p, " ")

In [10]:
print("Normalized:\n", normalized)

tokens = normalized.split()

print("Tokens:\n", tokens[:5], tokens[-5:])

Normalized:
 
I love Jackson is movie  It is sweet  but with satirical humour  The dialogue is great and the adventure scenes are fun   
It manages to be whimsical and romantic while laughing at the conventions of the fairytale genre 
I would recommend it to just about anyone  I have seen in several times  and I am always happy to see it again whenever
I have a friend who has not seen it yet 

Tokens:
 ['I', 'love', 'Jackson', 'is', 'movie'] ['has', 'not', 'seen', 'it', 'yet']


If we look at the output, we notice something:

$\texttt{['I', 'love', 'Jackson', }\underbrace{\texttt{'is'}}_{\text{This is wrong!}} \texttt{, 'movie']}$

There seems to be a problem with the `'s` replacement. It could also be the genitive case: we need to add the pronoun **it's** as well as the uppercase version **It's**. (OR convert the full text into lower case)

In [15]:
normalized = EXAMPLE

normalize = {
    "It's": "It is",
    "it's": "it is",
    "n't": " not",
    "'m": " am",
    "'ve": " have",
}

for old, new in normalize.items():
    normalized = normalized.replace(old, new)

for p in punctuation:
    normalized = normalized.replace(p, " ")

In [16]:
print("Normalized:\n", normalized)

tokens = normalized.split()

print("Tokens:\n", tokens[:5], tokens[-5:])

Normalized:
 
I love Jackson s movie  It is sweet  but with satirical humour  The dialogue is great and the adventure scenes are fun   
It manages to be whimsical and romantic while laughing at the conventions of the fairytale genre 
I would recommend it to just about anyone  I have seen in several times  and I am always happy to see it again whenever
I have a friend who has not seen it yet 

Tokens:
 ['I', 'love', 'Jackson', 's', 'movie'] ['has', 'not', 'seen', 'it', 'yet']


This is better than what we had previously:

$\texttt{['I', 'love', 'Jackson', }\underbrace{\texttt{'s'}}_{\text{Used to be } \texttt{'is'}} \texttt{, 'movie']}$

Depending on what we want to accomplish this is still not correct though. The **'s** is a possessive suffix, and by splitting the text like we do currently we lose this information. We can infer from the sentence that it used to be a possessive suffix, but it could also have been a typo that was present in the original text.

### Bag-of-Words

Here, we use an *uncounted* version of bag-of-words: we create a `set` out of the tokens.

In [17]:
bow = set(tokens)

print(f"Vocabulary count: {len(bow)}")

print("Vocabulary:\n", bow)

Vocabulary count: 56
Vocabulary:
 {'but', 'not', 'anyone', 's', 'humour', 'are', 'fun', 'always', 'conventions', 'satirical', 'just', 'the', 'great', 'I', 'movie', 'of', 'love', 'would', 'times', 'manages', 'has', 'have', 'be', 'with', 'dialogue', 'yet', 'sweet', 'The', 'about', 'see', 'It', 'am', 'laughing', 'again', 'friend', 'Jackson', 'at', 'who', 'it', 'while', 'whenever', 'happy', 'genre', 'whimsical', 'to', 'seen', 'fairytale', 'in', 'is', 'and', 'several', 'a', 'adventure', 'romantic', 'scenes', 'recommend'}


### Stopword removal

We see from the vocabulary, that many words **MIGHT** not influence a sentiment: pronouns, conjunctions, etc.

We call these **stopwords** (because they *stop* the analysis)

The module `nltk` has a predefined list of stopwords.

We need to download this list first before we can use it.

In [18]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [19]:
from nltk.corpus import stopwords

english_stopwords = set(stopwords.words("english"))


tokens_without_stopwords = []

for token in tokens:
    if token.lower() not in english_stopwords:
        tokens_without_stopwords.append(token)


# BoW
bow = set(tokens_without_stopwords)

print(f"Vocabulary count: {len(bow)}")

print("Vocabulary:\n", bow)

Vocabulary count: 30
Vocabulary:
 {'anyone', 'humour', 'fun', 'always', 'conventions', 'satirical', 'great', 'movie', 'love', 'would', 'times', 'manages', 'dialogue', 'yet', 'sweet', 'see', 'laughing', 'friend', 'Jackson', 'whenever', 'happy', 'genre', 'whimsical', 'seen', 'fairytale', 'several', 'adventure', 'romantic', 'scenes', 'recommend'}


In [20]:
english_stopwords

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

### Combine everything into functions

To modularize the code and reuse it for any input, we have to create functions.
Here, we create the following functions:
    
- Normalize according to predefined rules
- Remove leftover punctuation
- Remove stopwords

In [21]:
# Manual normalizing rules
NORMALIZE_RULES = {
    "It's": "It is",
    "it's": "it is",
    "n't": " not",
    "'m": " am",
    "'ve": " have",
    "'re": " are",
}

# Punctuation from string module (as set => faster)
PUNCTUATION = set(punctuation)

# Stopwords from NLTK (as set => faster)
STOPWORDS = set(stopwords.words("english"))

In [22]:
def normalize(text):
    for old, new in NORMALIZE_RULES.items():
        text = text.replace(old, new)
    return text

In [23]:
def remove_punctuation(text):
    for p in PUNCTUATION:
        text = text.replace(p, " ")
    return text

In [24]:
def remove_stopwords(tokens):
    clean = []
    for token in tokens:
        if token.lower() not in STOPWORDS:
            clean.append(token)
    return clean

#### TASK 1.2
Write a function for tokenization utilizing the above defined functions. The tokenization function should do the following:
1. Normalize the text
2. Remove punctuation from the text
3. split the text in tokens
4. Remove the stopwords from the tokens
5. Apply Bag-of-Words and return bow as set of tokens.

In [54]:
def tokenization(text):
    bow = text
    ### IMPLEMENT YOUR SOLUTION HERE ###
    # 1. Normalize
    bow = normalize(bow)
    # 2. Remove Punctuation
    bow = remove_punctuation(bow)
    # 3. Tokenize
    bow = bow.split()
    # 4. Remove Stopwords
    bow = remove_stopwords(bow)
    # 5. Apply Bag-of-Words (set of tokens)
    bow = set(bow)

    return bow

In [55]:
example = "This is a great movie."

print(f"Input : '{example}'")
print(f"Tokens: {tokenization(example)}")

# This should output the following:
# Input : 'This is a great movie.'
# Tokens: {'great', 'movie'}

Input : 'This is a great movie.'
Tokens: {'movie', 'great'}


In [56]:
example = "This is a bad movie."

print(f"Input : '{example}'")
print(f"Tokens: {tokenization(example)}")

# This should output the following:
# Input : 'This is a bad movie.'
# Tokens: {'bad', 'movie'}

Input : 'This is a bad movie.'
Tokens: {'movie', 'bad'}


In [57]:
example = "I'm enjoying this movie so much, you will, too."

print(f"Input : '{example}'")
print(f"Tokens: {tokenization(example)}")

# This should output the following:
# Input : 'I'm enjoying this movie so much, you will, too.'
# Tokens: {'much', 'enjoying', 'movie'}

Input : 'I'm enjoying this movie so much, you will, too.'
Tokens: {'movie', 'enjoying', 'much'}


## Basic Classifier based on hardcoded rules

We can define a list of important words for each sentiment (_good_ or _bad_) and then count how often each appears in the text.

In [58]:
POSITIVE_SENTIMENT = set(["good", "best", "great", "like", "enjoy", "love"])

NEGATIVE_SENTIMENT = set(["bad", "worst", "unlikable", "hate", "trash"])

Now, we need a functions that counts how often each word appears, compares the two approaches, and outputs the class with the most appearances.

In [59]:
def classify(text):
    bow = tokenization(text)

    positive_count, negative_count = 0, 0

    for word in bow:
        if word in POSITIVE_SENTIMENT:
            positive_count += 1
        elif word in NEGATIVE_SENTIMENT:
            negative_count += 1

    if positive_count > negative_count:
        return "POSITIVE"
    else:
        return "NEGATIVE"

In [60]:
classify("This is the worst movie of all time.")

'NEGATIVE'

In [61]:
classify("This is a trash movie. I hate that I love it so much.")

'NEGATIVE'

In [62]:
classify("Other people said that I would like this movie. I don't.")

'POSITIVE'

In [63]:
classify(EXAMPLE)

'POSITIVE'

In [64]:
classify("I LOVE this movie.")  # should lowercase everything, right?

'NEGATIVE'

<p style="color: #749CFF; background: #ccffff; font-size: xx-large">
    <br />
    <strong>
        Now, we have our first Artificial Intelligence.
    </strong>
    <br /><br />
</p>