# IMPORT DEPENDENCIES

In [9]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import regex as re
import string
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

## IMDB Dataset Download

In [10]:
trainDS, valDS, testDS = tfds.load('imdb_reviews',
                                   split=['train', 'test[:50%]', 'test[50%:]'],
                                   as_supervised=True)

In [11]:
trainDS

<_PrefetchDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>

In [12]:
for review, label in trainDS.take(2):
    print('Review:', review.numpy())
    print('Label:', label.numpy())
    print('\n')

Review: b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Label: 0


Review: b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbis

2025-02-22 18:20:19.606716: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2025-02-22 18:20:19.606937: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## TEXT STANDARDIZATION

### 1. Convert to Lower Case
### 2. Remove HTML Tags
### 3. Remove Punctuations
### 4. Stemming : Return a word to its base form -> Porter Stemmer

In [13]:
print(PorterStemmer().stem('Coming'))
print(PorterStemmer().stem('Tensed'))

come
tens


### 5. Lemmatization : Similar to stemming but with analysis of words

In [19]:
# nltk.download('wordnet')
# nltk.download('omw-1.4')

In [15]:
print(WordNetLemmatizer().lemmatize('Coming', pos=wordnet.VERB))
print(WordNetLemmatizer().lemmatize('Tensed', pos=wordnet.ADJ))

Coming
Tensed


In [17]:
def standardization(inputData):
    lowerCaseOutput = tf.strings.lower(inputData)
    noTagOutput = tf.strings.regex_replace(lowerCaseOutput, "<[^>]+>", " ")
    noPunctOutput = tf.strings.regex_replace(noTagOutput, "[%s]" % re.escape(string.punctuation), "")
    return noPunctOutput

In [18]:
for review, label in trainDS.take(2):
    print('Review:', review.numpy())
    print('Label:', label.numpy())
    print('Standardized Review:', standardization(review).numpy())
    print('\n')

Review: b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Label: 0
Standardized Review: b'this was an absolutely terrible movie dont be lured in by christopher walken or michael ironside both are great actors but this must simply be their worst role in history even their great acting could not redeem this movies ridiculous storyline th

2025-02-22 18:20:31.286725: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2025-02-22 18:20:31.287181: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## TOKENIZATION

### 1. Character Tokenization -> Smaller Vocabulary
#### "i love this movie" -> i, l, o, v, e, t, h, i, .......
### 2. Word Tokenization -> Larger Vocabulary
#### "i love this movie" -> i, love, this , movie
### 3. Subword Tokenization -> Middle Ground
#### "i love this movie" -> i, lov, e, thi, s, mov, ie
### 4. N-Gram Tokenization -> Combines N words as single word
#### "i love this movie" -> i love, love this, this movie (2-Gram)

## Numericalization of Tokens

### 1. One-Hot Encoding -> Returns matrix of size (vocab, number of token)
### 2. Bag-Of-Words -> Returns a single vector of size (vocab), with each value as count of the words in sentence
### 3. tf-idf Encoding -> Term Frequency / Inverse Document Frequency
#### Term Frequency = No. of times word occurs / No. of words in vocabulary
#### Inverse Document Frequency = log(No. of sentences / No. of sentences with the word)
#### Final Encoding = tf * idf
### 4. Embeddings -> Aims to reduce sparsity and dimensions
#### Embedded Matrix = Matrix * Embedding Matrix
#### => (4, 10000) * (10000, 300) = (4, 300), where 300 is embedding dimensions
#### ** This can also encode semantic relation between words **
#### *** Embedding Matrix is a trainable layer ***