### **Importing Libraries**

In [3]:
# Std Libraries...
import os
import re
import shutil
import string

# Data manipulation libraries...
import matplotlib.pyplot as pl
import numpy as np

# Deep-Learning libraries
import tensorflow as tf
import tensorflow.keras as tfk
from tensorflow.keras import layers as lyrs, optimizers as opts, losses


### **Downloading Data**

> **`Don't run this cell, it downloads data which is already done`**

In [5]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file(
    "aclImdb_v1", url,
    untar=True, cache_dir='.',
    cache_subdir=''
)

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
print(os.listdir(dataset_dir))


['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']


> #### **Checking a sample file**

In [6]:
train_dir = os.path.join(dataset_dir, 'train')
print(os.listdir(train_dir))

sample_file = os.path.join(train_dir, 'pos/1181_9.txt')
with open(sample_file) as f:
    print(f.read())
    
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)


['labeledBow.feat', 'neg', 'pos', 'unsup', 'unsupBow.feat', 'urls_neg.txt', 'urls_pos.txt', 'urls_unsup.txt']
Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


> **For data-preprocessing, it has to be passed onto library that expects a file structure like:**

* main_directory/
* ...class_a/
* ......a_text_1.txt
* ......a_text_2.txt
* ...class_b/
* ......b_text_1.txt
* ......b_text_2.txt

### **Loading Data For training**
> ##### **Dividing train data into training and validation data using `text_dataset_from_directory`**

In [7]:
### FunctionParameters ###
bs = 32
s = 42

# this is the trainig set...
RawTrainDataset = tfk.utils.text_dataset_from_directory(
    'aclImdb/train', batch_size=bs, seed=s,
    validation_split=0.2, subset='training'
)


Found 25000 files belonging to 2 classes.
Using 20000 files for training.


##### **Label `0` corresponds to `neg`**
##### **Label `1` corresponds to `pos`**


In [8]:
# Looking into data

for txt_b, lbl_b in RawTrainDataset.take(1):
    print(f"Review:: {txt_b.numpy()[7]}")
    print(f"Review:: {lbl_b.numpy()[7]}\n\n")


Review:: b"I'm a Christian who generally believes in the theology taught in Left Behind. That being said, I think Left Behind is one of the worst films I've seen in some time.<br /><br />To have a good movie, you need to have a well-written screenplay. Left Behind fell woefully short on this. For one thing, it radically deviates from the book. Sometimes this is done to condense a 400-page novel down to a two-hour film, but in this film I saw changes that made no sense whatsoever.<br /><br />Another thing, there is zero character development. When characters in the story get saved (I won't say who), the book makes it clear that it's a long, soul-searching process. In the film it's quick and artificial. The book is written decently enough where people like Rayford Steele, Buck Williams and Hattie Durham seem real, but in the movie scenarios are consistently given the quick treatment without anything substantial. In another scene where one character gets angry about being left behind (aga

In [9]:
# this is the validation set...
raw_val_ds = tfk.utils.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=bs,
    validation_split=0.2,
    subset='validation',
    seed=s)


# getting ready the test set...
raw_test_ds = tfk.utils.text_dataset_from_directory(
    'aclImdb/test',
    batch_size=bs,
)


Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


### **Preparing Dataset for training**
> **Next, you will standardize, tokenize, and vectorize the data using the helpful `tf.keras.layers.TextVectorization` layer.**

* **`Standardization` refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset**
* **`Tokenization` refers to splitting strings into tokens (for example, splitting a sentence into individual words)**
* **`Vectorization` refers to converting tokens into numbers so they can be fed into a neural network**


> ##### **Defining a custom Standardization function**

In [10]:
# Turns out, default standardization cannot remove <HTML/> tags, 
# thus we need to create our own simple one.

def cstm_stdfn(data):
    lc = tf.strings.lower(data)
    # the operation below strinps out basic HTML
    formatted = tf.strings.regex_replace(lc, '<br />', '')
    return tf.strings.regex_replace(
        formatted, '[%s]' % re.escape(string.punctuation), ''
    )

> ##### **Creating a text-vectorization layer**

In [11]:
mx_fea = 10000 # dont know what for
slen = 250 # truncate sequences to exact sequence length!!

vec_layer = lyrs.TextVectorization(
    standardize= cstm_stdfn,
    max_tokens= mx_fea,
    output_mode= 'int',
    output_sequence_length= slen
)

> **Next, you will call adapt to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.**

In [12]:
train_text = RawTrainDataset.map(lambda x, y: x)
print(train_text)
vec_layer.adapt(train_text)


<MapDataset shapes: (None,), types: tf.string>


In [None]:

# Testing the above layer with sample data to get insight on text pre-processing results!!

def vectorize(text, label):
    text = tf.expand_dims(input= text,axis= -1)
    return vec_layer(text), label

# Retrieving a batch (of 32 reviews and labels) from the dataset
(txt_b, lbl_b) = next(iter(RawTrainDataset))
first_txt, first_lbl = txt_b[0], lbl_b[0]
print("Printing out stuff!")
print(f"First Review:: {first_txt}")
print(f"Label(encoding):: {first_lbl}")
print(f"Sentiment:: {RawTrainDataset.class_names[first_lbl]}")
print("Vectorized review", vectorize(first_txt, first_lbl))


> **You can lookup the token (string) that each integer corresponds to by calling `.get_vocabulary()` on the layer.**

In [21]:
print(f"25 ==> {vec_layer.get_vocabulary()[1213]}")
print(f"2416 ==> {vec_layer.get_vocabulary()[2416]}")
print(f"10 ==> {vec_layer.get_vocabulary()[10]}")


25 ==> tough
2416 ==> speaks
10 ==> this


> **Applying the TextVectorization layer (created earlier) to the datasets (train, validation, and test)**

In [24]:
train_ds = RawTrainDataset.map(vectorize)
val_ds = raw_val_ds.map(vectorize) 
test_ds = raw_test_ds.map(vectorize)

print(train_ds)

<MapDataset shapes: ((None, 250), (None,)), types: (tf.int64, tf.int32)>


### **Configure Dataset for performance**

In [26]:
AT = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AT)
val_ds = val_ds.cache().prefetch(buffer_size=AT)
test_ds = test_ds.cache().prefetch(buffer_size=AT)


### **Create a Model**


In [27]:
emb_dim = 16

model = tfk.Sequential([
    lyrs.Embedding(mx_fea + 1, emb_dim),
    lyrs.Dropout(0.2),
    lyrs.GlobalAveragePooling1D(),
    lyrs.Dropout(0.2),
    lyrs.Dense(1)
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          160016    
_________________________________________________________________
dropout (Dropout)            (None, None, 16)          0         
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
Total params: 160,033
Trainable params: 160,033
Non-trainable params: 0
_________________________________________________________________


### **Loss function and optimizer**

### **Model training and evaluation**

### **Plot Model parameters with time(loss, accuracy)**

### **Export Model and Interface on new data** 