<a href="https://colab.research.google.com/github/SilahicAmil/NLP-NLTK/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment Analysis 

Sentiment analysis on the IMBD dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import string
import shutil

# TensorFlow imports
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import losses

## Dataset import

In [None]:


url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1",
                                  url,
                                  untar=True,
                                  cache_dir=".",
                                  cache_subdir="")


dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")

In [None]:
os.listdir(dataset_dir)

['README', 'imdb.vocab', 'train', 'imdbEr.txt', 'test']

In [None]:
train_dir = os.path.join(dataset_dir, "train")
os.listdir(train_dir)

['neg',
 'urls_unsup.txt',
 'pos',
 'unsup',
 'urls_neg.txt',
 'urls_pos.txt',
 'labeledBow.feat',
 'unsupBow.feat']

In [None]:
sample_file = os.path.join(train_dir, "pos/1181_9.txt")
with open(sample_file) as f:
  print(f.read())

Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


## Loading the dataset and some preprocessing

In [None]:
# removing irrelevant folder
remove_dir = os.path.join(train_dir, "unsup")
shutil.rmtree(remove_dir)

In [None]:
# Creating validation set
# text_dataset_from_directory creates a labeled td.data.Datset

batch_size = 32
seed = 42

train_set = tf.keras.utils.text_dataset_from_directory("aclImdb/train",
                                                       batch_size=batch_size,
                                                       validation_split=0.2,
                                                       subset="training",
                                                       seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


Originally 25k examples in the training folder which now 80% will be used for training and the other 5k for validation.

In [None]:
# Prinitng out examples
for text_batch, label_batch in train_set.take(1):
  for i in range(5):
    print("Review", text_batch.numpy()[i])
    print("Label", label_batch.numpy()[i])

Review b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label 0
Review b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into 

In the the reviews there is raw text and the occasional HTML tags. Let's see how we can handle these.

Labels 0 or 1 correspond to pos or neg movie reviews.

0- neg

1- pos

which we can see is confirmed below

In [None]:
print('Label 0 is', train_set.class_names[0])
print('Label 1 is', train_set.class_names[1])

Label 0 is neg
Label 1 is pos


## Creating Test and Validation dataset

In [None]:
# Validation set
val_set = tf.keras.utils.text_dataset_from_directory("aclImdb/train",
                                                     batch_size=batch_size,
                                                     validation_split=0.2,
                                                     subset="validation",
                                                     seed=seed)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [None]:
# Test set

test_set = tf.keras.utils.text_dataset_from_directory("aclImdb/test",
                                                      batch_size=batch_size)

Found 25000 files belonging to 2 classes.


## Preparing dataset for training

Standardizing, tokenizing and vectorizing the datasets with tf.keras.layers.TextVectorization.

Standardization refers to making the making the dataset to simplify it. Removing punctuation, HTML elements and etc.

Tokenization is splitting string to tokens. Example: splitting a sentence into individual words by splitting on the white space.

Vectorization is converting tokens into numbers so they can be used in a nueral net for learning.

In [None]:
# Standardizing dataset

def standardize_datasets(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')

  return tf.strings.regex_replace(stripped_html,
                                 '[%s]' % re.escape(string.punctuation),
                                 '')

In [None]:
# TextVectorization layer does everything. Standardizes, tokenize and vectorize
MAX_FEATS = 10000
SEQUENCE_LEN = 250

vectorization_layer = layers.TextVectorization(
    standardize=standardize_datasets,
    max_tokens=MAX_FEATS,
    output_mode="int", # creates unique int for each token
    output_sequence_length=SEQUENCE_LEN)

Note: When using .adapt() only use it on the trainin data

In [None]:
# Text only dataset, no labels
train_text_set = train_set.map(lambda x, y: x)
vectorization_layer.adapt(train_text_set)

In [None]:
# Function to see results of the layer
def vect_text(text, label):
  text = tf.expand_dims(text, -1)
  
  return vectorization_layer(text), label

In [None]:
# Review batch from the dataset

text_batch, label_batch = next(iter(train_set))
first_review, first_label = text_batch[0], label_batch[0]

print(f"First Review: {first_review}\nFirst Label {train_set.class_names[first_label]}\nVectorized Review: {vect_text(first_review, first_label)}")

First Review: b'Recipe for one of the worst movies of all time: a she-male villain who looks like it escaped from the WWF, has terrible aim with a gun that has inconsistent effects (the first guy she shoots catches on fire but when she shoots anyone else they just disappear) and takes time out to pet a deer. Then you got the unlikable characters, 30 year old college students, a lame attempt at a surprise ending and lots, lots more. Avoid at all costs.'
First Label neg
Vectorized Review: (<tf.Tensor: shape=(1, 250), dtype=int64, numpy=
array([[9257,   15,   28,    5,    2,  241,   91,    5,   30,   58,    4,
           1, 1011,   36,  262,   38,    9, 3891,   35,    2,    1,   43,
         382, 5223,   16,    4, 1113,   12,   43, 5739,  300,    2,   83,
         225,   55, 3209, 3898,   20,  973,   18,   51,   55, 3209,  250,
         320,   34,   40, 4386,    3,  294,   58,   44,    6, 2911,    4,
        6757,   92,   22,  184,    2, 4916,  100, 1221,  336,  161, 1199,
        1484,  

WE can see each token is an integer. Let's see what token corresponds to what integer

In [23]:
print(f"1337 -> {vectorization_layer.get_vocabulary()[1337]}\n420 {vectorization_layer.get_vocabulary()[420]}\nVocab Size: {len(vectorization_layer.get_vocabulary())}")

1337 -> sent
420 yes
Vocab Size: 10000
