# This is a Basic Sentiment Analysis Project | Aadi Kulkarni
# Libraries Used
Tensorflow/Keras - Used for defining, building, and training the model
Pandas - Used for loading the dataset
Numpy - Used for working with arrays
Matplotlib - Used for plotting the data
Scikit-learn - Used for splitting the data into train and test sets

# Importing necessary libraries

In [2]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import losses
from sklearn.model_selection import train_test_split

# Load in our dataset
This will take out CSV filed named `data.csv` and create a dataframe out of it with **Pandas**.
X is going to be the input data. It is getting the column named `text` from the dataframe.
y is going to be the output data. It is getting the column named `label` from the dataframe.

In [3]:
df = pd.read_csv('data.csv') # Dataframe

X = df['text'] # Input data
y = df['label'] # Output data

# Preprocessing the data
This section of code is going to do the following:
- Tokenize the text (Eg. "Hello, my name is John" -> ['Hello', 'my', 'name', 'is', 'John'])
- Convert the text to sequences (Eg. ['Hello', 'my', 'name', 'is', 'John'] -> [[3, 6, 8, 2, 4]])
- pad the sequences so they are all the same dimension (Eg. [[3, 6, 8, 2, 4], [3, 6, 8, 2, 4]] -> [[0, 0, 0, 0, 0, 0, 3, 6, 8, 2, 4], [0, 0, 0, 0, 0, 0, 3, 6, 8, 2, 4]]) This is because the model expects all the sequences to be the same dimension or same length so each sequence has a use, there are no "holes" in the sequence.

# Important Note
We aren't preprocessing the labels/y/output data because unlike X, the y is already in a numerical format that all has the same dimension. "1 and 0"

In [10]:
# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)

# Convert the text to sequences
X_seq = tokenizer.texts_to_sequences(X)

# pad the sequences so they are all the same dimension
X_pad = pad_sequences(X_seq)

We can now "visualize" what the X data might look like the the neural network. We can do this by printing out the first 5 sequences.

In [11]:
print("Preview of X_pad:", X_pad[:5])
print('Preview of X_seq: ', X_seq[:5])
# This is to show why we pad the sequences
print("Shape of X_seq 1 and X_seq 2:", len(X_seq[0]), len(X_seq[1]))
print("Shape of X_pad 1 and X_pad 2:", len(X_pad[0]), len(X_pad[1]))

Preview of X_pad: [[   0    0    0 ...  280  243    8]
 [   0    0    0 ...   47  543   93]
 [   0    0    0 ...  162   38  496]
 [   0    0    0 ...    1 1114  455]
 [   0    0    0 ...    3  510  652]]
Preview of X_seq:  [[10, 219, 927, 11, 216, 119, 14, 110, 2, 655, 7567, 2383, 80, 1153, 4243, 13, 619, 8, 9, 3, 2593, 18, 93, 27, 257, 2, 1365, 14878, 3026, 95, 2, 435, 14879, 418, 1023, 10, 1775, 12, 1, 2299, 13, 29, 1, 92, 20, 1, 76, 502, 4, 1, 636, 820, 143, 10, 96, 25, 39, 2015, 53, 3, 2697, 637, 1, 636, 5, 79, 1, 2299, 38, 55, 5, 1, 220, 5, 430, 1, 1169, 18, 93, 132, 21, 39, 79, 53, 3, 1192, 637, 1, 815, 5, 104, 220, 8, 149, 1248, 3551, 5, 12, 10, 311, 34, 10, 948, 5, 39, 3176, 3027, 20, 1, 5845, 3, 180, 902, 1983, 65, 8, 1124, 16, 67, 47, 1983, 65, 8, 1124, 13, 32, 350, 4, 11, 118, 32, 350, 44, 60, 10, 345, 50, 121, 582, 12, 10, 65, 283, 173, 89, 2, 50, 3938, 863, 7040, 18361, 4834, 464, 20, 1240, 4395, 38, 24641, 2127, 15, 153, 1617, 5, 145, 54, 4396, 53, 15, 3, 283, 173, 10, 13

Now we will split our data into train and test sets. This is so that we can train our model on the data, we can evaluate it's performance on unseen data.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X_pad, y, test_size=0.2)

# Define the model
We will now be defining the model with the following architecture and layers.
In a nutshell, here is the job of each layer:
- **Embedding Layer:** This layer takes in an integer matrix of size (input_dim, output_dim) as input and produces an output matrix of size (input_dim, output_dim) as output. This layer is used to learn word vectors.
- **GlobalAveragePooling1D:** This layer takes in a list of vectors and returns a vector with the average of the list of vectors.
- **Dropout:** This layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting.
- **Dense:** This layer has 32 units which are used to compute an output and uses the relu activation function.
- **Dense:** This layer has 16 units which are used to compute an output and uses the relu activation function.
- **Dense:** This layer has 1 unit which is used to compute an output and uses the sigmoid activation function to output a value between 0 and 1 or the probability of the input being true/positive.

In [None]:
model = keras.Sequential([
    layers.Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=32),
    layers.GlobalAveragePooling1D(),
    layers.Dropout(0.2),
    layers.Dense(32, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])