<a href="https://colab.research.google.com/github/Elzfe09/SentimentAnalysis-DL/blob/main/sentiment_analysis_DL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# sentiment analysis
NLP task which extracts the sentiment (emotions) embedded within a given piece of text

## step:
1. install required libraries
2. get a dataset and load it
3. pre-process the dataset
4. train and build model
5. use model on unseen data

In [1]:
!pip install tensorflow pandas scikit-learn



## Libraries
1. Keras: high level deep learning API (sent analysis is DL) allow to easily build, train, evaluate, execute all sorts of neural network
depend on low level API doing computation in DL

can work with tensorflow, microsoft cognitive toolkit (CNTK), theano, apache, MXNet, apples core ML

In [2]:
import pandas as pd

df = pd.read_csv('train.txt', sep = ';', names = ['text', 'emotion'])
df

Unnamed: 0,text,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger
...,...,...
15995,i just had a very brief time in the beanbag an...,sadness
15996,i am now turning and i feel pathetic that i am...,sadness
15997,i feel strong and good overall,joy
15998,i feel like this was such a rude comment and i...,anger


In [3]:
df['emotion'].unique()

array(['sadness', 'anger', 'love', 'surprise', 'fear', 'joy'],
      dtype=object)

# preprocess

- DL models work with numerical data, hence feature such as emotion need to be encoded to math representation

- **label encoding** >> assign unique integer to each distinct category within categorical var (happy = 0, joy = 1, sadness = 2)

- after encoding, dataset need to be split into 2 segments. Training set is used for training the model and testing set is used to test the model once it is trained.

In [4]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder

#encode label emotion into number
label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['emotion'])

#split data
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size = 0.2)
#x is independent (text), y is dependent (emotion)

#tokenize the text
tokenizer = Tokenizer(num_words = 5000, oov_token = '<OOV>')
tokenizer.fit_on_texts(X_train)

#convert text to sequences and pad them
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_seq, maxlen = 100, padding = 'post')
X_test_pad = pad_sequences(X_test_seq, maxlen = 100, padding = 'post')

## Explaination above

- Keras tokenizer class breaks down the text into smaller units called tokens (word, subword, character groups)
- further, tokens assigned a unique int for procesing
- **fit_on_text** and **text_to_sequences()** do this job of vocab building

*fit on text* >> word frequency, most frequent words get lowest int value (1,2,3), word that most appear get lowest value numbers where the least appear will be assigned a larger value

*texts_to_sequences(X_train)* >> transform text string into sequence of integer, each integer in sequence coresspond to index of a word in vocab that tokenizer learn during fit on text

*num_words* parameter in keras tokenizer specify max number of words to keep in vocab. You can configure the tokenizer to handle words not present in its built vocabulary. This is typically done by assigning a special token (“<OOV>”) ID to such words.

- neural network requires fixed size inputs, but text sequence may be of varying length, so use **pad_sequences()** fn to standardize the length of text sequences. maxlen argument defines standard length of text sequence need to be processed by nn

## Train and build model
- keras has diff kinds of API to build models.
- sequential API used to build seequential neural network (DL) models where layers are stacked
- input data flows from input layers (first layer) to output layer (last layer) procesed by each layer in between
- each layer takes data from previous one and feeds to next layer after processing

- Another kind of API in Keras is a Functional API which is used to create complex neural networks.

- this example has defined 3 diff layers
(embedding, globalavgpooling1D, dense)

In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense
model = Sequential([
    Embedding(input_dim=5000, output_dim=16),
    GlobalAveragePooling1D(),
    Dense(16, activation='relu'),
    Dense(len(df['label'].unique()), activation='softmax')
])
# batch size bisa None, 100 panjang input
model.summary()



In [6]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(X_train_pad, y_train, epochs=20, validation_data=(X_test_pad, y_test))

Epoch 1/20
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.3320 - loss: 1.6054 - val_accuracy: 0.3353 - val_loss: 1.5855
Epoch 2/20
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.3381 - loss: 1.5705 - val_accuracy: 0.3353 - val_loss: 1.5800
Epoch 3/20
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.3471 - loss: 1.5588 - val_accuracy: 0.3353 - val_loss: 1.5722
Epoch 4/20
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.3588 - loss: 1.5483 - val_accuracy: 0.4256 - val_loss: 1.5583
Epoch 5/20
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.3767 - loss: 1.5351 - val_accuracy: 0.4097 - val_loss: 1.5305
Epoch 6/20
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.4254 - loss: 1.4884 - val_accuracy: 0.3575 - val_loss: 1.4763
Epoch 7/20
[1m400/400[0m 

In [18]:
def predict_emotion(text):
  seq = tokenizer.texts_to_sequences([text])
  pad = pad_sequences(seq, maxlen = 100)
  pred = model.predict(pad)
  emotion = label_encoder.inverse_transform([pred.argmax()])[0]
  return emotion


print(predict_emotion('I feel weak'))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
sadness


# Expl inside DL
- ***embedding*** in NLP is a vector representation of tokens to capture semantic relationship between tokens
words with similar meaning tend to have similar embedding vector
hence, first layer since it takes text sequence and create embedding

- **input_dim** defines size of vocab, ie total no of unique categories or words to be embbedded and **output_dim** defines the size of dense embedding vector of each word.  Each word will be represented by a 16-dimensional vector. For example, after embedding a 5-word (token) text is represented as 5 x 16 vector (matrix).

- ***globalaveragepooling*** layer in Keras perform global avg pooling over the output. 16 dimensional vectors and calculates avg of each dimension.  It effectively flattens the sequence into a single 16-dimensional vector (dimensionality reduction). This results in a fixed-length output vector which summarizes the information from the entire sequence into a single vector per feature.

- ***hidden dense layer***: fully connected layer, perform multiplication between input data and weight matrix with the help of different activation fn

-- Relu: (rectified linear unit) introduce non-linearity, allow model learn complex relationship within data and helps mitigate ****vanishing gradient problem**** >> error occur during backpropagation (count and spread gradient from one layer to deeper one so that weight can be updated) gradient sometimes becoome really small (approach 0) when arrived at initial layer (near input) it cause
1. weight in first layer no change during training
2. layer difficult to learn important features
3. model stop to learn effectively especially in deepest layer
why?
- reduce gradient calclulation from actv fn and weight due to repetitive multiplication (chain rule), gradient value become smaller exponentially

if gradient each layer is 0.1, after 10 layer, 0.1^10

- last layer also a dense layer w softmax actv fn. why? because softmax fn is to change numbers result from neuron to probability value total 1, so that model can choose emotion label class w higher prob
- no of neuron in last layer = no of emotion class u want to predict

## Compile Method
----
compile() method configures the training parameters of model. training required loss fn, optimizer. Optimizer adjust weights and biases of nn during training to minimize loss fn
- loss fn: quantify adjustment by comparing model prediction to actual target values (label)

loss fn:
- **categorical_crossentropy, sparse_categorical_crossentropy, mse**

The fit() begins the actual training of the model on the training data. here training happens for 20 epochs (iterations). With each epoch the model is minimizing the loss and improving accuracy.

## Model predict

fn predict_emotion(text) fn which predicts the emotion for input text. it folow same path as the model building. first tokenize text, convrt into sequence, padding seqhence before feeding to model for determine sentiment

-  ***label_encoder.inverse_transfor ([pred.argmax()])[0]*** used to convert model numerical prediction back to original emotion label (str)

- The model predicts the probabilities (pred) of each type of emotion (class) as in the dataset i.e. number of different values of label feature, for the given text. Since, we need to predict a single emotion (class) for the text, argmax() finds the index of the class with the highest probability.

- label encoder which used earlier to convert emotions to integers now used in **reverse.inverse_transform()** take a list of integer and convert them back to str (word)

- Since inverse_transform returns a list (even if there’s only one prediction), [0] is used to extract the first element from that list, giving us the final emotion.