# 1. Select a dataset of social media comments labeled with different sentiment classes (you can find some good datasets in Kaggle - https://www.kaggle.com) and **preprocess the data**. Don't forget to split it into training and test sets. Note that the selection of a fittable dataset is important here.

I chose the dataset: https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis

The training and test sets are already in the dataset provided.

## **Preprocessing**

In [1]:
import numpy as np
import pandas as pd
import re    #Regular Expressions ( re ) module in Python
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing import sequence
from keras.callbacks import EarlyStopping

In [2]:
data = pd.read_csv('/content/twitter_training.csv',header=None)
data.head()

Unnamed: 0,0,1,2,3
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [3]:
new_columns=["id","type","sentiment","text"]
data.columns = new_columns
data.head()

Unnamed: 0,id,type,sentiment,text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [4]:
# Drop columns that are not important
data=data[['text','sentiment']]
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74682 entries, 0 to 74681
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       73996 non-null  object
 1   sentiment  74682 non-null  object
dtypes: object(2)
memory usage: 1.1+ MB


In [5]:
#select positive & negative only
data = data[data.sentiment != "Neutral"]
data = data[data.sentiment != "Irrelevant"]

data

Unnamed: 0,text,sentiment
0,im getting on borderlands and i will murder yo...,Positive
1,I am coming to the borders and I will kill you...,Positive
2,im getting on borderlands and i will kill you ...,Positive
3,im coming on borderlands and i will murder you...,Positive
4,im getting on borderlands 2 and i will murder ...,Positive
...,...,...
74677,Just realized that the Windows partition of my...,Positive
74678,Just realized that my Mac window partition is ...,Positive
74679,Just realized the windows partition of my Mac ...,Positive
74680,Just realized between the windows partition of...,Positive


In [6]:
data['sentiment'] = data['sentiment'].astype(str)
data['text'] = data['text'].astype(str)
data['text'] = data['text'].apply(lambda x: x.lower())
data

Unnamed: 0,text,sentiment
0,im getting on borderlands and i will murder yo...,Positive
1,i am coming to the borders and i will kill you...,Positive
2,im getting on borderlands and i will kill you ...,Positive
3,im coming on borderlands and i will murder you...,Positive
4,im getting on borderlands 2 and i will murder ...,Positive
...,...,...
74677,just realized that the windows partition of my...,Positive
74678,just realized that my mac window partition is ...,Positive
74679,just realized the windows partition of my mac ...,Positive
74680,just realized between the windows partition of...,Positive


In [7]:
data['text'] = data['text'].apply((lambda x: re.sub('[^a-z0-9\s]','',x))) #The syntax for re. sub() is re. sub(pattern,repl,string).
data

Unnamed: 0,text,sentiment
0,im getting on borderlands and i will murder yo...,Positive
1,i am coming to the borders and i will kill you...,Positive
2,im getting on borderlands and i will kill you all,Positive
3,im coming on borderlands and i will murder you...,Positive
4,im getting on borderlands 2 and i will murder ...,Positive
...,...,...
74677,just realized that the windows partition of my...,Positive
74678,just realized that my mac window partition is ...,Positive
74679,just realized the windows partition of my mac ...,Positive
74680,just realized between the windows partition of...,Positive


## **Build the tokenizer**

In [8]:
top_words = 10000
tokenizer = Tokenizer(num_words=top_words, split=' ')
tokenizer.fit_on_texts(data['text'].values)

In [9]:
word_index =  tokenizer.word_index
#word_index

In [10]:
len(word_index)

23315

In [11]:
X = tokenizer.texts_to_sequences(data['text'].values)
print(len(X))
print(X[0])

43374
[29, 130, 14, 106, 4, 2, 58, 1448, 13, 26]


In [12]:
# padding
X = pad_sequences(X)
print(X[0])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0   29  130   14  106    4    2   58 1448   13   26]


In [13]:
len(X[0])

166

# 2. Implement at least **two different configurations** of LTSM models. Explain the differences between the models, all the steps you perform and the decisions you take.

In [14]:
#First model

embed_dim = 128
lstm_units = 196

model = Sequential()
model.add(Embedding(top_words, embed_dim,input_length = X.shape[1]))
#model.add(LSTM(lstm_units))

model.add(LSTM(100,dropout=0.25,recurrent_dropout=0.25))
model.add(Dense(50,activation='relu'))
model.add(Dense(25,activation='relu'))

model.add(Dense(1,activation='sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
#print(model.summary())

parameters_of_Embedding_layer=top_words*embed_dim
#parameters_of_Embedding_layer

parameters_of_lstm_layer=(embed_dim+lstm_units+1)*lstm_units*4
#parameters_of_lstm_layer

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y=le.fit_transform(data['sentiment'])
#print(le.classes_)
#print(y)

## **Explanation first model**

### **1. Embedding layer**:

```python
model.add(Embedding(top_words, embed_dim, input_length=X.shape[1]))
```

* **`top_words`**: The size of the vocabulary (number of unique words in your dataset)
* **`embed_dim`**: The dimension of the dense embedding. Each word will be represented by a vector of his length.
* **`input_length**: The length of input sequences (number of words per sequence)

The embedding layer converts integer-encoded words into dense vectors of fixed size (**`embed_dim`**)

### **2. LSTM layer**:

```python
model.add(LSTM(100, dropout=0.25, recurrent_dropout=0.25))
```

* **`100`**: Number of LSTM units.
* **`dropout = 0.25`**: Dropout rate for the input connections (helps prevent overfitting).
* **`recurrent_dropout = 0.25`**: Dropout rate for the recurrent connections.

This LSTM layer processes the embedded sequences, capturing temporal dependencies.

### **3. Dense layers**:

```python
model.add(Dense(50, activation = 'relu'))
model.add(Dense(25, activation = 'relu'))
```

* **`50`** and **`25`**: Number of neurons in each dense layer.
* **`activation = 'relu'`**: ReLu activation function for introducing non-linearity.

These dense layers further process the output from the LSTM layer, learning complex patterns.

### **4. Output Layer**:

```python
model.add(Dense(1, activation = 'sigmoid'))
```

* **`1`**: Single output neuron (for binary classification)
* **`activation = 'sigmoid'`**: Sigmoid activation function, producing a probability between 0 and 1.

### **5. Model Compilation:**

```python
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
```

* **`loss = 'binary_crossentropy'`**: Loss function for binary classification.
* **`optimizer = 'adam'`**: Adam optimizer for efficient gradient descent.
* **`metrics = ['accuracy']`**: Metric to evaluate during training.





In [15]:
# Second LSTM configuration

# Create the model
model2 = Sequential()

# Embedding layer
model2.add(Embedding(top_words, embed_dim, input_length=X.shape[1]))

# First LSTM layer
model2.add(LSTM(128, return_sequences=True, dropout=0.3, recurrent_dropout=0.3))

# Second LSTM layer
model2.add(LSTM(64, dropout=0.3, recurrent_dropout=0.3))

# Dense layers
model2.add(Dense(32, activation='relu'))
model2.add(Dense(16, activation='relu'))

# Output layer
model2.add(Dense(1, activation='sigmoid'))

# Compile the model
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the model summary
model2.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 166, 128)          1280000   
                                                                 
 lstm_1 (LSTM)               (None, 166, 128)          131584    
                                                                 
 lstm_2 (LSTM)               (None, 64)                49408     
                                                                 
 dense_3 (Dense)             (None, 32)                2080      
                                                                 
 dense_4 (Dense)             (None, 16)                528       
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 1463617 (5.58 MB)
Trainable params: 1463

## **Differences and Decisions**

### **1. LSTM Layers**:

* **First Model**: Uses a single LSTM layer with 100 units.
* **Second Model**: Uses two LSTM layers with 128 and 64 units respectively. The **`return_sequence=True`** argument in the first LSTM layer ensures that the output of this layer is a sequence that is passed to the second LSTM layer.

### **2. Dropout Rates**:

* **First Model**: Dropout rates for the LSTM layer are set to 0.25.
* **Second Model**: Higher dropout rates (0.3) are used to prevent overfitting.

### **3. Dense layers**:

* **First Model**: Two dense layers with 50 and 25 neurons.
* **Second Model**: Three dense layers with 32 and 16 neurons.


# **3. Train the models using the corresponding dataset. Take a look at your models and explain them.**

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.33, random_state = 42)
print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)

(29060, 166) (29060,)
(14314, 166) (14314,)


In [17]:
batch_size = 32
model.fit(X_train, y_train, epochs = 7, batch_size=batch_size,validation_data= (X_test, y_test)
                             ,callbacks=EarlyStopping(patience=3,restore_best_weights=True))

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


<keras.src.callbacks.History at 0x7b367be329b0>

In [18]:
model2.fit(X_train, y_train, epochs = 7, batch_size=batch_size,validation_data= (X_test, y_test)
                             ,callbacks=EarlyStopping(patience=3,restore_best_weights=True))

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


<keras.src.callbacks.History at 0x7b3674216f80>

# **7. Write the conclusions to your findings**


### Analysis and Comparison

1. **Training Time:**
   - **First Model:** Each epoch takes approximately 354-370 seconds.
   - **Second Model:** Each epoch takes significantly longer, around 666-686 seconds. This is expected due to the additional LSTM layer and larger number of LSTM units.

2. **Accuracy:**
   - Both models achieve high training accuracy (~96.5% by epoch 7).
   - Validation accuracy for the first model reaches up to 93.31%, while the second model's validation accuracy reaches up to 92.91%.

3. **Loss:**
   - Both models show decreasing loss values over epochs, indicating good learning progress.
   - The first model's final validation loss is 0.2245, while the second model's final validation loss is 0.2026.

4. **Performance Stability:**
   - The validation accuracy and loss for both models do not show signs of severe overfitting. Although the validation loss for the first model increases slightly in the final epoch, the validation accuracy remains high.
   - Both models maintain similar performance on validation data across epochs, suggesting stable learning.

### Conclusions

1. **Model Performance:**
   - **First Model:** This model achieves slightly better validation accuracy (93.31%) and higher stability in validation loss compared to the second model.
   - **Second Model:** Although it has more complex architecture, it achieves slightly lower validation accuracy (92.91%) and takes significantly longer to train.

2. **Training Efficiency:**
   - The first model is more efficient in terms of training time, which might be a consideration for practical applications where training time is critical.

3. **Complexity and Generalization:**
   - The first model, being simpler with one LSTM layer, generalizes well and is less prone to overfitting compared to the more complex second model.

### Recommendations

- **Choice of Model:** Based on the results, the first model (with a single LSTM layer) is recommended due to its better validation accuracy, faster training time, and adequate complexity for the given task.
- **Further Improvements:** To further enhance performance, consider experimenting with additional dropout, regularization techniques, or even fine-tuning hyperparameters such as the number of LSTM units or learning rate.

By maintaining consistent training conditions for both models, the comparison remains logical and allows for clear insights into the impact of different architectural choices on the performance and efficiency of LSTM models for sentiment analysis.

# Bibliography

* https://www.kaggle.com/code/yasmeensharaan/twitter-sentiment-using-lstm