<a id="description"></a>

<div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:200%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Project Description</b></div>
<div style="padding: 40px; border-color: #50A20E; border-radius: 10px; box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1); border: 2px solid #9b59b6;"> <p> This project focuses on <b>Tweet Sentiment Classification</b> using deep learning techniques. It aims to analyze and classify tweets related to the COVID-19 pandemic into Multi sentiment classes: </p> <ul> <li><b>Extremely Positive</b></li>  <li><b>Positive</b></li> <li><b>Neutral</b></li> <li><b>Negative</b></li> <li><b>Extremely Negative</b></li> </ul> <h3 style="color:#2E86C1;">🔍 Objective</h3> <p> To develop an effective sentiment analysis model that accurately detects the emotional tone of tweets, enabling better understanding of public opinion during crises. </p> <h3 style="color:#2E86C1;">📊 Dataset</h3> <p> The dataset includes thousands of tweets labeled with sentiment classes. Each tweet is accompanied by metadata such as username, tweet date, and content. </p> <h3 style="color:#2E86C1;">🛠️ Technologies Used</h3> <ul> <li><b>Python</b> – Programming language</li> <li><b>Keras & TensorFlow</b> – For building and training the neural network</li> <li><b>GloVe</b> – Pre-trained word embeddings (300D)</li> <li><b>BiLSTM</b> – To capture forward and backward context in text</li> <li><b>NLTK / Regex</b> – For text preprocessing</li> </ul> <h3 style="color:#2E86C1;">🧠 Model Architecture</h3> <ul> <li><b>Embedding Layer</b> – Initialized with GloVe vectors</li> <li><b>Bidirectional LSTM</b> – Captures contextual information in both directions</li> <li><b>Dropout</b> – Prevents overfitting</li> <li><b>Dense Output</b> – Sigmoid activation for binary classification</li> </ul> <h3 style="color:#2E86C1;">✅ Evaluation</h3> <p> The model was evaluated using: </p> <ul> <li><b>Accuracy</b></li> <li><b>Precision</b></li> <li><b>Recall</b></li> <li><b>F1-Score</b></li> </ul> <p> The final model achieved a <b>weighted F1-score of 78%</b>, showing strong performance in classifying tweet sentiments. </p> <h3 style="color:#2E86C1;">🚀 Impact</h3> <p> This sentiment analysis system can be used for: </p> <ul> <li>Real-time monitoring of public opinion</li> <li>Social media trend analysis</li> <li>Supporting crisis communication strategies</li> </ul> </div>

<a id="About"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>About Author</b></div>

<div style="padding: 20px; border-color: #50A20E; border-radius: 10px; box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1); border: 2px solid #9b59b6;">
    <p>
      I am <b>Amr Ghanem</b>, a 4th-year student at the Faculty of Engineering, Tanta University, with a passion for AI and Data Science. My journey into the world of technology is driven by curiosity and enthusiasm to explore the endless possibilities that data and AI offer. I am continuously learning and honing my skills in data analysis, machine learning, and AI to build a strong foundation for my future career. My goal is to contribute to innovative projects and make a meaningful impact in the tech industry.
    </p>
    <p>
        You can find more about me on my:<br>
        <a href="https://www.linkedin.com/in/amr-ghanem-306b392b9/" target="_blank">LinkedIn</a>.<br>
        <a href="https://www.kaggle.com/amrgghanem" target="_blank">Kaggle</a>.<br>
        Feel free to connect and reach out for any collaboration or queries!
    </p>
</div>

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Importing Libraries</b></div>

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import nltk
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout,Bidirectional
from keras.callbacks import ReduceLROnPlateau
from keras.callbacks import EarlyStopping
from tqdm import tqdm
nltk.download('stopwords')

2025-05-12 21:31:34.752298: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747085494.775216     331 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747085494.786105     331 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Reading Data</b></div>

In [2]:
df=pd.read_csv('/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_train.csv',encoding='cp437')
df.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41157 entries, 0 to 41156
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   UserName       41157 non-null  int64 
 1   ScreenName     41157 non-null  int64 
 2   Location       32567 non-null  object
 3   TweetAt        41157 non-null  object
 4   OriginalTweet  41157 non-null  object
 5   Sentiment      41157 non-null  object
dtypes: int64(2), object(4)
memory usage: 1.9+ MB


<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Dealing With Nulls</b></div>

In [4]:
df.duplicated().sum()

0

In [5]:
df.isna().sum()

UserName            0
ScreenName          0
Location         8590
TweetAt             0
OriginalTweet       0
Sentiment           0
dtype: int64

In [6]:
df=df.dropna(axis=1)

In [7]:
df.isna().sum()

UserName         0
ScreenName       0
TweetAt          0
OriginalTweet    0
Sentiment        0
dtype: int64

In [8]:
df['Sentiment'].value_counts()

Sentiment
Positive              11422
Negative               9917
Neutral                7713
Extremely Positive     6624
Extremely Negative     5481
Name: count, dtype: int64

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Data Preprocessing</b></div>

In [9]:
def clean_text(text):
    text = re.sub(r"http\S+|@\S+|[^A-Za-z\s]", "", str(text).lower())
    return text.strip()

In [10]:
df['cleaned_text'] = df['OriginalTweet'].apply(clean_text)

In [11]:
sentiment_mapping = {
    "Extremely Negative": 0,
    "Negative": 1,
    "Neutral": 2,
    "Positive": 3,
    "Extremely Positive": 4
}
df['label'] = df['Sentiment'].map(sentiment_mapping)

In [12]:
df.head()

Unnamed: 0,UserName,ScreenName,TweetAt,OriginalTweet,Sentiment,cleaned_text,label
0,3799,48751,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,and and,2
1,3800,48752,16-03-2020,advice Talk to your neighbours family to excha...,Positive,advice talk to your neighbours family to excha...,3
2,3801,48753,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive,coronavirus australia woolworths to give elder...,3
3,3802,48754,16-03-2020,My food stock is not the only one which is emp...,Positive,my food stock is not the only one which is emp...,3
4,3803,48755,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative,me ready to go at supermarket during the covid...,0


<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Vectorizing & Padding</b></div>

In [13]:
# 4. Tokenization
MAX_VOCAB = 20000
MAX_LEN = 100

tokenizer = Tokenizer(num_words=MAX_VOCAB, oov_token="<OOV>")
tokenizer.fit_on_texts(df['cleaned_text'])
sequences = tokenizer.texts_to_sequences(df['cleaned_text'])
padded_sequences = pad_sequences(sequences, maxlen=MAX_LEN, padding='post', truncating='post')

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Loading Glove Embedding</b></div>

In [14]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2025-05-12 21:31:39--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-05-12 21:31:40--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-05-12 21:31:40--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’


2

In [None]:
embedding_index = {}
embedding_dim = 300
with open("glove.6B.300d.txt", encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = vector

In [16]:
word_index = tokenizer.word_index
vocab_size = min(MAX_VOCAB, len(word_index) + 1)
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_index.items():
    if i < MAX_VOCAB:
        vec = embedding_index.get(word)
        if vec is not None:
            embedding_matrix[i] = vec

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    padded_sequences, df['label'], test_size=0.2, random_state=42, stratify=df['label']
)

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Model Building</b></div>

In [18]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size,
                    output_dim=embedding_dim,
                    weights=[embedding_matrix],
                    input_length=MAX_LEN,
                    trainable=True))
model.add(Bidirectional(LSTM(128, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dropout(0.5))
model.add(Dense(5, activation='softmax'))  # 5 sentiment classes

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

I0000 00:00:1747085811.515274     331 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13942 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
I0000 00:00:1747085811.515956     331 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13942 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5


In [19]:
lr_scheduler = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    verbose=1,
    min_lr=1e-6
)  # Define learning rate scheduler



early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True,
    verbose=1
)  # Define early stopping callback

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Model Training</b></div>

In [21]:
model.fit(X_train, y_train, epochs=50, batch_size=64, validation_split=0.1,callbacks=[lr_scheduler, early_stop])

Epoch 1/50
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m77s[0m 151ms/step - accuracy: 0.3759 - loss: 1.3975 - val_accuracy: 0.6477 - val_loss: 0.8793 - learning_rate: 0.0010
Epoch 2/50
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 150ms/step - accuracy: 0.6998 - loss: 0.8010 - val_accuracy: 0.7662 - val_loss: 0.6567 - learning_rate: 0.0010
Epoch 3/50
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 150ms/step - accuracy: 0.7905 - loss: 0.5903 - val_accuracy: 0.7877 - val_loss: 0.6048 - learning_rate: 0.0010
Epoch 4/50
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 150ms/step - accuracy: 0.8426 - loss: 0.4658 - val_accuracy: 0.7950 - val_loss: 0.5850 - learning_rate: 0.0010
Epoch 5/50
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 150ms/step - accuracy: 0.8708 - loss: 0.3899 - val_accuracy: 0.7917 - val_loss: 0.6190 - learning_rate: 0.0010
Epoch 6/50
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0

<keras.src.callbacks.history.History at 0x79f106a98410>

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Classification Report</b></div>

In [22]:
y_pred = np.argmax(model.predict(X_test), axis=1)


[1m258/258[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 48ms/step


In [23]:
reverse_mapping = {v: k for k, v in sentiment_mapping.items()}
target_names = [reverse_mapping[i] for i in sorted(reverse_mapping)]

In [24]:
print(classification_report(y_test, y_pred, target_names=target_names))

                    precision    recall  f1-score   support

Extremely Negative       0.82      0.78      0.80      1096
          Negative       0.69      0.79      0.74      1983
           Neutral       0.90      0.79      0.84      1543
          Positive       0.76      0.75      0.76      2285
Extremely Positive       0.82      0.82      0.82      1325

          accuracy                           0.78      8232
         macro avg       0.80      0.78      0.79      8232
      weighted avg       0.79      0.78      0.78      8232



In [25]:
# Define sentiment reverse mapping
reverse_mapping = {
    0: "Extremely Negative",
    1: "Negative",
    2: "Neutral",
    3: "Positive",
    4: "Extremely Positive"
}

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Making Test</b></div>

In [26]:
# Function to preprocess and predict sentiment
def predict_sentiment(text, tokenizer, model):
    # Clean text
    def clean_text(text):
        text = re.sub(r"http\S+|@\S+|[^A-Za-z\s]", "", str(text).lower())
        return text.strip()
    
    cleaned = clean_text(text)
    
    # Tokenize & pad
    seq = tokenizer.texts_to_sequences([cleaned])
    padded = pad_sequences(seq, maxlen=MAX_LEN, padding='post', truncating='post')
    
    # Predict
    pred = model.predict(padded)
    pred_class = np.argmax(pred, axis=1)[0]
    
    return reverse_mapping[pred_class]

In [29]:
test_texts_covid19 = [
    "The vaccine rollout is going smoothly! Finally some hope for the future. #COVID19 #Vaccination",
    "I can’t believe how long this pandemic is lasting. I’m starting to lose hope. #COVID19 #pandemic",
    "We all need to stay strong, we will get through this. Stay safe everyone! #StayAtHome #COVID19",
    "It’s getting worse again. Hospitals are full, and the situation is just terrifying. #COVID19",
    "The lockdown has been really tough on mental health. I miss seeing my friends. #COVID19 #MentalHealth",
    "I’m tired of the endless restrictions. When will things go back to normal? #COVID19",
    "The new variant is spreading rapidly, and it’s concerning. #COVID19 #NewVariant",
    "I just got my second dose of the vaccine! Feeling relieved and hopeful. #Vaccinated #COVID19",
    "I don’t understand why some people still refuse to wear masks. It’s for everyone’s safety. #COVID19",
    "This is never going to end if people don’t start following guidelines. #COVID19 #pandemic"
]

for text in test_texts_covid19:
    sentiment = predict_sentiment(text, tokenizer, model)
    print(f"Text: {text}\nPredicted Sentiment: {sentiment}\n")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
Text: The vaccine rollout is going smoothly! Finally some hope for the future. #COVID19 #Vaccination
Predicted Sentiment: Positive

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
Text: I can’t believe how long this pandemic is lasting. I’m starting to lose hope. #COVID19 #pandemic
Predicted Sentiment: Positive

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
Text: We all need to stay strong, we will get through this. Stay safe everyone! #StayAtHome #COVID19
Predicted Sentiment: Extremely Positive

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 62ms/step
Text: It’s getting worse again. Hospitals are full, and the situation is just terrifying. #COVID19
Predicted Sentiment: Extremely Negative

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
Text: The lockdown has been really tough on mental health. I miss seeing my friends. #COVID1

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ce90f; overflow:hidden"><b>Thank You</b></div>