<a id="description"></a>

<div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:200%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Project Description</b></div>
<div style="padding: 40px; border-color: #50A20E; border-radius: 10px; box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1); border: 2px solid #50A20E;"> <p> This project focuses on <b>Tweet Sentiment Classification</b> using deep learning techniques. It aims to analyze and classify tweets related to the COVID-19 pandemic into binary sentiment classes: </p> <ul> <li><b>Positive</b></li> <li><b>Negative</b></li> </ul> <h3 style="color:#2E86C1;">🔍 Objective</h3> <p> To develop an effective sentiment analysis model that accurately detects the emotional tone of tweets, enabling better understanding of public opinion during crises. </p> <h3 style="color:#2E86C1;">📊 Dataset</h3> <p> The dataset includes thousands of tweets labeled with sentiment classes. Each tweet is accompanied by metadata such as username, tweet date, and content. </p> <h3 style="color:#2E86C1;">🛠️ Technologies Used</h3> <ul> <li><b>Python</b> – Programming language</li> <li><b>Keras & TensorFlow</b> – For building and training the neural network</li> <li><b>GloVe</b> – Pre-trained word embeddings (300D)</li> <li><b>BiLSTM</b> – To capture forward and backward context in text</li> <li><b>NLTK / Regex</b> – For text preprocessing</li> </ul> <h3 style="color:#2E86C1;">🧠 Model Architecture</h3> <ul> <li><b>Embedding Layer</b> – Initialized with GloVe vectors</li> <li><b>Bidirectional LSTM</b> – Captures contextual information in both directions</li> <li><b>Dropout</b> – Prevents overfitting</li> <li><b>Dense Output</b> – Sigmoid activation for binary classification</li> </ul> <h3 style="color:#2E86C1;">✅ Evaluation</h3> <p> The model was evaluated using: </p> <ul> <li><b>Accuracy</b></li> <li><b>Precision</b></li> <li><b>Recall</b></li> <li><b>F1-Score</b></li> </ul> <p> The final model achieved a <b>weighted F1-score of 90%</b>, showing strong performance in classifying tweet sentiments. </p> <h3 style="color:#2E86C1;">🚀 Impact</h3> <p> This sentiment analysis system can be used for: </p> <ul> <li>Real-time monitoring of public opinion</li> <li>Social media trend analysis</li> <li>Supporting crisis communication strategies</li> </ul> </div>

<a id="About"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>About Author</b></div>

<div style="padding: 20px; border-color: #50A20E; border-radius: 10px; box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1); border: 2px solid #50A20E;">
    <p>
      I am <b>Amr Ghanem</b>, a 4th-year student at the Faculty of Engineering, Tanta University, with a passion for AI and Data Science. My journey into the world of technology is driven by curiosity and enthusiasm to explore the endless possibilities that data and AI offer. I am continuously learning and honing my skills in data analysis, machine learning, and AI to build a strong foundation for my future career. My goal is to contribute to innovative projects and make a meaningful impact in the tech industry.
    </p>
    <p>
        You can find more about me on my:<br>
        <a href="https://www.linkedin.com/in/amr-ghanem-306b392b9/" target="_blank">LinkedIn</a>.<br>
        <a href="https://www.kaggle.com/amrgghanem" target="_blank">Kaggle</a>.<br>
        Feel free to connect and reach out for any collaboration or queries!
    </p>
</div>

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Importing Libraries</b></div>

In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import nltk
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout,Bidirectional
from keras.callbacks import ReduceLROnPlateau
from keras.callbacks import EarlyStopping
from tqdm import tqdm
nltk.download('stopwords')

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Reading Data</b></div>

In [2]:
df=pd.read_csv('/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_train.csv',encoding='cp437')
df.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41157 entries, 0 to 41156
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   UserName       41157 non-null  int64 
 1   ScreenName     41157 non-null  int64 
 2   Location       32567 non-null  object
 3   TweetAt        41157 non-null  object
 4   OriginalTweet  41157 non-null  object
 5   Sentiment      41157 non-null  object
dtypes: int64(2), object(4)
memory usage: 1.9+ MB


<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Dealing With Nulls</b></div>

In [4]:
df.duplicated().sum()

0

In [5]:
df.isna().sum()

UserName            0
ScreenName          0
Location         8590
TweetAt             0
OriginalTweet       0
Sentiment           0
dtype: int64

In [6]:
df=df.dropna(axis=1)

In [7]:
df.isna().sum()

UserName         0
ScreenName       0
TweetAt          0
OriginalTweet    0
Sentiment        0
dtype: int64

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Feature Engineering</b></div>

In [10]:
def Bclass(x):
    if x in ['Positive','Extremely Positive']:
       return 1
    else:
        return 0

df['Bclasses']=df['Sentiment'].apply(lambda x : Bclass(x))
df['Bclasses'].value_counts()

Bclasses
0    23111
1    18046
Name: count, dtype: int64

In [11]:
df.head()

Unnamed: 0,UserName,ScreenName,TweetAt,OriginalTweet,Sentiment,Bclasses
0,3799,48751,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,0
1,3800,48752,16-03-2020,advice Talk to your neighbours family to excha...,Positive,1
2,3801,48753,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive,1
3,3802,48754,16-03-2020,My food stock is not the only one which is emp...,Positive,1
4,3803,48755,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative,0


<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Loading Glove</b></div>

In [12]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2025-05-12 20:35:13--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-05-12 20:35:13--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-05-12 20:35:13--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [13]:
df = df[['OriginalTweet', 'Bclasses']]

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Text Cleaning</b></div>

In [14]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)  # remove URLs
    text = re.sub(r"@\w+", "", text)     # remove mentions
    text = re.sub(r"#\w+", "", text)     # remove hashtags
    text = re.sub(r"[^a-z\s]", "", text) # remove punctuation and numbers
    text = re.sub(r"\s+", " ", text).strip()
    return text

df['cleaned_text'] = df['OriginalTweet'].apply(clean_text)

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Vectorizing & Padding</b></div>

In [15]:
# Parameters
MAX_VOCAB = 20000
MAX_LEN = 50

# Tokenizer
tokenizer = Tokenizer(num_words=MAX_VOCAB, oov_token="<OOV>")
tokenizer.fit_on_texts(df['cleaned_text'])

# Convert to sequences
sequences = tokenizer.texts_to_sequences(df['cleaned_text'])
padded_sequences = pad_sequences(sequences, maxlen=MAX_LEN, padding='post', truncating='post')

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Loading Glove Embedding</b></div>

In [16]:
# Load GloVe
embedding_index = {}
with open("glove.6B.300d.txt", encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = vector

# Create embedding matrix
embedding_dim = 300
word_index = tokenizer.word_index
vocab_size = min(MAX_VOCAB, len(word_index) + 1)
embedding_matrix = np.zeros((vocab_size, embedding_dim))

for word, i in word_index.items():
    if i < MAX_VOCAB:
        embedding_vector = embedding_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

In [17]:
X = padded_sequences
y = df['Bclasses'].values  # binary labels (0 or 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Model Building</b></div>

In [18]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size,
                    output_dim=embedding_dim,
                    weights=[embedding_matrix],
                    input_length=MAX_LEN,
                    trainable=True))


model.add(Bidirectional(LSTM(128, dropout=0.2, recurrent_dropout=0.2)))

model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

I0000 00:00:1747082320.070593      31 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13942 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
I0000 00:00:1747082320.071236      31 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13942 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5


In [19]:
lr_scheduler = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=3,
    verbose=1,
    min_lr=1e-6
)  # Define learning rate scheduler



early_stop = EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True,
    verbose=1
)  # Define early stopping callback

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Model Training</b></div>

In [20]:
history = model.fit(X_train, y_train, epochs=50, batch_size=64, validation_split=0.1,callbacks=[lr_scheduler, early_stop])

Epoch 1/50
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 81ms/step - accuracy: 0.6793 - loss: 0.5870 - val_accuracy: 0.8524 - val_loss: 0.3598 - learning_rate: 0.0010
Epoch 2/50
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 78ms/step - accuracy: 0.8794 - loss: 0.3030 - val_accuracy: 0.8986 - val_loss: 0.2616 - learning_rate: 0.0010
Epoch 3/50
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 77ms/step - accuracy: 0.9395 - loss: 0.1720 - val_accuracy: 0.9059 - val_loss: 0.2640 - learning_rate: 0.0010
Epoch 4/50
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 78ms/step - accuracy: 0.9607 - loss: 0.1128 - val_accuracy: 0.8961 - val_loss: 0.2821 - learning_rate: 0.0010
Epoch 5/50
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step - accuracy: 0.9719 - loss: 0.0843
Epoch 5: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
[1m463/463[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Classification Report</b></div>

In [21]:
# Predict
y_pred_probs = model.predict(X_test)
y_pred = (y_pred_probs > 0.5).astype(int)

# Report
print(classification_report(y_test, y_pred))

[1m258/258[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 28ms/step
              precision    recall  f1-score   support

           0       0.90      0.92      0.91      4615
           1       0.89      0.87      0.88      3617

    accuracy                           0.90      8232
   macro avg       0.90      0.90      0.90      8232
weighted avg       0.90      0.90      0.90      8232



<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#9b59b6; overflow:hidden"><b>Thank You</b></div>