### **Mini Project 3 – Twitter Sentimental Analysis Using NLP and Python**

### **Objective:** 

Use Python libraries such as Pandas for data operations, Seaborn and Matplotlib for data visualization and EDA tasks, NLTK to extract and analyze the information, Sklearn for model building and performance visualization, to predict our different categories of people’s mindsets.

### **Dataset description:** 

The data contain information about many Tweets in the form of text and their types, as mentioned below.

Tweets: Data is in the form of a sentence written by individuals.

category: Numeric(0: Neutral, -1: Negative, 1: Positive) (It is our dependent variable)

### **Tasks To  Be Performed**

• Read the Data from the Given excel file.

• Change our dependent variable to categorical. ( 0 to “Neutral,” -1 to “Negative”, 1 to “Positive”)

• Do Missing value analysis and drop all null/missing values

• Do text cleaning. (remove every symbol except alphanumeric, transform all words to lower case, and remove punctuation and stopwords )

• Create a new column and find the length of each sentence (how many words they contain)

• Split data into dependent(X) and independent(y) dataframe

• Do operations on text data

    Hints:
        o Do one-hot encoding for each sentence (use TensorFlow)

        o Add padding from the front side (use Tensorflow)

        o Build an LSTM model and compile it (describe features, input length, vocabulary size, information drop-out layer, activation function for output, )

        o Do dummy variable creation for the dependent variable

        o split the data into tests and train
        
• Train new model

• Normalize the prediction as same as the original data(prediction might be in decimal, so whoever is nearest to 1 is predicted as yes and set other as 0)

• Measure performance metrics and accuracy

• print Classification report

### **Task 1:** *Importing Libs and Load dataset*

In [1]:
import re
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report
from nltk.corpus import stopwords

In [2]:
# Load dataset
twitter_df = pd.read_csv('datasets\\Twitter_Data.csv')

In [3]:
twitter_df.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


### **Task 2:** *Change our dependent variable to categorical*

In [4]:
twitter_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162980 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162976 non-null  object 
 1   category    162973 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.5+ MB


In [5]:
# checking for unique values
twitter_df.category.unique()

array([-1.,  0.,  1., nan])

### **Task 3:** *Do Missing value analysis and drop all null/missing values*

In [6]:
# checking for missing values
twitter_df.isna().sum()


clean_text    4
category      7
dtype: int64

In [7]:
#  checking the shape of the dataset before dropping the missing values
twitter_df.shape

(162980, 2)

In [8]:
twitter_df.dropna(inplace=True)

**NB:** *Completing the task 2*

In [9]:
# Convert dependent variable to categorical
twitter_df['category'] = twitter_df['category'].map({0.0: 'Neutral', -1.0: 'Negative', 1.0: 'Positive'})


### **Task 4:** *Do text cleaning* 

In [10]:
stop_words = set(stopwords.words('english'))

def clean_text(text):
    # Remove non-alphanumeric characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Apply text cleaning
twitter_df['cleaned_text'] = twitter_df['clean_text'].apply(clean_text)


### **Task 5:** *Create a new column and find the length of each sentence* 

In [11]:
# Calculate sentence length
twitter_df['sentence_length'] = twitter_df['cleaned_text'].apply(lambda x: len(x.split()))


### **Task 6:**  *Split data into dependent(X) and independent(y) dataframe*

In [12]:
# Features and target variable
X = twitter_df['cleaned_text']
y = twitter_df['category']


### **Task 7:** *Do operations on text data*

In [13]:
# One-hot encoding(Tokenizer) add padding using TensorFlow
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
X_sequences = tokenizer.texts_to_sequences(X)
X_padded = pad_sequences(X_sequences, padding='pre')


In [14]:
X_padded.shape

(162969, 43)

In [15]:
# Dummy variable creation for the dependent variable
label_binarizer = LabelBinarizer()
y_encoded = label_binarizer.fit_transform(y)


In [27]:
y_encoded.shape

(162969, 3)

In [19]:
# Define LSTM model
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=128, input_length=X_padded.shape[1]))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.5))
model.add(LSTM(64))
model.add(Dense(3, activation='softmax'))  # 3 classes: Neutral, Negative, Positive

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [20]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_padded, y_encoded, test_size=0.2, random_state=42)

### **Task 8:** *Train The model*

In [21]:
model.fit(X_train, y_train, epochs=5, batch_size=32, validation_split=0.2)

Epoch 1/5
[1m3260/3260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2118s[0m 631ms/step - accuracy: 0.7954 - loss: 0.5159 - val_accuracy: 0.9162 - val_loss: 0.2717
Epoch 2/5
[1m3260/3260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1525s[0m 468ms/step - accuracy: 0.9313 - loss: 0.2221 - val_accuracy: 0.9074 - val_loss: 0.2935
Epoch 3/5
[1m3260/3260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m825s[0m 245ms/step - accuracy: 0.9574 - loss: 0.1412 - val_accuracy: 0.9102 - val_loss: 0.3251
Epoch 4/5
[1m3260/3260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m806s[0m 247ms/step - accuracy: 0.9708 - loss: 0.0913 - val_accuracy: 0.8973 - val_loss: 0.4063
Epoch 5/5
[1m3260/3260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m816s[0m 250ms/step - accuracy: 0.9807 - loss: 0.0597 - val_accuracy: 0.8866 - val_loss: 0.4719


<keras.src.callbacks.history.History at 0x224bde49b50>

In [22]:
# Make predictions
y_pred = model.predict(X_test)


[1m1019/1019[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m51s[0m 48ms/step


### **Task 9:** *Normalize the prediction*

In [23]:
# Normalize predictions
y_pred_normalized = np.argmax(y_pred, axis=1)
y_test_labels = np.argmax(y_test, axis=1)

### **Task 10:** *Measure performance metrics and accuracy*

### **Task 11:** *Classification report*

In [26]:
# Classification report
report = classification_report(y_test_labels, y_pred_normalized, target_names=label_binarizer.classes_)
print(report)

              precision    recall  f1-score   support

    Negative       0.79      0.84      0.81      7152
     Neutral       0.92      0.91      0.91     11067
    Positive       0.91      0.89      0.90     14375

    accuracy                           0.89     32594
   macro avg       0.87      0.88      0.88     32594
weighted avg       0.89      0.89      0.89     32594

