<a href="https://colab.research.google.com/github/HadiyaArfa/Sentiment_Analysis_for_Twitter_data/blob/main/Sentiment_Analysis_for_Twitter_data_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Title:** **Sentiment Analysis on Twitter Data Using Deep Learning**

**Objective:**
To build a robust Deep Learning-based sentiment analysis model capable of accurately categorizing tweets into different sentiment classes (positive, negative, neutral). The model will leverage Natural Language Processing (NLP) techniques and Deep Learning architectures to understand and predict sentiments from text data.


Data Loading and Exploration:

In [67]:
# Import necessary libraries
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

In [68]:
# Load the Twitter data
data = pd.read_csv('/Twitter- Project 1.csv',encoding = 'ISO-8859-1')
# Explore the data
print(data.head())
print(data.info())
print(data['Sentiment'].value_counts())
print(data.shape)

                                          Tweet Text Sentiment  \
0  Our team is excited to introduce our latest pr...  Positive   
1  We apologize for the inconvenience caused by o...  Negative   
2  Neutral on this matter, awaiting further updat...   Neutral   
3  We're grateful for the continuous support from...  Positive   
4  Disappointed to announce the delay in the ship...  Negative   

             Timestamp         User  
0   2023-11-18 8:45:00  @CompanyXYZ  
1   2023-11-18 9:30:00  @CompanyXYZ  
2  2023-11-18 10:15:00  @CompanyXYZ  
3  2023-11-18 11:00:00  @CompanyXYZ  
4  2023-11-18 11:45:00  @CompanyXYZ  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Tweet Text  52 non-null     object
 1   Sentiment   52 non-null     object
 2   Timestamp   52 non-null     object
 3   User        52 non-null     object
dtypes: object(4)
memory usage: 1

In [69]:
#exploring if there are any missing values
data.isnull().sum()

Tweet Text    0
Sentiment     0
Timestamp     0
User          0
dtype: int64

Data Preprocessing

In [70]:
# Encode labels
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(data['Sentiment'])
encoded_labels

array([2, 0, 1, 2, 0, 2, 2, 2, 1, 0, 1, 2, 0, 1, 2, 0, 2, 1, 0, 1, 2, 1,
       2, 0, 1, 2, 0, 2, 1, 0, 1, 2, 0, 1, 2, 2, 1, 1, 2, 1, 0, 1, 2, 0,
       2, 1, 2, 1, 0, 1, 2, 1])

where,

 2 --> Positive

 0 --> Negative

 1 --> Neutral

In [71]:
data['Encoded Sentiment'] = encoded_labels

In [72]:
# checking the distribution of sentiments column
data['Encoded Sentiment'].value_counts()

2    20
1    19
0    13
Name: Encoded Sentiment, dtype: int64

Stemming: Stemming is a natural language processing technique that is used to reduce words to their base form, also known as the root form. The process of stemming is used to normalize text and make it easier to process.

Ex: King,Emperor = kingdom

similar meaning words

In [73]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [74]:
# Since stopwords do not add much meaning to the contextual data we can exclude it from the tweets,it reduces the computational complexity
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [75]:
Port_stem = PorterStemmer()
def stemming(content):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', content, flags=re.MULTILINE)

    # Remove mentions
    text = re.sub(r'@\w+', '', text)

    # Remove hashtags
    text = re.sub(r'#\w+', '', text)

    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    # Convert to lowercase
    text = text.lower()

    # Splitting the words in a tweet and convert it to a list
    text = text.split()

    #stemming for words that are not present in the stopwords
    text = [Port_stem.stem(word) for word in text if not word in stopwords.words('english')]
    text = ' '.join(text)

    return text

In [76]:
# Apply the stemming function to the 'Tweet Text' column
data['Processed Tweet'] = data['Tweet Text'].apply(stemming)

In [77]:
print(data['Processed Tweet'])

0               team excit introduc latest product line
1     apolog inconveni caus recent technic issu team...
2                            neutral matter await updat
3                    grate continu support loyal custom
4               disappoint announc delay shipment order
5     thrill posit feedback receiv recent product la...
6                  strive provid better servic pass day
7                                unexpect sale made day
8            understand concern rais activ work address
9                     team frustrat recur system glitch
10                  thought everyon affect recent event
11                delight see custom enjoy latest offer
12           deepli regret inconveni caus stock shortag
13                  neutral stanc ongo industri discuss
14            look forward share excit news custom soon
15                apolog confus caus recent price updat
16                 posit vibe around celebr team achiev
17                     pleas bear us work improv

In [78]:
 # seperating data and label
 X = data['Processed Tweet'].values
 Y = data['Encoded Sentiment'].values

In [79]:
print(X)
print(Y)

['team excit introduc latest product line'
 'apolog inconveni caus recent technic issu team work hard resolv'
 'neutral matter await updat' 'grate continu support loyal custom'
 'disappoint announc delay shipment order'
 'thrill posit feedback receiv recent product launch'
 'strive provid better servic pass day' 'unexpect sale made day'
 'understand concern rais activ work address'
 'team frustrat recur system glitch' 'thought everyon affect recent event'
 'delight see custom enjoy latest offer'
 'deepli regret inconveni caus stock shortag'
 'neutral stanc ongo industri discuss'
 'look forward share excit news custom soon'
 'apolog confus caus recent price updat'
 'posit vibe around celebr team achiev' 'pleas bear us work improv servic'
 'disappoint neg rumor circul' 'neutral feel toward recent market trend'
 'eagerli anticip upcom product launch'
 'commit resolv report issu promptli'
 'proud announc involv commun cleanup drive'
 'understand frustrat caus delay respons'
 'neutral stanc

Train-Test Split:

In [80]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2,stratify=Y, random_state=2)

In [81]:
print(X.shape,X_train.shape,X_test.shape)

(52,) (41,) (11,)


In [82]:
print(X_test)

['neutral stanc ongo industri discuss'
 'regret oversight led recent product recal'
 'thrill posit feedback receiv recent product launch'
 'apolog recent deliveri delay' 'neutral feel toward recent market trend'
 'look forward bring innov solut custom'
 'commit resolv report issu promptli' 'commit ensur satisfact servic'
 'neutral stanc recent industri regul'
 'excit introduc latest sustain initi'
 'understand frustrat caus delay respons']


Vectorizing the Tweets

In [83]:
# Converting the tweets to numerical vectors

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform( X_test)

In [84]:
print(X_train)

  (0, 66)	0.386897389995149
  (0, 43)	0.5311822478475151
  (0, 101)	0.3336461419516689
  (0, 21)	0.4779309998040351
  (0, 29)	0.4779309998040351
  (1, 27)	0.31232936719346716
  (1, 71)	0.407121202763846
  (1, 125)	0.4524827971384037
  (1, 136)	0.3499723525224046
  (1, 133)	0.4524827971384037
  (1, 121)	0.4524827971384037
  (2, 99)	0.5119555968325987
  (2, 105)	0.39597152807493885
  (2, 64)	0.5119555968325987
  (2, 1)	0.42421703585773535
  (2, 66)	0.3728932678202036
  (3, 137)	0.4322507903420552
  (3, 95)	0.48041233262103067
  (3, 12)	0.34991809000724616
  (3, 23)	0.48041233262103067
  (3, 7)	0.37157442292050447
  (3, 101)	0.30175654772827065
  (4, 35)	0.419346659223958
  (4, 115)	0.34747934757123045
  (4, 53)	0.419346659223958
  :	:
  (36, 109)	0.4941328384799485
  (36, 13)	0.44459580960717976
  (36, 92)	0.44459580960717976
  (36, 27)	0.34107859509352234
  (37, 81)	0.43494692215139896
  (37, 42)	0.43494692215139896
  (37, 113)	0.43494692215139896
  (37, 31)	0.43494692215139896
  (37, 6

In [85]:
print( X_test)

  (0, 120)	0.5069818783599955
  (0, 82)	0.5634699859930371
  (0, 79)	0.41041482053423456
  (0, 61)	0.5069818783599955
  (1, 103)	0.6340616221671181
  (1, 101)	0.4426417497136993
  (1, 96)	0.6340616221671181
  (2, 132)	0.44660514448538535
  (2, 101)	0.28052157999845617
  (2, 96)	0.40183278730880934
  (2, 92)	0.40183278730880934
  (2, 68)	0.44660514448538535
  (2, 48)	0.44660514448538535
  (3, 101)	0.4678731736848057
  (3, 30)	0.670204343957491
  (3, 7)	0.576125707364822
  (4, 101)	0.6530675193116966
  (4, 79)	0.7572996865310766
  (5, 70)	0.5950907073049084
  (5, 50)	0.6613959331137744
  (5, 27)	0.4565330981866521
  (6, 107)	0.5248967467892798
  (6, 105)	0.4512153826872965
  (6, 66)	0.42491736554646137
  (6, 19)	0.5833809356616327
  (7, 115)	0.6380403873591676
  (7, 19)	0.7700028987598445
  (8, 120)	0.5640917317905827
  (8, 101)	0.39379540162896653
  (8, 79)	0.45664671016759073
  (8, 61)	0.5640917317905827
  (9, 67)	0.5798476677309216
  (9, 63)	0.6444545084556984
  (9, 46)	0.498452674542

Training the Model using Logistic Regression

In [86]:
model=LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)

Model Evaluation

In [87]:
 # accuracy score on the training data
 X_train_prediction = model.predict(X_train)
 training_data_accuracy= accuracy_score(y_train,X_train_prediction)

In [88]:
print('Accuracy score on the training data is :', training_data_accuracy)

Accuracy score on the training data is : 1.0


In [89]:
 # accuracy score on the testing data
 X_test_prediction = model.predict(X_test)
 testing_data_accuracy= accuracy_score(y_test,X_test_prediction)

In [90]:
print('Accuracy score on the testing data is :', testing_data_accuracy)

Accuracy score on the testing data is : 0.8181818181818182


Model Accuracy = 81.8%

Saving the trained mode

In [91]:
import pickle
filename = 'trained_model.sav'
pickle.dump(model,open(filename,'wb'))

Using the saved model for future predictions


In [92]:
# loading the saved mode
loaded_model = pickle.load(open('/content/trained_model.sav','rb'))

In [93]:
X_new = X_test[2]
print(y_test[2])
prediction_probabilities = loaded_model.predict_proba(X_new.reshape(1, -1))

# Assuming prediction_probabilities is an array [[prob_negative, prob_neutral, prob_positive]]
negative_threshold = 0.5
neutral_threshold = 0.5

if prediction_probabilities[0, 2] >= neutral_threshold:
    print('Positive Tweet')
elif prediction_probabilities[0, 1] >= neutral_threshold:
    print('Neutral Tweet')
else:
    print('Negative Tweet')


2
Positive Tweet
