<a href="https://colab.research.google.com/github/Ayush-rawat7/Twitter_sentiment_analysis_with_NLP/blob/main/Twitter_sentiment_analysis_with_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Twitter Sentiment Analysis Project Workflow**

**1.Data Acquisition:** Collect tweets using the Twitter API or a pre-existing dataset (like Sentiment140 used in my project).

**2.Data Preprocessing:** Clean and prepare the data:

a. Remove URLs, hashtags, mentions, and special characters.

b. Convert text to lowercase.

c.Remove stop words.

d. Perform stemming using Porter Stemmer.

**3.Feature Extraction:**
Convert text into numerical vectors using TF-IDF.

**4.Model Training:**
Train a machine learning model (Logistic Regression in this case) using the preprocessed data and extracted features.

**5.Model Evaluation:**
Evaluate the model's performance using metrics like accuracy.

**6.Deployment:**
Save and load the trained model to predict the sentiment of new tweets.

#**1.DATA ACQUISITION**

**Installing Kaggle Library**

In [None]:
! pip install kaggle  #To be able to use the API collected from Twitter sentiment dataset on kaggle



In [None]:
#Upload the kaggle file
from google.colab import files
files.upload()


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"ayush2rawat","key":"c9a381caf3eb70d3710337a77d7ee999"}'}

In [None]:
#Configuring the path of Kaggle.json
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


**Importing twitter sentiment dataset using API**


In [None]:
#API to fetch dataset from kaggle
!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to /content
 95% 77.0M/80.9M [00:00<00:00, 105MB/s]
100% 80.9M/80.9M [00:00<00:00, 85.9MB/s]


In [None]:
#extracting the compressed dataset
from zipfile import ZipFile
dataset = '/content/sentiment140.zip'

with ZipFile(dataset,'r') as zip:
  zip.extractall()
  print('The dataset is extracted')

The dataset is extracted


#**2.DATA PREPROCESSING**

**Importing Libraries**

In [None]:
#importing dependencies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import stopwords #natural language tool kit
from nltk.stem.porter import PorterStemmer  #It helps in text preprocessing for NLP tasks by reducing different forms of a word to a common base.
from sklearn.feature_extraction.text import TfidfVectorizer #converting text data into numerical values using TF-IDF (Term Frequency-Inverse Document Frequency).
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# am printing stopwords in english- have less contribution in creating the context--
#as we want to reduce the size of this dataset
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [None]:
#loading data from csv file to pandas dataframe
twitter_data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv',encoding='ISO-8859-1')

In [None]:
# checking the no of rows and colums
twitter_data.shape

(1599999, 6)

In [None]:
#printing first five rows of dataset
twitter_data.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [None]:
# As can be seen that above table have missing columns,So, naming the columns and reading the dataset again
column_names = ['target','ids','date','flag','user','text']
twitter_data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv',encoding='ISO-8859-1',names=column_names)

In [None]:
# checking the no of rows and colums
twitter_data.shape

(1600000, 6)

In [None]:
#printing first five rows of dataset
twitter_data.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [None]:
#counting the number of missing values in the dataset
twitter_data.isnull().sum()

#no value is missing

Unnamed: 0,0
target,0
ids,0
date,0
flag,0
user,0
text,0


In [None]:
#checking the distribution of target column
twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


In [None]:
#convert the target '4' to '1'
twitter_data.replace({'target':{4:1}},inplace=True)



0--- Negative Tweet,

1--- Positive Tweet

In [None]:
#checking the distribution of target column
twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
1,800000


In [None]:

port_stem= PorterStemmer()

**Stemming the target column**


In [None]:
#Stemming = a process of reducing a word to its root word
#e.g. actor,actress,acting= act
#using porterstemmer

def stemming(content):

  stemmed_content = re.sub('[^a-zA-Z]',' ',content)  # from each tweet we remove those texts which are other than bw a-z.(small+Capital both)
  stemmed_content = stemmed_content.lower()   #all letters are converted to lower case
  stemmed_content = stemmed_content.split()   #split all words and put them in a list
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)

  return stemmed_content






In [None]:
twitter_data['stemmed_content'] = twitter_data['text'].apply(stemming)   #using the above functiion and on the text column


In [None]:
print(twitter_data['target'])

0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Name: target, Length: 1600000, dtype: int64


#**3. MODEL DEVELOPMENT**

**Stepwise creation of a ML Model - Logistic Regression**

In [None]:
#separating the data and label

X = twitter_data['stemmed_content'].values
Y = twitter_data['target'].values

In [None]:
print(X)

['switchfoot http twitpic com zl awww bummer shoulda got david carr third day'
 'upset updat facebook text might cri result school today also blah'
 'kenichan dive mani time ball manag save rest go bound' ...
 'readi mojo makeov ask detail'
 'happi th birthday boo alll time tupac amaru shakur'
 'happi charitytuesday thenspcc sparkschar speakinguph h']


**splitting data to training and test data**

In [None]:
print(Y)


[0 0 0 ... 1 1 1]


In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)

In [None]:
print(X.shape,X_train.shape,X_test.shape)


(1600000,) (1280000,) (320000,)


In [None]:
print(X_train)

['watch saw iv drink lil wine' 'hatermagazin'
 'even though favourit drink think vodka coke wipe mind time think im gonna find new drink'
 ... 'eager monday afternoon'
 'hope everyon mother great day wait hear guy store tomorrow'
 'love wake folger bad voic deeper']


In [None]:
print(X_test)

['mmangen fine much time chat twitter hubbi back summer amp tend domin free time'
 'ah may show w ruth kim amp geoffrey sanhueza'
 'ishatara mayb bay area thang dammit' ...
 'destini nevertheless hooray member wonder safe trip' 'feel well'
 'supersandro thank']


#**4.FEATURE EXTRACTION**

**converting the textual data into numerical data**

In [None]:
#converting the textual data into numerical data

vectorizer = TfidfVectorizer()

# Fit and transform only on the original text data (X) before splitting
X = vectorizer.fit_transform(X)  # Fit and transform on the original text data

# Now split the data into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

# X_train and X_test are already in the correct format now

In [None]:
print(X_train)

  (0, 504044)	0.2725664180116586
  (0, 126118)	0.37484032460318173
  (0, 409438)	0.35861719248416485
  (0, 213554)	0.5289928671316322
  (0, 271144)	0.42037613603065954
  (0, 511398)	0.4472655127083222
  (1, 185304)	1.0
  (2, 472107)	0.151714231979682
  (2, 470034)	0.18720443014780436
  (2, 126118)	0.458250851237366
  (2, 205312)	0.16175394132247142
  (2, 333015)	0.1678067864881519
  (2, 468994)	0.3209933731315585
  (2, 143635)	0.18906085394224742
  (2, 173898)	0.18807038740989349
  (2, 152669)	0.20261379420870737
  (2, 90030)	0.31341180319788786
  (2, 307827)	0.24104556465616028
  (2, 511810)	0.3361762065862128
  (2, 149331)	0.29041017783193285
  (2, 500375)	0.3297483694639192
  (3, 175126)	0.27847482359309217
  (3, 474893)	0.27096057778571386
  (3, 198842)	0.3743117172596608
  (3, 468994)	0.2901937126965763
  :	:
  (1279996, 367538)	0.21106719280315756
  (1279996, 274436)	0.22225898756809329
  (1279996, 500911)	0.27035064240730977
  (1279996, 430827)	0.3500359054978763
  (1279996, 274

Here different sentiments are analysed and alloted a numerical value

**Training the Machine learning Model Using Logistic Regression**

In [None]:
model= LogisticRegression(max_iter=1000)


In [None]:
model.fit(X_train,Y_train)

#**5.MODEL EVALUATION**

**Model Evaluation accuracy score**

In [None]:
# Model Evaluation - accuracy score
#accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

print('accuracy score of the training data: ', training_data_accuracy)

accuracy score of the training data:  0.8047734375


In [None]:
#accuracy score on the training data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)

print('accuracy score of the test data: ', test_data_accuracy)

accuracy score of the test data:  0.777475


**model accuracy = 77.7%**



#**6.MODEL DEPLOYMENT**

**saving the trained model**

In [None]:
import pickle

In [None]:
filename =  '/content/drive/MyDrive/Colab Notebooks/Twitter sentiment analysis with NLP/TSA_model.pkl'
pickle.dump(model,open(filename,'wb'))

**using saved model for future prediction**

In [None]:
#loading saved model
loaded_model = pickle.load(open('/content/drive/MyDrive/Colab Notebooks/Twitter sentiment analysis with NLP/TSA_model.pkl','rb'))

In [None]:
X_new = X_test[200]
print('True Value: ',Y_test[200])

prediction = model.predict(X_new)
print('Predicted value by model: ',prediction)

if (prediction[0]==0):
  print('Negative Tweet')
else:
  print('Positive Tweet')

True Value:  1
Predicted value by model:  [1]
Positive Tweet


In [None]:
X_new = X_test[320]
print('True Value: ',Y_test[320])

prediction = model.predict(X_new)
print('Predicted value by model: ',prediction)

if (prediction[0]==0):
  print('Negative Tweet')
else:
  print('Positive Tweet')

True Value:  0
Predicted value by model:  [0]
Negative Tweet


#**Practical Applications of Sentiment Analysis**

**1.Monitor brand reputation:** Analyze public sentiment towards a brand or product to gain insights into customer satisfaction and identify areas for improvement.

**2.Track customer feedback:** Automatically categorize customer reviews or tweets to understand customer sentiment and improve products or services.

**3.Identify emerging trends:** Detect trending topics and public opinion towards those topics, providing valuable insights for businesses and decision-makers.

**4.Conduct market research:** Understand consumer sentiment towards various products or services to inform marketing strategies and campaigns