# Twitter Sentiment Analysis - NLP

The goal is to use NLP to understand and interpret human sentiments on social media platforms(in this case Twitter or X) 

This is the [sentiment140 dataset](https://www.kaggle.com/datasets/kazanova/sentiment140). It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

**Content**

It contains the following 6 fields:

- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

- ids: The id of the tweet 

- date: the date of the tweet 

- flag: The query. If there is no query, then this value is NO_QUERY.

- user: the user that tweeted 

- text: the text of the tweet

In [1]:
import pandas as pd
import numpy as np
import re

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
# Print a list of common English stopwords
# ** Stopwords are frequently used words that are typically filtered out in text processing to focus on more meaningful words.

import nltk
nltk.download('stopwords')

print(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any'

**Data collection & Processing**

In [3]:
df = pd.read_csv('/kaggle/input/sentiment140/training.1600000.processed.noemoticon.csv', encoding='ISO-8859-1')

df.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [4]:
# Naming columns and reading the dataframe again

column_names = ['target', 'id', 'date', 'flag', 'user', 'text']

df = pd.read_csv('/kaggle/input/sentiment140/training.1600000.processed.noemoticon.csv', names = column_names, encoding='ISO-8859-1')

df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [5]:
df.shape

(1600000, 6)

In [6]:
df.isnull().sum()

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

In [7]:
# Checking the distribution of target column

df['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

In [8]:
# Convert the target '4' to '1'

df.replace({'target':{4:1}}, inplace=True)

In [9]:
# Checking the distribution of target column
# ** 0 = negative tweet
# ** 1 = positive tweet

df['target'].value_counts()

target
0    800000
1    800000
Name: count, dtype: int64

**Stemming**

Stemming is a text preprocessing technique in natural language processing (NLP) that reduces words to their base or root form, removing prefixes or suffixes.

In [10]:
# Instantiate PorterStemmer

port_stem = PorterStemmer()

In [11]:
# Create a function that cleans and preprocesses the input text by removing non-alphabetic characters... 
# ...and converts to lowercase, splitting into words, removing stopwords, stemming the words, and joins them back into a single string.

def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    
    return stemmed_content

In [12]:
# Apply the stemming function to the 'text' column. 

df['stemmed content'] = df['text'].apply(stemming)

df.head()

Unnamed: 0,target,id,date,flag,user,text,stemmed content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see


**Separating feature and target**

In [13]:
X = df['stemmed content'].values
Y = df['target'].values

In [14]:
print(X)

['switchfoot http twitpic com zl awww bummer shoulda got david carr third day'
 'upset updat facebook text might cri result school today also blah'
 'kenichan dive mani time ball manag save rest go bound' ...
 'readi mojo makeov ask detail'
 'happi th birthday boo alll time tupac amaru shakur'
 'happi charitytuesday thenspcc sparkschar speakinguph h']


In [15]:
print(Y)

[0 0 0 ... 1 1 1]


**Split the data Train and Test data**

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, stratify = Y, random_state = 3)

In [17]:
print(X.shape, X_train.shape, X_test.shape)

(1600000,) (1120000,) (480000,)


**Convert textual data to numerical**

In [18]:
# Instantiate Tfidfvectorizer
vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [19]:
print(X_train)

  (0, 118512)	0.39409245008142857
  (0, 155006)	0.25179020741046637
  (0, 298925)	0.33392635957197253
  (0, 11693)	0.3120045155341715
  (0, 124608)	0.5129590998684574
  (0, 311804)	0.2777058939190307
  (0, 376483)	0.22855229245575107
  (0, 355764)	0.3780062147170107
  (0, 137062)	0.19252727210351214
  (1, 377435)	0.2642979402726371
  (1, 325394)	0.29205729847806877
  (1, 408008)	0.21508817044822207
  (1, 399197)	0.304401180431251
  (1, 97519)	0.29965760783450796
  (1, 315561)	0.3538996430817855
  (1, 135399)	0.44606144726841673
  (1, 49123)	0.3570417890657717
  (1, 145337)	0.357690033431952
  (1, 137062)	0.19103580883458432
  (2, 141667)	0.429492658644999
  (2, 379796)	0.25244244517639514
  (2, 395869)	0.36085673299877974
  (2, 41624)	0.29711665926313724
  (2, 334934)	0.20512365020881038
  (2, 244754)	0.165895274008558
  :	:
  (1119997, 113877)	0.22493615065967207
  (1119997, 387913)	0.2767496636088945
  (1119997, 409858)	0.23955180054268152
  (1119997, 70554)	0.27383674805939734
  (11

In [20]:
print(X_test)

  (0, 251042)	0.778825506289135
  (0, 137959)	0.6272406481992957
  (1, 376483)	0.2545749091096363
  (1, 372574)	0.3182396002751607
  (1, 326920)	0.542709413572274
  (1, 255066)	0.2923806328943388
  (1, 244754)	0.2658196497078675
  (1, 140239)	0.2986154111672435
  (1, 15138)	0.5422766641111115
  (2, 349121)	0.63094867205752
  (2, 216845)	0.2672271295107595
  (2, 164610)	0.4073517490391374
  (2, 135356)	0.2692742275268316
  (2, 46386)	0.45795788835060747
  (2, 37730)	0.28692150481329914
  (3, 301450)	0.4421123507497218
  (3, 297152)	0.5054666067638668
  (3, 208293)	0.32032482066572304
  (3, 121185)	0.39278664157717896
  (3, 23769)	0.5405097984543579
  (4, 401245)	0.17915877422779894
  (4, 391270)	0.25345686435953474
  (4, 376483)	0.1471063528104932
  (4, 339799)	0.34949702149010586
  (4, 268966)	0.2981938805013646
  :	:
  (479996, 73013)	0.42966685404431837
  (479996, 13682)	0.26359606432413435
  (479997, 391010)	0.252347965663915
  (479997, 374240)	0.18637397676714657
  (479997, 366539)

**Model training: Logistic Regression**

In [21]:
# Instantiate LogisticRegression
lr = LogisticRegression(max_iter = 1000)

In [22]:
# Train the logistic regression model

lr.fit(X_train, y_train)

**Model evaluation: Accuracy score**

In [23]:
# train data accuracy score
X_train_pred = lr.predict(X_train)
train_data_accuracy = accuracy_score(y_train, X_train_pred)

print('Accuracy score on training data:', train_data_accuracy)

Accuracy score on training data: 0.8104598214285714


In [24]:
# test data accuracy score
X_test_pred = lr.predict(X_test)
test_data_accuracy = accuracy_score(y_test, X_test_pred)

print('Accuracy score on testing data:', test_data_accuracy)

Accuracy score on testing data: 0.7777854166666667


**Saving the trained model**

In [25]:
import pickle

In [26]:
# Save the trained logistic regression model

filename = 'trained_model.sav'
pickle.dump(lr, open(filename, 'wb'))

**Using the saved model for predictions**

In [27]:
# load the saved model

loaded = pickle.load(open('/kaggle/working/trained_model.sav', 'rb'))

In [28]:
# Evaluate a single test instance using the trained logistic regression model to predict the sentiment of a tweet

X_new = X_test[200]
print(y_test[200])

pred = lr.predict(X_new)
print(pred)

if (pred[0] == 0):
    print('Negative Tweet')
else: 
    print('Positive Tweet')

0
[0]
Negative Tweet


In [29]:
# Evaluate a single test instance using the trained logistic regression model to predict the sentiment of a tweet

X_new = X_test[50]
print(y_test[50])

pred = lr.predict(X_new)
print(pred)

if (pred[0] == 0):
    print('Negative Tweet')
else: 
    print('Positive Tweet')

1
[1]
Positive Tweet
