<a href="https://colab.research.google.com/github/Navyatj/TWITTER-SENTIMENT-ANALYSIS/blob/main/Twitter_Sentiment_Analysis_using_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **TWITTER SENTIMENT ANALYSIS USING MACHINE LEARNING WITH PYTHON**

**PROJECT OVERVIEW**

This project performs sentiment analysis on Twitter data using Natural Language Processing (NLP) and Machine Learning techniques. The primary objective is to build a model capable of classifying tweets as positive or negative based on their textual content.


In [None]:
#installing kaggle library
pip install kaggle



**Upload kaggle.json file**

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [None]:
!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
sentiment140.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
#extracting the compressed dataset
from zipfile import ZipFile

# Path to your zip file
dataset = '/content/sentiment140.zip'

# Destination folder
extract_path = './'

# Open and extract
with ZipFile(dataset, 'r') as zip_ref:
    zip_ref.extractall(extract_path)
    print('The dataset is extracted')


The dataset is extracted


**Importing** **the** **Dependencies**


In [None]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
#printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

**Data Processing**


In [None]:
#Loading the data from csv to pandas dataframe
twitter_data=pd.read_csv('/content/training.1600000.processed.noemoticon.csv',encoding='ISO-8859-1')

In [None]:
#checking the number of rows and columns
twitter_data.shape

(1599999, 6)

In [None]:
#printing the first 5 rows of the dataframe
twitter_data.head(5)

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [None]:
#naming the columns correctly
column_names=['target','ids','date','flag','user','text']
twitter_data=pd.read_csv('/content/training.1600000.processed.noemoticon.csv',names=column_names,encoding='ISO-8859-1')

In [None]:
#checking the number of rows and columns
twitter_data.shape

(1600000, 6)

In [None]:
#printing the first 5 rows of the dataframe
twitter_data.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [None]:
#checking the null values in the dataset
twitter_data.isnull().sum()

Unnamed: 0,0
target,0
ids,0
date,0
flag,0
user,0
text,0


Zero null Values found in the dataset

In [None]:
 #checking the distribution of the target columns
 twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


Equal Distribution of both the Positive and Negative Tweets

0 => Negative Tweet

4 => Positive Tweet

In [None]:
#Converting the Value 4 to 1 as positive tweet
twitter_data.replace({'target':{4:1}}, inplace=True)

In [None]:
 twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
1,800000


The tweets are correctly labeled as following;

0 =>negative tweet

1 =>positive tweet

In [None]:
import string
nltk.download('stopwords')
from nltk.corpus import stopwords
# Cleaning the text by removing URLs, mentions, hashtags, numbers, punctuation, and stopwords
stop_words = set(stopwords.words('english'))
def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = re.sub(r"#\w+", "", text)
    text = re.sub(r"\d+", "", text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
#displaying the text with clean_text
twitter_data['clean_text'] = twitter_data['text'].apply(clean_text)
twitter_data[['target', 'clean_text']].head()

Unnamed: 0,target,clean_text
0,0,awww thats bummer shoulda got david carr third...
1,0,upset cant update facebook texting might cry r...
2,0,dived many times ball managed save rest go bounds
3,0,whole body feels itchy like fire
4,0,behaving im mad cant see


In [None]:
#printing the first 5 rows of the dataframe
twitter_data.head()

Unnamed: 0,target,ids,date,flag,user,text,clean_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",awww thats bummer shoulda got david carr third...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset cant update facebook texting might cry r...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,dived many times ball managed save rest go bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole body feels itchy like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",behaving im mad cant see


In [None]:
#printing the column clean_text
print(twitter_data['clean_text'])

0          awww thats bummer shoulda got david carr third...
1          upset cant update facebook texting might cry r...
2          dived many times ball managed save rest go bounds
3                           whole body feels itchy like fire
4                                   behaving im mad cant see
                                 ...                        
1599995                        woke school best feeling ever
1599996          thewdbcom cool hear old walt interviews â«
1599997                      ready mojo makeover ask details
1599998    happy th birthday boo alll time tupac amaru sh...
1599999                                                happy
Name: clean_text, Length: 1600000, dtype: object


In [None]:
Printing the target column
print(twitter_data['target'])

0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Name: target, Length: 1600000, dtype: int64


In [None]:
#separating the data and label
X=twitter_data['clean_text'].values
Y=twitter_data['target'].values

In [None]:
print(X)

['awww thats bummer shoulda got david carr third day'
 'upset cant update facebook texting might cry result school today also blah'
 'dived many times ball managed save rest go bounds' ...
 'ready mojo makeover ask details'
 'happy th birthday boo alll time tupac amaru shakur' 'happy']


In [None]:
print(Y)

[0 0 0 ... 1 1 1]


**Splitting the data into training data and test data**

In [None]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)

In [None]:
print(X.shape,X_train.shape,X_test.shape)

(1600000,) (1280000,) (320000,)


In [None]:
print(X_train)

['watch saw iv drink lil wine' 'im'
 'even though favourite drink think vodka coke wipes mind time think im gonna find new drink'
 ... 'eager monday afternoon'
 'hope everyone mother great day cant wait hear guys store tomorrow'
 'love waking folgers bad voice deeper']


In [None]:
print(X_test)

['fine havent much time chat twitter hubby back summer amp tends dominate free time'
 'ahs may show w ruth kim amp geoffrey sanhueza'
 'maybe bay area thang dammit' ...
 'nevertheless hooray members wonderful safe trip' 'feeling well' 'thank']


In [None]:
#converting the text data into numerical data
vectorizer = TfidfVectorizer()
X_train=vectorizer.fit_transform(X_train)
X_test=vectorizer.transform(X_test)

In [None]:
print(X_train)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 8843121 stored elements and shape (1280000, 358598)>
  Coords	Values
  (0, 334156)	0.31944510477587795
  (0, 265058)	0.34480489261408703
  (0, 148130)	0.5122432618073612
  (0, 82232)	0.4033973184493289
  (0, 168181)	0.40554006245294194
  (0, 341130)	0.4352919859251099
  (1, 141567)	1.0
  (2, 82232)	0.48409877134851337
  (2, 141567)	0.11091704457669742
  (2, 92311)	0.1846995619093264
  (2, 305892)	0.1769332066727175
  (2, 98156)	0.2770865236442473
  (2, 305039)	0.31699602504804786
  (2, 331079)	0.3130947189203445
  (2, 57450)	0.2979810140523102
  (2, 341530)	0.391682958860094
  (2, 187483)	0.23201748486663284
  (2, 307754)	0.1483871852937364
  (2, 116492)	0.17749294846161098
  (2, 100808)	0.19486466640820702
  (2, 201763)	0.158428837345547
  (3, 305039)	0.28559608316546087
  (3, 124944)	0.483603453393562
  (3, 117851)	0.26222786979659324
  (3, 42733)	0.4887632975700404
  :	:
  (1279996, 227807)	0.2646021340807897
  (1279996, 

In [None]:
print(X_test)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2150281 stored elements and shape (320000, 358598)>
  Coords	Values
  (0, 10529)	0.17972306337154578
  (0, 22138)	0.16533834063011607
  (0, 50608)	0.2848761782093938
  (0, 79621)	0.438521510018306
  (0, 100855)	0.2604182012824416
  (0, 106311)	0.24205171123571506
  (0, 126794)	0.22740674036066874
  (0, 137275)	0.2893748603927381
  (0, 195265)	0.18090327378687204
  (0, 293471)	0.22490059067510432
  (0, 301347)	0.42125602791165095
  (0, 307754)	0.3304770608980503
  (0, 319799)	0.18727252095666902
  (1, 6021)	0.5312277328107123
  (1, 10529)	0.18818977725900865
  (1, 112595)	0.5080172875005524
  (1, 158285)	0.36126935603498256
  (1, 181486)	0.2495175443346319
  (1, 261746)	0.4264568504627907
  (1, 274036)	0.2228455549954211
  (2, 15196)	0.4377581692991733
  (2, 25557)	0.4669625509322318
  (2, 68601)	0.44534915634400263
  (2, 181510)	0.3169210600189138
  (2, 302640)	0.5399434525011599
  :	:
  (319994, 305039)	0.1800256449396045
 

**Training the Machine Learning model**

**Logistic Regression**

In [None]:
model = LogisticRegression(max_iter=1000, C=0.1, penalty='l2')


In [None]:
model.fit(X_train,Y_train)

**Model Evaluation**

**Accuracy Score**

In [None]:
#Accuracy score on the training data
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(Y_train,X_train_prediction)

In [None]:
print('Accuracy Score on the training data:',training_data_accuracy)

Accuracy Score on the training data: 0.7837609375


In [None]:
#Accuracy score on the testing data
X_test_prediction=model.predict(X_test)
testing_data_accuracy=accuracy_score(Y_test,X_test_prediction)

In [None]:
print('Accuracy Score on the testingg data:',testing_data_accuracy)

Accuracy Score on the testingg data: 0.7792625


Model Accuracy=77.9%

**Saving the trained model**

In [None]:
import pickle

In [None]:
filename='trained_model.sav'
pickle.dump(model, open(filename,'wb'))

**Using the saved model for future predictions**

In [None]:
#loading the saved model
loaded_model=pickle.load(open('/content/trained_model.sav','rb'))

In [None]:
X_new=X_test[200]
print(Y_test[200])

prediction = model.predict(X_new)
print(prediction)

if(prediction[0]==0):
  print('Negative Tweet')

else:
  print('positive Tweet')

1
[1]
positive Tweet


In [None]:
X_new=X_test[3]
print(Y_test[3])

prediction = model.predict(X_new)
print(prediction)

if(prediction[0]==0):
  print('Negative Tweet')

else:
  print('positive Tweet')

0
[0]
Negative Tweet


**Conclusion:**

The Logistic Regression model demonstrates reliable performance with an accuracy of 78.38% on training data and 77.93% on testing data, indicating its effectiveness in correctly predicting the sentiment of tweets.