<h1><center>Twitter Sentiment analysis</center></h1>

Linke : https://www.kaggle.com/datasets/kazanova/sentiment140

This is the sentiment140 dataset.
It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 2 = neutral, 4 = positive) and they can be used to detect sentiment .
It contains the following 6 fields:

- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- ids: The id of the tweet ( 2087)
- date: the date of the tweet `(Sat May 16 23:58:44 UTC 2009)`
- flag: The query (lyx). If there is no query, then this value is NO_QUERY.
- user: the user that tweeted (robotickilldozr)
- text: the text of the tweet (Lyx is cool)
- The official link regarding the dataset with resources about how it was generated is here
- The official paper detailing the approach is here

According to the creators of the dataset:

"Our approach was unique because our training data was automatically created, as opposed to having humans manual annotate tweets. In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search"


In [9]:
!pip install kaggle



In [10]:
import os
import shutil

# Create .kaggle directory if it doesn't exist
os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)

# Move kaggle.json to the correct location
shutil.move("kaggle.json", os.path.expanduser("~/.kaggle/kaggle.json"))

In [12]:
# API to fetch data from Kaggle
!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to C:\Users\Ashwin\Documents\End to End Projects\Twitter Sentiment Analysis (NLP)




  0%|          | 0.00/80.9M [00:00<?, ?B/s]
 84%|########4 | 68.0M/80.9M [00:00<00:00, 710MB/s]
100%|##########| 80.9M/80.9M [00:00<00:00, 723MB/s]


In [13]:
import zipfile
with zipfile.ZipFile("sentiment140.zip", "r") as zip_ref:
    zip_ref.extractall("sentiment140")

# Import Libraries

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re
import nltk
from nltk.corpus import stopwords,wordnet
nltk.download('stopwords')


from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ashwin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [22]:
import pickle

# Data Processing

In [13]:
column_names =['target','id','date','flag','user','text'] 

data=pd.read_csv('C:/Users/Ashwin/Documents/End to End Projects/Twitter Sentiment Analysis (NLP)/sentiment140/Tweet_dataset.csv',\
                 names=column_names,encoding='ISO-8859-1')

In [14]:
data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


# Missing values verification

In [15]:
data.isnull().sum()

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

In [16]:
# Checking distrivution opf target caolumns

data['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

In [17]:
# Convert target 5 to 1

data['target']=data['target'].apply(lambda x :1 if x==4 else x)

In [18]:
# Checking distrivution opf target caolumns

data['target'].value_counts()

target
0    800000
1    800000
Name: count, dtype: int64

# Lemmatizationming

In [8]:
lem=WordNetLemmatizer()

In [19]:
class LemmaCleaner:
    def __init__(self):
        self.lem = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))

    def __call__(self, content):
        clean_txt = re.sub(r'http\S+|www\S+|https\S+', '', content, flags=re.MULTILINE)
        clean_txt = re.sub(r'[^a-zA-Z\s]', ' ', clean_txt)
        clean_txt = clean_txt.lower().split()
        clean_txt = [self.lem.lemmatize(word, pos=wordnet.ADV) for word in clean_txt if word not in self.stop_words]
        return ' '.join(clean_txt)

In [20]:
cleaner = LemmaCleaner()
data['Clean_text_1'] = data['text'].apply(cleaner)

In [23]:
clean_data = pickle.dump(cleaner,open('cleaner.pkl','wb'))

In [24]:
X = data['Clean_text_1']
y=data['target']

# Splitting the data

In [25]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,stratify=y,random_state=1)

In [26]:
X_train.shape

(1120000,)

In [27]:
y_train.value_counts()

target
1    560000
0    560000
Name: count, dtype: int64

# Converting text to vectors

In [28]:
vectorizer = TfidfVectorizer(ngram_range=(1,2))

In [29]:
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [30]:
vectorize_model = pickle.dump(vectorizer,open('vectorizer.pkl','wb'))

# training Model

In [31]:
model1=LogisticRegression(max_iter=1000)

In [32]:
model1.fit(X_train,y_train)

# Model Evaluation

In [33]:
train_accuracy = accuracy_score(y_train,model1.predict(X_train))
train_accuracy

0.8467321428571428

In [34]:
test_accuracy = accuracy_score(y_test,model1.predict(X_test))
test_accuracy

0.7912541666666667

In [35]:
# Model Accuracy
print("Accuracy = ",test_accuracy)

Accuracy =  0.7912541666666667


# Export model

In [36]:
import pickle

In [37]:
filename = 'train_model.pkl'
pickle.dump(model1,open(filename,'wb'))

# Using the saved model for future prediction

In [38]:
loaded_model = pickle.load(open('train_model.pkl','rb'))

In [39]:
loaded_cleaner = pickle.load(open('cleaner.pkl','rb'))

In [40]:
loaded_vectorizer = pickle.load(open('vectorizer.pkl','rb'))

In [45]:
new_tweet = ["Happy to be part of the course"]

In [46]:
# Step 1: Clean the new tweet
cleaned_tweet = loaded_cleaner(new_tweet[0])

# Step 2: Vectorize the cleaned tweet
vectorized_tweet = loaded_vectorizer.transform([cleaned_tweet])

# Step 3: Make prediction
prediction = loaded_model.predict(vectorized_tweet)

# Step 4: Output result
print("Predicted label:", prediction[0])

Predicted label: 1
