In [1]:
! pip install kaggle



In [2]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [3]:
!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
sentiment140.zip: Skipping, found more recently modified local copy (use --force to force download)


In [4]:
from zipfile import ZipFile
dataset = '/content/sentiment140.zip'

with ZipFile(dataset, 'r') as zip:
  zip.extractall()
  print('The dataset is extracted')

The dataset is extracted


## The dataset being used is the **sentiment140 dataset.** It contains 1,600,000 tweets extracted using the **Twitter Dataset CSV file**. The tweets have been annotated **(0 = Negative, 4 = Positive)** and they can be used to detect sentiment.

###     It contains the following 6 fields:
1. **sentiment**: the polarity of the tweet (0 = negative, 4 = positive)
2. **IDs**: The id of the tweet (2087)
3. **date**: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. **flag**: The query (lyx). If there is no query, then this value is NO_QUERY.
5. **user**: the user that tweeted (robotickilldozr)
6. **text**: the text of the tweet (Lyx is cool)e cool)


**Importing The Dependencies**

In [5]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Preprocess Text

### **Text Preprocessing** is traditionally an important step for **Natural Language Processing (NLP)** tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better.

### The Preprocessing steps:

#### 1. **Lower Casing**: Each text is converted to lowercase.
#### 2. **Replacing URLs**: Links starting with "http" or "https" or "www" are replaced by "URL".
#### 3. **Replacing Emojis**: Replace emojis by using a pre-defined dictionary containing emojis along with their meaning. (eg: ":)" to "EMOJIsmile")
#### 4. **Replacing Usernames**: Replace @Usernames with word "USER". (eg: "@Kaggle" to "USER")
#### 5. **Removing Non-Alphabets**: Replacing characters except Digits and Alphabets with a space.
#### 6. **Removing Short Words**: Words with length less than 2 are removed.
#### 7. **Removing Stopwords**: Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. (eg: "the", "he", "have")
#### 8. **Stemming**: Stemming is the process of converting a word to its base form. (e.g: “Great” to “Good”)




In [6]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

## Data Processing

In [8]:
twitter_data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding = 'ISO-8859-1')

In [9]:
twitter_data.shape

(1599999, 6)

In [10]:
twitter_data.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [11]:
column_names = ['target','id','date','flag','user','text']
twitter_data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv',names=column_names, encoding = 'ISO-8859-1')

In [12]:
twitter_data.shape

(1600000, 6)

In [13]:
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [14]:
twitter_data.isnull().sum()

Unnamed: 0,0
target,0
id,0
date,0
flag,0
user,0
text,0


In [15]:
twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


In [16]:
twitter_data.replace({'target':{4:1}},inplace=True)

In [17]:
twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
1,800000


In [18]:
port_stem = PorterStemmer()

In [19]:
def stemming(content):

  stemmed_content = re.sub('[^a-zA-Z]',' ',content)
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)

  return stemmed_content

In [20]:
twitter_data['stemmed_content'] = twitter_data['text'].apply(stemming)

In [21]:
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see


In [22]:
X = twitter_data['stemmed_content'].values
Y = twitter_data['target'].values

In [23]:
print(X)

['switchfoot http twitpic com zl awww bummer shoulda got david carr third day'
 'upset updat facebook text might cri result school today also blah'
 'kenichan dive mani time ball manag save rest go bound' ...
 'readi mojo makeov ask detail'
 'happi th birthday boo alll time tupac amaru shakur'
 'happi charitytuesday thenspcc sparkschar speakinguph h']


In [24]:
print(Y)

[0 0 0 ... 1 1 1]


# Splitting the data

### The Preprocessed Data is divided into 2 sets of data:

###Training Data: The dataset upon which the model would be trained on. Contains 80% data.
###Test Data: The dataset upon which the model would be tested against. Contains 20% data.

In [25]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2, stratify=Y, random_state=2)

In [26]:
print(X.shape, X_train.shape, X_test.shape)

(1600000,) (1280000,) (320000,)


In [27]:
print(X_train)

['watch saw iv drink lil wine' 'hatermagazin'
 'even though favourit drink think vodka coke wipe mind time think im gonna find new drink'
 ... 'eager monday afternoon'
 'hope everyon mother great day wait hear guy store tomorrow'
 'love wake folger bad voic deeper']


In [28]:
print(X_test)

['mmangen fine much time chat twitter hubbi back summer amp tend domin free time'
 'ah may show w ruth kim amp geoffrey sanhueza'
 'ishatara mayb bay area thang dammit' ...
 'destini nevertheless hooray member wonder safe trip' 'feel well'
 'supersandro thank']


# TF-IDF Vectorizer

### **TF-IDF indicates what the importance of the word is in order to understand the document or dataset.** Let us understand with an example. Suppose you have a dataset where students write an essay on the topic, My House. In this dataset, the word a appears many times; it’s a high frequency word compared to other words in the dataset. The dataset contains other words like home, house, rooms and so on that appear less often, so their frequency are lower and they carry more information compared to the word. This is the intuition behind TF-IDF.

In [29]:
vectorizer = TfidfVectorizer()

# Transforming the dataset

### Transforming the **X_train** and **X_test dataset** into matrix of TF-IDF Features by using the TF-IDF Vectoriser. This datasets will be used to train the model and test against it.

In [30]:
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [31]:
print(X_train)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 9453092 stored elements and shape (1280000, 461488)>
  Coords	Values
  (0, 436713)	0.27259876264838384
  (0, 354543)	0.3588091611460021
  (0, 185193)	0.5277679060576009
  (0, 109306)	0.3753708587402299
  (0, 235045)	0.41996827700291095
  (0, 443066)	0.4484755317023172
  (1, 160636)	1.0
  (2, 109306)	0.4591176413728317
  (2, 124484)	0.1892155960801415
  (2, 407301)	0.18709338684973031
  (2, 129411)	0.29074192727957143
  (2, 406399)	0.32105459490875526
  (2, 433560)	0.3296595898028565
  (2, 77929)	0.31284080750346344
  (2, 443430)	0.3348599670252845
  (2, 266729)	0.24123230668976975
  (2, 409143)	0.15169282335109835
  (2, 178061)	0.1619010109445149
  (2, 150715)	0.18803850583207948
  (2, 132311)	0.2028971570399794
  (2, 288470)	0.16786949597862733
  (3, 406399)	0.29029991238662284
  (3, 158711)	0.4456939372299574
  (3, 151770)	0.278559647704793
  (3, 56476)	0.5200465453608686
  :	:
  (1279996, 318303)	0.21254698865277744
  (12

In [32]:
print(X_test)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2289192 stored elements and shape (320000, 461488)>
  Coords	Values
  (0, 15110)	0.1719352837797837
  (0, 31168)	0.1624772418052177
  (0, 67828)	0.26800375270827315
  (0, 106069)	0.36555450010904555
  (0, 132364)	0.255254889555786
  (0, 138164)	0.23688292264071406
  (0, 171378)	0.2805816206356074
  (0, 271016)	0.45356623916588285
  (0, 279082)	0.17825180109103442
  (0, 388348)	0.2198507607206174
  (0, 398906)	0.34910438732642673
  (0, 409143)	0.3143047059807971
  (0, 420984)	0.17915624523539805
  (1, 6463)	0.30733520460524466
  (1, 15110)	0.211037449588008
  (1, 145393)	0.575262969264869
  (1, 217562)	0.40288153995289894
  (1, 256777)	0.28751585696559306
  (1, 348135)	0.4739279595416274
  (1, 366203)	0.24595562404108307
  (2, 22532)	0.3532582957477176
  (2, 34401)	0.37916255084357414
  (2, 89448)	0.36340369428387626
  (2, 183312)	0.5892069252021465
  (2, 256834)	0.2564939661498776
  :	:
  (319994, 443794)	0.2782185641032538


# Creating and Evaluating the Model

### We're creating 3 different types of model for our sentiment analysis problem:
- Logistic Regression (LR)
- Bernoulli Naive Bayes (BernoulliNB)
- Linear Support Vector Classification (LinearSVC)


### Since our dataset is not **skewed**, i.e. it has equal number of **Positive and Negative** Predictions. We're choosing Accuracy as our evaluation metric.

# Logistic Regression Model

In [33]:
model = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1, solver='sag', penalty='l2', verbose=1)

In [34]:
model.fit(X_train, Y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.


convergence after 27 epochs took 38 seconds


In [35]:
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

In [36]:
print('Accuracy score on the training data :', training_data_accuracy)
print(classification_report(Y_train, X_train_prediction))

Accuracy score on the training data : 0.825840625
              precision    recall  f1-score   support

           0       0.84      0.81      0.82    640000
           1       0.82      0.84      0.83    640000

    accuracy                           0.83   1280000
   macro avg       0.83      0.83      0.83   1280000
weighted avg       0.83      0.83      0.83   1280000



In [37]:
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)

In [38]:
print('Accuracy score on the test data :', test_data_accuracy)
print(classification_report(Y_test, X_test_prediction))

Accuracy score on the test data : 0.7778875
              precision    recall  f1-score   support

           0       0.79      0.76      0.77    160000
           1       0.77      0.79      0.78    160000

    accuracy                           0.78    320000
   macro avg       0.78      0.78      0.78    320000
weighted avg       0.78      0.78      0.78    320000



In [39]:
import pickle

# Saving the Logistic Regression Model

In [40]:
filename = 'trained_model.sav'
pickle.dump(model, open(filename, 'wb'))

# Loading the model

In [41]:
loaded_model = pickle.load(open('/content/trained_model.sav', 'rb'))

# Predicting the values

In [42]:
X_new = X_test[200]
print(Y_test[200])

prediction = model.predict(X_new)
print(prediction)

if (prediction[0] == 0):
  print('Negative tweet')

else:
  print('positive tweet')

1
[1]
positive tweet


In [43]:
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB

# BernoulliNB Model

In [44]:
BNBmodel = BernoulliNB(alpha = 8)
BNBmodel.fit(X_train, Y_train)
Y_pred = BNBmodel.predict(X_train)

training_data_accuracy = accuracy_score(Y_train, Y_pred)
print('Accuracy score on the train data :', training_data_accuracy)

print(classification_report(Y_train, Y_pred))

Accuracy score on the train data : 0.78223046875
              precision    recall  f1-score   support

           0       0.77      0.81      0.79    640000
           1       0.80      0.75      0.78    640000

    accuracy                           0.78   1280000
   macro avg       0.78      0.78      0.78   1280000
weighted avg       0.78      0.78      0.78   1280000



In [45]:
BNBmodel = BernoulliNB(alpha = 2)
BNBmodel.fit(X_train, Y_train)
Y_pred = BNBmodel.predict(X_test)

training_data_accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy score on the test data :', training_data_accuracy)

print(classification_report(Y_test, Y_pred))

Accuracy score on the test data : 0.76580625
              precision    recall  f1-score   support

           0       0.75      0.79      0.77    160000
           1       0.78      0.74      0.76    160000

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000



# Linear SVC Model

In [46]:
SVCmodel = LinearSVC()
SVCmodel.fit(X_train, Y_train)
Y_pred = SVCmodel.predict(X_train)
print(classification_report(Y_train, Y_pred))
training_data_accuracy = accuracy_score(Y_train, Y_pred)
print('Accuracy score on the test data :', training_data_accuracy)

              precision    recall  f1-score   support

           0       0.87      0.85      0.86    640000
           1       0.86      0.87      0.86    640000

    accuracy                           0.86   1280000
   macro avg       0.86      0.86      0.86   1280000
weighted avg       0.86      0.86      0.86   1280000

Accuracy score on the test data : 0.8623390625


In [47]:
SVCmodel = LinearSVC()
SVCmodel.fit(X_train, Y_train)
Y_pred = SVCmodel.predict(X_test)
print(classification_report(Y_test, Y_pred))
training_data_accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy score on the test data :', training_data_accuracy)

              precision    recall  f1-score   support

           0       0.78      0.76      0.77    160000
           1       0.76      0.78      0.77    160000

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000

Accuracy score on the test data : 0.769671875


# Saving the models
### We're using PICKLE to save SVCmodel and BernoulliNB Model for later use.

In [59]:
file = open('vectorizer-ngram-(1,2).pickle','wb')
pickle.dump(SVCmodel, file)
file.close()

file = open('Sentiment-BNB.pickle','wb')
pickle.dump(BNBmodel, file)
file.close()

# Using the model
### The vectoriser can be used to transform data to matrix of TF-IDF Features. While the model can be used to predict the sentiment of the transformed Data. The text whose sentiment has to be predicted however must be preprocessed.

In [49]:
print(SVCmodel.predict(vectorizer.transform(["Rahul drinks urine instead of water"])))
print(SVCmodel.predict(vectorizer.transform(["USA is better than India."])))
print(SVCmodel.predict(vectorizer.transform(["USA is better than India but i like usa more than india"])))



[1]
[1]
[1]


In [50]:
print(BNBmodel.predict(vectorizer.transform(["Rahul drinks urine instead of water"])))
print(BNBmodel.predict(vectorizer.transform(["USA is better than India."])))
print(BNBmodel.predict(vectorizer.transform(["USA is better than India but i like usa more than india"])))

[1]
[1]
[1]


In [57]:
prediction = model.predict(vectorizer.transform(["Rahul drinks urine instead of water"]))
prediction = model.predict(vectorizer.transform(["USA is better than India."]))
prediction = model.predict(vectorizer.transform(["USA is not better than India but i dont like usa more than india"]))

if (prediction[0] == 0):
  print('Negative tweet')

else:
  print('positive tweet')

Negative tweet


## We can clearly see that the **Linear SVC Model** performs the best out of all the different models that we tried. It achieves nearly **86% accuracy** while classifying the sentiment of a tweet.