# Sentiment Analysis on twitter comments.
### Performing sentiment on Sentiment140 dataset from kaggle.
Using NLP(Natural Language Processing) to pre-process the data set and applying algorithm such as Naive Baye and Logistic Regression to check the model accuracy.
### Data set contents:

    1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
    2. ids: The id of the tweet (2087)
    3. date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
    4. flag: The query (lyx). If there is no query, then this value is NO_QUERY.
    5. user: the user that tweeted.
    6. text: the text of the tweet.

## Step-1: Loading, inspection and cleaning of data:
Importing all necessary packages

In [1]:
import pandas as pd
import numpy as np
import string
import re
from tqdm.notebook import tqdm

import warnings
warnings.filterwarnings('ignore')

### Loading the data

In [2]:
# Reading csv data into pandas dataframe.
data_set = pd.read_csv("Datasets//Twitter_Sentiment_Analysis/training_1.6_million.csv", header=None)

In [3]:
# set column names for the data set
data_set.columns = ['target', 'id', 'date', 'flag', 'user', 'comment']

### Inspecting the data
#### Lets have a look at the data.
Checking for null values present in the data, as well as removal of unwanted columns if any.

In [4]:
data_set.head(10)

Unnamed: 0,target,id,date,flag,user,comment
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
5,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
6,0,1467811592,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,mybirch,Need a hug
7,0,1467811594,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,coZZ,@LOLTrish hey long time no see! Yes.. Rains a...
8,0,1467811795,Mon Apr 06 22:20:05 PDT 2009,NO_QUERY,2Hood4Hollywood,@Tatiana_K nope they didn't have it
9,0,1467812025,Mon Apr 06 22:20:09 PDT 2009,NO_QUERY,mimismo,@twittera que me muera ?


In [5]:
# Checking for null values
data_set.isna().sum()

target     0
id         0
date       0
flag       0
user       0
comment    0
dtype: int64

As we can see the data is clean of null values, hence we can go ahead with further processing of data.

### Cleaning the data
#### Data preprocessing

In [6]:
# Deleting of unwanted columns from the data
del data_set['id']
del data_set['date']
del data_set['flag']
del data_set['user']

# Lets have a look at our actual data
data_set.head(10)

Unnamed: 0,target,comment
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."
5,0,@Kwesidei not the whole crew
6,0,Need a hug
7,0,@LOLTrish hey long time no see! Yes.. Rains a...
8,0,@Tatiana_K nope they didn't have it
9,0,@twittera que me muera ?


Now our data looks much more clean.
Let's proceed with the data pre-processing steps.

#### Using the nltk(natural language tool kit) for NLP on data.

In [7]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

#### Creating a set of stopwords


In [8]:
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

#### Preprocessing on text data
    1. Casing - Convert data to upper/lower case.
    2. Noise Removal - Eliminate unwanted chars like html tags, punctuatuion marks, special characters, white spaces, etc.
    3. Tokenization - Turning tweets into tokens. Words separated by spaces in the text
    4. Stopword Removal - Removing the words that do not add meaning to any secntence.
    5. Text Normalization(Stemming and Lemmatization) - 
        i. Stemming - Eliminate the prefix, suffix, infix from a word
       ii. Lemmatization - Stemming sometime loses the actual meaning of the word. Lemmatization reduces the inflected word properly by taking its morphological analysis into account.

In [9]:
def pre_process_data(data):
    
    # Converting text to lower case
    data = data.lower()
    
    # Remove numbers
    data = re.sub(r"[0-9]", "", data)
    
    # Removal of urls if present
    data = re.sub(r"https\S+|www\S+|http\S+", "", data, flags=re.MULTILINE)
    
    # Remove user name
    data = re.sub(r"@([A-Za-z0–9_.]+)", "", data)
    
    # Removal of special characters
    data = re.sub(r"\@\w+|\#", "", data)
    
    # Remove punctuations
    data = data.translate(str.maketrans("","",string.punctuation))
    
    # Removing stopwords
    data_tokens = word_tokenize(data)
    filtered_data = [word for word in data_tokens if word not in stop_words]
    
    # Remove single letter words
    filtered_data = [word for word in filtered_data if len(word) > 1]
    
    # Stemming
    porter_stem = PorterStemmer()
    stemmed_data = [porter_stem.stem(word) for word in filtered_data]
    
    # Lematization
    lemmatize = WordNetLemmatizer()
    lematized_data = [lemmatize.lemmatize(word, 'a') for word in stemmed_data]
    
    data = " ".join(lematized_data)
    
    return data


In [10]:
# Looping through the tweets one by one and calling function pre_process_data for data preprocessing.
data_set['comment'] = [pre_process_data(data) for data in tqdm(data_set['comment'])]

  0%|          | 0/1600000 [00:00<?, ?it/s]

Our data preprocessing is almost done.
Lets check the modified data now.

In [11]:
data_set.head(5)

Unnamed: 0,target,comment
0,0,that bummer shoulda got david carr third day
1,0,upset cant updat facebook text might cri resul...
2,0,dive mani time ball manag save rest go bound
3,0,whole bodi feel itchi like fire
4,0,behav im mad cant see


Wow, our data looks so much better than before now.
Before proceeding, lets check the target values.

In [12]:
data_set['target'].unique()

array([0, 4], dtype=int64)

Negative tweets are labelled as 0, while as positive tweets are labelled 4. Let's change that to 1 before we perform the vectorization operation on data.

In [13]:
# Replacing all the 4's with 1's in the target column
data_set['target'] = data_set['target'].replace(4, 1)
data_set.target.unique()

array([0, 1], dtype=int64)

Wow, verything looks good now!

## Step-2: Conversion of data
### Conversion of textual data into numeric representation, to feed into the model.
Importing the required functions to convert the text data into tokens and vectorize it.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

### Vectorize tokens
Converting the word tokens to numbers

In [15]:
vectorize = TfidfVectorizer(binary='true')
data_new = vectorize.fit_transform(data_set['comment'])

In [16]:
print(data_new[0])

  (0, 70811)	0.17742138934618049
  (0, 310802)	0.4010903980233806
  (0, 48104)	0.506368005287779
  (0, 70634)	0.3521196682769122
  (0, 118785)	0.1946914711431045
  (0, 278813)	0.44135380244808553
  (0, 42919)	0.37482696082312
  (0, 308215)	0.23250403699334615


### Data split into train and test set

In [17]:
x = data_new
y = data_set['target'].values

# Splitting of data into train and test sets
x_train, x_test, y_train, y_test =  train_test_split(x, y, test_size= 0.2, random_state=44)

print('Training set: {}'.format(x_train.shape))
print('Testing set: {}'.format(x_test.shape))

Training set: (1280000, 367639)
Testing set: (320000, 367639)


## Step-3: Selection and training of model
### Train the model with the prepared data
Importing the selecting models and functions required

In [18]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import ComplementNB
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [19]:
# Bernoulli Naive Bayes
classify_nb = BernoulliNB()
classify_nb.fit(x_train, y_train)

BernoulliNB()

In [20]:
# Logistic Regression
classify_lg =LogisticRegression(fit_intercept=False, max_iter=100)
classify_lg.fit(x_train, y_train)

LogisticRegression(fit_intercept=False)

## Step-4: Testing the model
### Performance of the model on test data

In [21]:
# Performance of Bernoulli Naive Bayes
prediction = classify_nb.predict(x_test)
prediction

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [22]:
print("Bernouli Naive Bayes:")

print("\nAccuracy score: {}".format(accuracy_score(prediction, y_test)))
print("\nClassification Report:\n{}".format(classification_report(y_test, prediction)))
print("\nConfusion Matrix:\n{}".format(confusion_matrix(y_test, prediction)))

Bernouli Naive Bayes:

Accuracy score: 0.76821875

Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.76      0.77    159759
           1       0.77      0.77      0.77    160241

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000


Confusion Matrix:
[[122097  37662]
 [ 36508 123733]]


In [23]:
# Performance of Logistic Regression
prediction = classify_lg.predict(x_test)
prediction

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [None]:
print("Logistic Regression:")

print("\nAccuracy score: {.2f}".format(accuracy_score(prediction, y_test)))
print("\nClassification Report:\n{}".format(classification_report(y_test, prediction)))
print("\nConfusion Matrix:\n{}".format(confusion_matrix(y_test, prediction)))

Here we see, using Logistic Regression model, we get a higher accuracy.
The accuracy can be increased by further fine tuning the data along with selection of more suitable algorithm.