# DATASET - Twitter tweet sentiment prediction 

### Aim to predict - Sentiment of Tweets, weather Positive, negative or Neutral 

### Brief about dataset

I have taken this dataset from kaggle, it has 4 columns and 74681 Rows, we will train our model to predict the tweets column sentiments, weather it is positive, negative or neutral. 

#### Importing the Libraries 

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import string
import gensim

from gensim import models

import seaborn as sns

In [15]:
    wn = nltk.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english')

# 1. read in-text data 

In [3]:
data = pd.read_csv("twitter_sentiment.csv")
data.head(30)

Unnamed: 0,2401,Borderlands,Positive,"im getting on borderlands and i will murder you all ,"
0,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
1,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
2,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
3,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
4,2401,Borderlands,Positive,im getting into borderlands and i can murder y...
5,2402,Borderlands,Positive,So I spent a few hours making something for fu...
6,2402,Borderlands,Positive,So I spent a couple of hours doing something f...
7,2402,Borderlands,Positive,So I spent a few hours doing something for fun...
8,2402,Borderlands,Positive,So I spent a few hours making something for fu...
9,2402,Borderlands,Positive,2010 So I spent a few hours making something f...


### Pre-processing 

In [4]:
data.shape

(74681, 4)

In [5]:
data.isnull().sum()

2401                                                       0
Borderlands                                                0
Positive                                                   0
im getting on borderlands and i will murder you all ,    686
dtype: int64

### Dropping the null values 

In [6]:
clean_data = data.dropna()

In [7]:
clean_data.isnull().sum()

2401                                                     0
Borderlands                                              0
Positive                                                 0
im getting on borderlands and i will murder you all ,    0
dtype: int64

### Droping the less relevant columns

In [8]:
clean_data = clean_data.drop(columns=['2401', 'Borderlands'])

In [9]:
clean_data.head()

Unnamed: 0,Positive,"im getting on borderlands and i will murder you all ,"
0,Positive,I am coming to the borders and I will kill you...
1,Positive,im getting on borderlands and i will kill you ...
2,Positive,im coming on borderlands and i will murder you...
3,Positive,im getting on borderlands 2 and i will murder ...
4,Positive,im getting into borderlands and i can murder y...


In [13]:
clean_data.columns = ['Sentiment','Review']

In [14]:
clean_data.head()

Unnamed: 0,Sentiment,Review
0,Positive,I am coming to the borders and I will kill you...
1,Positive,im getting on borderlands and i will kill you ...
2,Positive,im coming on borderlands and i will murder you...
3,Positive,im getting on borderlands 2 and i will murder ...
4,Positive,im getting into borderlands and i can murder y...


# 2. format using regex and other tools, punctuation , tokenize, remove stop words stem and lemmatize the data.

In [12]:
#Removed number and punctuations
#Tokenize and converted into Lowecase

clean_data['clean_gensim'] = clean_data['Review'].apply(lambda x: gensim.utils.simple_preprocess(x))


def clean_one(text):
    text = " ".join([word for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [word for word in tokens if word not in stopwords]
    return text

#Removed Stopwords

clean_data['cleaner'] = clean_data['clean_gensim'].apply(lambda x: clean_one(x))

def lemma(token_text):
    text = [wn.lemmatize(word) for word in token_text]
    return text

#Lemmatized

clean_data['lemmatized'] = clean_data['cleaner'].apply(lambda x: lemma(x))

clean_data.head()

Unnamed: 0,Sentiment,Review,clean_gensim,cleaner,lemmatized
0,Positive,I am coming to the borders and I will kill you...,"[am, coming, to, the, borders, and, will, kill...","[coming, borders, kill]","[coming, border, kill]"
1,Positive,im getting on borderlands and i will kill you ...,"[im, getting, on, borderlands, and, will, kill...","[im, getting, borderlands, kill]","[im, getting, borderland, kill]"
2,Positive,im coming on borderlands and i will murder you...,"[im, coming, on, borderlands, and, will, murde...","[im, coming, borderlands, murder]","[im, coming, borderland, murder]"
3,Positive,im getting on borderlands 2 and i will murder ...,"[im, getting, on, borderlands, and, will, murd...","[im, getting, borderlands, murder]","[im, getting, borderland, murder]"
4,Positive,im getting into borderlands and i can murder y...,"[im, getting, into, borderlands, and, can, mur...","[im, getting, borderlands, murder]","[im, getting, borderland, murder]"


### Splitting the data for testing and training 

In [13]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(clean_data['lemmatized'], clean_data['Sentiment'], test_size=0.2, random_state=42)

print('\nX_train size:\t',X_train.shape)
print('\nX_test size:\t',X_test.shape)
print('\ny_train size:\t',y_train.shape)
print('\ny_test size:\t',y_test.shape)


X_train size:	 (59196,)

X_test size:	 (14799,)

y_train size:	 (59196,)

y_test size:	 (14799,)


# 3. Vectorize your data
# 4. Create and transform features 

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train.astype('U'))

X_test = vectorizer.transform(X_test.astype('U'))

X_train.shape, X_test.shape

((59196, 25983), (14799, 25983))

# 5. Select 2 algorithms and build 2 models 
# 6. make predictions & evaluate the results

### Logistics Regression 

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

# Initializing LogisticRegression

lr = LogisticRegression()

# Fit model

lr.fit(X_train, y_train)

y_pred_test = lr.predict(X_test)

# Evaluation accuracy data training

acc_test_lr = accuracy_score(y_test, y_pred_test)

# Print accuracy


print(f'Accuracy calculation results Data Test  : {acc_test_lr}')

Accuracy calculation results Data Test  : 0.7808635718629637


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [16]:
lr_class = classification_report(y_test, y_pred_test)

print(lr_class)

              precision    recall  f1-score   support

  Irrelevant       0.83      0.67      0.74      2624
    Negative       0.79      0.83      0.81      4463
     Neutral       0.75      0.76      0.76      3589
    Positive       0.77      0.81      0.79      4123

    accuracy                           0.78     14799
   macro avg       0.79      0.77      0.78     14799
weighted avg       0.78      0.78      0.78     14799



### Linear SVM

In [17]:
from sklearn.svm import LinearSVC

# Initialize Linear SVM

lsvc = LinearSVC()

# Fit model
lsvc.fit(X_train, y_train)

# Predication Accuracy data training

y_pred_test = lsvc.predict(X_test)

# Evaluation Accuracy of test data

acc_test_sv = accuracy_score(y_test, y_pred_test)

# Print Accuracy


print(f'Accuracy calculation results Data Test  : {acc_test_sv}')

Accuracy calculation results Data Test  : 0.8575579431042638


In [18]:
sv_class = classification_report(y_test, y_pred_test)
print(sv_class)

              precision    recall  f1-score   support

  Irrelevant       0.90      0.80      0.85      2624
    Negative       0.87      0.88      0.87      4463
     Neutral       0.89      0.83      0.86      3589
    Positive       0.80      0.89      0.84      4123

    accuracy                           0.86     14799
   macro avg       0.86      0.85      0.86     14799
weighted avg       0.86      0.86      0.86     14799



# 7. Select one final model & explain why you want to select this particular model as the final model. 

In [19]:
print("\033[1m"+"\t\t\tResults"+"\033[0m")
print("\n"+"-"*55)
print("\033[1m"+"\tFor Logistic Regression the results are:"+"\033[0m")
print("\nTest Accuracy Score:\t", acc_test_lr)
print("Classification Report\n",lr_class)
print("\n"+"-"*55)
print("\033[1m"+"\tFor SVC the results are:"+"\033[0m")
print("\nTest Accuracy Score:\t",acc_test_sv )
print("Classification Report:\n",sv_class)
print("-"*55)

[1m			Results[0m

-------------------------------------------------------
[1m	For Logistic Regression the results are:[0m

Test Accuracy Score:	 0.7808635718629637
Classification Report
               precision    recall  f1-score   support

  Irrelevant       0.83      0.67      0.74      2624
    Negative       0.79      0.83      0.81      4463
     Neutral       0.75      0.76      0.76      3589
    Positive       0.77      0.81      0.79      4123

    accuracy                           0.78     14799
   macro avg       0.79      0.77      0.78     14799
weighted avg       0.78      0.78      0.78     14799


-------------------------------------------------------
[1m	For SVC the results are:[0m

Test Accuracy Score:	 0.8575579431042638
Classification Report:
               precision    recall  f1-score   support

  Irrelevant       0.90      0.80      0.85      2624
    Negative       0.87      0.88      0.87      4463
     Neutral       0.89      0.83      0.86      3589


So after analyzing both the models classification report properly, we can see accuracy 
with Support Vector Classifier (SVC) is 85 % which is better than Logistic regression 
model.

Few reasons of SVM performing better could be as  
1. SVM finds the “best” margin which reduces the risk of error on the data.
2. SVM works well with unstructured and semi-structured data like text.
3. The risk of overfitting is less in SVM

## Thank you