### PROJECT-1 (TWITTER SENTIMENT ANALYSIS)



The following project is about analyzing the sentiments of tweets on social networking website
‘Twitter’. The dataset for this project is scraped from Twitter. It contains 1,600,000 tweets
extracted using Twitter API. It is a labeled dataset with tweets annotated with the sentiment (0 =
negative, 2 = neutral, 4 = positive).
It contains the following 6 fields:
1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. ids: The id of the tweet .
3. date: The date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. flag: The query. If there is no query, then this value is NO_QUERY.
5. user: The user that tweeted
6. text: The text of the tweet.
Design a classification model that correctly predicts the polarity of the tweets provided in the
dataset.

#### Importing the libraries

In [59]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### Importing the dataset

In [60]:
# ! pip install chardet

In [61]:
import chardet
with open('twitter_new.csv', 'rb') as file:
    print(chardet.detect(file.read(500)))

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


In [62]:
print(repr(open('twitter_new.csv', 'rb').read(700)) )# dump 1st 200 bytes of file

b'"0","1467810369","Mon Apr 06 22:19:45 PDT 2009","NO_QUERY","_TheSpecialOne_","@switchfoot http://twitpic.com/2y1zl - Awww, that\'s a bummer.  You shoulda got David Carr of Third Day to do it. ;D"\n"0","1467810672","Mon Apr 06 22:19:49 PDT 2009","NO_QUERY","scotthamilton","is upset that he can\'t update his Facebook by texting it... and might cry as a result  School today also. Blah!"\n"0","1467810917","Mon Apr 06 22:19:53 PDT 2009","NO_QUERY","mattycus","@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds"\n"0","1467811184","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","ElleCTF","my whole body feels itchy and like its on fire "\n"0","1467811193","Mon Apr 06 22:19:'


In [63]:
import dask.dataframe as dd
df = dd.read_csv('twitter_new.csv',delimiter = ",",encoding = 'latin',header = None,
                 names = ['target','id','date','Query','User','text'] ,
                 dtype={'Query':str,'User': str,'text' : str})

In [64]:
df.compute()

Unnamed: 0,target,id,date,Query,User,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
400385,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
400386,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
400387,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
400388,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [65]:
df

Unnamed: 0_level_0,target,id,date,Query,User,text
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,int64,int64,object,object,object,object
,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


In [66]:
df.shape[0].compute()

1600000

In [67]:
df['text'].tail()

400385    Just woke up. Having no school is the best fee...
400386    TheWDB.com - Very cool to hear old Walt interv...
400387    Are you ready for your MoJo Makeover? Ask me f...
400388    Happy 38th Birthday to my boo of alll time!!! ...
400389    happy #charitytuesday @theNSPCC @SparksCharity...
Name: text, dtype: object

In [68]:
data=df[['text','target']]

In [69]:
type(data)

dask.dataframe.core.DataFrame

In [70]:
data.tail()

Unnamed: 0,text,target
400385,Just woke up. Having no school is the best fee...,4
400386,TheWDB.com - Very cool to hear old Walt interv...,4
400387,Are you ready for your MoJo Makeover? Ask me f...,4
400388,Happy 38th Birthday to my boo of alll time!!! ...,4
400389,happy #charitytuesday @theNSPCC @SparksCharity...,4


### Cleaning the text

In [71]:
import nltk
import re
# nltk.download('stopwords')
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

# Making statement in lower case
data['text']=data['text'].str.lower()
data['text'].tail()

400385    just woke up. having no school is the best fee...
400386    thewdb.com - very cool to hear old walt interv...
400387    are you ready for your mojo makeover? ask me f...
400388    happy 38th birthday to my boo of alll time!!! ...
400389    happy #charitytuesday @thenspcc @sparkscharity...
Name: text, dtype: object

In [72]:
# Removing stopwords from the text
all_stopwords  = stopwords.words('english')
all_stopwords.remove('not')
def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in all_stopwords])
data['text'] = data['text'].apply(cleaning_stopwords, meta=pd.Series(dtype='str', name='text'))
#data.tail()

In [73]:
# cleaning multiple ... characters
def cleaning_repeating_char(text):
    return re.sub(r'(.)1+', r'1', text)
data['text'] = data['text'].apply(cleaning_repeating_char,meta=pd.Series(dtype='str', name='text'))
#data.tail()

In [74]:
# cleaning url addressess 
def cleaning_URLs(text):
    return re.sub(r'((https?://\S+)|(www.\S+))','',text)
data['text'] = data['text'].apply(cleaning_URLs,meta=pd.Series(dtype='str', name='text'))
#data.tail()

In [75]:
# Removing Alpha-numeric character
def cleaning_alpha_numeric(data):
    return re.sub('[0-9]', '', data)
data['text'] = data['text'].apply(cleaning_alpha_numeric,meta=pd.Series(dtype='str', name='text'))
#data.tail()

In [76]:
# tokenizing tweet text
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
func = tokenizer.tokenize
data['text'] = data['text'].apply(func,meta=pd.Series(dtype='str', name='text') )
#data['text'].head()

In [77]:
# Applying Stemming
import nltk
st = nltk.PorterStemmer()
def stemming_on_text(token_list):
    text = [st.stem(token) for token in token_list]
    return token_list
data['text']= data['text'].apply(stemming_on_text,meta=pd.Series(dtype='str', name='text'))
data['text'].tail()

400385              [woke, up, school, best, feeling, ever]
400386    [thewdb, com, cool, hear, old, walt, interview...
400387                [ready, mojo, makeover, ask, details]
400388    [happy, th, birthday, boo, alll, time, tupac, ...
400389    [happy, charitytuesday, thenspcc, sparkscharit...
Name: text, dtype: object

In [78]:
lm = nltk.WordNetLemmatizer()
# Use it once first, to "unlazify" wordnet
lm.lemmatize('cats')

'cat'

In [79]:
import time
time.sleep(5)

In [80]:
#Applying Lemmitization
def lemmatizer_on_text(token_list):
    text = [lm.lemmatize(token) for token in token_list]
    return token_list
data['text'] = data['text'].apply(lemmatizer_on_text,meta=pd.Series(dtype='str', name='text') )
data['text'].tail()

400385              [woke, up, school, best, feeling, ever]
400386    [thewdb, com, cool, hear, old, walt, interview...
400387                [ready, mojo, makeover, ask, details]
400388    [happy, th, birthday, boo, alll, time, tupac, ...
400389    [happy, charitytuesday, thenspcc, sparkscharit...
Name: text, dtype: object

In [82]:
#data['text'].head()

In [83]:
data['target'] = data['target'].replace(4,1)

### Splitting the dataset into training set and test set 

In [None]:
#! pip install dask-ml

In [84]:
data.shape

(Delayed('int-13b4c0d3-2dad-4773-af0f-7f8a78060ee0'), 2)

In [85]:
X=data.text
y=data.target

In [86]:
X.shape

(dd.Scalar<size-ag..., dtype=int32>,)

In [108]:
from dask_ml.model_selection import train_test_split
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0,shuffle =False)

In [109]:
X_train.shape

(dd.Scalar<size-ag..., dtype=int32>,)

In [110]:
#X_train.compute()[:10]

In [111]:
y_train.shape

(dd.Scalar<size-ag..., dtype=int32>,)

### Model preparation : 

In [112]:
from dask_ml.feature_extraction.text import HashingVectorizer
vec = HashingVectorizer(analyzer= lambda x: x,n_features =6000)

In [113]:
vec

In [114]:
X_train = vec.fit_transform(X_train)
X_test  = vec.transform(X_test)

In [115]:
#data.target.unique().compute()

In [116]:
X_train.shape

(nan, 6000)

In [117]:
#X_train.compute()[:10]

In [None]:
#y_train.head(5)

In [None]:
#y_train.tail(5)

In [118]:
type(X_train)

dask.array.core.Array

In [119]:
type(y_train)

dask.dataframe.core.Series

### Training the Bernoulli Naive Bayes on Training set

In [120]:
from mlxtend.preprocessing import DenseTransformer

In [121]:
dt = DenseTransformer()
X_dense = dt.fit_transform(X= X_train)

In [122]:
X_dense.shape

(nan, 6000)

In [123]:
#X_dense.compute()[0]

In [124]:
X_dense

Unnamed: 0,Array,Chunk
Shape,"(nan, 6000)","(nan, 6000)"
Count,132 Tasks,4 Chunks
Type,float64,scipy.sparse.csr.csr_matrix
"Array Chunk Shape (nan, 6000) (nan, 6000) Count 132 Tasks 4 Chunks Type float64 scipy.sparse.csr.csr_matrix",,

Unnamed: 0,Array,Chunk
Shape,"(nan, 6000)","(nan, 6000)"
Count,132 Tasks,4 Chunks
Type,float64,scipy.sparse.csr.csr_matrix


In [125]:
#X_dense.compute()[:2]

In [126]:
from sklearn.naive_bayes import BernoulliNB
from dask.diagnostics import ProgressBar
from dask_ml.wrappers import Incremental

classifier = BernoulliNB()
parallel_nb = Incremental(classifier)
with ProgressBar():
    parallel_nb.fit(X_dense, y_train, classes=[0,1])

[########################################] | 100% Completed |  5min 32.1s


### Predicting the test result

In [None]:
#y_test.compute()

In [128]:
y_pred = parallel_nb.predict(X_test)
#print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [129]:
y_pred

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,136 Tasks,4 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes unknown unknown Shape (nan,) (nan,) Count 136 Tasks 4 Chunks Type int32 numpy.ndarray",,

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,136 Tasks,4 Chunks
Type,int32,numpy.ndarray


In [130]:
#y_pred.compute()

### Making the confusion Matrix

In [132]:
from sklearn.metrics import confusion_matrix
from dask_ml.metrics import accuracy_score
with ProgressBar():
    cm = confusion_matrix(y_test.compute(), y_pred.compute())
print(cm)
accuracy_score(y_test, y_pred)

[########################################] | 100% Completed |  5min 29.3s
[########################################] | 100% Completed |  5min 17.0s
[[131227  69191]
 [ 56303 143720]]


0.6866105119106185

### Precision

In [133]:
from sklearn.metrics import precision_score
with ProgressBar():
    precision = precision_score(y_test.compute(), y_pred.compute(),pos_label = 1 )
print(precision)

[########################################] | 100% Completed |  5min  9.5s
[########################################] | 100% Completed |  5min 10.0s
0.6750238362508278


### Recall

In [134]:
from sklearn.metrics import recall_score
with ProgressBar():
    recall = recall_score(y_test.compute(),y_pred.compute(), average='binary', sample_weight=None, zero_division='warn',pos_label = 1)
print(recall)

[########################################] | 100% Completed |  5min 10.1s
[########################################] | 100% Completed |  5min  5.1s
0.7185173705023922


### F1 Score

In [135]:
from sklearn.metrics import f1_score
with ProgressBar():
    f1_score = f1_score(y_test.compute(), y_pred.compute(), average='binary', zero_division='warn', pos_label =1)
print(f1_score)

[########################################] | 100% Completed |  5min  3.1s
[########################################] | 100% Completed |  5min  6.3s
0.6960918694028586


### Predicting if single review is positive or Negative

In [212]:
new_review = ["0","2251250681","Sat Jun 20 02:54:45 PDT 2009","NO_QUERY","rosaliehalerpg","@EmmettCullenRPG Miss you, love. We never see each other anymore /frowns/ The world keeps getting in the way "]
new_review = new_review[-1].lower()
new_review = new_review.split()
ps = PorterStemmer()
all_stopwords  = stopwords.words('english')
all_stopwords.remove('not')
new_review = [ps.stem(word) for word in new_review if  word not in set(all_stopwords)]
new_review = [lemmatizer_on_text(word) for word in new_review]
new_review  = ' '.join([str(elem) for elem in new_review])
# func : tokenize.tokenize
new_review = tokenizer.tokenize(new_review)
new_corpus = ' '.join(new_review)
X_new = vec.transform([new_corpus])
new_y_pred = parallel_nb.predict(X_new)
print(new_y_pred)

[1]


In [215]:
print(new_corpus)

emmettcullenrpg miss you love never see anymor frowns world keep get way


In [213]:
X_new.shape

(1, 6000)