### PROJECT-1 (TWITTER SENTIMENT ANALYSIS)
The following project is about analyzing the sentiments of tweets on social networking website
‘Twitter’. The dataset for this project is scraped from Twitter. It contains 1,600,000 tweets
extracted using Twitter API. It is a labeled dataset with tweets annotated with the sentiment (0 =
negative, 2 = neutral, 4 = positive).
It contains the following 6 fields:
1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. ids: The id of the tweet .
3. date: The date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. flag: The query. If there is no query, then this value is NO_QUERY.
5. user: The user that tweeted
6. text: The text of the tweet.
Design a classification model that correctly predicts the polarity of the tweets provided in the
dataset.

#### Importing the libraries

In [145]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### Importing the dataset

In [146]:
# ! pip install chardet

In [147]:
import chardet
with open('twitter_mini.csv', 'rb') as file:
    print(chardet.detect(file.read(500)))

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


In [148]:
print(repr(open('twitter_mini.csv', 'rb').read(700)) )# dump 1st 200 bytes of file

b'0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that\'s a bummer.  You shoulda got David Carr of Third Day to do it. ;D"\r\n0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can\'t update his Facebook by texting it... and might cry as a result  School today also. Blah!\r\n0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds\r\n0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire \r\n0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass n'


In [149]:
import dask.dataframe as dd
df = dd.read_csv('twitter_mini.csv',delimiter = ",",encoding = 'latin',header = None,
                 names = ['target','id','date','Query','User','text'] ,
                 dtype={'Query':str,'User': str,'text' : str})

In [150]:
df.compute()

Unnamed: 0,target,id,date,Query,User,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [151]:
df

Unnamed: 0_level_0,target,id,date,Query,User,text
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,int64,int64,object,object,object,object
,...,...,...,...,...,...


In [152]:
df.shape[0].compute()

2000

In [153]:
df['text'].tail()

1995    Just woke up. Having no school is the best fee...
1996    TheWDB.com - Very cool to hear old Walt interv...
1997    Are you ready for your MoJo Makeover? Ask me f...
1998    Happy 38th Birthday to my boo of alll time!!! ...
1999    happy #charitytuesday @theNSPCC @SparksCharity...
Name: text, dtype: object

In [154]:
data=df[['text','target']]

In [155]:
type(data)

dask.dataframe.core.DataFrame

In [156]:
data.tail()

Unnamed: 0,text,target
1995,Just woke up. Having no school is the best fee...,4
1996,TheWDB.com - Very cool to hear old Walt interv...,4
1997,Are you ready for your MoJo Makeover? Ask me f...,4
1998,Happy 38th Birthday to my boo of alll time!!! ...,4
1999,happy #charitytuesday @theNSPCC @SparksCharity...,4


### Cleaning the text

In [157]:
import nltk
import re
# nltk.download('stopwords')
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

# Making statement in lower case
data['text']=data['text'].str.lower()
data['text'].tail()

1995    just woke up. having no school is the best fee...
1996    thewdb.com - very cool to hear old walt interv...
1997    are you ready for your mojo makeover? ask me f...
1998    happy 38th birthday to my boo of alll time!!! ...
1999    happy #charitytuesday @thenspcc @sparkscharity...
Name: text, dtype: object

In [158]:
# Removing stopwords from the text
all_stopwords  = stopwords.words('english')
all_stopwords.remove('not')
def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in all_stopwords])
data['text'] = data['text'].apply(cleaning_stopwords, meta=pd.Series(dtype='str', name='text'))
#data.tail()

In [159]:
# cleaning multiple ... characters
def cleaning_repeating_char(text):
    return re.sub(r'(.)1+', r'1', text)
data['text'] = data['text'].apply(cleaning_repeating_char,meta=pd.Series(dtype='str', name='text'))
#data.tail()

In [160]:
# cleaning url addressess 
def cleaning_URLs(text):
    return re.sub(r'((https?://\S+)|(www.\S+))','',text)
data['text'] = data['text'].apply(cleaning_URLs,meta=pd.Series(dtype='str', name='text'))
#data.tail()

In [161]:
# Removing Alpha-numeric character
def cleaning_alpha_numeric(data):
    return re.sub('[0-9]', '', data)
data['text'] = data['text'].apply(cleaning_alpha_numeric,meta=pd.Series(dtype='str', name='text'))
#data.tail()

In [162]:
# tokenizing tweet text
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
func = tokenizer.tokenize
data['text'] = data['text'].apply(func,meta=pd.Series(dtype='str', name='text') )
#data['text'].head()

In [163]:
# Applying Stemming
import nltk
st = nltk.PorterStemmer()
def stemming_on_text(token_list):
    text = [st.stem(token) for token in token_list]
    return token_list
data['text']= data['text'].apply(stemming_on_text,meta=pd.Series(dtype='str', name='text'))
data['text'].tail()

1995              [woke, up, school, best, feeling, ever]
1996    [thewdb, com, cool, hear, old, walt, interview...
1997                [ready, mojo, makeover, ask, details]
1998    [happy, th, birthday, boo, alll, time, tupac, ...
1999    [happy, charitytuesday, thenspcc, sparkscharit...
Name: text, dtype: object

In [164]:
lm = nltk.WordNetLemmatizer()
# Use it once first, to "unlazify" wordnet
lm.lemmatize('cats')

'cat'

In [165]:
import time
time.sleep(5)

In [167]:
#Applying Lemmitization
def lemmatizer_on_text(token_list):
    text = [lm.lemmatize(token) for token in token_list]
    return token_list
data['text'] = data['text'].apply(lemmatizer_on_text,meta=pd.Series(dtype='str', name='text') )
# joining with " "
data["text"]= data["text"].str.join(" ")
data['text'].tail()

1995                     woke up school best feeling ever
1996           thewdb com cool hear old walt interviews â
1997                      ready mojo makeover ask details
1998    happy th birthday boo alll time tupac amaru sh...
1999    happy charitytuesday thenspcc sparkscharity sp...
Name: text, dtype: object

In [168]:
data['text'].head()

0    switchfoot awww that s bummer shoulda got davi...
1    upset can t update facebook texting it might c...
2    kenichan dived many times ball managed save re...
3                     whole body feels itchy like fire
4    nationwideclass no not behaving all i m mad he...
Name: text, dtype: object

In [169]:
# Replacing the target value for positive comments to 1.
data['target'] = data['target'].replace(4,1)

### Splitting the dataset into training set and test set 

In [170]:
#! pip install dask-ml

In [171]:
data.shape

(Delayed('int-d8d7b474-19e1-4806-aa52-83288e4d643d'), 2)

In [172]:
X=data.text
y=data.target

In [173]:
X.shape

(dd.Scalar<size-ag..., dtype=int32>,)

In [216]:
from dask_ml.model_selection import train_test_split
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0,shuffle =False)

In [217]:
X_train.shape

(dd.Scalar<size-ag..., dtype=int32>,)

In [218]:
X_train.compute()[:10]

0     switchfoot awww that s bummer shoulda got davi...
3                      whole body feels itchy like fire
5                               kwesidei not whole crew
6                                              need hug
7     loltrish hey long time see yes rains bit only ...
8                                        tatiana_k nope
10                      spring break plain city snowing
12    caregiving bear watch it thought ua loss embar...
13         octolin counts idk either never talk anymore
14    smarrison would ve first gun not really though...
Name: text, dtype: object

In [220]:
type(X_train.compute()[0])

str

In [207]:
y_train.shape

(dd.Scalar<size-ag..., dtype=int32>,)

### Model preparation 

In [208]:
from dask_ml.feature_extraction.text import HashingVectorizer
vec = HashingVectorizer(analyzer= lambda x: x,n_features =1500)

In [209]:
vec

In [211]:
X_train = vec.fit_transform(X_train)
X_test  = vec.transform(X_test)

In [181]:
data.target.unique().compute()

0    0
1    1
Name: target, dtype: int64

In [182]:
X_train.shape

(nan, 1500)

In [183]:
X_train.compute()[:10]

<10x1500 sparse matrix of type '<class 'numpy.float64'>'
	with 159 stored elements in Compressed Sparse Row format>

In [184]:
y_train.head(5)

0    0
3    0
5    0
6    0
7    0
Name: target, dtype: int64

In [185]:
y_train.tail(5)

1995    1
1996    1
1997    1
1998    1
1999    1
Name: target, dtype: int64

In [186]:
type(X_train)

dask.array.core.Array

In [187]:
type(y_train)

dask.dataframe.core.Series

### Training the Bernoulli Naive Bayes on Training set

In [188]:
from mlxtend.preprocessing import DenseTransformer

In [189]:
dt = DenseTransformer()
X_dense = dt.fit_transform(X= X_train)

In [190]:
X_dense.shape

(nan, 1500)

In [191]:
X_dense.compute()[0]

<1x1500 sparse matrix of type '<class 'numpy.float64'>'
	with 20 stored elements in Compressed Sparse Row format>

In [192]:
X_dense

Unnamed: 0,Array,Chunk
Shape,"(nan, 1500)","(nan, 1500)"
Count,40 Tasks,1 Chunks
Type,float64,scipy.sparse.csr.csr_matrix
"Array Chunk Shape (nan, 1500) (nan, 1500) Count 40 Tasks 1 Chunks Type float64 scipy.sparse.csr.csr_matrix",,

Unnamed: 0,Array,Chunk
Shape,"(nan, 1500)","(nan, 1500)"
Count,40 Tasks,1 Chunks
Type,float64,scipy.sparse.csr.csr_matrix


In [193]:
#X_dense.compute()[:2]

In [194]:
from sklearn.naive_bayes import BernoulliNB
from dask.diagnostics import ProgressBar
from dask_ml.wrappers import Incremental

classifier = BernoulliNB()
parallel_nb = Incremental(classifier)
with ProgressBar():
    parallel_nb.fit(X_dense, y_train, classes=[0,1])

[########################################] | 100% Completed |  0.6s


### Predicting the test result

In [195]:
y_test.compute()

1       0
2       0
4       0
9       0
11      0
       ..
1982    1
1985    1
1989    1
1991    1
1993    1
Name: target, Length: 516, dtype: int64

In [196]:
y_pred

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,34 Tasks,1 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes unknown unknown Shape (nan,) (nan,) Count 34 Tasks 1 Chunks Type int32 numpy.ndarray",,

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,34 Tasks,1 Chunks
Type,int32,numpy.ndarray


In [197]:
y_pred = parallel_nb.predict(X_test)
#print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [198]:
y_pred.compute()

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,

### Making the confusion Matrix

In [199]:
from sklearn.metrics import confusion_matrix
from dask_ml.metrics import accuracy_score
cm = confusion_matrix(y_test.compute(), y_pred.compute())
print(cm)
accuracy_score(y_test, y_pred)

[[ 83 188]
 [ 70 175]]


0.5

### Precision

In [200]:
from sklearn.metrics import precision_score
precision = precision_score(y_test.compute(), y_pred.compute(),pos_label = 1 )
print(precision)

0.4820936639118457


### Recall

In [201]:
from sklearn.metrics import recall_score
recall = recall_score(y_test.compute(),y_pred.compute(), average='binary', sample_weight=None, zero_division='warn',pos_label = 1)
print(recall)

0.7142857142857143


### F1 Score

In [202]:
from sklearn.metrics import f1_score
f1_score = f1_score(y_test.compute(), y_pred.compute(), average='binary', zero_division='warn', pos_label =1)
print(f1_score)

0.575657894736842


### Predicting if single review is positive or Negative

In [221]:
new_review = ["0","2251250681","Sat Jun 20 02:54:45 PDT 2009","NO_QUERY","rosaliehalerpg","@EmmettCullenRPG Miss you, love. We never see each other anymore /frowns/ The world keeps getting in the way "]
new_review = new_review[-1].lower()
new_review = new_review.split()
ps = PorterStemmer()
all_stopwords  = stopwords.words('english')
all_stopwords.remove('not')
new_review = [ps.stem(word) for word in new_review if  word not in set(all_stopwords)]
new_review = [lemmatizer_on_text(word) for word in new_review]
new_review  = ' '.join([str(elem) for elem in new_review])
# func : tokenize.tokenize
new_review = tokenizer.tokenize(new_review)
new_corpus = ' '.join(new_review)

In [222]:
new_corpus

'emmettcullenrpg miss you love never see anymor frowns world keep get way'

In [224]:
X_new = vec.transform([new_corpus])

In [225]:
X_new.shape

(1, 1500)

In [227]:
new_y_pred = parallel_nb.predict(X_new)
print(new_y_pred)

[1]
