## Mini Project 3

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Importing libraries

In [None]:
import pandas as pd
import numpy as np
import nltk

import re
import string
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Reading the Data from the Given excel file

In [None]:
df = pd.read_csv('Twitter_Data.csv')

### Checking shape of data and print top 5 data rows

In [None]:
df.shape

(162980, 2)

In [None]:
df.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


### Total unique count of the dependent variable

In [None]:
df.category.value_counts()

 1.0    72250
 0.0    55213
-1.0    35510
Name: category, dtype: int64

So, there are 3 different values for category. Lets encode these numeric value to categorical as folllows

-1 to Negative,
0 to Neutral,
1 to Positive.

In [None]:
df['category']=df['category'].map({-1.0:'Negative', 0.0:'Neutral', 1.0:'Positive'})

In [None]:
df.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,Negative
1,talk all the nonsense and continue all the dra...,Neutral
2,what did just say vote for modi welcome bjp t...,Positive
3,asking his supporters prefix chowkidar their n...,Positive
4,answer who among these the most powerful world...,Positive


### Missing value analysis

In [None]:
df.isna().sum()

clean_text    4
category      7
dtype: int64

In [None]:
df = df.dropna()

In [None]:
df.isna().sum()

clean_text    0
category      0
dtype: int64

The usual cleaning process in NLP involves:-
<br>Remove missing values, if any.
<br>Remove unwanted characters like punctuations.
<br>Replace all the Uppercase to lowercase as the machine treat them differently, but we know the meaning of 'cat' and 'CAT' is identical.
<br>Remove type of words that follow a specific pattern like link, email, or username; these words do not contribute much to analysis and can be removed from the description with the help of regular expression.
<br>Remove all the stopwords like the pronoun, articles, etc. these words occur in massive number in any sentence but does not contribute much in NLP analysis and thus can be removed.
<br>At last, Change the verb form to its root form. example:- the root word for 'Playing' and 'Played' will be 'Play'

In [None]:
punct = string.punctuation
punct

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
stopWords = stopwords.words('english')
stopWords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

There are two ways to find the root word.

Stemming:- This is a hardcoded algorithm to remove suffixes like 'ing', 's', 'es'..etc the resulting word may not be the correct English word. This is computationally faster than Lemmatizing.

Lemmatizing:- This algorithm looks for synonyms for the word and finds the appropriate root word for the given the word. This is a bit slower than Stemming.

In [None]:
ps = nltk.PorterStemmer()
wn = nltk.WordNetLemmatizer()

We know 'goose' and 'geese' denote something. One word is singular and another plural. But stem and lemmatize treat them differently.



Lemmatize method can be used when we have a smaller dataset, as it will not take much time. But if we have a large dataset, using Lemmatization could be time expensive; in that case, we prefer to use the Stem method.

### Writting a function to clean the data.

In [None]:
def cleanData(text):
    
    # To convert the all uppercase to lowercase
    text = text.lower()
    
    # This is a reguglar expression to replace anything char that is not alphabet or numeric.
    text = re.sub(r"[^A-Za-z0-9]",' ', text)
    
    # The above regular expression itself will take care of punctuation, below is an alternative to remove only punctuation.
    text = ''.join([char for char in text if char not in punct])
    
    # This will remove the stopwords and lemmatize the remaining word to its root word.
    text = [wn.lemmatize(word) for word in text.split(' ') if ((word not in stopWords) & len(word)!=0)]
    
    return ' '.join(text)

In [None]:
df['clean_text'] = df['clean_text'].apply(cleanData) 

In [None]:
df

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,Negative
1,talk all the nonsense and continue all the dra...,Neutral
2,what did just say vote for modi welcome bjp t...,Positive
3,asking his supporters prefix chowkidar their n...,Positive
4,answer who among these the most powerful world...,Positive
...,...,...
162975,why these 456 crores paid neerav modi not reco...,Negative
162976,dear rss terrorist payal gawar what about modi...,Negative
162977,did you cover her interaction forum where she ...,Neutral
162978,there big project came into india modi dream p...,Neutral


In [None]:
df['clean_text'][0]

'when modi promised “minimum government maximum governance” expected him begin the difficult job reforming the state why does take years get justice state should and not business and should exit psus and temples'

### Creating a column with the word length of the tweet and then analyze it.

In [None]:
def find_len(txt):
    return len(txt.split())

In [None]:
df['Txt_len'] = [find_len(txt) for txt in df['clean_text']]

In [None]:
df.head()

Unnamed: 0,clean_text,category,Txt_len
0,when modi promised “minimum government maximum...,Negative,33
1,talk all the nonsense and continue all the dra...,Neutral,13
2,what did just say vote for modi welcome bjp t...,Positive,22
3,asking his supporters prefix chowkidar their n...,Positive,34
4,answer who among these the most powerful world...,Positive,14


In [None]:
### Vocabulary size
voc_size=5000

### Spliting data into dependent(X) and independent(y) dataframe

In [None]:
X = df.drop(["category","Txt_len"],axis = 1)
y = df.category

### Creating copy of independent variables for data operations

In [None]:
messages=X.copy()

In [None]:
#first tweet
messages['clean_text'][0]

'when modi promised “minimum government maximum governance” expected him begin the difficult job reforming the state why does take years get justice state should and not business and should exit psus and temples'

In [None]:
messages.reset_index(inplace=True)

In [None]:
import nltk
import re
from nltk.corpus import stopwords

In [None]:
### Dataset Preprocessing
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []
for i in range(0, len(messages)):
    print(i)
    review = re.sub('[^a-zA-Z]', ' ', messages['clean_text'][i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
157969
157970
157971
157972
157973
157974
157975
157976
157977
157978
157979
157980
157981
157982
157983
157984
157985
157986
157987
157988
157989
157990
157991
157992
157993
157994
157995
157996
157997
157998
157999
158000
158001
158002
158003
158004
158005
158006
158007
158008
158009
158010
158011
158012
158013
158014
158015
158016
158017
158018
158019
158020
158021
158022
158023
158024
158025
158026
158027
158028
158029
158030
158031
158032
158033
158034
158035
158036
158037
158038
158039
158040
158041
158042
158043
158044
158045
158046
158047
158048
158049
158050
158051
158052
158053
158054
158055
158056
158057
158058
158059
158060
158061
158062
158063
158064
158065
158066
158067
158068
158069
158070
158071
158072
158073
158074
158075
158076
158077
158078
158079
158080
158081
158082
158083
158084
158085
158086
158087
158088
158089
158090
158091
158092
158093
158094
158095
158096
158097
158098
158099
158100
158101
1581

In [None]:
corpus

['modi promis minimum govern maximum govern expect begin difficult job reform state take year get justic state busi exit psu templ',
 'talk nonsens continu drama vote modi',
 'say vote modi welcom bjp told rahul main campaign modi think modi relax',
 'ask support prefix chowkidar name modi great servic confus read crustal clear crass filthi nonsens see abus come chowkidar',
 'answer among power world leader today trump putin modi may',
 'kiya tho refresh maarkefir comment karo',
 'surat women perform yagna seek divin grace narendra modi becom',
 'come cabinet scholar like modi smriti hema time introspect',
 'upcom elect india saga go import pair look current modi lead govt elect deal brexit combin weekli look juici bear imho',
 'gandhi gay modi',
 'thing like demonetis gst good servic tax upper cast would sort either view favour say need give time cast like dalit muslim modi constitu',
 'hope tuthukudi peopl would prefer honest well behav nationalist courag likli minist modi cabinet vo

### One hot encoding for each word

In [None]:
from tensorflow.keras.preprocessing.text import one_hot
onehot_repr=[one_hot(words,voc_size)for words in corpus] 
onehot_repr

[[1425,
  4879,
  1298,
  3527,
  446,
  3527,
  3306,
  4792,
  2235,
  1361,
  909,
  2931,
  19,
  1347,
  1805,
  1793,
  2931,
  3773,
  1032,
  2500,
  3984],
 [3812, 89, 2188, 1791, 4544, 1425],
 [2975, 4544, 1425, 979, 1580, 2652, 648, 2071, 1603, 1425, 1872, 1425, 4172],
 [91,
  2243,
  3738,
  3695,
  3690,
  1425,
  357,
  2846,
  3366,
  4914,
  3034,
  4008,
  4954,
  4865,
  89,
  2805,
  1440,
  2449,
  3695],
 [4328, 2063, 110, 1606, 2692, 2970, 2223, 4143, 1425, 1983],
 [3566, 121, 4804, 4975, 2513, 2779],
 [515, 3755, 2425, 1797, 4695, 337, 4377, 4927, 1425, 116],
 [2449, 2846, 3127, 4312, 1425, 1584, 926, 2213, 4374],
 [654,
  3690,
  1055,
  1352,
  2656,
  2739,
  2174,
  3506,
  3535,
  1425,
  955,
  4943,
  3690,
  4782,
  4250,
  1024,
  4890,
  3506,
  3060,
  1458,
  3954],
 [2054, 2630, 1425],
 [130,
  4312,
  1882,
  3310,
  2780,
  2846,
  919,
  3754,
  3262,
  986,
  1748,
  4220,
  659,
  2495,
  2975,
  2539,
  604,
  2213,
  3262,
  4312,
  4834,
  32

### Adding padding from the front 

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
sent_length=20 # sentance length
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length) # add padding from front 
print(embedded_docs)

[[4879 1298 3527 ... 1032 2500 3984]
 [   0    0    0 ... 1791 4544 1425]
 [   0    0    0 ... 1872 1425 4172]
 ...
 [   0    0    0 ... 2392 3144 3814]
 [   0    0    0 ... 3819 2034 4637]
 [4312 1691 3825 ...   19  150 1967]]


In [None]:
embedded_docs[0]

array([4879, 1298, 3527,  446, 3527, 3306, 4792, 2235, 1361,  909, 2931,
         19, 1347, 1805, 1793, 2931, 3773, 1032, 2500, 3984], dtype=int32)

### Model building

In [None]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import Dropout


In [None]:
## Creating model
embedding_vector_features=40
model1=Sequential()
model1.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model1.add(Bidirectional(LSTM(100)))
model1.add(Dropout(0.3))
model1.add(Dense(3,activation='softmax'))
model1.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model1.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 40)            200000    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              112800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 3)                 603       
                                                                 
Total params: 313,403
Trainable params: 313,403
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
len(embedded_docs),y.shape

(162969, (162969,))

In [None]:
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [None]:
X_final.shape,y_final.shape

((162969, 20), (162969,))

In [None]:
y_final

array(['Negative', 'Neutral', 'Positive', ..., 'Neutral', 'Neutral',
       'Positive'], dtype=object)

### Dummy variable creation for dependent variable

In [None]:
y_final = pd.get_dummies(y_final)
y_final

Unnamed: 0,Negative,Neutral,Positive
0,1,0,0
1,0,1,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
162964,1,0,0
162965,1,0,0
162966,0,1,0
162967,0,1,0


### Spliting the data into train and test 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.2, random_state=42)

### Model training

In [None]:
### Finally Training
model1.fit(X_train,y_train, validation_data=(X_test,y_test),epochs=10,batch_size=64)



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f866fd60a10>

### Performance metrix and accuracy

In [None]:

y_pred1 = model1.predict(X_test)



In [None]:
y_pred1[0] #  first prediction

array([0.9712264 , 0.02665931, 0.0021142 ], dtype=float32)

In [None]:
y_pred1[0:10]

array([[9.71226394e-01, 2.66593061e-02, 2.11420469e-03],
       [1.02044112e-04, 9.97128844e-01, 2.76906928e-03],
       [5.57467163e-01, 2.85121854e-02, 4.14020777e-01],
       [1.39543368e-02, 6.09568246e-02, 9.25088823e-01],
       [1.43581163e-02, 6.76990449e-01, 3.08651417e-01],
       [9.99990404e-01, 4.19597063e-06, 5.33859611e-06],
       [5.69216982e-02, 8.46006274e-01, 9.70721096e-02],
       [1.04278736e-01, 8.47533159e-03, 8.87245834e-01],
       [1.56539457e-03, 2.15255001e-04, 9.98219371e-01],
       [1.58440709e-01, 4.15564924e-02, 8.00002813e-01]], dtype=float32)

In [None]:
ex = y_pred1.copy()
ex[1]

array([1.0204411e-04, 9.9712884e-01, 2.7690693e-03], dtype=float32)

### Normalize the prediction as same as orignal data

In [None]:
# Maintain prediction by normalizing data 
count = 0
for i in range(len(ex)):    
    if (ex[i][0] > ex[i][1] and ex[i][0] > ex[i][2]):
        count = 0
        
    elif(ex[i][0] < ex[i][1] and ex[i][1] > ex[i][2]):
        count = 1
        
    elif(ex[i][0] < ex[i][2] and ex[i][1] < ex[i][2]):
        count = 2
    print(i,"for count is",count)   
    for j in range(3):
        ex[i][j] = 0   
        ex[i][count] = 1
    print(ex[i])
    count = 0

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
30094 for count is 2
[0. 0. 1.]
30095 for count is 0
[1. 0. 0.]
30096 for count is 1
[0. 1. 0.]
30097 for count is 0
[1. 0. 0.]
30098 for count is 2
[0. 0. 1.]
30099 for count is 2
[0. 0. 1.]
30100 for count is 1
[0. 1. 0.]
30101 for count is 1
[0. 1. 0.]
30102 for count is 1
[0. 1. 0.]
30103 for count is 1
[0. 1. 0.]
30104 for count is 1
[0. 1. 0.]
30105 for count is 0
[1. 0. 0.]
30106 for count is 2
[0. 0. 1.]
30107 for count is 1
[0. 1. 0.]
30108 for count is 1
[0. 1. 0.]
30109 for count is 0
[1. 0. 0.]
30110 for count is 0
[1. 0. 0.]
30111 for count is 2
[0. 0. 1.]
30112 for count is 1
[0. 1. 0.]
30113 for count is 0
[1. 0. 0.]
30114 for count is 0
[1. 0. 0.]
30115 for count is 0
[1. 0. 0.]
30116 for count is 1
[0. 1. 0.]
30117 for count is 0
[1. 0. 0.]
30118 for count is 2
[0. 0. 1.]
30119 for count is 2
[0. 0. 1.]
30120 for count is 1
[0. 1. 0.]
30121 for count is 0
[1. 0. 0.]
30122 for count is 2
[0. 0. 1.]
30123 f

In [None]:
y_test

Unnamed: 0,Negative,Neutral,Positive
42228,0,1,0
22034,0,0,1
79981,1,0,0
118492,1,0,0
12814,0,1,0
...,...,...,...
47104,0,0,1
33631,1,0,0
93675,0,1,0
37756,0,1,0


### Accuracy on test data

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,ex)

0.7645272135975947

### Classification report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,ex))

              precision    recall  f1-score   support

           0       0.66      0.66      0.66      7152
           1       0.79      0.78      0.79     11067
           2       0.79      0.80      0.80     14375

   micro avg       0.76      0.76      0.76     32594
   macro avg       0.75      0.75      0.75     32594
weighted avg       0.76      0.76      0.76     32594
 samples avg       0.76      0.76      0.76     32594

