## Sentiment140 dataset

### Context
This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

### Content
It contains the following 6 fields:

- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- ids: The id of the tweet ( 2087)
- date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- flag: The query (lyx). If there is no query, then this value is NO_QUERY.
- user: the user that tweeted (robotickilldozr)
- text: the text of the tweet (Lyx is cool)

In [85]:
import json
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import seaborn as sns

import os
import zipfile

# for better visualization of columns in pandas, to see more of the twitt
pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)
sns.set(style='whitegrid', palette='Set3', font_scale=1.2)

Start with our constants 

In [87]:
embedding_dim = 100
max_length = 16
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size=160000
test_portion=.1

In [88]:
data = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='ISO-8859-1', names=['target', 'id','date','flag','user','text'])
# take a smaller dataset but shuffle before, if we don't do this will only take target =0
data = data.sample(frac=1)
#Get the specified amounth of samples 
data = data[:training_size]
data.head()

Unnamed: 0,target,id,date,flag,user,text
302414,0,1998997936,Mon Jun 01 19:28:18 PDT 2009,NO_QUERY,chelshhh,"scream to be heard, like you needed any more attention"
327269,0,2009276560,Tue Jun 02 15:25:56 PDT 2009,NO_QUERY,laurax4trees,"@jaxel042 I love that I'm on your desktop...&amp; I think we all need to hang out soon because apparently Julie, Kelly, &amp; Cam did lunch today"
279870,0,1991905283,Mon Jun 01 07:52:09 PDT 2009,NO_QUERY,amwa9,took my doggy to the vet this morning.. hope we don't put her down tonight.
1094276,4,1970133580,Sat May 30 02:35:29 PDT 2009,NO_QUERY,neaf,"First job in real company 'with an office' and only regression there. Please, don't you need a coder willing to develop his skills?"
655594,0,2240015698,Fri Jun 19 09:14:14 PDT 2009,NO_QUERY,bangshesays,"$430 later all the animals have their shots, tests, and preventative stuff! Do not like annuals."


In [89]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = data['text'].values 
labels = data['target'].values

split = int(test_fraction * training_size)

testing_sentences = sentences[0:split]
training_sentences = sentences[split:training_size]
testing_labels = labels[0:split]
training_labels = labels[split:training_size]

print('Size of Sentences ', len(sentences))
print('Size if labels ', len(labels))

# Declare the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index
# actual size of vocab is len(word_index)+1 for padding
vocab_size = len(word_index)

# convert to sequences and padd before split into train and test
training_sequences = tokenizer.texts_to_sequences(training_sentences)
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)

training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

print('Size of vocab ',vocab_size)
print('Word index for i {}'.format(word_index['i']))
print('Train and Test length')
print(len(training_sequences), len(testing_sequences))

Size of Sentences  160000
Size if labels  160000
Size of vocab  128237
Word index for i 1
Train and Test length
144000 16000


### GloVe: Global Vectors for Word Representation

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. You can download the pretrained word vectors in the official [GitHub repo](https://github.com/stanfordnlp/GloVe). The zip file contains different versions depending on the dimensions of the embedding. We will use the 100 dimensions.
#### Loading the vectors
The files are in .txt format, they contain a word and a string of numbers, depending on the size of the embedding dimesions, in this case 100. We will open the file and iterate over every line, taking the first word and the rest as the coefficients and save it into a dictionary.

In [90]:
path = 'glove.6B/glove.6B.100d.txt'

embedding_dim = 100
embeddings_index = {}

with open(path, encoding='utf8') as f:
    for line in f:
        values = line.split();
        word = values[0];
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs;
print(len(embeddings_index), 'words loaded')

# Create an embedding matrix by assigning the vocabulary with the pretrained word embeddings:
# create a weight matrix for words in training docs
embeddings_matrix = np.zeros((vocab_size+1, embedding_dim));
for word, i in word_index.items():
    #get the coefficients per word (vector)
    embedding_vector = embeddings_index.get(word);
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector;

print(len(embeddings_matrix))

400001 words loaded
128238


In [82]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Conv1D(64, 5, activation='relu'),
    tf.keras.layers.MaxPooling1D(pool_size=4),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 16, 100)           12799600  
_________________________________________________________________
dropout_12 (Dropout)         (None, 16, 100)           0         
_________________________________________________________________
conv1d_12 (Conv1D)           (None, 12, 64)            32064     
_________________________________________________________________
max_pooling1d_12 (MaxPooling (None, 3, 64)             0         
_________________________________________________________________
lstm_12 (LSTM)               (None, 64)                33024     
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 65        
Total params: 12,864,753
Trainable params: 65,153
Non-trainable params: 12,799,600
____________________________________

In [80]:
epochs=10

training_labels = np.array(training_labels)
testing_labels = np.array(testing_labels)

history = model.fit(training_padded, training_labels,
                    epochs=epochs,
                    validation_data=(testing_padded, testing_labels),
                    verbose=1)

Epoch 1/10
  59/4500 [..............................] - ETA: 14:29 - loss: -5.3233 - accuracy: 0.0085

KeyboardInterrupt: 