# Dataset Preparation

In this notebook we train a Word2Vec model for preprocessing data for the LSTM benchmark neural network.

In [1]:
import pandas as pd
import numpy as np
import json
import os
import sys
import pickle
from tqdm.notebook import tqdm
from nltk.tokenize import word_tokenize 
from sklearn.model_selection import train_test_split
import gc
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

In [2]:
folder_path_it = "C:\\Users\\Imetomi\\Documents\\MEGA\\Uni\\5. Felev\\Deep Learning\\Deep-Learning-Shuffle-Algorithms"
data_path = 'data'
file_name = 'twitter_sentiment.csv'

In [3]:
data = pd.read_csv(os.path.join(data_path, file_name), encoding = "cp1252")
data.columns = ['Label', 'ID', 'Date', 'IDK', 'User', 'Post']

In [4]:
data.head()

Unnamed: 0,Label,ID,Date,IDK,User,Post
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [5]:
delimiters = '!?/-_\\.’\'' 
filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n‘'
posts = data['Post']
labels = data['Label']

This filtering and preprocessing method is a really naive approach but it only has to be good enough to have a great representation for key words for sentiment classification.

In [6]:
posts = posts.str.lower()
posts = posts.str.rstrip()

for d in tqdm(delimiters):
    posts = posts.str.replace(d, ' ')
    
for f in tqdm(filters):
    posts = posts.str.replace(f, '')

posts = posts.str.replace('  ', ' ')
posts = posts.str.split(' ')

HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=34.0), HTML(value='')))




In [7]:
posts

0          [is, upset, that, he, can, t, update, his, fac...
1          [kenichan, i, dived, many, times, for, the, ba...
2          [my, whole, body, feels, itchy, and, like, its...
3          [nationwideclass, no, it, s, not, behaving, at...
4                          [kwesidei, not, the, whole, crew]
                                 ...                        
1599994    [just, woke, up, having, no, school, is, the, ...
1599995    [thewdb, com, , very, cool, to, hear, old, wal...
1599996    [are, you, ready, for, your, mojo, makeover, a...
1599997    [happy, 38th, birthday, to, my, boo, of, alll,...
1599998    [happy, charitytuesday, thenspcc, sparkscharit...
Name: Post, Length: 1599999, dtype: object

## Training Word2Vec

Before training this model we split the dataset into training and testing sets so the W2V model will not see the testing dataset.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(posts, labels, shuffle=True, random_state=42)

In [10]:
X_train

66270                                       [working, madly]
428045     [how, can, it, be, this, cold, it, s, june, , ...
1307927    [jojoalexander, ight, i, let, they, lil, white...
1112400    [tweetlater, pro, is, the, way, to, go, for, t...
840793     [quoti, wanna, wake, up, where, you, arequot, ...
                                 ...                        
259178                   [i, didn, t, the, link, was, wrong]
1414414    [tommcfly, yes, , mcfly, twitter, profile, is,...
131932     [sarahftw, i, know, sometimes, i, just, preten...
671155     [cant, believe, you, came, and, asked, me, tha...
121958                                    [back, from, bali]
Name: Post, Length: 1199999, dtype: object

In [32]:
model = Word2Vec(X_train, size=250, window=8, min_count=5, workers=8)

In [33]:
model.wv.most_similar('nice')

[('lovely', 0.6832084655761719),
 ('great', 0.6388952136039734),
 ('cool', 0.5888252854347229),
 ('lush', 0.5723758935928345),
 ('pleasant', 0.5713949203491211),
 ('fantastic', 0.5628350377082825),
 ('fab', 0.562271773815155),
 ('beautiful', 0.5509770512580872),
 ('good', 0.5485318303108215),
 ('wonderful', 0.5428882241249084)]

In [34]:
model.wv.most_similar('bad')

[('terrible', 0.5930500626564026),
 ('horrible', 0.5832663774490356),
 ('shitty', 0.5423679351806641),
 ('good', 0.5274155139923096),
 ('shabby', 0.5255607962608337),
 ('weak', 0.506283700466156),
 ('baaad', 0.49840688705444336),
 ('badly', 0.4821602702140808),
 ('crappy', 0.4775882959365845),
 ('nasty', 0.46078673005104065)]

In [35]:
model.save('models/w2v.model')

In [11]:
model = Word2Vec.load('models/w2v.model')

Intuitively I would say this is a reliable representation and we can go on to translate every post to vectors. 

## Converting and Saving Data

We don't know yet how we want to use this dataset. By that I mean it is not clear if we want to add the vectors, calculate the mean of them or just use them as they are with an LSTM network. As a best solution I will convert every sentence into matrices and save them as a numpy array, which is a slow process but this way we can reuse the whole dataset later.

In [12]:
y_train = (y_train / 4).astype(int)

In [13]:
y_test = (y_test / 4).astype(int)

In [14]:
def vectorize_sentece(model, sentence, verbose=False):
    errors = 0
    data = []
    for word in sentence:
        try:
            data.append(model.wv[word])
        except:
            errors += 1
    if verbose:
        print(str(errors) + ' error occured during vectorization.')
    return np.array(data)

In [16]:
# because of memory problems we only work with part of the data
X_train_vectorized = np.array([vectorize_sentece(model, sentence) for sentence in tqdm(X_train[:int(len(X_train)/10)])])

HBox(children=(FloatProgress(value=0.0, max=119999.0), HTML(value='')))




In [17]:
np.save(os.path.join(data_path, 'twitter_train_vectors.npy'), X_train_vectorized)

In [19]:
np.save(os.path.join(data_path, 'twitter_train_labels.npy'), y_train[:int(len(y_train)/10)])

In [20]:
X_test_vectorized = np.array([vectorize_sentece(model, sentence) for sentence in tqdm(X_test[:int(len(X_test)/10)])])

HBox(children=(FloatProgress(value=0.0, max=40000.0), HTML(value='')))




In [21]:
np.save(os.path.join(data_path, 'twitter_test_vectors.npy'), X_test_vectorized)

In [22]:
np.save(os.path.join(data_path, 'twitter_test_labels.npy'), y_test[:int(len(y_test)/10)])