# GloVe Embedding

In this notebook we will follow different steps in order to create different embeddings for different processing scenarios. \
We will be storing data as text files and pickle files, before anything else the following empty folders needs to be created : 
- embeddings 
- vocab 
- pptweets

In [1]:
# Importing all necessary librairies 

import numpy as np
import pandas as pd
import pickle as pkl
from scipy.sparse import *
import random

from pp_utils import *
%load_ext autoreload
%autoreload 2

### Creating base vocabulary

To generate the vocabulary we use the following lines in the terminal at current folder location and in the following order : 
- cat  twitter-datasets/train_pos_full.txt twitter-datasets/train_neg_full.txt | sed "s/ /\n/g" | grep -v "^\s*$" | sort | uniq -c > vocab/built_voc_pp0.txt
- cat vocab/built_voc_pp0.txt | sed "s/^\s\+//g" | sort -rn | grep -v "^[1234]\s" | cut -d' ' -f2 > vocab/vocab_cut_pp0.txt

This can be achieved by double clickings on the shell files in the folder in the following order :
- **build_vocab.sh**
- **cut_vocab.sh** 

Now that we have generated a vocabulary from the tweets, we can use the cooc and glove functions located in the pp_utils.py file. 

In [2]:
# Relocating the tweets txt files in the right format and folder to be used later

tweets = load_tweets()
with open("pptweets/tweets_pp0.txt", "w", encoding = 'utf8') as txt_file:
    for tweet in np.array(tweets['text']) :
        txt_file.write(tweet)

loaded 200000 tweets in dataframe with columns: Index(['text', 'label'], dtype='object')


In [9]:
# Create a vocab pickle stored in the vocab folder 
pickle_vocab(0)

# Uses the vocabulary pickle to build embeddings 
cooc(0)
glove(0)

creating cooccurrence matrix ...
loading cooccurrence matrix ...
8583351 nonzero entries
using nmax = 100 , cooc.max() = 207302
initializing embeddings
epoch 0
epoch 1
epoch 2
epoch 3
epoch 4
epoch 5
epoch 6
epoch 7
epoch 8
epoch 9


The GloVe word embeddings are stored in the embeddings folder. 

### Preprocessing 

We implemented three different tasks of preprocessing :
- (1) The creation of tokens representing numerical and/or textual patterns such as emoticons, word elongation, numbers, repeating punctuation. 
- (2)  Hashtag processing both using a \<hashtag\> token to quantify the use of  hashtags and splitting hashtags into known words in the vocabulary. 
- (3) Replacing \<hashtag\> by a stopword token. 

We decided to test four combination of these preprocessing tasks
- 0 No preprocessing 
- 1 Tokenization (1)
- 2 Tokenization and Hashtag Split (1) and (2)
- 3 Tokenization and Stop Words (1) and (3)

##### *We assumed that testing our preprocessing on the reduced dataset would be sufficient to evaluate efficiency.*

In [10]:
# Importing preprocessing functions 
from preprocessing import *

# Loading tweets in dataframes 
tweets_pp1 = create_df(100000, 100000)
tweets_pp2 = create_df(100000, 100000)
tweets_pp3 = create_df(100000, 100000)

#### Make tokens 

The following tokens are created : 
- \<elong\> \<repeat\> \<number\> 
- \<heart\> \<smiling\> \<tongue\> \<angrysad\> \<skeptical\> \<kissing\> \<brokenheart\> \<surprised\> 

The textual patterns for emoticons were found on https://en.wikipedia.org/wiki/List_of_emoticons \
These patterns are stored as text files in the Emoticon folder. 

In [19]:
# First Preprocessing Option 
tweets_pp1.text = tweets.text.apply(lambda x: preprocess_tweet(x,  tokenize=True, split_hashtags=False, remove_stopwords=False))

#### Extracting words from hashtag

Splitting the hashtags is a difficult and uncertain task because there is no way to be absolutely sure that the hashtag will be correctly splitted. 

In order to get better results, we formed a list of all the words used more than a hundred times in all the reduced dataset. What the splitting function does is that it tries to extract the longest possible words out of the hashtags, using a list of the most used words guarantee that the split don't return garbage. Going further, we decided not to return the split if it only return one or two letters words.

The topword list is created thanks to the CountVectorizer and stored in a txt file for later use. 

In [21]:
# Create topwords list from tokenized tweets
create_topwords(tweets, 100)

In [22]:
# Second Preprocessing Option 
tweets_pp2.text = tweets_pp1.text.apply(lambda x: preprocess_tweet(x,  tokenize=False, split_hashtags=True, remove_stopwords=False))

#### Removing stopwords 

We used a stopwords list available at https://www.ranks.nl/stopwords from which we removed all the words expressing a negation in order to limit the loss of meaning. It is located in the PpreprocessingFiles folder. 

In [23]:
# Third Preprocessing Option 
tweets_pp3.text = tweets_pp1.text.apply(lambda x: preprocess_tweet(x,  tokenize=False, split_hashtags=False, remove_stopwords=True))

#### Saving the preprocessed tweets for later use 

We save the tweets in a folder we'll use later to train our models. 

In [25]:
with open("pptweets/tweets_pp1.txt", "w", encoding = 'utf8') as txt_file:
    for tweet in np.array(tweets_pp1['text']) :
        txt_file.write(tweet)
        
with open("pptweets/tweets_pp2.txt", "w", encoding = 'utf8') as txt_file:
    for tweet in np.array(tweets_pp2['text']) :
        txt_file.write(tweet)
        
with open("pptweets/tweets_pp3.txt", "w", encoding = 'utf8') as txt_file:
    for tweet in np.array(tweets_pp3['text']) :
        txt_file.write(tweet)

### Creating preprocessing vocabulary 

Pre-processing introduces new tokens, and as a result, creates new words that are not taken into account in the first vocabulary. This is the reason why we need to conpute our own vocabulary from the preprocessed tweets in order to create new embeddings. We first need to apply preprocessing on the full data set, this operation can take quite a lot of time. We already placed the fully preprocessed tweets in the vocab folder. 

The preprocessing on the full data set is done by executing the following cell :

In [None]:
# WARNING : TAKES 40 min - already computed for convenience 
# Preprocessing on the full dataset 

tweets_full = load_tweets(full = True)
tweets_pp1_full = create_df(1250000, 1250000)
tweets_pp1_full.text = tweets_pp1_full.text.apply(lambda x: preprocess_tweet(x,  tokenize=False, split_hashtags=True, remove_stopwords=False))

with open("pptweets/tweets_pp1_full.txt", "w", encoding = 'utf8') as txt_file:
    for tweet in np.array(tweets_pp1_full['text']) :
        txt_file.write(tweet)

We start by creating a vocabulary from the tokenized tweets using the following shell : 
- cat  pptweets/tweets_pp1_full | sed "s/ /\n/g" | grep -v "^\s*$" | sort | uniq -c > vocab/built_voc_pp.txt
- cat vocab/built_voc_pp.txt | sed "s/^\s\+//g" | sort -rn | grep -v "^[1234]\s" | cut -d' ' -f2 > vocab/vocab_cut_pp.txt

This can be achieved by double clickings on the shell files in the folder in the following order :
- **build_vocab_pp.sh**
- **cut_vocab_pp.sh**

The vocabulary is stored as vocab_cut_pp1.txt in the vocab folder. \
This operation is time consuming as it uses the full dataset, but we don't need to do it all over again for the other datasets. 

In [None]:
# Transform the txt file into a pickle 
pickle_vocab(1)

#### Hashtag split 

The words in the hashtag already existed in the previous vocabulary, for that matter we can re-use the vocabulary we just computed, we just need to add a \<hashtag\> token. 

#### Stopwords 

Removing stopwords from the tweets makes all stopwords disappear from the vocabulary. On the same basis, we can use the same vocabulary as before adding a \<stopword\> token and removing all stopwords from the file. 

In [None]:
# Creating a cut vocabulary for the second and third preprocessing options
create_vocab_pp2()
create_vocab_pp3()

## Create gloVe embeddings 

Now that we have both the vocabularies and different preprocessed tweet sets we can now compute the word embeddings for each case. 
The embeddings are saved in the embedding folder. 

In [None]:
# Create embeddings for the unprocessed tweets 
glove(0)

In [3]:
# Creating embeddings for the three preprocessing scenarios 
glove(1)
glove(2)
glove(3)