# Introduction

In this notebook we present our training environments, explain the technical issues we faced, how we tackled them, and finally we launch a large scale training of the word embeddings

# Dataset

We tested different techniques to fetch a sufficient dataset of tweets:
 - Use the twitter streaming API with the Tweepy python wrapper
 - Use twitterscraper python module to scrap tweets directly from Twitter pages
 - Download tweets from different sources available on the WEB.

Our constraints are the following:
 - We need more than 30 million tweets, so the fetch technique must be fast enough
 - We want to limit the number of truncated tweet (i.e. tweets finished by ... cutting a sentence). This is important because training the word2vec embeddings take into account the neighbors of each word.
 - The language must be english

## Use twitter streaming API

This technique has been proved to be too slow to be usable in this project. Only one stream can be launched by IP address. This stream can only fetch less than 5 tweets per seconds, and rely a lot on the query words used. There is no need to perform the calculation of the compute time, the constraints are too strong.

## Use a twitter WEB scraper

This technique is way faster than the previous one. Our script using this technique is able to fetch a stable rate of 20 tweets per seconds, by batches of 800 tweets. It could be improved to perform the fetch part and the database writes in parallel, roughly improving the performances by an expected 50% in the best case.

Still, this technique is too slow. It would take 470h of compute time for one worker to fetch the entire dataset, which is more than the time we have for the project. Even considering the previous improvements and that each of the 3 group member could run one worker in parallel, this technique is too expensive in compute time. Moreover the CPU usage is high when using the scraper, forbidding any cloud deployment due to prohibitive cost.

## Download a dataset

This is the most efficient way to fetch tweets we could afford. There are still some issues with the quality of the dataset. The entire class spent a lot of time to find a large enough dataset matching our constraints. We managed to find what we needed at : https://archive.org/details/archiveteam-twitter-stream-2017-11

One can find the little preprocessing we performed on the dataset before inserting into our mysql database in the file "decompress_dataset.py"

## Ensuring tweet unicity

As mentioned before, we used a mysql database to store our tweets temporarily. This has two main objectives :
 - Low memory usage
 - Low disk usage
 - Low cost unicity check
 - Quite performant data access

The unicity of each tweets is checked using the unique tweet id.

The only drawback of the usage of a local database is that only the member of the group possessing it can access it 24/24h. We plan to push the data in a large json file on S3 when the dataset is ready.

# Streaming from/to the database

The code to use the mysql database can be found in mysql_utils.py.

By default it uses the mysql database configured on my personnal machine.

Print the 10 first tweets:

In [2]:
from mysql_utils import mysql_reader

for tweet in mysql_reader(max=10):
    print(tweet)

(8951, 'looking for some deep cry for help in the song ease on down the road, but not finding it. it truly is a happy song. damn dorothy and toto ruining my fun.')
(9594, 'updating the postmarks project..')
(9606, "my krissy behind it's fine all of the time.")
(9607, "It would be impossible to surf Linda Mar with the short board, but it won't stop teh Stewie!")
(9618, 'wondering when my conversation will be light hearted again...')
(9619, 'Havin a drink at the 500 club in the mission -- to the sound of... oooo the israelites ya')
(9626, 'At the yacht club, talking to the bartenders about Wes')
(9639, 'Just made up the Deadwood Drinking Game. ')
(9644, "Limon in the Mission is tart, cool and refreshing. Like ceviche? You'll like Limon. Try sauvignon blanc with as an accent :)")
(9645, 'Is that a software architect, enterprise architect, or the real-world kind?')


Try to push the tweet with id 9619 again:

In [4]:
from mysql_utils import Tweet, mysql_sink       

tweets = [Tweet(9619, 'Havin a drink at the 500 club in the mission -- to the sound of... oooo the israelites ya')]

errors, inserted, stream_size = mysql_sink(iter(tweets))

print('Errors during insertion: {}'.format(errors))
print('Number of tweets in the input stream: {}'.format(stream_size))
print('Number of effectively inserted tweets: {}'.format(inserted))

Errors during insertion: 0
Number of tweets in the input stream: 1
Number of effectively inserted tweets: 0


# Preprocessing pipeline

To simplify the implementation and modification of a preprocessing pipeline of tweets, we implemented some classes to model this pipeline. The base classes can be found in the file "processing_pipeline.py".

In this section we'll focus on the standardisation part:
 - Tolenization
 - Lemmatisation
 - Stemming

Our preprocessing wrappers are available under "text_preprocessing.py"

In [5]:
from processing_pipeline import Pipeline
from text_preprocessing import TweetTokenizer, NLTKStemmer, NLTKLemmatizer, CorpusWrapper

preprocessor_factories = [
    TweetTokenizer,
    lambda tokens: CorpusWrapper(NLTKStemmer, tokens),
    lambda tokens: CorpusWrapper(NLTKLemmatizer, tokens),
]

test_raw_tweet_stream = mysql_reader(max=10)
test_tweet_text_stream = map(lambda tweet: tweet[1], test_raw_tweet_stream)

pipeline = Pipeline(test_tweet_text_stream, preprocessor_factories)

for val in pipeline:
    print(val)

Using TensorFlow backend.


['look', 'for', 'some', 'deep', 'cri', 'for', 'help', 'in', 'the', 'song', 'eas', 'on', 'down', 'the', 'road', ',', 'but', 'not', 'find', 'it', '.', 'it', 'truli', 'be', 'a', 'happi', 'song', '.', 'damn', 'dorothi', 'and', 'toto', 'ruin', 'my', 'fun', '.']
['updat', 'the', 'postmark', 'project', '..']
['my', 'krissi', 'behind', "it'", 'fine', 'all', 'of', 'the', 'time', '.']
['it', 'would', 'be', 'imposs', 'to', 'surf', 'linda', 'mar', 'with', 'the', 'short', 'board', ',', 'but', 'it', "won't", 'stop', 'teh', 'stewi', '!']
['wonder', 'when', 'my', 'convers', 'will', 'be', 'light', 'heart', 'again', '...']
['havin', 'a', 'drink', 'at', 'the', '500', 'club', 'in', 'the', 'mission', '-', '-', 'to', 'the', 'sound', 'of', '...', 'oooo', 'the', 'israelit', 'ya']
['at', 'the', 'yacht', 'club', ',', 'talk', 'to', 'the', 'bartend', 'about', 'we']
['just', 'make', 'up', 'the', 'deadwood', 'drink', 'game', '.']
['limon', 'in', 'the', 'mission', 'be', 'tart', ',', 'cool', 'and', 'refresh', '.', 'l

# Large scale training

In order to use gensim word2vec implementation, we need to provide batches of tokenized sentences. We use batches of 10000 tweets.

In [13]:
from functools import partial                  # Nicer than lambdas
from mysql_utils import MysqlTweetTextGetter   # Nicer than maps
from processing_pipeline import BatchMaker

input_stream = mysql_reader(max=30000)

factories = [
    MysqlTweetTextGetter,
    TweetTokenizer,
    partial(CorpusWrapper, NLTKStemmer),
    partial(CorpusWrapper, NLTKLemmatizer),
    BatchMaker,
]

batch_pipeline = Pipeline(input_stream, factories)

In [14]:
from gensim.models import Word2Vec
import datetime

print('Start training at: {}'.format(datetime.datetime.now()))

model = None
for count, batch in enumerate(iter(batch_pipeline)):
    print('Start batch n°{}'.format(count + 1))
    if model is not None:
        model.train(batch, total_examples=len(batch), epochs=model.epochs)
    model = model or Word2Vec(batch, size=300, sg=1, window=1, min_count=1)
    print('End of batch n°{} at {}'.format(count + 1, datetime.datetime.now()))

print('End of training at: {}'.format(datetime.datetime.now()))

End of batch n°2 at 2018-11-01 15:00:36.784892


Start batch n°3


End of batch n°3 at 2018-11-01 15:00:52.441293
Start batch n°4


End of batch n°4 at 2018-11-01 15:00:57.936902
End of training at: 2018-11-01 15:00:57.937164
