# Introduction

In this notebook we present our training environments, explain the technical issues we faced, how we tackled them, and finally we launch a large scale training of the word embeddings

# Dataset

We tested different techniques to fetch a sufficient dataset of tweets:
 - Use the twitter streaming API with the Tweepy python wrapper
 - Use twitterscraper python module to scrap tweets directly from Twitter pages
 - Download tweets from different sources available on the WEB.

Our constraints are the following:
 - We need more than 30 million tweets, so the fetch technique must be fast enough
 - We want to limit the number of truncated tweet (i.e. tweets finished by ... cutting a sentence). This is important because training the word2vec embeddings take into account the neighbors of each word.
 - The language must be english

## Use twitter streaming API

This technique has been proved to be too slow to be usable in this project. Only one stream can be launched by IP address. This stream can only fetch less than 5 tweets per seconds, and rely a lot on the query words used. There is no need to perform the calculation of the compute time, the constraints are too strong.

## Use a twitter WEB scraper

This technique is way faster than the previous one. Our script using this technique is able to fetch a stable rate of 20 tweets per seconds, by batches of 800 tweets. It could be improved to perform the fetch part and the database writes in parallel, roughly improving the performances by an expected 50% in the best case.

Still, this technique is too slow. It would take 470h of compute time for one worker to fetch the entire dataset, which is more than the time we have for the project. Even considering the previous improvements and that each of the 3 group member could run one worker in parallel, this technique is too expensive in compute time. Moreover the CPU usage is high when using the scraper, forbidding any cloud deployment due to prohibitive cost.

## Download a dataset

This is the most efficient way to fetch tweets we could afford. There are still some issues with the quality of the dataset. The entire class spent a lot of time to find a large enough dataset matching our constraints. We managed to find what we needed at : https://archive.org/details/archiveteam-twitter-stream-2017-11

One can find the little preprocessing we performed on the dataset before inserting into our mysql database in the file "decompress_dataset.py"

## Ensuring tweet unicity

As mentioned before, we used a mysql database to store our tweets temporarily. This has two main objectives :
 - Low memory usage
 - Low disk usage
 - Low cost unicity check
 - Quite performant data access

The unicity of each tweets is checked using the unique tweet id.

The only drawback of the usage of a local database is that only the member of the group possessing it can access it 24/24h. We plan to push the data in a large json file on S3 when the dataset is ready.

# Streaming from/to the database

The code to use the mysql database can be found in mysql_utils.py.

By default it uses the mysql database configured on my personnal machine.

Print the 10 first tweets:

In [3]:
from mysql_utils import mysql_reader

for tweet in mysql_reader(max=10):
    print(tweet)

(8951, 'looking for some deep cry for help in the song ease on down the road, but not finding it. it truly is a happy song. damn dorothy and toto ruining my fun.')
(9594, 'updating the postmarks project..')
(9606, "my krissy behind it's fine all of the time.")
(9607, "It would be impossible to surf Linda Mar with the short board, but it won't stop teh Stewie!")
(9618, 'wondering when my conversation will be light hearted again...')
(9619, 'Havin a drink at the 500 club in the mission -- to the sound of... oooo the israelites ya')
(9626, 'At the yacht club, talking to the bartenders about Wes')
(9639, 'Just made up the Deadwood Drinking Game. ')
(9644, "Limon in the Mission is tart, cool and refreshing. Like ceviche? You'll like Limon. Try sauvignon blanc with as an accent :)")
(9645, 'Is that a software architect, enterprise architect, or the real-world kind?')


Try to push the tweet with id 9619 again:

In [3]:
from mysql_utils import Tweet, mysql_sink       

tweets = [Tweet(9619, 'Havin a drink at the 500 club in the mission -- to the sound of... oooo the israelites ya')]

errors, inserted, stream_size = mysql_sink(iter(tweets))

print('Errors during insertion: {}'.format(errors))
print('Number of tweets in the input stream: {}'.format(stream_size))
print('Number of effectively inserted tweets: {}'.format(inserted))

Errors during insertion: 0
Number of tweets in the input stream: 1
Number of effectively inserted tweets: 0


# Preprocessing pipeline

To simplify the implementation and modification of a preprocessing pipeline of tweets, we implemented some classes to model this pipeline. The base classes can be found in the file "processing_pipeline.py".

In this section we'll focus on the standardisation part:
 - Tolenization
 - Lemmatisation
 - Stemming

Our preprocessing wrappers are available under "text_preprocessing.py"

In [1]:
from processing_pipeline import Pipeline
from text_preprocessing import TweetTokenizer, NLTKStemmer, NLTKLemmatizer, CorpusWrapper

Using TensorFlow backend.


In [4]:
preprocessor_factories = [
    TweetTokenizer,
    lambda tokens: CorpusWrapper(NLTKStemmer, tokens),
    lambda tokens: CorpusWrapper(NLTKLemmatizer, tokens),
]

test_raw_tweet_stream = mysql_reader(max=10)
test_tweet_text_stream = map(lambda tweet: tweet[1], test_raw_tweet_stream)

pipeline = Pipeline(test_tweet_text_stream, preprocessor_factories)

for val in pipeline:
    print(val)

['look', 'for', 'some', 'deep', 'cri', 'for', 'help', 'in', 'the', 'song', 'eas', 'on', 'down', 'the', 'road', ',', 'but', 'not', 'find', 'it', '.', 'it', 'truli', 'be', 'a', 'happi', 'song', '.', 'damn', 'dorothi', 'and', 'toto', 'ruin', 'my', 'fun', '.']
['updat', 'the', 'postmark', 'project', '..']
['my', 'krissi', 'behind', "it'", 'fine', 'all', 'of', 'the', 'time', '.']
['it', 'would', 'be', 'imposs', 'to', 'surf', 'linda', 'mar', 'with', 'the', 'short', 'board', ',', 'but', 'it', "won't", 'stop', 'teh', 'stewi', '!']
['wonder', 'when', 'my', 'convers', 'will', 'be', 'light', 'heart', 'again', '...']
['havin', 'a', 'drink', 'at', 'the', '500', 'club', 'in', 'the', 'mission', '-', '-', 'to', 'the', 'sound', 'of', '...', 'oooo', 'the', 'israelit', 'ya']
['at', 'the', 'yacht', 'club', ',', 'talk', 'to', 'the', 'bartend', 'about', 'we']
['just', 'make', 'up', 'the', 'deadwood', 'drink', 'game', '.']
['limon', 'in', 'the', 'mission', 'be', 'tart', ',', 'cool', 'and', 'refresh', '.', 'l

# Large scale training

In order to use gensim word2vec implementation, we need to provide batches of tokenized sentences. We use batches of 10000 tweets.

In [5]:
from functools import partial                  # Nicer than lambdas
from database_to_json import read_tweet_json   # Nicer than maps
from processing_pipeline import BatchMaker

In [None]:
input_stream = read_tweet_json(max=23563755)

factories = [
    TweetTokenizer,
    partial(CorpusWrapper, NLTKStemmer),
    partial(CorpusWrapper, NLTKLemmatizer),
    partial(BatchMaker, batch_size=100000),
]

batch_pipeline = Pipeline(input_stream, factories)

## Testing the pipeline for hashtags and twitter users

In [11]:
factories = [
    TweetTokenizer,
    partial(CorpusWrapper, NLTKStemmer),
    partial(CorpusWrapper, NLTKLemmatizer),
    partial(BatchMaker, batch_size=100000),
]

test = ['#tata', '@toto']
test_pip = Pipeline(test, factories)

for b in test_pip:
    for w in b:
        print(w)

['#tata']
['@toto']


Conclusion: it's ok.

## Launch the training

In [60]:
from gensim.models import Word2Vec
import datetime

print('Start training at: {}'.format(datetime.datetime.now()))

model = None
count = 1
for batch in batch_pipeline:
    print('Start batch n°{} at {}'.format(count, datetime.datetime.now()))
    if model is None:
        model = Word2Vec(list(batch), size=300, sg=1, window=1, min_count=1, workers=16)
    else:
        try:
            model.train(iter(batch), total_examples=len(batch), epochs=model.epochs)
        except RuntimeError:
            break
    print('End of batch n°{} at {}'.format(count, datetime.datetime.now()))
    count += 1

print('End of training at: {}'.format(datetime.datetime.now()))

Start training at: 2018-11-13 08:02:50.244557
Start batch n°1 at 2018-11-13 08:02:50.247117
End of batch n°1 at 2018-11-13 08:05:54.121536
Start batch n°2 at 2018-11-13 08:05:54.121763
End of batch n°2 at 2018-11-13 08:07:58.304526
Start batch n°3 at 2018-11-13 08:07:58.304669
End of batch n°3 at 2018-11-13 08:09:51.643927
Start batch n°4 at 2018-11-13 08:09:51.644050
End of batch n°4 at 2018-11-13 08:11:37.922917
Start batch n°5 at 2018-11-13 08:11:37.923267
End of batch n°5 at 2018-11-13 08:13:27.317980
Start batch n°6 at 2018-11-13 08:13:27.318124
End of batch n°6 at 2018-11-13 08:15:20.384651
Start batch n°7 at 2018-11-13 08:15:20.384844
End of batch n°7 at 2018-11-13 08:16:54.683148
Start batch n°8 at 2018-11-13 08:16:54.683313
End of batch n°8 at 2018-11-13 08:18:31.939057
Start batch n°9 at 2018-11-13 08:18:31.939227
End of batch n°9 at 2018-11-13 08:20:08.517673
Start batch n°10 at 2018-11-13 08:20:08.517798
End of batch n°10 at 2018-11-13 08:21:40.700399
Start batch n°11 at 20

End of batch n°87 at 2018-11-13 10:13:57.548798
Start batch n°88 at 2018-11-13 10:13:57.548958
End of batch n°88 at 2018-11-13 10:15:37.159494
Start batch n°89 at 2018-11-13 10:15:37.159630
End of batch n°89 at 2018-11-13 10:17:10.605119
Start batch n°90 at 2018-11-13 10:17:10.605262
End of batch n°90 at 2018-11-13 10:18:46.589226
Start batch n°91 at 2018-11-13 10:18:46.589376
End of batch n°91 at 2018-11-13 10:20:26.650993
Start batch n°92 at 2018-11-13 10:20:26.651138
End of batch n°92 at 2018-11-13 10:22:06.190088
Start batch n°93 at 2018-11-13 10:22:06.190263
End of batch n°93 at 2018-11-13 10:23:30.716949
Start batch n°94 at 2018-11-13 10:23:30.717065
End of batch n°94 at 2018-11-13 10:24:57.830386
Start batch n°95 at 2018-11-13 10:24:57.830939
End of batch n°95 at 2018-11-13 10:26:28.399595
Start batch n°96 at 2018-11-13 10:26:28.399727
End of batch n°96 at 2018-11-13 10:27:43.892722
Start batch n°97 at 2018-11-13 10:27:43.892837
End of batch n°97 at 2018-11-13 10:29:01.624618
St

End of batch n°172 at 2018-11-13 12:49:19.755835
Start batch n°173 at 2018-11-13 12:49:19.756001
End of batch n°173 at 2018-11-13 12:51:15.533986
Start batch n°174 at 2018-11-13 12:51:15.534117
End of batch n°174 at 2018-11-13 12:53:09.277625
Start batch n°175 at 2018-11-13 12:53:09.278067
End of batch n°175 at 2018-11-13 12:54:59.693679
Start batch n°176 at 2018-11-13 12:54:59.693822
End of batch n°176 at 2018-11-13 12:56:51.764661
Start batch n°177 at 2018-11-13 12:56:51.764790
End of batch n°177 at 2018-11-13 12:59:00.527433
Start batch n°178 at 2018-11-13 12:59:00.527917
End of batch n°178 at 2018-11-13 13:01:06.107741
Start batch n°179 at 2018-11-13 13:01:06.108249
End of batch n°179 at 2018-11-13 13:03:14.184759
Start batch n°180 at 2018-11-13 13:03:14.184943
End of batch n°180 at 2018-11-13 13:05:09.774381
Start batch n°181 at 2018-11-13 13:05:09.774511
End of batch n°181 at 2018-11-13 13:07:07.591642
Start batch n°182 at 2018-11-13 13:07:07.591774
End of batch n°182 at 2018-11-

## Saving the model

In [64]:
import nltk

nltk.download('wordnet')

In [65]:

file = open('trained_embeddings_23M.model', 'w+')
model.save('trained_embeddings_23M.model')
file.close()


## Model loading

In [9]:
from gensim.models import Word2Vec

model = Word2Vec.load('./trained_embeddings_23M.model')

## Test embeddings

In [88]:
print(model.wv.most_similar('sad'))

[('depress', 0.46674293279647827), ('weird', 0.4597640335559845), ('sick', 0.4553745985031128), ('disappoint', 0.45280617475509644), ('frustrat', 0.45277246832847595), ('upset', 0.4420504570007324), ('funni', 0.43968212604522705), ('happi', 0.4352729916572571), ('annoy', 0.4248059093952179), ('excit', 0.4106306731700897)]


  if np.issubdtype(vec.dtype, np.int):


In [89]:
print(model.wv.most_similar('trust'))

[('believ', 0.4070039987564087), ('tell', 0.3571561276912689), ('respect', 0.356683611869812), ('faith', 0.3467700481414795), ('understand', 0.34412968158721924), ('let', 0.3375264108181), ('promis', 0.3068927526473999), ('control', 0.30342984199523926), ('bother', 0.3027229309082031), ('underestim', 0.30046722292900085)]


  if np.issubdtype(vec.dtype, np.int):


In [90]:
print(model.wv.most_similar('respect'))

[('appreci', 0.3969097435474396), ('support', 0.3940708637237549), ('confid', 0.36279332637786865), ('trust', 0.3566836714744568), ('employ', 0.3524959683418274), ('faith', 0.3501768112182617), ('digniti', 0.34686940908432007), ('encourag', 0.3448216915130615), ('love', 0.3382894694805145), ('admir', 0.33313828706741333)]


  if np.issubdtype(vec.dtype, np.int):


In [91]:
print(model.wv.most_similar('babi'))

[('daddi', 0.4360034763813019), ('mama', 0.4000909626483917), ('girl', 0.3890559673309326), ('kid', 0.3793148696422577), ('princess', 0.370619535446167), ('boy', 0.3642154037952423), ('preciou', 0.35340815782546997), ('son', 0.3482373356819153), ('puppi', 0.3445145785808563), ('babe', 0.3434704840183258)]


  if np.issubdtype(vec.dtype, np.int):


In [92]:
print(model.wv.most_similar('incred'))

[('amaz', 0.5798534154891968), ('awesom', 0.5109658241271973), ('unbeliev', 0.46301302313804626), ('insan', 0.4410589933395386), ('extrem', 0.3975587487220764), ('fantast', 0.394329309463501), ('phenomen', 0.390927791595459), ('outstand', 0.36891746520996094), ('aw', 0.36402973532676697), ('brilliant', 0.35805773735046387)]


In [93]:
print(model.wv.most_similar('man'))

[('dude', 0.566826581954956), ('woman', 0.5109279155731201), ('guy', 0.44352054595947266), ('boy', 0.4050363302230835), ('girl', 0.396494060754776), ('brother', 0.37180647253990173), ("man'", 0.3705722987651825), ('bro', 0.370300829410553), ('mother', 0.3333069980144501), ('son', 0.3330507278442383)]


  if np.issubdtype(vec.dtype, np.int):


In [94]:
print(model.wv.most_similar('luv'))

[('love', 0.5302918553352356), ('ma', 0.3786517381668091), ('appreci', 0.3643260598182678), ('arrghhhh', 0.3245513439178467), ('<3', 0.32012438774108887), ('pzizz.com/affiliates.asp?id=1995', 0.31416797637939453), ('bro', 0.31183305382728577), ('e-a-g-l-e-', 0.31055912375450134), ('hate', 0.3082127869129181), ('xx', 0.30289795994758606)]


  if np.issubdtype(vec.dtype, np.int):


In [95]:
print(model.wv.most_similar('feel'))

[('felt', 0.5829184055328369), ('feelin', 0.40771132707595825), ('smell', 0.3914710283279419), ('sound', 0.376400887966156), ('tast', 0.3494603633880615), ('think', 0.3464612364768982), ('behav', 0.34026259183883667), ('offkey', 0.3203328847885132), ('suitabili', 0.31692323088645935), ('know', 0.3084157109260559)]


  if np.issubdtype(vec.dtype, np.int):


In [96]:
print(model.wv.most_similar('car'))

[('truck', 0.49514272809028625), ('vehicl', 0.4850355088710785), ('bike', 0.4163765609264374), ('motorcycl', 0.3982599377632141), ('garag', 0.39300549030303955), ('bu', 0.37754032015800476), ('merced', 0.37616923451423645), ('boat', 0.36503323912620544), ('phone', 0.36238914728164673), ('bathroom', 0.35686442255973816)]


In [97]:
print(model.wv.most_similar('great'))

[('fantast', 0.590543270111084), ('good', 0.5120762586593628), ('fab', 0.5019850134849548), ('brilliant', 0.4965236783027649), ('terrif', 0.47120505571365356), ('awesom', 0.4628204107284546), ('amaz', 0.4508037269115448), ('nice', 0.44725051522254944), ('fabul', 0.44397595524787903), ('excel', 0.4163700342178345)]


  if np.issubdtype(vec.dtype, np.int):


In [98]:
print(model.wv.most_similar('wors'))

[('better', 0.582304060459137), ('easier', 0.42270427942276), ('worst', 0.4013630151748657), ('harder', 0.39631474018096924), ('stronger', 0.39296117424964905), ('bigger', 0.3910757303237915), ('hotter', 0.3763754963874817), ('bad', 0.3730286955833435), ('cooler', 0.36886197328567505), ('colder', 0.36526110768318176)]


  if np.issubdtype(vec.dtype, np.int):


In [99]:
print(model.wv.most_similar('terribl'))

[('horribl', 0.650357723236084), ('bad', 0.4911038875579834), ('shitti', 0.421306312084198), ('disgust', 0.41420310735702515), ('great', 0.40791767835617065), ('worst', 0.40693503618240356), ('pathet', 0.37771177291870117), ('tragic', 0.36558565497398376), ('aw', 0.36078372597694397), ('sad', 0.3571690320968628)]


  if np.issubdtype(vec.dtype, np.int):


## What about emojis?

This part still needs a little work in order to user emojis unicode encoding. For the moment, empjis are handled as strings.

In [10]:
print(model.wv.most_similar(':)'))
print(model.wv.most_similar(':('))

[(':-)', 0.6159348487854004), (';)', 0.5726117491722107), ('<3', 0.5300270318984985), (':D', 0.4842647314071655), ('!', 0.474102258682251), (';-)', 0.43480759859085083), ('xx', 0.4305958151817322), ('lol', 0.42179274559020996), (':(', 0.4141804575920105), ('hehe', 0.40834444761276245)]
[(':-(', 0.5189728736877441), (':/', 0.5026484727859497), ('lol', 0.4185192883014679), (':)', 0.4141804575920105), ('ugh', 0.3916919529438019), ('haha', 0.3840927183628082), ('<3', 0.3624846935272217), (':D', 0.346662312746048), (';)', 0.3448978066444397), ('xx', 0.3353324234485626)]


  if np.issubdtype(vec.dtype, np.int):


## Download our trained models

You can find the models we trained with aroud 23M tweets here:

https://mega.nz/#!oUshxYoZ!pbA40Xzi_1kmZ68UhgDzu1rSytn67h6iYW4MKgoJqJQ

https://mega.nz/#!ZN1h0QYC!JqeIN3DhjoRrBmO75Qmg8fI4w4jqvqZNGeXUiMl3I9M

https://mega.nz/#!IB8BXQzC!K-hrA6r-A99b_t3g-qqBQdRK7RZV4rJHrBvkZtBvo2s

## TODO

- [x] Handle @ and #
- [ ] Handle emojis