In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)

Fill in your keys to access Twitter API

In [2]:
import tweepy as tw

# define keys
consumer_key= 'gEJhQtgiIvxzNB50u4JPic8f4'
consumer_secret= 'iEfTG65lFX8cAzKJ4QIhJklvuh3tfWdaRAAWO3b17082dZaSiu'
access_token= '1369691334051852293-IGWGrIUKFY6rTwmrA5WD3YLkJrlUk5'
access_token_secret= '7ftmxYiYnso7PNkPOWOWKCkNrguFFMhwwPTHhQ6bFVvgG'
# authenticate and create api object
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

First of all, just to see the texts data and getting to know *tokenization* and *cleaning data*, we give [`tweepy.API.user_timeline`](https://docs.tweepy.org/en/v3.5.0/api.html#API.user_timeline) method to the [`tweepy.Cursor`](https://docs.tweepy.org/en/v3.4.0/cursor_tutorial.html#cursor-tutorial) object for iterating through a specific user's timeline. In this case, one [twitter user](https://twitter.com/indykaila) who tweets about football news is considered, with excluding replies and retweets.

A brief cleaning data is executed. This contains removing *links* and some *puctuations* from tweets (= `status.full_text`).

Let's iterate through first page of user's timeline, print *tweet* & *not cleaned tokenized tweet* & *cleaned tokenized tweet* for each status:

> **Note**: we look at all texts lowercased.

In [3]:
import re

# just see how iterate through first user's timeline pages with cursor
for page in tw.Cursor(api.user_timeline,
                      id="indykaila", exclude_replies=True,
                      include_rts=False, tweet_mode='extended').pages(1):
    for status in page:
        # cleaning documents (remove links and punctuations) to raw texts
        print(status.full_text)
        print(status.full_text.lower().split())
        # replacing links with an empty character
        link_removed = re.sub(r'\bhttps:\S+', '', status.full_text.lower())
        # replacing punctuations with a whitespace
        punc_link_removed = re.sub(r'[–,-.!":]\D', ' ', link_removed)
        print(punc_link_removed.split(), '\n---')

Statement: Chelsea FC is disgusted with posts on social media this evening targeting West Bromwich Albion player Callum Robinson. #CFC
['statement:', 'chelsea', 'fc', 'is', 'disgusted', 'with', 'posts', 'on', 'social', 'media', 'this', 'evening', 'targeting', 'west', 'bromwich', 'albion', 'player', 'callum', 'robinson.', '#cfc']
['statement', 'chelsea', 'fc', 'is', 'disgusted', 'with', 'posts', 'on', 'social', 'media', 'this', 'evening', 'targeting', 'west', 'bromwich', 'albion', 'player', 'callum', 'robinson', '#cfc'] 
---
David Luiz could need surgery #AFC
['david', 'luiz', 'could', 'need', 'surgery', '#afc']
['david', 'luiz', 'could', 'need', 'surgery', '#afc'] 
---
Confirmed: Liverpool will meet with Erling Haaland's representatives on Friday #LFC
['confirmed:', 'liverpool', 'will', 'meet', 'with', 'erling', "haaland's", 'representatives', 'on', 'friday', '#lfc']
['confirmed', 'liverpool', 'will', 'meet', 'with', 'erling', "haaland's", 'representatives', 'on', 'friday', '#lfc'] 
--

# Corpora

We define a class object to *stream* the corpus, by stream this means iterating through the corpus as a generator object. Streaming help us not to save corpora in RAM and make the programm memory-friendly. [**Gensim**](https://radimrehurek.com/gensim/index.html) python library deals with such corpus classes and allows you create documents on the fly.

There's a simple and particular class defined. For now only `tweepy.API.user_timeline` method is available in this class. We can add some other [api wrapper's methods](https://docs.tweepy.org/en/v3.4.0/api.html#tweepy-api-twitter-api-wrapper) for various uses and initiate class object with them in the `tweepy.Cursor` object. Also we can control method's parameters (like `id` for `tweepy.API.user_timeline`) with the `tweepy.Cursor` object in the instantiation.

In [4]:
class MyTexts:
    """Implement a class object to iterate on a specific user timeline.
    This class define an iterator as a generator function
    which yield cleaned text (removed links and punctuations) tokenized"""
    def __init__(self, pagination_num=3):
        self.pagination_num = pagination_num
        # cursor on user's timeline
        self.cursor = tw.Cursor(api.user_timeline, id="indykaila",
                              exclude_replies=True, include_rts=False,
                              tweet_mode='extended').pages(self.pagination_num)
    def __iter__(self):
        for page in self.cursor:
            for status in page:
                # cleaning: removing links and some punctuations
                link_removed = re.sub(r'\bhttps:\S+', '', status.full_text.lower())
                punc_link_removed = re.sub(r'[–,-.!":]\D', ' ', link_removed)
                yield punc_link_removed.split()

In [5]:
MyTexts().__iter__()

<generator object MyTexts.__iter__ at 0x7fe93084ea20>

> **Collect statistics about corpus and preprocess it**

We can collect statistics about all tokens by creating **dictionaries** as a [`gensim.corpora.Dictionary` class object](https://radimrehurek.com/gensim/corpora/dictionary.html#module-gensim.corpora.dictionary). For *preprocessing* the corpus, we just remove *stop words* and *once words* from dictionary. Stopwords are the words which frequenctly will be used in sentences (like for, of , a, and, etc.) with no special semantics in this stage. Once words are the words which are used once in the corpus and can be ignored. For filtering these tokens we obtain their ids from dictionary and filter them out.

With having dictionary object for a corpus, we can convert tokenized *documents* to vectors in **BoW representation**. Documents could be tweets, which here is considered so.

Here we consider 100 first page of user's timeline as the whole corpus.

At the end, we save the dictionary into disk for possible later use.

In [7]:
from gensim import corpora

texts = MyTexts(100)  # define corpus streamable object
# collect statistics about tokens into dictionary
dictionary = corpora.Dictionary(texts)
# preprocess: remove stop words and only once words from dictionary
stop_words = set('for of a an and the to in on at with is are'.split())
stop_word_ids = [dictionary.token2id[stopword]
                 for stopword in stop_words
                 if stopword in dictionary.token2id]
once_word_ids = [tokenid
                 for tokenid, docfreq in dictionary.dfs.items()
                 if docfreq == 1]
dictionary.filter_tokens(stop_word_ids + once_word_ids)
dictionary.compactify()
# store the dictionary
dictionary.save('dictionary.dict')

2021-04-04 04:49:08,726 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-04-04 04:54:59,862 : INFO : built Dictionary(3620 unique tokens: ['#cfc', 'albion', 'bromwich', 'callum', 'chelsea']...) from 1202 documents (total 19581 corpus positions)
2021-04-04 04:54:59,889 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(3620 unique tokens: ['#cfc', 'albion', 'bromwich', 'callum', 'chelsea']...) from 1202 documents (total 19581 corpus positions)", 'datetime': '2021-04-04T04:54:59.863224', 'gensim': '4.0.1', 'python': '3.7.3 (default, Mar 27 2019, 16:54:48) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-19.6.0-x86_64-i386-64bit', 'event': 'created'}
2021-04-04 04:54:59,896 : INFO : Dictionary lifecycle event {'fname_or_handle': 'dictionary.dict', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-04-04T04:54:59.896392', 'gensim': '4.0.1', 'python': '3.7.3 (default, Mar 27 2019, 16:54:48) \n[Clang 4.0.1 (tags/R

In [8]:
print(dictionary)

Dictionary(1613 unique tokens: ['#cfc', 'callum', 'chelsea', 'disgusted', 'evening']...)


With `dfs` attribute in `dictionary` object, you can see the *document frequencies*: token_id -> how many documents contain this token

In [9]:
dictionary.dfs

{10: 12,
 2: 17,
 5: 2,
 3: 2,
 8: 4,
 9: 5,
 6: 9,
 12: 119,
 4: 3,
 11: 3,
 13: 13,
 7: 26,
 1: 3,
 0: 43,
 16: 18,
 17: 6,
 15: 29,
 18: 22,
 19: 5,
 14: 132,
 21: 47,
 24: 122,
 26: 151,
 25: 6,
 22: 4,
 23: 6,
 20: 258,
 36: 7,
 31: 2,
 30: 137,
 29: 5,
 32: 48,
 28: 4,
 38: 10,
 37: 10,
 33: 2,
 34: 10,
 35: 2,
 27: 4,
 44: 10,
 43: 3,
 45: 19,
 42: 70,
 41: 7,
 39: 24,
 40: 43,
 47: 7,
 50: 3,
 48: 135,
 51: 2,
 46: 10,
 52: 15,
 49: 10,
 56: 6,
 59: 165,
 57: 75,
 55: 3,
 60: 85,
 64: 3,
 69: 54,
 63: 79,
 61: 19,
 53: 25,
 54: 7,
 65: 4,
 68: 21,
 67: 20,
 62: 31,
 70: 21,
 66: 3,
 58: 3,
 76: 60,
 83: 58,
 82: 58,
 77: 39,
 79: 24,
 73: 9,
 78: 16,
 81: 23,
 75: 19,
 72: 11,
 74: 21,
 80: 64,
 84: 3,
 71: 15,
 89: 66,
 85: 32,
 87: 9,
 88: 18,
 86: 51,
 90: 19,
 96: 10,
 94: 2,
 95: 4,
 91: 48,
 93: 40,
 99: 35,
 102: 37,
 97: 4,
 101: 2,
 98: 71,
 92: 13,
 104: 20,
 100: 12,
 103: 6,
 109: 3,
 111: 2,
 112: 3,
 105: 40,
 120: 46,
 107: 70,
 114: 16,
 116: 3,
 115: 18,
 117: 

> **Building training corpus**

Now that we have a dictionary, we can select our training corpus and vectorize it's documents in BoW representation. For this purpose, I choosed 10 first page of user's timeline for training corpus, building a generator to yield `dictionary.doc2bow(text)` for each `text` (= tweet) in `train_text` (= training corpus as texts), and store training corpus in BoW (= `train_corpus_bow`) to a file named 'train_corpus_bow.mm' with `gensim.corpora.MmCorpus.serialize` method in [`gensim.corpora.MmCorpus` class](https://radimrehurek.com/gensim/corpora/mmcorpus.html#module-gensim.corpora.mmcorpus). Corpus serialized using the sparse coordinate [Matrix Market (.mm) format](https://math.nist.gov/MatrixMarket/formats.html).

> **Note**: Every document's token that is not in the dictionary is a "blah" for `gensim.corpora.Dictionary.doc2bow` method and would'nt count in document's BoW representation.

In [10]:
# choose 10 first page of user's timeline for training corpus
train_texts = MyTexts(10)
# stream on the training corpus and store it in bow representation
train_corpus_bow = (dictionary.doc2bow(text) for text in train_texts)
corpora.MmCorpus.serialize('train_corpus_bow.mm', train_corpus_bow)

2021-04-04 04:56:15,345 : INFO : storing corpus in Matrix Market format to train_corpus_bow.mm
2021-04-04 04:56:15,372 : INFO : saving sparse matrix to train_corpus_bow.mm
2021-04-04 04:56:18,804 : INFO : PROGRESS: saving document #0
2021-04-04 04:56:50,968 : INFO : saved 104x572 matrix, density=1.952% (1161/59488)
2021-04-04 04:56:50,970 : INFO : saving MmCorpus index to train_corpus_bow.mm.index


In [11]:
train_texts = MyTexts(10)
train_corpus_bow = corpora.MmCorpus('train_corpus_bow.mm')
for count, (tweet_bow, tweet_text) in enumerate(zip(train_corpus_bow, train_texts)):
    print(tweet_bow, tweet_text, '\n---')
    if count == 5: break

2021-04-04 04:57:32,765 : INFO : loaded corpus index from train_corpus_bow.mm.index
2021-04-04 04:57:32,774 : INFO : initializing cython corpus reader from train_corpus_bow.mm
2021-04-04 04:57:32,788 : INFO : accepted corpus with 104 documents, 572 features, 1161 non-zero entries
[(0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0), (9, 1.0), (10, 1.0), (11, 1.0), (12, 1.0), (13, 1.0)] ['statement', 'chelsea', 'fc', 'is', 'disgusted', 'with', 'posts', 'on', 'social', 'media', 'this', 'evening', 'targeting', 'west', 'bromwich', 'albion', 'player', 'callum', 'robinson', '#cfc'] 
---
[(14, 1.0), (15, 1.0), (16, 1.0), (17, 1.0), (18, 1.0), (19, 1.0)] ['david', 'luiz', 'could', 'need', 'surgery', '#afc'] 
---
[(20, 1.0), (21, 1.0), (22, 1.0), (23, 1.0), (24, 1.0), (25, 1.0), (26, 1.0)] ['confirmed', 'liverpool', 'will', 'meet', 'with', 'erling', "haaland's", 'representatives', 'on', 'friday', '#lfc'] 
---
[(27, 1.0), (28, 1.0), (29, 1.0), (30, 1.0), (31,

# Train the Models with Training Corpus

Now that we have BoW representation of training corpus (in `train_corpus_bow`), we can *transform* it to have various representations. These tranformations would be obtained by training the *models* like **TF-IDF** and **LSI**. **Gensim** have a module `gensim.models` which contains various models such as [`gensim.models.TfidfModel`](https://radimrehurek.com/gensim/models/tfidfmodel.html#module-gensim.models.tfidfmodel) and [`gensim.models.LsiModel`](https://radimrehurek.com/gensim/models/lsimodel.html#module-gensim.models.lsimodel) class objects.

First we load training corpus in BoW to be sure.

In [12]:
# load training corpus in bow from disk
train_corpus_bow = corpora.MmCorpus('train_corpus_bow.mm')

2021-04-04 04:57:45,136 : INFO : loaded corpus index from train_corpus_bow.mm.index
2021-04-04 04:57:45,159 : INFO : initializing cython corpus reader from train_corpus_bow.mm
2021-04-04 04:57:45,181 : INFO : accepted corpus with 104 documents, 572 features, 1161 non-zero entries


Then we can initialize a tfidf model with the corpus in BoW. A TF-IDF model would be *trained* with this training corpus.

In [13]:
from gensim import models

# initialize a tfidf model with training corpus in bow: training
tfidf_model = models.TfidfModel(train_corpus_bow)

2021-04-04 04:57:56,113 : INFO : collecting document frequencies
2021-04-04 04:57:56,125 : INFO : PROGRESS: processing document #0
2021-04-04 04:57:56,172 : INFO : TfidfModel lifecycle event {'msg': 'calculated IDF weights for 104 documents and 572 features (1161 matrix non-zeros)', 'datetime': '2021-04-04T04:57:56.172322', 'gensim': '4.0.1', 'python': '3.7.3 (default, Mar 27 2019, 16:54:48) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-19.6.0-x86_64-i386-64bit', 'event': 'initialize'}


Now that you have a TF-IDF model, you can transform any documents (= wheter it's a tweet or the corpus as a whole, just everything in BoW representation) with only indexing it into the `tfidf_model` object and get the transformed vectors (= documents) in the output. So TF-IDF model transforms documents from BoW representation into Tfidf representation, this can be summarized in the notation *bow->tfidf*.

> **Note**: Tranformation could be done just **after** training the model. In fact, training the model gives us a transformation model.

Here I transform (bow->tfidf) the whole training corpus because I want to give the whole training corpus in Tfidf to another model for training.

In [14]:
# tranform training corpus bow->tfidf
train_corpus_tfidf = tfidf_model[train_corpus_bow]

In [15]:
for count, (tweet_tfidf, tweet_text) in enumerate(zip(train_corpus_tfidf, MyTexts(10))):
    print(tweet_tfidf, tweet_text, '\n---')
    if count == 5: break

[(0, 0.20305576652385546), (1, 0.2894543924822621), (2, 0.2209851013533081), (3, 0.2894543924822621), (4, 0.2894543924822621), (5, 0.2894543924822621), (6, 0.2894543924822621), (7, 0.2209851013533081), (8, 0.2894543924822621), (9, 0.2894543924822621), (10, 0.2894543924822621), (11, 0.2894543924822621), (12, 0.15251581022435406), (13, 0.2894543924822621)] ['statement', 'chelsea', 'fc', 'is', 'disgusted', 'with', 'posts', 'on', 'social', 'media', 'this', 'evening', 'targeting', 'west', 'bromwich', 'albion', 'player', 'callum', 'robinson', '#cfc'] 
---
[(14, 0.2325356183336032), (15, 0.39234888686124225), (16, 0.4611767152851847), (17, 0.4611767152851847), (18, 0.39234888686124225), (19, 0.4611767152851847)] ['david', 'luiz', 'could', 'need', 'surgery', '#afc'] 
---
[(20, 0.17086465762202846), (21, 0.35364733935246695), (22, 0.4604179485908612), (23, 0.541186796496686), (24, 0.2851556061823202), (25, 0.4604179485908612), (26, 0.2181114048733865)] ['confirmed', 'liverpool', 'will', 'meet',

Now that's time to give a brief description about BoW and Tfidf representations.

first of all a dictionary should be created.

> **BoW representation**: Any documents can have a BoW representation. *Document* could be just a word, a sentence, a tweet, a paragraph or even a book, anything which is tokenized. `gensim.corpora.Dictionary.doc2bow` method get the document as input, simultaneously looks at the dictionary and says us which words (in word ids) *there are* & how many *repetition* they have on this document, in tuples. For example, a BoW representation of a document like `[(144, 2), (8, 1)]` tells us the word associated with id = 144 repeated 2 times in this document and word id = 8 repeated only once (in **this** document).

> **Tfidf representation**: TF-IDF model would be simply obtained by training. Training process would be done by looking at words in dictionary and calculate the inverse frequencies of them in the main corpus. Then Tfidf representation gives us a statistical measure that evaluates *how relevant a word is to a document in a corpus*. This is done by multplying two factors: *inverse word frequency* of the word in the whole corpus, and *how many times this word appeared in this document*. For example, suppose for the same above document, `[(144, 0.378), (8, 0.129)]` is the Tfidf representation which tells us word ids = 144 and 8 *are* in the document and their *tf-idf statistical measures* are 0.378 and 0.129 respectively. For more information you can see [tf-idf wikipedia](https://en.wikipedia.org/wiki/Tf–idf).

Now that we've reached here, it will be useful to train another model. **Latent Semantic Indexing** or **LSI** model first published in [Deerwester et al. (1990): Indexint by Latent Semantic Analysis](https://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf). I just know this analysis works on *singular value decomposition* method and I want to know about the model more!

Let's just initiate the `gensim.models.LsiModel` object with the training corpus in Tfidf representation for train a LSI model. Some other parameters needed like the dictionary as `id2word` and `num_topics` that I just guessed to 10.

In [16]:
# train a LSI model with training corpus in tfidf
lsi_model = models.LsiModel(train_corpus_tfidf, id2word=dictionary, num_topics=10)
# tfidf->fold-in-lsi
train_corpus_lsi = lsi_model[train_corpus_tfidf]

2021-04-04 05:11:52,695 : INFO : using serial LSI version on this node
2021-04-04 05:11:52,705 : INFO : updating model with new documents
2021-04-04 05:11:52,733 : INFO : preparing a new chunk of documents
2021-04-04 05:11:52,747 : INFO : using 100 extra samples and 2 power iterations
2021-04-04 05:11:52,751 : INFO : 1st phase: constructing (1613, 110) action matrix
2021-04-04 05:11:52,768 : INFO : orthonormalizing (1613, 110) action matrix
2021-04-04 05:11:52,856 : INFO : 2nd phase: running dense svd on (110, 104) matrix
2021-04-04 05:11:52,868 : INFO : computing the final decomposition
2021-04-04 05:11:52,870 : INFO : keeping 10 factors (discarding 80.171% of energy spectrum)
2021-04-04 05:11:52,875 : INFO : processed documents up to #104
2021-04-04 05:11:52,879 : INFO : topic #0(1.774): -0.401*"news" + -0.238*"#lfc" + -0.190*"breaking" + -0.173*"will" + -0.150*"has" + -0.145*"be" + -0.138*"@lfc" + -0.126*"team" + -0.125*"out" + -0.115*"he"
2021-04-04 05:11:52,882 : INFO : topic #1(1

Note that we *piplined* models as *bow->tfidf->fold-in-lsi* and wrapped them over the training corpus.

Before representing documents in fold-in-lsi representation, we print 10 topics that we ordered. I've read the topics obtained by the model and believe me, it reminded me to black days of Liverpool FC in this season :(

Read these topics with *hope in your heart*:

In [17]:
lsi_model.print_topics()

2021-04-04 05:12:28,226 : INFO : topic #0(1.774): -0.401*"news" + -0.238*"#lfc" + -0.190*"breaking" + -0.173*"will" + -0.150*"has" + -0.145*"be" + -0.138*"@lfc" + -0.126*"team" + -0.125*"out" + -0.115*"he"
2021-04-04 05:12:28,238 : INFO : topic #1(1.602): -0.579*"news" + -0.259*"breaking" + 0.162*"has" + -0.158*"10pm" + -0.153*"big" + -0.147*"11pm" + 0.136*"out" + -0.121*"@arsenal" + 0.110*"team" + -0.108*"positive"
2021-04-04 05:12:28,271 : INFO : topic #2(1.470): 0.280*"out" + 0.242*"#nufc" + 0.232*"#afc" + 0.217*"#thfc" + -0.216*"money" + 0.198*"😔" + 0.197*"has" + -0.195*"some" + -0.193*"follow" + -0.169*"#ad"
2021-04-04 05:12:28,287 : INFO : topic #3(1.438): 0.391*"#lfc" + -0.231*"#nufc" + -0.224*"😔" + -0.200*"#afc" + -0.186*"out" + -0.168*"money" + -0.166*"follow" + 0.154*"team" + -0.148*"some" + -0.144*"#ad"
2021-04-04 05:12:28,290 : INFO : topic #4(1.395): 0.563*"😔" + 0.444*"#nufc" + 0.241*"#lfc" + -0.204*"#afc" + 0.161*"season" + -0.155*"#thfc" + -0.138*"has" + -0.122*"kane" + 

[(0,
  '-0.401*"news" + -0.238*"#lfc" + -0.190*"breaking" + -0.173*"will" + -0.150*"has" + -0.145*"be" + -0.138*"@lfc" + -0.126*"team" + -0.125*"out" + -0.115*"he"'),
 (1,
  '-0.579*"news" + -0.259*"breaking" + 0.162*"has" + -0.158*"10pm" + -0.153*"big" + -0.147*"11pm" + 0.136*"out" + -0.121*"@arsenal" + 0.110*"team" + -0.108*"positive"'),
 (2,
  '0.280*"out" + 0.242*"#nufc" + 0.232*"#afc" + 0.217*"#thfc" + -0.216*"money" + 0.198*"😔" + 0.197*"has" + -0.195*"some" + -0.193*"follow" + -0.169*"#ad"'),
 (3,
  '0.391*"#lfc" + -0.231*"#nufc" + -0.224*"😔" + -0.200*"#afc" + -0.186*"out" + -0.168*"money" + -0.166*"follow" + 0.154*"team" + -0.148*"some" + -0.144*"#ad"'),
 (4,
  '0.563*"😔" + 0.444*"#nufc" + 0.241*"#lfc" + -0.204*"#afc" + 0.161*"season" + -0.155*"#thfc" + -0.138*"has" + -0.122*"kane" + -0.122*"harry" + -0.120*"aubameyang"'),
 (5,
  '0.305*"#lfc" + 0.256*"#afc" + 0.208*"aubameyang" + -0.192*"😔" + 0.190*"same" + 0.169*"team" + -0.161*"will" + -0.126*"manchester" + 0.123*"furious" + 

Then print first 5 tweets representation in fold-in-lsi. Note that we have 10 topics and the representation shows the relateness of each tweets to each one of 10 topics.

In [18]:
for count, (tweet_lsi, tweet_text) in enumerate(zip(train_corpus_lsi, MyTexts(10))):
    print(tweet_lsi, tweet_text, '\n---')
    if count == 5: break

[(0, -0.04861963696522687), (1, 0.02614180540190273), (2, -0.013537251848193124), (3, 0.0016044318514834319), (4, -0.015416615212562145), (5, -0.010980581726811846), (6, 0.05679758723532831), (7, -0.007275692716576178), (8, 0.012138058118968094), (9, -0.010483042063101715)] ['statement', 'chelsea', 'fc', 'is', 'disgusted', 'with', 'posts', 'on', 'social', 'media', 'this', 'evening', 'targeting', 'west', 'bromwich', 'albion', 'player', 'callum', 'robinson', '#cfc'] 
---
[(0, -0.04743868296514272), (1, -0.011337340781317722), (2, 0.10406379211482553), (3, -0.08082195369033182), (4, -0.09977622606145477), (5, 0.13535018506832436), (6, -0.062217162449585695), (7, -0.015246985119146397), (8, 0.17476093308953192), (9, -0.008777367833210809)] ['david', 'luiz', 'could', 'need', 'surgery', '#afc'] 
---
[(0, -0.17661357974789188), (1, 0.07747152742326434), (2, -0.016897596975931274), (3, 0.1562200482516766), (4, 0.04223098125836869), (5, 0.0006513178763686359), (6, -0.13132871819521363), (7, -0.