# Twitter datasets for water events

Author: Fadoua Ghourabi (fadouaghourabi@gmail.com)

Date: July 13, 2019

- **data**: dataframe to store extracted tweet texts and meta data (date, location, etc.) before processing
- **clean_tweets**: list to store tweet texts after nlp preprocessing (using function clean_collection)
- **data_clean_tweets**: dataframe copy of datasets data where tweet text is replaced by clean_tweets
- **data_clean_tweets_in_vocab**: a copy of data_clean_tweets where tweets that are not in the model vacabulary are dropped

To export for classification:
- **tweet_avg_w2v**: tweet data where vector representation is average vectors (gensim.models.word2vec)
- **tweet_avg_w2v_tfidf**: tweet data where vector representation is average vectors multiplied with the tfidf metric (gensim.models.word2vec and sklearn's TfidfVectorizer)
- **tweet_d2v**: twet data where vector representation is computed by gensim.models.doc2vec
- **tweet_avg_ft**: tweet data where vector representation uses fasttext corpora

In [2]:
import os
import time
import pandas as pd
import numpy as np
import gensim
#fasttext 0.9.1 
import fasttext
# clean_collection is a function that implements nlp preprocessing pipeline
from ipynb.fs.full.fr_twitter_water_nlp import clean_collection 
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import common_texts
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
# file contains manually labeled tweets
tw_path = "../datasets/twData_clean_labeled.csv"
# folder for corpus generated from collected tweets
corpusdir = 'corpus/'
# fast text coprpus (https://fasttext.cc/docs/en/crawl-vectors.html)
fast = '/Users/basho/fadouaproject/SafeWater/model/cc.fr.300.bin'
# folder for tweet datasets
dataset_path = "../datasets/"

In [4]:
# date of the most recent collection and location of extracted tweet
collection_date = "February 10, 2019 ~ July 13, 2019"
geo_loc = "Sfax (center), 400km (radius)"

### Data structure and NLP preprocessing

In [5]:
def load_water_tweets(desc=True):
    
    '''
    - Description: 
    load_water_tweets read the tweets and other meta data from the file by tw_path and
    returns a dataframe
    - Input:
    <desc> indicates wether to display description
    - Output:
    <tw_data> is a dataframe where columns are given by the header of the file tw_path
    - History:
    July 13, 2019 --> implementation, to fix: 11 columns should be 10 (extra "," in tw_path?)
    July 17, 2019 --> fixed: extra column removed in tw_path
    
    '''
    
    file = open(tw_path,"r")
    tw_data = pd.read_csv(file, header=0, encoding = 'utf-8') # utf-8 for better representation of french text
    
    if desc:
        print("=== Tweet datasets for water events ===")
        print("Language: French")
        print("Collection date: {}".format(collection_date))
        print("Location: {}".format(geo_loc))
        print("Data size: {}".format(tw_data.shape))
        print("""Features: \n
              - Timestamp: date and time of collection. \n
              - TwDate: date and time of tweet publication. \n
              - TwLoc: localisation of user \n
              - TwUserName: user name\n
              - TwUserID: user's unique ID\n
              - TwContent: tweet message\n
              - ContentLoc: list of locations that are included in the tweet\n
              - urls: list of urls that are included in the tweet\n
              - Event: label --> water shortage (1) or not (0)""")
        print("=======================================")
    file.close()
    return tw_data

In [6]:
data = load_water_tweets()

=== Tweet datasets for water events ===
Language: French
Collection date: February 10, 2019 ~ July 13, 2019
Location: Sfax (center), 400km (radius)
Data size: (535, 10)
Features: 

              - Timestamp: date and time of collection. 

              - TwDate: date and time of tweet publication. 

              - TwLoc: localisation of user 

              - TwUserName: user name

              - TwUserID: user's unique ID

              - TwContent: tweet message

              - ContentLoc: list of locations that are included in the tweet

              - urls: list of urls that are included in the tweet

              - Event: label --> water shortage (1) or not (0)


In [7]:
data.Event.isnull().values.any() # missing label?

False

In [8]:
data.Event.isna().values.any() # nan label?

False

In [9]:
data.TwContent.head(10) # to check the coding of french text

0    Les gouvernorats de Siliana, Kasserine et Jend...
1    Perturbations et coupures de l’approvisionneme...
2    L'approvisionnement en eau potable reprendra, ...
3    Perturbations et coupures dans l’approvisionne...
4    Perturbations dans l’approvisionnement en eau ...
5    La reprise sera progressive... https://t.co/6h...
6    #Tunisie : Perturbations et coupures dans l’ap...
7    Tunisie - Tozeur : La SONEDE rassure sur la qu...
8    Nos gouvernants ont l'habitude de prendre de l...
9    Jendouba nord : le vol des équipements de la S...
Name: TwContent, dtype: object

In [10]:
tweets = data.TwContent # variable tweets contain raw tweet text as collected by the api
tweets.shape 

(535,)

In [11]:
def nlp_preprocessing(tws,lem=False):
    ''' 
    - Description:
    nlp_preprocessing simply calls clean_collection
    clean_collection applies nlp preprocessing (lower, remove urls, remove stopwords) on all the tweets
    - Input:
    tws: raw tweet text
    lem: lemmatization is optional, default False
    - Output:
    clean_tweets: prepreocessed tweet text
    duration: computation time
    - History:
    July 13, 2019 --> implementation, clean_collection is not prompt
    July 16, 2019 --> added assertion to make sure no tweet is lost
    '''
    start = time.time()
    clean_tweets = clean_collection(tws, lem) 
    end = time.time()
    duration = end - start # about 5s !
    
    assert len(clean_tweets) == tws.shape[0]
    
    return duration, clean_tweets

In [12]:
# preprocessing of tweets (see function clean_collection)
# warning: slow computation ~ 5s
start = time.time()
_, clean_tweets = nlp_preprocessing(tweets, True) # 
end = time.time()
len(clean_tweets), end - start

(535, 4.709585189819336)

In [13]:
def replace_column(data,column_index,newcolumn):
    '''
    - Description:
    replace_column replaces an entire column with new values, 
    e.g. the raw tweet text is replaced by its nlp preprocessing
    !! Attention !!
    if column_index is not an existing column then a new column is added
    - Input:
    data: original data
    column_index: a string indicating the column header
    new_column: new values for the column
    - Output:
    newdata: new data with new column values
    - History:
    July 13, 2019 --> implementation
    July 16, 2019 --> added assertions
    '''
    newdata = data.copy()
    assert not (id(newdata) == id(data))
    
    m = newdata.shape[0]
    newcolumn = list(newcolumn)
    newdata[column_index] = pd.Series(newcolumn)
    nm = len(newcolumn)
        
    assert (m == nm)
    # column is replaced or new column is added
    assert (newdata.shape == data.shape) or (newdata.shape[1] == data.shape[1]+1)
    
    return newdata

In [14]:
data_clean_tweets = replace_column(data,"TwContent",clean_tweets)

In [15]:
id(data_clean_tweets),id(data) # dataframes should not refer to the same object

(112777912160, 112777706296)

In [16]:
data_clean_tweets.shape

(535, 10)

### Make text corpus

We make a corpus out of the collected tweets. The corpus is used to generate word2vec representation (see next section). We make $n$ text files for $n$ tweets. All are savec in ``corpusdir``. 

For the time being, when adding new tweets, delete folder ``corpusdir`` and run ``make_text(clean_tweets)``. <font color='red'>This feature should be improved so that the exsisting corpus can be extended with new data.</font>

In [23]:
if not os.path.isdir(corpusdir):
    os.mkdir(corpusdir)

In [24]:
# one file - !! not used !!
def make_data_from_tweets(tweets):
    filename = 0
    file = open(corpusdir+'data.txt','a')
    for tw in tweets:
        file.write(tw)
        file.write('\n')
    file.close()

In [25]:
# seperate files
def make_text(data):
    '''
    - Description:
    make_text create corpus text files, one file for each tweet.
    - Input:
    tweets preferably nlp preprocessed tweets
    - Histroy:
    July 13, 2019 --> implementation, to fix: extending the corpus with new tweets
    '''
    
    filename = 0
    for text in data:
        filename+=1
        file = open(corpusdir+str(filename)+'.txt','w')
        file.write(text)#,encoding="UTF-8")
        file.close()

In [28]:
make_text(clean_tweets)

### tweet2vector representation (gensim)

According to NLTK dosucmentation on ``PlaintextCorpusReader``: "Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor."

However, our corpus already contains preprocessed tweets. We made this choice because of the language of the tweets requires non-common libraries. 

<font color='red'>For stronger credence, experiments should be performed: 
- ``PlaintextCorpusReader`` on raw corpora, 
- ``PlaintextCorpusReader`` on raw corpora + custom tokenizer</font>

In [29]:
newcorpus = PlaintextCorpusReader(corpusdir, '.*')

In [30]:
(clean_tweets[352],newcorpus.sents()[352])

('comment le conduite   eau jelma sauvagement saboter réparer sou 24 heure vidéo photo tunisie sfax eau leaderstunisie',
 ['comment',
  'le',
  'conduite',
  'eau',
  'jelma',
  'sauvagement',
  'saboter',
  'réparer',
  'sou',
  '24',
  'heure',
  'vidéo',
  'photo',
  'leaderstunisi'])

Gensim’s word2vec expects **a sequence of sentences as its input**. Each sentence is a list of words (utf8 strings). 

``class gensim.models.Word2Vec(sentences=None, corpus_file=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), max_final_vocab=None)``

- Words that appear only once or twice in a corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so the model ignores them. These words are not included in the vocabulary ``w2v_model.wv.vocab`` (OOV words). To change the minimum occurance of a word to be included in the vocabulary, use parameter ``min_count``.

- Word2Vec runs a neural network of default size 100. To change the size of NN, use parameter ``size``.

- Word2Vec uses a context window to limit the number of words in each context. To change the window size, use parameter ``window``

- The default algorithm is CBOW (parameter ``sg=0``). The use skipgram, set parameter ``sg=1``. 

<font color='red'>We should perform experiments with various values for the hyperparameters.</font>

In [31]:
# training Word2Vec on newcorpus sentenses
# CBOW algorithm, words occuring > 3 times make vocab, context window of size 3, NN of 50 layers 
w2v_model = gensim.models.Word2Vec(newcorpus.sents(), sg=0, min_count=3, window=5, size=50)

In [32]:
# size of the vocabulary (276 when min_count = 5)
len(w2v_model.wv.vocab)

561

In [33]:
w2v_model.wv["eau"]

array([-0.00502588, -0.00020879, -0.00313085, -0.04761149,  0.02591472,
        0.0797812 ,  0.02400271, -0.02761331, -0.01668293, -0.00924041,
        0.01550639,  0.02967573,  0.02331197, -0.01905984, -0.02850658,
       -0.06293469,  0.02005616, -0.0118856 ,  0.03819882, -0.00791364,
       -0.11097454, -0.0066397 ,  0.08152932, -0.04701308, -0.07478894,
        0.01865409, -0.05734168,  0.00458778, -0.12077726,  0.02771127,
       -0.09081082, -0.08316717, -0.07378556, -0.0551071 , -0.01240881,
       -0.03893253,  0.00585383,  0.07512907,  0.01821466, -0.02814579,
       -0.04035135,  0.00860792, -0.04398966, -0.06897693,  0.01197071,
        0.03174187,  0.02844218, -0.00801674, -0.04205552,  0.03668069],
      dtype=float32)

In [34]:
w2v_model.wv["coupure"]

array([-0.00387644,  0.00468257,  0.00676616, -0.01473087,  0.01621496,
        0.02416983,  0.01524329, -0.01899278, -0.01286543,  0.0028702 ,
        0.01588697,  0.0147522 ,  0.01496758, -0.0048678 , -0.00419197,
       -0.01462643,  0.00433477, -0.01591946,  0.02273679, -0.01218186,
       -0.03896337, -0.00629161,  0.03724789, -0.0240705 , -0.03266382,
       -0.00021227, -0.02231739,  0.00794812, -0.04975024,  0.01611223,
       -0.02692081, -0.04251795, -0.02203562, -0.02840655, -0.01560746,
       -0.00837522, -0.00395815,  0.03792381,  0.00723598, -0.01338847,
       -0.00716835, -0.00604373, -0.01443875, -0.02419888,  0.00816183,
        0.01295616,  0.00439254,  0.00036245, -0.00648204,  0.02234869],
      dtype=float32)

In [35]:
def similarity_two_words(model,w1, w2):
    sim = model.wv.similarity(w1,w2)
    #print("The similarity between <{}> and <{}>: ".format(w1, w2), sim)
    return sim

In [36]:
similarity_two_words(w2v_model, 'eau', 'potable')

0.9751394

In [37]:
similarity_two_words(w2v_model, 'eau', 'coupure')

0.94240123

In [38]:
similarity_two_words(w2v_model, 'eau', 'gafsa')

0.60116863

In [39]:
similarity_two_words(w2v_model, 'eau', 'tunis')

0.7594028

In [40]:
w2v_model.wv.most_similar('eau')

[('le', 0.9853494763374329),
 ('leau', 0.9834550619125366),
 ('potable', 0.9751394391059875),
 ('avoir', 0.9732195734977722),
 ('projet', 0.9708139896392822),
 ('plus', 0.9697649478912354),
 ('tunisie', 0.9683387279510498),
 ('rt', 0.9657126069068909),
 ('jour', 0.9641616344451904),
 ('deau', 0.9614266753196716)]

In [41]:
w2v_model.wv.most_similar('coupure')

[('eau', 0.9424012303352356),
 ('faire', 0.9310715794563293),
 ('le', 0.928193986415863),
 ('plus', 0.9276940822601318),
 ('potable', 0.9273414611816406),
 ('rt', 0.9245010018348694),
 ('accès', 0.9230503439903259),
 ('leau', 0.9185031652450562),
 ('avoir', 0.9168297052383423),
 ('projet', 0.9137657284736633)]

In [42]:
w2v_model.wv.most_similar('distribution')

[('potable', 0.9339042901992798),
 ('le', 0.9324987530708313),
 ('avoir', 0.9290918111801147),
 ('eau', 0.92876797914505),
 ('tout', 0.923543393611908),
 ('bangui', 0.9212305545806885),
 ('leau', 0.9203723669052124),
 ('projet', 0.9202545285224915),
 ('devenir', 0.9178816676139832),
 ('deau', 0.9155687689781189)]

#### Strategy 1: Averaging word2vec

How to compute the vector representation of a tweet from the vector representation of its words? One alternative is the sum the vector representation and divide by the lenght of the tweet.

In [43]:
def check_in_vocabulary(model,tws,vocab):
    '''
    - Description:
    check_in_vocabulary checks which words are not in the model's vocabulary. 
    These words cause errors as they cannot be converted to vectors.
    eg: KeyError "word 'satisfaction' not in vocabulary"
    - Input:
    model: the trained word2vector model
    tws: clean tweets
    vocab: the model's vocabulary. In case of gensim's Word2Vec, model.wv.vocab gives a dictionary of vocabulary.
    - Output:
    not_in_vocabulary: list of (row_index,tweet) where tweet is composed of words 
    that are not in the model's vocabulary.
    - History:
    July 16, 2019 --> implementation, we choose to ignore the words that are not in the vocabulary (OOV).
    '''
    not_in_vocabulary = []
    i = 0
    for tw in tws:
        n_words = 0
        for w in tw.split():
            if w in vocab: # careful! model.wv.vocab is not a complete list of unique words
                n_words += 1
        
        if n_words == 0: # meaning all the words in tw are not in model's vocabulary
            not_in_vocabulary.append((i,tw)) 
        i += 1
    
    return not_in_vocabulary

In [44]:
def drop_not_in_vocabulary(data,indexes):
    '''
    - Description:
    drop_not_in_vocabulary drops from data the tweets that are not in the model's vocabulary
    - Input:
    data: complete data with clean tweet texts, i.e. the dataset data_clean_tweets
    indexes: row indexes of tweets not in vocabulary
    - Output:
    data.drop(indexes) 
    '''
    return data.drop(indexes)

In [45]:
nv = check_in_vocabulary(w2v_model,clean_tweets,w2v_model.wv.vocab)
nv # empty nv means all tweets are composed of words in w2v_model.wv.vocab

[]

In [46]:
indexes = [i for (i, _) in nv]
data_clean_tweets_in_vocab = drop_not_in_vocabulary(data_clean_tweets,indexes)

In [47]:
data_clean_tweets_in_vocab.shape

(535, 10)

In [48]:
def tweets_to_avgvecs(model,tws,num_features):
    '''
    - Description:
    tweets_to_avgvecs computes the vector representation of a tweet. 
    To that end, it sums the vector representations of words, then divide by the nbr of words.
    Out of vocabulary words (OOV) are ignored.
    numpy arrays are used for efficient array arithmetics
    - Input:
    model: the trained model
    tws: clean tweets
    num_feature: the size of the vector representation
    - Output:
    vectors: numpy array of stacked vector representations of tweets
    not_in_vocabulary: tweets that are not in the vocabulary
    - History:
    July 16, 2019 --> implementation, 
    to fix: 
    a) one function for different models: e.g. w2v_avg, w2v_avg_tfidf, w2v_avg_ft
    b) what to do with OOV?
    '''
    vocab = model.wv.vocab # assuming gensim's word2vec model
    vectors = [] #np.empty((len(tweets), num_features)) # dtype='float32')
    not_in_vocabulary = []
    i = 0
    for tw in tws:
        tw_vec = np.zeros((num_features,), dtype='float32')
        n_words = 0
        
        for w in tw.split():
            if w in vocab: # careful! model.wv.vocab is not a complete list of unique words
                n_words += 1
                # summation of vectors of words in vocab
                # OOV are ignored
                tw_vec = np.add(tw_vec, model.wv[w]) 
                
        
        if (n_words > 0):
            tw_vec = np.divide(tw_vec, n_words)
            vectors.append(tw_vec.tolist()) #, axis=0)
        else:
            # not_in_vocabulary staures the tweets composed of words not in model.wv.vocab
            not_in_vocabulary.append((i,tw)) 
        i += 1
        
        assert len(vectors[0]) == num_features

    return (np.array(vectors),not_in_vocabulary)

In [49]:
size = w2v_model.vector_size
tws = data_clean_tweets_in_vocab.TwContent
tw_vectors, nv = tweets_to_avgvecs(w2v_model,tws,size)

In [50]:
tw_vectors.shape

(535, 50)

In [51]:
tweet_avg_w2v = replace_column(data_clean_tweets_in_vocab,"TwVec",list(tw_vectors))

In [52]:
tweet_avg_w2v.shape

(535, 11)

#### Strategy 2: averaging word2vec with TF-IDF

In [53]:
def tweets_to_avgvecs_with_tfidf(model,vectorizer,tws,num_features):
    '''
    - Description:
    tweets_to_avgvecs_with_tfidf computes the vector representation of a tweet. 
    To that end, it sums the vector representations of words multiplied by the tfidf representation,
    then divide by the nbr of words.
    Out of vocabulary words (OOV) are ignored.
    numpy arrays are used for efficient array arithmetics
    - Input:
    model: the trained model
    vectorizer: the tfidf model
    tws: clean tweets
    num_feature: the size of the vector representation
    - Output:
    vectors: numpy array of stacked vector representations of tweets
    not_in_vocabulary: tweets that are not in the vocabulary
    - History:
    July 16, 2019 --> implementation, 
    to fix: 
    a) one function for different models: e.g. w2v_avg, w2v_avg_tfidf, w2v_avg_ft
    b) what to do with OOV?
    '''
    tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
    vectors = [] #np.empty((len(tweets), num_features)) # dtype='float32')
    not_in_vocabulary = []
    i = 0
    for tw in tws:
        tw_vec = np.zeros((num_features,), dtype='float32')
        n_words = 0
        
        for w in tw.split():
            # careful! model.wv.vocab is not a complete list of unique words
            if (w in model.wv.vocab) and (w in vectorizer.vocabulary_): 
                n_words += 1
                vec = model.wv[w]*tfidf[w]
                tw_vec = np.add(tw_vec, vec)
                
        
        if (n_words > 0):
            tw_vec = np.divide(tw_vec, n_words)
            vectors.append(tw_vec.tolist()) #, axis=0)
        else:
            # not_in_vocabulary staures the tweets composed of words not in model.wv.vocab
            not_in_vocabulary.append((i,tw)) 
        i += 1

    return (np.array(vectors),not_in_vocabulary)

In [54]:
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(newcorpus.words())

<7465x2375 sparse matrix of type '<class 'numpy.float64'>'
	with 7420 stored elements in Compressed Sparse Row format>

In [55]:
vectorizer.idf_.shape

(2375,)

In [56]:
size = w2v_model.vector_size
tws = data_clean_tweets_in_vocab.TwContent
tw_vectors, nv = tweets_to_avgvecs_with_tfidf(w2v_model,vectorizer,tws,size)

In [57]:
tw_vectors.shape

(535, 50)

In [58]:
tweet_avg_w2v_tfidf = replace_column(data_clean_tweets_in_vocab,"TwVec",list(tw_vectors))
tweet_avg_w2v_tfidf.shape

(535, 11)

In [59]:
#tweet_avg_w2v_tfidf

#### Strategy 3: doc2vec

In [60]:
def tag_tweets(tws):
    '''
    - Description:
    tag_tweets generates tweets along with iterable tags, 
    which are the input document format needed by Doc2Vec
    - Input:
    tws: a list of tweet where each tweet is tokenized (composed of a list of words)
    - Output:
    docs: tagged tweets
    - History:
    July 17, 2019 --> implementation
    '''
    docs = []
    i = 0
    for tw in tws:
        document = TaggedDocument(tw, ["t"+str(i)])
        docs.append(document)
        i += 1
    return docs

In [61]:
# we make tagged tweets
docs = tag_tweets(newcorpus.sents())

In [62]:
# d2v vector size is 300, context window 3, min occurrence is 3, epochs for computation is 20
d2v = gensim.models.Doc2Vec(vector_size=300, window=3, min_count=3, epochs=20)
d2v.build_vocab(docs)

In [63]:
# d2v model is trainted over the entire corpus (size is d2v.corpus_count)
# field epochs is required, otherwise error
d2v.train(docs, total_examples=d2v.corpus_count, epochs=d2v.iter)

  This is separate from the ipykernel package so we can avoid doing imports until


In [64]:
# to obtain vector representation of a sentence:
# first tokenize the sentence
# pass the list of words to function infer_vector
#d2v.infer_vector(["coupure","eau","gafsa"])

In [65]:
# d2v saves the vector representation in d2v.docvecs
# we transform the vectors to a list for easy manipulation
tw_dic = dict(zip(d2v.docvecs.doctags, d2v.docvecs))
vectors = []
for _,vec in tw_dic.items():
    vectors.append(vec)

In [66]:
tweet_d2v= replace_column(data_clean_tweets_in_vocab,"TwVec", vectors)
tweet_d2v.shape

(535, 11)

In [67]:
#tweet_avg_d2v.head()

### tweet2vec representation (fasttext corpus)

In [68]:
def check_in_vocabulary_ft(model,tws):
    '''
    - Description:
    check_in_vocabulary_ft checks which words are not in the model's vocabulary. 
    These words cause errors as they cannot be converted to vectors.
    - Input:
    model: the trained fasttext model
    tws: clean tweets
    - Output:
    not_in_vocabulary: list of (row_index,tweet) where tweet is composed of words 
    that are not in the model's vocabulary.
    - History:
    July 16, 2019 --> implementation, we choose to ignore the words that are not in the vocabulary (OOV).
    to do: merge with functioncheck_in_vocabulary
    '''
    vocab = model.get_words()
    not_in_vocabulary = []
    i = 0
    for tw in tws:
        n_words = 0
        for w in tw.split():
            if w in vocab: 
                n_words += 1
        
        if n_words == 0:
            not_in_vocabulary.append((i,tw)) 
        i += 1
    
    return not_in_vocabulary

In [69]:
def tweets_to_ft_avgvecs(model,tws,num_features):
    '''
    - Description:
    tweets_to_ft_avgvecs computes the vector representation of a tweet based on fasttext corporus for french. 
    To that end, it sums the vector representations of words, then divide by the nbr of words.
    Out of vocabulary words (OOV) are ignored.
    numpy arrays are used for efficient array arithmetics
    - Input:
    model: the trained model
    tws: clean tweets
    num_feature: the size of the vector representation
    - Output:
    vectors: numpy array of stacked vector representations of tweets
    not_in_vocabulary: tweets that are not in the vocabulary
    - History:
    July 16, 2019 --> implementation, 
    to fix: 
    a) one function for different models: e.g. w2v_avg, w2v_avg_tfidf, w2v_avg_ft
    b) what to do with OOV?
    '''
    vocab = model.get_words()
    vectors = [] #np.empty((len(tweets), num_features)) # dtype='float32')
    not_in_vocabulary = []
    i = 0
    for tw in tws:
        tw_vec = np.zeros((num_features,), dtype='float32')
        n_words = 0
        
        for w in tw.split():
            if w in vocab: 
                n_words += 1
                tw_vec = np.add(tw_vec, model[w])
                
        
        if (n_words > 0):
            tw_vec = np.divide(tw_vec, n_words)
            vectors.append(tw_vec.tolist()) #, axis=0)
        else:
            # not_in_vocabulary staures the tweets composed of words not in model.wv.vocab
            not_in_vocabulary.append((i,tw)) 
        i += 1

    return (np.array(vectors),not_in_vocabulary)

In [70]:
# we load the corpus indicated by the file fast
# warning: slow computation ~ 30s (sometimes 19s)
start = time.time()
fasttext_model = fasttext.load_model(fast)
end = time.time()
end - start




19.700091123580933

In [71]:
# size of the vocabulary - obviously larger than water corpus
len(fasttext_model.get_words())

2000000

In [72]:
# dimention of vector representation
fasttext_model["eau"].shape

(300,)

In [73]:
# tweets that are not in the vocabulary, i.e. composed of OOV only
# warning: slow computation ~ 34s 
start = time.time()
nv = check_in_vocabulary_ft(fasttext_model,clean_tweets)
end = time.time()
end - start

31.85414409637451

In [74]:
indexes = [i for (i, _) in nv]
data_clean_tweets_in_vocab = drop_not_in_vocabulary(data_clean_tweets,indexes)

In [75]:
data_clean_tweets_in_vocab.shape

(535, 10)

In [76]:
# computing vector representation of tweets (using fasttext model)
# warning: slow computation ~ 32s
tws = data_clean_tweets_in_vocab.TwContent
start = time.time()
tw_vectors, nv = tweets_to_ft_avgvecs(fasttext_model,tws,300)
end = time.time()
end - start

31.29478621482849

In [77]:
tweet_avg_ft = replace_column(data_clean_tweets_in_vocab,"TwVec",list(tw_vectors))
tweet_avg_w2v_tfidf.shape

(535, 11)