# Word Embeddings em um Dataset Aberto

## Terceiro Exercício Prático


Neste TP você deve explorar e criar os embeddings utilizando uma base de dados de sua preferência. Uma sugestão para encontrar bases interessantes seria a partir das competições do [Kaggle](https://www.kaggle.com/).

Para essa prática, solicito que o aluno prepare a base de dados e gere os embeddings utilizando obrigatoriamente o algoritmo Word2Vec. Dependendo do contexto da base, você pode utilizar o Doc2Vec em vez do Word2Vec ou ambos. 

As 3 etapas descritas abaixo devem ser seguidas obrigatoriamente:

1. Preparação da base de dados assim como visto na prática anterior.
2. Execução do Modelo Word2Vec usando o Gensim, ou outra implementação similar.
3. Teste do seu embedding assim como foi realizado na [demo](https://github.com/gesteves91/nlp/blob/master/notebooks/06-word2vec.ipynb).


Para o trabalho foi capturado os tweets sobre o ex-presidente Lula no dia em que recebeu autorização para deixar a cadeia. Especificamos o dia 08/11/2019 (data em que ele saiu), restringindo a localização na região de Belo Horizonte, num raio de 10.000 km.

Documentação:<br> 
https://tweepy.readthedocs.io/en/latest/index.html

In [2]:
# bibliotecas
import numpy as np
import pandas as pd
import tweepy
import nltk
import demoji
import re
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from string import punctuation
from googletrans import Translator

In [3]:
# leitura das chaves 
with open('twitter-tokens.txt', 'r') as tfile:
    consumer_key = tfile.readline().strip('\n')
    consumer_secret = tfile.readline().strip('\n')
    access_token = tfile.readline().strip('\n')
    access_token_secret = tfile.readline().strip('\n')

In [None]:
# variaveis para fazer o login
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)

In [None]:
# Query passando a busca e excluindo os retweets.
query_search = 'Lula' + '-filter:retweets'

#Gerando o cursor de busca.
cursor_tweets = tweepy.Cursor(api.search, q=query_search, tweet_mode='extended',lang="pt", 
                              since= "2019-11-07", until= "2019-11-09",
                              geocode='-19.9026615,-44.1041363,10000km').items(10000)

In [None]:
tw = []
for tweet in cursor_tweets:
    tw.append([tweet.created_at, tweet.full_text])

In [None]:
df = pd.DataFrame(tw, columns=['Data','Tweet'])

In [None]:
# Convertendo a data
df['Data'] = pd.to_datetime(df['Data'])
df['Data'] = df['Data'].dt.strftime('%d-%m-%Y')

In [None]:
# Salvando em um arquivo csv.
df.to_csv('bh.csv')

In [None]:
df.shape

# Preparando a base de dados

In [6]:
df_bh = pd.read_csv('bh3.csv')

In [7]:
df_bh.shape

(6750, 3)

In [82]:
df_bh.head()

Unnamed: 0.1,Unnamed: 0,Data,Tweet
0,0,08-11-2019,Lula virou comediante na cadeia e saiu fazendo...
1,1,08-11-2019,Hoj o Lula transa com o super pênis dele https...
2,2,08-11-2019,@ana_claudiinha até o Lula tá beijando e você ...
3,3,08-11-2019,@ggreenwald Bravo! Bravo! Você tem grande impo...
4,4,08-11-2019,"Vou me ausentar desse site, quando pararem de ..."


In [31]:
df = df_bh[['Data', 'Tweet']].copy()

# Removendo stopwords

In [32]:
stopwords = set(nltk.corpus.stopwords.words('portuguese'))

In [33]:
def remov_stopwords(text):
    text = text.lower()
    palavras = [i for i in text.split() if not i in stopwords]
    return (" ".join(palavras))

In [34]:
df['Tweet'] = df.apply(lambda row: remov_stopwords(row['Tweet']), axis=1)

# Removendo links

In [35]:
df['Tweet'] = df.apply(lambda x: re.sub(r"http\S+", "", x['Tweet']), axis=1)

# Substituindo emojis

In [36]:
def remove_emoji(emoji):
    rep = demoji.findall(emoji)
    re = demoji.replace(emoji)
    if any(rep) == False:
        return re
    else:
        for x in rep:
            text = re + rep[x]
            return text.replace("  ", " ")

In [37]:
df['Tweet'] = df.apply(lambda x: remove_emoji(x['Tweet']), axis=1)

# Tokenização

In [38]:
# Tokenizando as frases.
df['Tokens'] = df.apply(lambda x: word_tokenize(x['Tweet'], language='portuguese'), axis=1)

In [39]:
# Removendo pontuação
pontos = list(punctuation)

def remove_pont(tweets):
    return(x for x in tweets if not x in pontos)

In [40]:
df['Tokens'] = df.apply(lambda x: remove_pont(x['Tokens']), axis=1)

# Lematização

In [41]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [42]:
# Lemmatizando os tweets.
def lemmatize_func(mylist):
    return [lemmatizer.lemmatize(w) for w in mylist]

df['Tokens'] = df.apply(lambda row: lemmatize_func(row['Tokens']), axis=1)

# Stemming

In [43]:
stemming = nltk.stem.RSLPStemmer()

In [44]:
def stemming_func(mylist):
    return [stemming.stem(w) for w in mylist]

df['Tokens'] = df.apply(lambda row: stemming_func(row['Tokens']), axis=1)

# Word2Vec com Gensim

In [45]:
# imports das bibliotecas
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [47]:
documents = df['Tokens']

In [49]:
documents.value_counts()

[lul, livr]                                                                                                                           52
[lul, sai, sext]                                                                                                                      24
[chor, livr, lul]                                                                                                                     20
[lul, tá, livr, babac]                                                                                                                18
[lul, tá, solt, babac]                                                                                                                13
[lul, sai, sext, nad]                                                                                                                 10
[sext, lul, livr]                                                                                                                      7
[lul]                                    

In [48]:
# Treinando o modelo
model = gensim.models.Word2Vec(documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

2019-11-12 15:01:24,037 : INFO : collecting all words and their counts
2019-11-12 15:01:24,037 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-11-12 15:01:24,062 : INFO : collected 8225 word types from a corpus of 73055 raw words and 6750 sentences
2019-11-12 15:01:24,064 : INFO : Loading a fresh vocabulary
2019-11-12 15:01:24,091 : INFO : effective_min_count=2 retains 3616 unique words (43% of original 8225, drops 4609)
2019-11-12 15:01:24,092 : INFO : effective_min_count=2 leaves 68446 word corpus (93% of original 73055, drops 4609)
2019-11-12 15:01:24,108 : INFO : deleting the raw counts dictionary of 8225 items
2019-11-12 15:01:24,110 : INFO : sample=0.001 downsamples 49 most-common words
2019-11-12 15:01:24,111 : INFO : downsampling leaves estimated 55245 word corpus (80.7% of prior 68446)
2019-11-12 15:01:24,135 : INFO : estimated required memory for 3616 words and 150 dimensions: 6147200 bytes
2019-11-12 15:01:24,136 : INFO : resetting layer weigh

2019-11-12 15:01:25,299 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-11-12 15:01:25,328 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-11-12 15:01:25,341 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-11-12 15:01:25,356 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-11-12 15:01:25,374 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-11-12 15:01:25,398 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-11-12 15:01:25,405 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11-12 15:01:25,406 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-11-12 15:01:25,407 : INFO : EPOCH - 2 : training on 73055 raw words (55335 effective words) took 0.2s, 347001 effective words/s
2019-11-12 15:01:25,465 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-11-12 15:01:25,475 : INFO : worker thread f

2019-11-12 15:01:26,563 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-11-12 15:01:26,600 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-11-12 15:01:26,613 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-11-12 15:01:26,637 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-11-12 15:01:26,642 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-11-12 15:01:26,647 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-11-12 15:01:26,668 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-11-12 15:01:26,671 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11-12 15:01:26,676 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-11-12 15:01:26,677 : INFO : EPOCH - 10 : training on 73055 raw words (55271 effective words) took 0.1s, 424116 effective words/s
2019-11-12 15:01:26,678 : INFO : training on a 

(552785, 730550)

In [67]:
# Procurando palavras semelhantes
w1 = ['livr']
model.wv.most_similar(positive=w1)

[('marciolimaf8', 0.9392938017845154),
 ('tbm', 0.9371939897537231),
 ('barretorec', 0.9338074922561646),
 ('bolsominiom', 0.9323339462280273),
 ('ein', 0.9316583871841431),
 ('bord', 0.9245335459709167),
 ('abrahamweint', 0.9155640602111816),
 ('carlux', 0.9146286249160767),
 ('tmb', 0.9114054441452026),
 ('amo', 0.910111665725708)]

In [72]:
# vamos ver as 5 palavras mais similares a 'lul'
w1 = ["lul"]
model.wv.most_similar(positive=w1, topn=5)

[('moh', 0.7590863704681396),
 ('remédi', 0.755330502986908),
 ('fest', 0.7552500367164612),
 ('seri', 0.7528912425041199),
 ('foguet', 0.7526100277900696)]

In [74]:
# vamos ver as 5 palavras mais similares a 'sext'
w1 = ["sext"]
model.wv.most_similar (positive=w1,topn=6)

[('boc', 0.9440861940383911),
 ('feir', 0.9391844868659973),
 ('noil', 0.932273268699646),
 ('hj', 0.9317667484283447),
 ('beij', 0.9312781095504761),
 ('bebemor', 0.9282745122909546)]

In [75]:
# palavras mais relacionadas 
w1 = ['lul', 'sai', 'sext']
model.wv.most_similar (positive=w1,topn=10)

[('noil', 0.9658463001251221),
 ('feir', 0.9630030989646912),
 ('noiv', 0.9605275988578796),
 ('bebemor', 0.9588722586631775),
 ('ate', 0.9584187865257263),
 ('fas', 0.9567884206771851),
 ('não', 0.9537703394889832),
 ('colab', 0.9515069723129272),
 ('sel', 0.9507938027381897),
 ('plenum', 0.9492806792259216)]

# Similaridade entre palavras

In [76]:
# similaridade de duas palavras diferentes
model.wv.similarity(w1="lul", w2="liv")

0.70240027

In [78]:
# similaridades de duas palavras idênticas
model.wv.similarity(w1="lul", w2="lul")

1.0

In [80]:
# similaridade de duas palavras opostas
model.wv.similarity(w1="lul", w2="bolsonar")

0.4332605