# Embedding Neuronal 


En esta notebook entrenaremos un embedding neuronal, utilizando Word2Vec sobre el dataset de noticias financieras.


In [1]:
import pandas as pd
import numpy as np

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
df_original = pd.read_csv('/content/drive/MyDrive/Mentoria/news_dataset.csv')

In [4]:
#df =pd.read_csv("sentPositiveall-data.csv", names=["feeling", "news"], encoding='latin-1')

In [5]:
df_original.head()

Unnamed: 0,id,ticker,title,category,content,release_date,provider,url,article_id
0,221515,NIO,Why Shares of Chinese Electric Car Maker NIO A...,news,What s happening\nShares of Chinese electric c...,2020-01-15,The Motley Fool,https://invst.ly/pigqi,2060327
1,221516,NIO,NIO only consumer gainer Workhorse Group amon...,news,Gainers NIO NYSE NIO 7 \nLosers MGP Ingr...,2020-01-18,Seeking Alpha,https://invst.ly/pje9c,2062196
2,221517,NIO,NIO leads consumer gainers Beyond Meat and Ma...,news,Gainers NIO NYSE NIO 14 Village Farms In...,2020-01-15,Seeking Alpha,https://invst.ly/pifmv,2060249
3,221518,NIO,NIO NVAX among premarket gainers,news,Cemtrex NASDAQ CETX 85 after FY results \n...,2020-01-15,Seeking Alpha,https://invst.ly/picu8,2060039
4,221519,NIO,PLUG NIO among premarket gainers,news,aTyr Pharma NASDAQ LIFE 63 on Kyorin Pharm...,2020-01-06,Seeking Alpha,https://seekingalpha.com/news/3529772-plug-nio...,2053096


In [6]:
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221513 entries, 0 to 221512
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            221513 non-null  int64 
 1   ticker        221513 non-null  object
 2   title         221513 non-null  object
 3   category      221513 non-null  object
 4   content       221505 non-null  object
 5   release_date  221513 non-null  object
 6   provider      221513 non-null  object
 7   url           221513 non-null  object
 8   article_id    221513 non-null  int64 
dtypes: int64(2), object(7)
memory usage: 15.2+ MB


## Train Word2Vec

Preprocesamiento

Realizamos un preprocesamiento sobre el dataset para eliminar signos de puntuacion, stopwords.

In [7]:
# Nos quedamos con el titulo y el contenido de las noticias
X = df_original['title'].fillna(' ') + ' ' + df_original['content'].fillna(' ')  
X.shape

(221513,)

In [8]:
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
stop_words=set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Realizamos el preprocesamiento de las noticias.

In [9]:
import string
from nltk.tokenize import word_tokenize

news_lines= list()
lines=X.tolist()

for line in lines:
  tokens= word_tokenize(line)
  #to lower case
  tokens=[w.lower() for w in tokens]
  # remove puntuaction from each word
  table=str.maketrans('','',string.punctuation)
  stripped= [w.translate(table) for w in tokens]
  # remove remaining tokens that are not alphabetic
  words=[word for word in stripped if word.isalpha()]
  # filter out stop words
  ###stop_words=set(stopwords.words('english'))
  words=[w for w in words if not w in stop_words]
  news_lines.append(words)

In [32]:
len(news_lines)

221513

In [14]:
# Noticia con mayor cantidad de palabras.
max=0
for list in news_lines:
  if len(list) > max:
    max=len(list)
print(max)

11536


Analizamos la frecuencias de las palabras

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
def dummy(doc):
    return doc

vectorizer = CountVectorizer(
        tokenizer=dummy,
        preprocessor=dummy,
    )

vectorizer.fit(news_lines)



CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1),
                preprocessor=<function dummy at 0x7f8d99e82200>,
                stop_words=None, strip_accents=None,
                token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function dummy at 0x7f8d99e82200>, vocabulary=None)

In [34]:
names = vectorizer.get_feature_names()

In [35]:
len(names)

261785

In [36]:
cv= vectorizer.transform(news_lines)

In [19]:
cv.shape

(221513, 261785)

In [20]:
counts=((cv.sum(axis=0)))

In [22]:
counts.shape

(1, 261785)

In [27]:
count_dict={}
for i in range(len(names)-1):
  count_dict[names[i]]=int(counts[0,i])

In [63]:
len(count_dict)

261784

In [29]:
df_counts=pd.DataFrame.from_dict(count_dict,orient='index')

In [64]:
df_counts.head()

Unnamed: 0,0
aa,1661
aaa,874
aaaa,1
aaaai,1
aaae,1


In [70]:
(df_counts[0].sort_values(ascending=False)>5).sum()

86769

Observando la frecuencia de palabras decidimos quedarnos con aquellas que posean al menos 5 registros para entrenar el embedding.

### Word2Vec


In [71]:
# Decidimos que la dimension de los vectores sea 100
EMBEDDING_DIM=100

In [72]:
import gensim

#train word2vec model
modelg=gensim.models.Word2Vec(sentences=news_lines,
                              size=EMBEDDING_DIM,                          
                              window=5, workers=4, min_count=5)

In [75]:
#vocab_size
words=modelg.wv.vocab
print('vocabulary size:%d' %len(words))

vocabulary size:94966


In [85]:
modelg.wv.most_similar('msft')

[('microsoft', 0.8490878343582153),
 ('intc', 0.7448422908782959),
 ('googl', 0.6724490523338318),
 ('cmcsa', 0.6469652652740479),
 ('amzn', 0.6334707736968994),
 ('aapl', 0.6233265399932861),
 ('fb', 0.620248019695282),
 ('symc', 0.6192315816879272),
 ('sbux', 0.6154654026031494),
 ('csco', 0.6142740249633789)]

Guardamos los embeddings entrenados.

In [78]:
# save model
filename='Embedding_FinancialNews_word2vec.txt'
modelg.wv.save_word2vec_format(filename,binary=False)

In [99]:
# save model 2
#filename='Embedding_FinancialNews_word2vec_format2.txt'
#modelg.wv.save(filename)