
## Implementing Word2Vec Embedding

## Aim: To convert the given sequences to vectors

## Objective:
### To understand the context of the given sequences.
### To find the similirity between the words in the corpus

## Dataset: https://www.kaggle.com/datasets/sulphatet/twitter-financial-news -  Product News (0) and Stock Market Commentary. (1)


#### Word2Vec Embedding using Gensim

## Importing the libraries

In [None]:
import numpy as np
import pandas as pd
import gensim
import nltk
from nltk.corpus import stopwords
import re
from sklearn.model_selection import train_test_split
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Loading the data

In [None]:
data=pd.read_csv("train_data.csv")
data.head()

Unnamed: 0,text,label
0,$HOUR flagging here below the squeeze level to...,1
1,$SPY closed just above 2 mo channel &amp; 10d ...,1
2,$VLCN going green.....,1
3,$QQQ - QQQ: It's Make It Or Break It For The S...,1
4,Nike college apparel will have 'faster speed t...,0


## Preprocessing the data using Gensim

In [None]:
processed_text = data["text"].apply(gensim.utils.simple_preprocess)
processed_text

0       [hour, flagging, here, below, the, squeeze, le...
1       [spy, closed, just, above, mo, channel, amp, m...
2                                    [vlcn, going, green]
3       [qqq, qqq, it, make, it, or, break, it, for, t...
4       [nike, college, apparel, will, have, faster, s...
                              ...                        
5658    [codenotary, introduces, first, continuous, ba...
5659    [msc, cruises, has, set, its, sights, on, the,...
5660    [rcm, rcm, long, term, growth, story, intact, ...
5661    [cety, clean, energy, technologies, files, for...
5662    [intu, orcl, the, better, buy, in, business, s...
Name: text, Length: 5663, dtype: object

## Model Building

In [None]:
model = gensim.models.Word2Vec(window =10, min_count = 2)
model.build_vocab(processed_text)
model.train(processed_text, total_examples =model.corpus_count, epochs = model.epochs )

## Finding the most similar word to Market

In [None]:
model.wv.most_similar("market")

[('buying', 0.9997498989105225),
 ('potential', 0.9997010231018066),
 ('portfolio', 0.9996929168701172),
 ('don', 0.9996792078018188),
 ('outlook', 0.9996621012687683),
 ('term', 0.999644935131073),
 ('cash', 0.9996039271354675),
 ('risk', 0.9995737075805664),
 ('scanx', 0.9995546936988831),
 ('bullish', 0.9995248317718506)]

## Cosine Similarity between Market and Economy

In [None]:
model.wv.similarity(w1 = "market", w2 = "economy")

0.9242024

## Cosine Similarity between Market and Trade

In [None]:
model.wv.similarity(w1 = "market", w2 ="trade")

0.9907693

## Preprocessing using nltk libraries

In [None]:
cleaned_data = []
stopwords =stopwords.words("english")
for text in data["text"]:
  text  = re.sub(r"https\S+", "", text) # removing links
  text = re.sub("[^a-zA-Z0-9]", " ", text) # including only alphabets and numericals
  text = nltk.word_tokenize(text.lower()) # tokenization
  text = [word for word in text if word not in stopwords] # stopwords
  # text = " ".join(text)
  cleaned_data.append(text)

In [None]:
for i in range(0,5):
  print(cleaned_data[i], end= "\n\n")

['hour', 'flagging', 'squeeze', 'level', 'accumulate', 'b4', 'spikes', 'still', 'high', 'alert', 'low', 'float', 'short', 'squeeze', 'play']

['spy', 'closed', '2', 'mo', 'channel', 'amp', '10d', 'next', 'res', '400', 'qqq', 'closed', '1', 'mo', 'desc', 'tl', 'see', 'consolidate', 'sideways', 'continue', 'th', 'zs', 'fri', 'core', 'pce', '5', '30', 'us', 'mkt', 'closed', '6', '1', 'fed', 'qt', 'starts', '6', '2', 'crwd']

['vlcn', 'going', 'green']

['qqq', 'qqq', 'make', 'break', 'summer', 'rally', 'markets', 'economy', 'trading']

['nike', 'college', 'apparel', 'faster', 'speed', 'market', 'new', 'deal', 'joshschafer']



## Model Building

In [None]:
model1 = gensim.models.Word2Vec(window =10, min_count = 2)
model1.build_vocab(cleaned_data)
model1.train(cleaned_data, total_examples =model1.corpus_count, epochs = model1.epochs )

(278810, 322915)

## Similar Words to Market

In [None]:
model1.wv.most_similar("market")

[('company', 0.9999471306800842),
 ('stock', 0.9999435544013977),
 ('2022', 0.9999397993087769),
 ('high', 0.9999319314956665),
 ('price', 0.9999316930770874),
 ('time', 0.9999316334724426),
 ('amp', 0.9999293088912964),
 ('could', 0.9999256730079651),
 ('industry', 0.999923825263977),
 ('u', 0.9999204277992249)]

## Cosine Similarity between Market and Economy

In [None]:
model1.wv.similarity(w1 = "market", w2 = "economy")

0.99841577

# Conclusion:
## The Dataset consists of Product News and Stock Market Commentaries labelled as 0 and 1 respectively. We have used the gensim library for implementing the word2vec.  The first model was built by preprocessing the data using the inbuilt preprocess method provided by the gensim library. Using this method, the model is able to predict all the words which are 99% similar to "market" . The same results have been obtained when preprocessed using the nltk libraries. The Cosine Similarity between the word Market and Economy is 99%