# Vector Semantics – Part 2

# What is Word Embedding?

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.

# What is Word2vec?

Word2Vec consists of models for generating word embedding. These models are shallow two-layer neural networks having one input layer, one hidden layer, and one output layer. Word2Vec utilizes two architectures :

1. CBOW (Continuous Bag of Words)
2. Skip Gram

The basic idea of word embedding is words that occur in similar context tend to be closer to each other in vector space. For generating word vectors in Python, modules needed are nltk and gensim.

Run this command in terminal to install
> pip install gensim

We will use fake and real news dataset to do our experment. We will do a small pre-processing to revise the steps. You can find the dataset here: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?select=True.csv


>gensim.models.Word2Vec()

1. sentences: The toknized sentences.
2. vector_size: the dimensions for the data (This is critical parametr, higher vector size require more data)
3. window: The window which moved over the sentance.
4. min_count: how many words considred in the right or left while the window is moving.
5. workers: how many CPUs you want use for traning the model
6. sg: set to 1 for using Skip-Gram model.

In [62]:
import pandas as pd
import nltk
import numpy as np
import gensim
import re

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

In [63]:
df = pd.read_csv('datasets/True.csv')
df.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [64]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"[0-9]", '', text)
    text = re.sub(r"[)(,”“.’$-]", '', text)
    return text

In [65]:
df['cleaned_text'] = df['text'].apply(clean_text)

In [66]:
df[['cleaned_text', 'text']].head()

Unnamed: 0,cleaned_text,text
0,washington reuters the head of a conservative...,WASHINGTON (Reuters) - The head of a conservat...
1,washington reuters transgender people will be...,WASHINGTON (Reuters) - Transgender people will...
2,washington reuters the special counsel invest...,WASHINGTON (Reuters) - The special counsel inv...
3,washington reuters trump campaign adviser geo...,WASHINGTON (Reuters) - Trump campaign adviser ...
4,seattle/washington reuters president donald t...,SEATTLE/WASHINGTON (Reuters) - President Donal...


In [67]:
data = df['cleaned_text'].tolist()
print(data[0:2])

['washington reuters  the head of a conservative republican faction in the us congress who voted this month for a huge expansion of the national debt to pay for tax cuts called himself a fiscal conservative on sunday and urged budget restraint in  in keeping with a sharp pivot under way among republicans us representative mark meadows speaking on cbs face the nation drew a hard line on federal spending which lawmakers are bracing to do battle over in january when they return from the holidays on wednesday lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues such as immigration policy even as the november congressional election campaigns approach in which republicans will seek to keep control of congress president donald trump and his republicans want a big budget increase in military spending while democrats also want proportional increases for nondefense discretionary spending on programs that support education scientific research infrast

In [68]:
tokens = []

for i in data:
    token = word_tokenize(i)
    tokens.append(token)

tokens[0]    

['washington',
 'reuters',
 'the',
 'head',
 'of',
 'a',
 'conservative',
 'republican',
 'faction',
 'in',
 'the',
 'us',
 'congress',
 'who',
 'voted',
 'this',
 'month',
 'for',
 'a',
 'huge',
 'expansion',
 'of',
 'the',
 'national',
 'debt',
 'to',
 'pay',
 'for',
 'tax',
 'cuts',
 'called',
 'himself',
 'a',
 'fiscal',
 'conservative',
 'on',
 'sunday',
 'and',
 'urged',
 'budget',
 'restraint',
 'in',
 'in',
 'keeping',
 'with',
 'a',
 'sharp',
 'pivot',
 'under',
 'way',
 'among',
 'republicans',
 'us',
 'representative',
 'mark',
 'meadows',
 'speaking',
 'on',
 'cbs',
 'face',
 'the',
 'nation',
 'drew',
 'a',
 'hard',
 'line',
 'on',
 'federal',
 'spending',
 'which',
 'lawmakers',
 'are',
 'bracing',
 'to',
 'do',
 'battle',
 'over',
 'in',
 'january',
 'when',
 'they',
 'return',
 'from',
 'the',
 'holidays',
 'on',
 'wednesday',
 'lawmakers',
 'will',
 'begin',
 'trying',
 'to',
 'pass',
 'a',
 'federal',
 'budget',
 'in',
 'a',
 'fight',
 'likely',
 'to',
 'be',
 'linked

In [69]:
Skip_gram_model = gensim.models.Word2Vec(tokens, min_count = 1, vector_size = 100,window = 5, sg = 1)

In [71]:
print("Cosine similarity between 'provide' and 'program' - Skip Gram : ",Skip_gram_model.wv.similarity('provide', 'program'))

Cosine similarity between 'provide' and 'program' - Skip Gram :  0.4147774


In [72]:
print("words that similar to 'program' - Skip Gram : ",Skip_gram_model.wv.most_similar('program'))

words that similar to 'program' - Skip Gram :  [('programs', 0.8141056299209595), ('programme', 0.7501056790351868), ('bondbuying', 0.6976277828216553), ('programmes', 0.6915321946144104), ('modernization', 0.6878538131713867), ('curbed', 0.6783848404884338), ('abolishing', 0.6775704026222229), ('entitlement', 0.6759329438209534), ('accelerates', 0.675923228263855), ('curtail', 0.6693112850189209)]


In [73]:
# Skip_gram_model.save('Word2vec_Skip-Gram')

# Word Embedding Layers for Deep Learning

Keras offers an Embedding layer that can be used for neural networks on text data.

It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

It is a flexible layer that can be used in a variety of ways, such as:

- It can be used alone to learn a word embedding that can be saved and used in another model later.
- It can be used as part of a deep learning model where the embedding is learned along with the model itself.
- It can be used to load a pre-trained word embedding model, a type of transfer learning.

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:

It must specify 3 arguments:

- input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
- output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
- input_length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

In [88]:
import pandas as pd
import nltk
import numpy as np
import re
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Flatten
from tensorflow.keras.models import Sequential

In [77]:
df = pd.read_csv('datasets/True.csv')
df.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [78]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"[0-9]", '', text)
    text = re.sub(r"[)(,”“.’$-]", '', text)
    return text

In [79]:
df['cleaned_text'] = df['text'].apply(clean_text)

In [80]:
df[['cleaned_text', 'text']].head()

Unnamed: 0,cleaned_text,text
0,washington reuters the head of a conservative...,WASHINGTON (Reuters) - The head of a conservat...
1,washington reuters transgender people will be...,WASHINGTON (Reuters) - Transgender people will...
2,washington reuters the special counsel invest...,WASHINGTON (Reuters) - The special counsel inv...
3,washington reuters trump campaign adviser geo...,WASHINGTON (Reuters) - Trump campaign adviser ...
4,seattle/washington reuters president donald t...,SEATTLE/WASHINGTON (Reuters) - President Donal...


In [82]:
def calculate_length(x):
    return(len(x.split()))

df['length'] = df['cleaned_text'].apply(calculate_length)
max_length = df.length.max()
max_length

5127

In [83]:
# apply tokeniztion and find total number of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['cleaned_text']) #IMPORTANT NOTE: Make sure to provide the traning set only 
vocab_size = len(tokenizer.word_index)+1
vocab_size

80024

In [85]:
x_train_seq = tokenizer.texts_to_sequences(df['cleaned_text'])
# x_test_seq = tokenizer.texts_to_sequences(x_test) # FOR TESTING
print(x_train_seq[0])

[65, 30, 1, 337, 3, 4, 296, 55, 4225, 6, 1, 21, 145, 31, 736, 39, 176, 10, 4, 1789, 2193, 3, 1, 102, 614, 2, 496, 10, 116, 784, 159, 719, 4, 770, 296, 7, 244, 5, 598, 270, 5827, 6, 6, 1905, 12, 4, 2588, 6972, 111, 260, 247, 138, 21, 421, 1142, 5411, 583, 7, 3594, 505, 1, 648, 1671, 4, 764, 699, 7, 146, 428, 40, 229, 33, 12120, 2, 142, 1087, 60, 6, 574, 81, 38, 569, 22, 1, 9036, 7, 112, 229, 36, 1249, 497, 2, 740, 4, 146, 270, 6, 4, 466, 284, 2, 26, 1728, 2, 70, 318, 147, 16, 304, 167, 238, 16, 1, 681, 395, 71, 1475, 1053, 6, 40, 138, 36, 759, 2, 507, 291, 3, 145, 35, 82, 20, 5, 23, 138, 252, 4, 541, 270, 661, 6, 109, 428, 130, 179, 53, 252, 8008, 2691, 10, 11148, 8622, 428, 7, 692, 9, 131, 1008, 4054, 1100, 990, 174, 288, 5, 897, 905, 1, 20, 124, 19, 317, 43, 1469, 2, 141, 22248, 213, 2, 661, 11148, 8622, 428, 15, 48, 86, 5411, 374, 3, 1, 641, 32, 2601, 52, 799, 2103, 8, 7, 1, 279, 149, 179, 33, 162, 915, 25, 666, 42, 263, 2, 464, 1, 46, 4, 496, 1051, 3, 2, 86, 10, 4, 770, 296, 58, 510

In [86]:
x_train_seq = pad_sequences(x_train_seq, maxlen=max_length)
# x_test_seq = pad_sequences(x_test_seq, maxlen=max_length)

print(x_train_seq.shape)

(21417, 5127)


In [87]:
output_dim = 30 # Test different number

model = Sequential()
model.add(Embedding(vocab_size, output_dim=output_dim, input_length=max_length))
# You Can complete the model with LSTM, CNN, RNN, ... ect.