#### If the cosine similarity is 1, the angle between the vectors is 0 degrees, meaning they point in the same direction (very similar).

#### If the cosine similarity is 0, the angle is 90 degrees, meaning the vectors are orthogonal (no similarity).

#### If the cosine similarity is -1, the angle is 180 degrees, meaning the vectors point in opposite directions (very dissimilar).

In [1]:
import numpy as np
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot

In [2]:
sent = ['the cup of tea',
       'the cup of juice',
       'the glass of milk',
       'i am a good boy',
       'i am a good developer',
       'understand the meaning of word',
       'your bike is very good']

In [3]:
sent

['the cup of tea',
 'the cup of juice',
 'the glass of milk',
 'i am a good boy',
 'i am a good developer',
 'understand the meaning of word',
 'your bike is very good']

In [4]:
len(sent)

7

### Represents the size of your vocabulary, i.e., the total number of unique words in your dataset.

In [5]:
voc_size = 10000

### Word Embeddings: Representing words as dense vectors, capturing semantic and syntactic relationships.

In [6]:
onehot_repr = [one_hot(word,voc_size) for word in sent]

In [7]:
onehot_repr

[[8830, 1429, 40, 457],
 [8830, 1429, 40, 2914],
 [8830, 9994, 40, 9506],
 [6193, 1973, 4971, 2776, 2310],
 [6193, 1973, 4971, 2776, 4864],
 [3723, 8830, 3153, 40, 489],
 [6248, 6985, 1821, 3846, 2776]]

In [8]:
sent_len = 8

In [9]:
embedding_docs = pad_sequences(onehot_repr,padding='pre',maxlen=sent_len)

In [10]:
embedding_docs

array([[   0,    0,    0,    0, 8830, 1429,   40,  457],
       [   0,    0,    0,    0, 8830, 1429,   40, 2914],
       [   0,    0,    0,    0, 8830, 9994,   40, 9506],
       [   0,    0,    0, 6193, 1973, 4971, 2776, 2310],
       [   0,    0,    0, 6193, 1973, 4971, 2776, 4864],
       [   0,    0,    0, 3723, 8830, 3153,   40,  489],
       [   0,    0,    0, 6248, 6985, 1821, 3846, 2776]])

In [11]:
model = Sequential()
model.add(Embedding(voc_size,10,input_length=sent_len))

## 10 = Specifies the dimensionality of the embedding space.

## Each word will be represented as a dense vector of size 10.

In [12]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 8, 10)             100000    
                                                                 
Total params: 100,000
Trainable params: 100,000
Non-trainable params: 0
_________________________________________________________________


In [13]:
model.compile(optimizer='adam',loss='mse')

In [14]:
test = model.predict(embedding_docs)



In [15]:
test

array([[[-0.02738047, -0.01239734, -0.00592859, -0.03856251,
          0.01730131,  0.01123793, -0.04281748, -0.00725784,
          0.01974161,  0.01852438],
        [-0.02738047, -0.01239734, -0.00592859, -0.03856251,
          0.01730131,  0.01123793, -0.04281748, -0.00725784,
          0.01974161,  0.01852438],
        [-0.02738047, -0.01239734, -0.00592859, -0.03856251,
          0.01730131,  0.01123793, -0.04281748, -0.00725784,
          0.01974161,  0.01852438],
        [-0.02738047, -0.01239734, -0.00592859, -0.03856251,
          0.01730131,  0.01123793, -0.04281748, -0.00725784,
          0.01974161,  0.01852438],
        [ 0.02260143,  0.0288301 ,  0.00158245,  0.03234601,
          0.00966799, -0.01907057, -0.04882883,  0.04142496,
         -0.02896094,  0.02413738],
        [ 0.01197611,  0.00743956,  0.03801392,  0.01021146,
         -0.04364231, -0.00971224,  0.02474083, -0.04152535,
         -0.00479716,  0.04601495],
        [ 0.00578063, -0.01016172, -0.0405167 , -0.0

# Word2vec

In [16]:
text = """The climate has continuously changing for centuries. The global warming happens because the natural rotation of the sun that changes the intensity of sunlight and moving closer to the earth. Another cause of global warming is greenhouse gases. Greenhouse gases are carbon monoxide and sulphur dioxide it trap the solar heats rays and prevent it from escaping from the surface of the earth. This has cause the temperature of the earth increase. Volcanic eruptions are another issue that causes global warming. For instance, a single volcanic eruption will release amount of carbon dioxide and ash to the atmosphere. Once carbon dioxide increase, the temperature of earth increase and greenhouse trap the solar radiations in the earth. Finally, methane is another issue that causes global warming. Methane is also a greenhouse gas. Methane is more effective in trapping heat in the atmosphere that carbon dioxide by 20 times. Usually methane gas can release from many areas. For instance, it can be from cattle, landfill, natural gas, petroleum systems, coal mining, mobile explosion, or industrial waste process."""

In [17]:
text

'The climate has continuously changing for centuries. The global warming happens because the natural rotation of the sun that changes the intensity of sunlight and moving closer to the earth. Another cause of global warming is greenhouse gases. Greenhouse gases are carbon monoxide and sulphur dioxide it trap the solar heats rays and prevent it from escaping from the surface of the earth. This has cause the temperature of the earth increase. Volcanic eruptions are another issue that causes global warming. For instance, a single volcanic eruption will release amount of carbon dioxide and ash to the atmosphere. Once carbon dioxide increase, the temperature of earth increase and greenhouse trap the solar radiations in the earth. Finally, methane is another issue that causes global warming. Methane is also a greenhouse gas. Methane is more effective in trapping heat in the atmosphere that carbon dioxide by 20 times. Usually methane gas can release from many areas. For instance, it can be fr

In [18]:
import nltk

## Gensim is a free open-source Python library for representing documents as semantic vectors,

## as efficiently (computer-wise) and painlessly (human-wise) as possible.

#### Gensim is designed to process raw, unstructured digital texts (“plain text”) using unsupervised machine learning algorithms.

In [19]:
#!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp38-cp38-win_amd64.whl.metadata (8.2 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)
Downloading gensim-4.3.3-cp38-cp38-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
    --------------------------------------- 0.5/24.0 MB 1.4 MB/s eta 0:00:17
   - -------------------------------------- 0.8/24.0 MB 1.6 MB/s eta 0:00:15
   -- ------------------------------------- 1.3/24.0 MB 1.7 MB/s eta 0:00:14
   --- ------------------------------------ 1.8/24.0 MB 2.0 MB/s eta 0:00:12
   --- ------------------------------------ 2.1/24.0 MB 2.0 MB/s eta 0:00:12
   --- ------------------------------------ 2.4/24.0 MB 1.7 MB/s eta 0:00:13
   ---- ----------------------------------- 2.9/24.0 MB 1.9 MB/s eta 0:00:12
   ------ --------------------------------- 3.7/24.0 MB 2.1 MB/s et

In [20]:
from gensim.models import Word2Vec
from nltk.corpus import stopwords
import re

In [21]:
para = re.sub(r'\d',' ',text)
para = re.sub(r'\s+', ' ',para)
para = para.lower()
para = re.sub(r'\s+', ' ',para)

In [22]:
para

'the climate has continuously changing for centuries. the global warming happens because the natural rotation of the sun that changes the intensity of sunlight and moving closer to the earth. another cause of global warming is greenhouse gases. greenhouse gases are carbon monoxide and sulphur dioxide it trap the solar heats rays and prevent it from escaping from the surface of the earth. this has cause the temperature of the earth increase. volcanic eruptions are another issue that causes global warming. for instance, a single volcanic eruption will release amount of carbon dioxide and ash to the atmosphere. once carbon dioxide increase, the temperature of earth increase and greenhouse trap the solar radiations in the earth. finally, methane is another issue that causes global warming. methane is also a greenhouse gas. methane is more effective in trapping heat in the atmosphere that carbon dioxide by times. usually methane gas can release from many areas. for instance, it can be from 

In [23]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [24]:
sentences = nltk.sent_tokenize(para)

In [25]:
sentences

['the climate has continuously changing for centuries.',
 'the global warming happens because the natural rotation of the sun that changes the intensity of sunlight and moving closer to the earth.',
 'another cause of global warming is greenhouse gases.',
 'greenhouse gases are carbon monoxide and sulphur dioxide it trap the solar heats rays and prevent it from escaping from the surface of the earth.',
 'this has cause the temperature of the earth increase.',
 'volcanic eruptions are another issue that causes global warming.',
 'for instance, a single volcanic eruption will release amount of carbon dioxide and ash to the atmosphere.',
 'once carbon dioxide increase, the temperature of earth increase and greenhouse trap the solar radiations in the earth.',
 'finally, methane is another issue that causes global warming.',
 'methane is also a greenhouse gas.',
 'methane is more effective in trapping heat in the atmosphere that carbon dioxide by times.',
 'usually methane gas can release f

In [26]:
len(sentences)

13

In [27]:
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

In [28]:
sentences

[['the',
  'climate',
  'has',
  'continuously',
  'changing',
  'for',
  'centuries',
  '.'],
 ['the',
  'global',
  'warming',
  'happens',
  'because',
  'the',
  'natural',
  'rotation',
  'of',
  'the',
  'sun',
  'that',
  'changes',
  'the',
  'intensity',
  'of',
  'sunlight',
  'and',
  'moving',
  'closer',
  'to',
  'the',
  'earth',
  '.'],
 ['another',
  'cause',
  'of',
  'global',
  'warming',
  'is',
  'greenhouse',
  'gases',
  '.'],
 ['greenhouse',
  'gases',
  'are',
  'carbon',
  'monoxide',
  'and',
  'sulphur',
  'dioxide',
  'it',
  'trap',
  'the',
  'solar',
  'heats',
  'rays',
  'and',
  'prevent',
  'it',
  'from',
  'escaping',
  'from',
  'the',
  'surface',
  'of',
  'the',
  'earth',
  '.'],
 ['this',
  'has',
  'cause',
  'the',
  'temperature',
  'of',
  'the',
  'earth',
  'increase',
  '.'],
 ['volcanic',
  'eruptions',
  'are',
  'another',
  'issue',
  'that',
  'causes',
  'global',
  'warming',
  '.'],
 ['for',
  'instance',
  ',',
  'a',
  'sing

In [29]:
for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]

In [30]:
sentences

[['climate', 'continuously', 'changing', 'centuries', '.'],
 ['global',
  'warming',
  'happens',
  'natural',
  'rotation',
  'sun',
  'changes',
  'intensity',
  'sunlight',
  'moving',
  'closer',
  'earth',
  '.'],
 ['another', 'cause', 'global', 'warming', 'greenhouse', 'gases', '.'],
 ['greenhouse',
  'gases',
  'carbon',
  'monoxide',
  'sulphur',
  'dioxide',
  'trap',
  'solar',
  'heats',
  'rays',
  'prevent',
  'escaping',
  'surface',
  'earth',
  '.'],
 ['cause', 'temperature', 'earth', 'increase', '.'],
 ['volcanic',
  'eruptions',
  'another',
  'issue',
  'causes',
  'global',
  'warming',
  '.'],
 ['instance',
  ',',
  'single',
  'volcanic',
  'eruption',
  'release',
  'amount',
  'carbon',
  'dioxide',
  'ash',
  'atmosphere',
  '.'],
 ['carbon',
  'dioxide',
  'increase',
  ',',
  'temperature',
  'earth',
  'increase',
  'greenhouse',
  'trap',
  'solar',
  'radiations',
  'earth',
  '.'],
 ['finally',
  ',',
  'methane',
  'another',
  'issue',
  'causes',
  'gl

In [31]:
model = Word2Vec(sentences,min_count=2)

### min_count=2: This parameter specifies the minimum number of times a word must appear in the training data to be included in the model's vocabulary. Words that occur less than 2 times will be ignored.

# words = model.wv.vocab

In [32]:
vector = model.wv['global']

In [33]:
vector.shape

(100,)

In [34]:
model.wv.most_similar('global')

[('temperature', 0.25295764207839966),
 ('earth', 0.17037425935268402),
 ('solar', 0.15011754631996155),
 ('warming', 0.13924837112426758),
 ('issue', 0.1084781363606453),
 ('release', 0.09975417703390121),
 ('greenhouse', 0.035267166793346405),
 ('causes', 0.03357555344700813),
 ('gases', 0.016446169465780258),
 ('natural', 0.013856201432645321)]

In [35]:
model.wv.most_similar('warming')

[('increase', 0.16687826812267303),
 ('global', 0.1392483413219452),
 ('methane', 0.13180485367774963),
 ('natural', 0.09753084182739258),
 ('cause', 0.07178264111280441),
 ('earth', 0.06410785764455795),
 ('dioxide', 0.06106419488787651),
 ('issue', 0.04776986315846443),
 ('temperature', 0.04407171905040741),
 ('gas', 0.019936688244342804)]

In [36]:
model.wv.most_similar('gas')

[('solar', 0.12813477218151093),
 ('increase', 0.10928673297166824),
 ('carbon', 0.10865344107151031),
 ('trap', 0.10797619819641113),
 ('atmosphere', 0.09932279586791992),
 ('cause', 0.09611022472381592),
 ('instance', 0.0863659456372261),
 ('.', 0.06253919750452042),
 ('greenhouse', 0.05043398588895798),
 ('dioxide', 0.02675705775618553)]