### Word embedding:

Word embeddings are a type of representation for words in a format that a machine learning model can understand. They are essential in natural language processing (NLP) tasks, as they capture semantic relationships between words and enable machines to process and understand language more effectively.

### Summary about Word2Vec corpus:

Word2Vec is a technique for generating word embeddings, and the quality of these embeddings depends on the corpus it is trained on. The choice of corpus is crucial in ensuring that the model captures a broad and accurate understanding of the language. If you have a specific corpus in mind, you would use Word2Vec to train embeddings tailored to that particular dataset

### Import libraries

In [166]:
from gensim.models import Word2Vec
import pandas as pd
from nltk.tokenize import word_tokenize
import nltk

In [167]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alire\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Specify the path to your CSV file

In [168]:
csv_file_path = "./words_pos.csv"

Read the CSV file using pandas

In [169]:
data = pd.read_csv(csv_file_path)

In [170]:
data.head()

Unnamed: 0.1,Unnamed: 0,word,pos_tag
0,0,aa,NN
1,1,aaa,NN
2,2,aah,NN
3,3,aahed,VBN
4,4,aahing,VBG


In [171]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 370100 entries, 0 to 370099
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  370100 non-null  int64 
 1   word        370100 non-null  object
 2   pos_tag     370100 non-null  object
dtypes: int64(1), object(2)
memory usage: 8.5+ MB


Assuming there's a 'text' column in your CSV containing the text data

In [172]:
english_corpus = " ".join(data["word"].dropna())

Tokenize the text into words

In [173]:
tokenized_corpus = word_tokenize(english_corpus.lower())

Train Word2Vec model

In [174]:
word2vec_model = Word2Vec(
    sentences=[tokenized_corpus], vector_size=100, window=5, min_count=1, workers=4
)

save the model

In [175]:
# for pushing to github, i have  commented this line
# word2vec_model.save("./word2vec_model_from_corpus.bin")

Choose 5 words from the vocabulary

In [176]:
words_to_compare = ["king", "queen", "man", "woman", "computer"]

Find the most similar words for each word in the list

In [177]:
similar_words_results = {}
for word in words_to_compare:
    if word in word2vec_model.wv:
        similar_words = word2vec_model.wv.most_similar(
            word
        ) 
        similar_words_results[word] = similar_words

 Print the results

In [178]:
for word, similar_words in similar_words_results.items():
    print(f"Words most similar to '{word}': {similar_words}")

Words most similar to 'king': [('platystencephalism', 0.4391556978225708), ('sheol', 0.4329252243041992), ('bodilize', 0.42569977045059204), ('censureship', 0.4167605936527252), ('froust', 0.41011297702789307), ('elacolite', 0.40441256761550903), ('whirlpool', 0.39911168813705444), ('overplacement', 0.3965550363063812), ('laetic', 0.3890400826931), ('conehead', 0.3861731290817261)]
Words most similar to 'queen': [('supernumeraries', 0.5076783299446106), ('boxwoods', 0.4479225277900696), ('diminuendoed', 0.4358047544956207), ('cataphysical', 0.43141430616378784), ('dorm', 0.42526260018348694), ('geoethnic', 0.4123079776763916), ('pronely', 0.40865856409072876), ('phoneticohieroglyphic', 0.4074624180793762), ('dauted', 0.40591299533843994), ('emanationism', 0.4045831859111786)]
Words most similar to 'man': [('diverged', 0.43074479699134827), ('hepatolysis', 0.4206509292125702), ('thronos', 0.42058518528938293), ('pyrazine', 0.415852814912796), ('hallmark', 0.4139459431171417), ('pribble'