<a href="https://colab.research.google.com/github/HamzaAnjum15/NLP/blob/main/Word2Vec_FastText.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Word2Vec is a neural network-based technique to convert words into vectors, where words with similar meanings have similar vector representations. These vectors are used in many natural language processing (NLP) applications, such as sentiment analysis and machine translation. Word2Vec essentially helps in mapping words into a numerical format that can be fed into machine learning models.

There are two main types of Word2Vec models:

CBOW (Continuous Bag of Words): Predicts the target word from surrounding context words.
Skip-gram: Predicts surrounding context words given a target word.


In [1]:
!pip install gensim




In [2]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize


In [4]:
# Example sentences
import nltk
nltk.download('punkt')
sentences = [
    "I love natural language processing and machine learning",
    "Word2Vec helps in understanding word similarities",
    "Machine learning is fun and interesting",
    "Natural language processing is a part of machine learning",
    "I enjoy learning new things in artificial intelligence"
]

# Tokenize each sentence
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
print(tokenized_sentences)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[['i', 'love', 'natural', 'language', 'processing', 'and', 'machine', 'learning'], ['word2vec', 'helps', 'in', 'understanding', 'word', 'similarities'], ['machine', 'learning', 'is', 'fun', 'and', 'interesting'], ['natural', 'language', 'processing', 'is', 'a', 'part', 'of', 'machine', 'learning'], ['i', 'enjoy', 'learning', 'new', 'things', 'in', 'artificial', 'intelligence']]


vector_size=50: Sets the vector dimension size to 50 (you can adjust it).
window=3: Specifies that we want to consider a window of 3 words around the target word.
min_count=1: Ignores words that appear less than once.
sg=0: Sets the model to use CBOW (if sg=1, it uses Skip-gram).

In [5]:
# Create the Word2Vec model
model = Word2Vec(sentences=tokenized_sentences, vector_size=50, window=3, min_count=1, sg=0)


In [10]:
# Get vector for a word
word_vector = model.wv['machine']
print(word_vector)
#This will print out a 50-dimensional vector representation for the word "machine".


[-0.01631348  0.00898898 -0.0082744   0.00164721  0.01700014 -0.00892164
  0.00903623 -0.01357399 -0.00709444  0.01879727 -0.00316053  0.00064218
 -0.0082786  -0.01536866 -0.00301601  0.00494228 -0.00177495  0.01106751
 -0.00548537  0.00452524  0.01090642  0.01668959 -0.00290629 -0.01841395
  0.00873797  0.00114173  0.01487998 -0.00163029 -0.00527568 -0.01750633
 -0.00171392  0.00565271  0.01080288  0.01410596 -0.01140395  0.00371848
  0.01217604 -0.00959917 -0.00621352  0.01359702  0.003265    0.00037837
  0.00695048  0.00043391  0.01923932  0.01012449 -0.01783132 -0.01408467
  0.00180031  0.01278293]


In [7]:
# Find words similar to 'learning'
similar_words = model.wv.most_similar('learning', topn=5)
print(similar_words)


[('a', 0.2707342505455017), ('artificial', 0.21156349778175354), ('things', 0.18648891150951385), ('word2vec', 0.1672801673412323), ('new', 0.16135866940021515)]


In [11]:
print(f"Word Vector for 'data': {word_vector[:5]}")


Word Vector for 'data': [-0.01631348  0.00898898 -0.0082744   0.00164721  0.01700014]


In [12]:
from gensim.models import FastText
from nltk.tokenize import word_tokenize

# Sample corpus (list of sentences)
sentences = [
    "I love natural language processing.",
    "Natural language processing is fun.",
    "I enjoy learning about data science.",
    "Machine learning is a part of data science."
]

# Tokenize the sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train the FastText model
model_fasttext = FastText(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1)

# Get the vector for a specific word
word_vector_fasttext = model_fasttext.wv['data']

# Print the word vector for 'data' (first 5 dimensions)
print(f"FastText Word Vector for 'data': {word_vector_fasttext[:5]}")

# Find similar words to 'data'
similar_words_fasttext = model_fasttext.wv.most_similar('data')
print("\nWords similar to 'data' using FastText:", similar_words_fasttext)

FastText Word Vector for 'data': [-0.00014643  0.00029048  0.00104975 -0.00074337 -0.00244071]

Words similar to 'data' using FastText: [('machine', 0.2867852747440338), ('science', 0.10667353123426437), ('is', 0.09705585241317749), ('fun', 0.09495128691196442), ('a', 0.0866626650094986), ('i', 0.08312514424324036), ('of', 0.04464837163686752), ('natural', 0.028660064563155174), ('about', 0.0029689136426895857), ('language', -0.02337067760527134)]
