<a href="https://colab.research.google.com/github/Ehtisham1053/Natural-Language-Processing/blob/main/Skip_grams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Skip-gram Model:
The Skip-gram model is a technique used in Word2Vec to learn word embeddings. Unlike CBOW(Continuous Bag of Words), which predicts the target word from context words, Skip-gram predicts context words from the target word.



---



---



# 📌 How Skip-gram Works
1. Training Data: A large corpus of text (e.g., Wikipedi
articles).

2. Word Pairs: For each word, it predicts the surrounding words (context).

3. Window Size: Defines how many words before and after the target word are considered.

4. Neural Network: Uses a simple shallow neural network to learn word vectors.

5. Optimization: Uses negative sampling or hierarchical softmax to improve efficiency.

6. Output: A dense vector representation for each word.



In [4]:
!pip install --no-cache-dir --force-reinstall gensim nltk matplotlib wikipedia-api

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting matplotlib
  Downloading matplotlib-3.10.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting wikipedia-api
  Downloading wikipedia_api-0.8.1.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m96.6 MB/s[0m eta [36m0:00:00[0m
[

In [11]:
import gensim
from gensim.models import Word2Vec
import wikipediaapi
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import string
import matplotlib.pyplot as plt
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent='MyProjectName/1.0 (MyProjectURL; MyEmail@example.com)'
)


topics = ["Machine Learning", "Artificial Intelligence", "Deep Learning", "Neural Networks",
          "Natural Language Processing", "Big Data", "Computer Vision", "Data Science"]

wiki_texts = []
for topic in topics:
    page = wiki.page(topic)
    if page.exists():
        wiki_texts.append(page.text)

large_wiki_text = " ".join(wiki_texts)

print("Sample Wikipedia Text (First 500 characters):\n", large_wiki_text[:500])

Sample Wikipedia Text (First 500 characters):
 Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances in the field of deep learning have allowed neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance.
ML finds application in many fields, 


In [12]:
sentences = nltk.sent_tokenize(large_wiki_text)
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

stop_words = set(stopwords.words("english"))
tokenized_sentences = [[word for word in sentence if word not in stop_words and word not in string.punctuation]
                       for sentence in tokenized_sentences]

print("Tokenized Sentences:\n", tokenized_sentences)



Tokenized Sentences:


In [14]:
skipgram_model = Word2Vec(sentences=tokenized_sentences,
                          vector_size=100,  # Embedding size
                          window=5,        # Context window size
                          sg=1,            # sg=1 means Skip-gram, sg=0 means CBOW
                          min_count=3,     # Ignore words with frequency <3
                          workers=4)       # Number of CPU threads

skipgram_model.save("skipgram_wikipedia.model")

print("Skip-gram Model Training Completed!")


Skip-gram Model Training Completed!


### find similar words

In [15]:
model = Word2Vec.load("skipgram_wikipedia.model")

print("\nWords similar to 'data':")
for word, similarity in model.wv.most_similar("data", topn=5):
    print(f"{word}: {similarity:.4f}")

print("\nWords similar to 'intelligence':")
for word, similarity in model.wv.most_similar("intelligence", topn=5):
    print(f"{word}: {similarity:.4f}")



Words similar to 'data':
big: 0.9931
analysis: 0.9916
science: 0.9909
sets: 0.9904
size: 0.9894

Words similar to 'intelligence':
called: 0.9867
neurons: 0.9859
use: 0.9853
solve: 0.9852
general: 0.9850
