# Word2Vec
**Word2Vec** is not a singular algorithm, rather, it is a family of model architectures and optimizations that can be used to learn **word embeddings** from large datasets. **Embeddings** learned through **word2vec** have proven to be successful on a variety of downstream **NLP** tasks.

There're methods for learning representations of words:
- **Continuous bag-of-words model:** predicts the middle word based on surrounding context words. The context consists of a few words before and after the current word (middle). This architecture is called a bag-of-words model as the order of words in the context is not important.

- **Continuous skip-gram model:** predicts words within a certain range before and after the current word in the same sentence. A worked example of this is given below.

### Skip-gram and negative sampling
While a bag-of-words model predicts a word given the neighboring context, a skip-gram model predicts the context (or neighbors) of a word, given the word itself. The model is trained on skip-grams, which are n-grams that allow tokens to be skipped (see the diagram below for an example). The context of a word can be represented through a set of skip-gram pairs of `(target_word, context_word)` where context_word appears in the neighboring context of `target_word`.

`The wide road shimmered in the hot sun.`

The context words for each of the 8 words of this sentence are defined by a window size. The window size determines the span of words on either side of a `target_word` that can be considered a `context word`.

The training objective of the skip-gram model is to maximizee the probability of predicting context words given the target word. For a sequence of words *w1, w2,....., wt* the objective can be written as the average log probability.

### Step 1: DataLoader
As with any NLP task (or any data analysis task for that matter), there are two steps:
- Preparing Data (the Loader).
- Processing Data (the Model).
Iâ€™ll be using the publicly available `tweets_hate_speech_detection` dataset from **Huggingface**ðŸ¤—. 

In [21]:
import torch
import datasets

dataset = datasets.load_dataset('tweets_hate_speech_detection')

Found cached dataset tweets_hate_speech_detection (C:/Users/Mohamed Ragab/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c6b6f41e91ac9113e1c032c5ecf7a49b4e1e9dc8699ded3c2d8425c9217568b2)


  0%|          | 0/1 [00:00<?, ?it/s]

We need a function to split up the raw tweets into lists of tokens. Iâ€™ll keep the pre-processing very simple for this demonstration, and apply:

1. Lowercase everything.
2. Remove all symbols other than a-z@#.
3. Split on spaces.
4. Remove stopwords/empty tokens.
5. Apply snowball stemmer to remainder.

In [27]:
! pip install --user -U nltk



In [28]:
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

ss = SnowballStemmer('english')
sw = stopwords.words('english')

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\Mohamed Ragab/nltk_data'
    - 'c:\\Users\\Mohamed Ragab\\AppData\\Local\\Programs\\Python\\Python37\\nltk_data'
    - 'c:\\Users\\Mohamed Ragab\\AppData\\Local\\Programs\\Python\\Python37\\share\\nltk_data'
    - 'c:\\Users\\Mohamed Ragab\\AppData\\Local\\Programs\\Python\\Python37\\lib\\nltk_data'
    - 'C:\\Users\\Mohamed Ragab\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [22]:
! pip install nltk

Collecting nltk
  Downloading nltk-3.8-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 1.9 MB/s eta 0:00:00
Collecting regex>=2021.8.3
  Downloading regex-2022.10.31-cp37-cp37m-win_amd64.whl (268 kB)
     -------------------------------------- 268.0/268.0 kB 2.4 MB/s eta 0:00:00
Collecting click
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
     ---------------------------------------- 96.6/96.6 kB 1.8 MB/s eta 0:00:00
Installing collected packages: regex, click, nltk
Successfully installed click-8.1.3 nltk-3.8 regex-2022.10.31
