# Word2Vec

“you shall know a word by the company it keeps” — (Firth, J. R. 1957:11)

As the goal of this resource is to teach NLP for low-resource languages, we are going to start with learning word embeddings for some low resource languages e.g Yoruba, Igbo, Swahili e.t.c from scratch using word2vec. Word2Vec was introduced in [Efficient Estimation of Word Representations in
Vector Space](https://arxiv.org/pdf/1301.3781.pdf) and the overarching idea is that the meaning of a word is dependent on the context of which it is often used.

In this tutorial, we would learn and visualize word embeddings for different languages using the [mC4](https://huggingface.co/datasets/mc4) dataset and pytorch.


## Important terminologies to note

- CBOW:
- SkipGram:
- Corpus:
- Vocabulary:
- Word Subsampling:


In [None]:
!pip install datasets

# Dataset

- First we need to build a corpus which consists of all th

In [4]:
#@title String fields

Select_Language = 'English' #@param ["English", "Yoruba", "Igbo", "Swahili", "Hausa"]

LANG_2_CODE = {"English" : "en", "Yoruba" : "yo", "Igbo": "ig", "Swahili": "sw", "Hausa": "ha"} 

In [None]:
import warnings
from tqdm import tqdm
from datasets import load_dataset
warnings.filterwarnings("ignore")

def fetch_igbo_corpus() -> list:
    """
    This function generates a a corpus for Igbo language which is a list of 
    sentences from a huggingface dataset
    Dataset: https://huggingface.co/datasets/igbo_monolingual
    """

    igbo_monolingual = ['eze_goes_to_school', 'bbc-igbo', 'igbo-radio', 'jw-ot-igbo', \
                        'jw-nt-igbo', 'jw-books', 'jw-teta', 'jw-ulo_nche', 'jw-ulo_nche_naamu']

    sentences = []
    for conf in tqdm(igbo_monolingual):
        dataset = load_dataset("igbo_monolingual", conf)
        if conf == 'eze_goes_to_school':
            sentences.extend([sentence for sentence in dataset['train']['chapters'][0]['title']])
            for i in range(len(dataset['train']['chapters'][0]['content'])):
                sentences.extend([sentence for sentence in dataset['train']['chapters'][0]['content'][i].split(".")])
        elif conf == 'bbc-igbo':
            sentences.extend([sentence for sentence in dataset['train']['title']])
            sentences.extend([sentence for sentence in dataset['train']['description']])
            for i in range(len(dataset['train']['content'])):
                sentences.extend([sentence for sentence in dataset['train']['content'][i].split(".")])
        elif conf == 'igbo-radio':
            for i in range(len(dataset['train']['content'])):
                sentences.extend([sentence for sentence in dataset['train']['content'][i].split(".")])
                sentences.extend([sentence for sentence in dataset['train']['description'][i].split(".")])
        elif conf in ['jw-ot-igbo', 'jw-nt-igbo']:
            for i in range(len(dataset['train'])):
                for j in range(len(dataset['train'][i]['chapters'])):
                    try:
                        sentences.extend([sentence for sentence in dataset['train']['chapters'][i]['content'][j].split(".")])
                    except IndexError as e:
                        continue
        else:
            for i in range(len(dataset['train'])):
                sentences.extend([sentence for sentence in dataset['train'][i]['content'].split(".")])

    return sentences


def fetch_english_corpus() -> list:
    """
    This function generates a a corpus for English language which is a list of 
    sentences from a huggingface dataset
    Dataset: https://huggingface.co/datasets/wikitext
    """
    sentences = []

    dataset = load_dataset("wikitext", 'wikitext-103-v1')
    for split in ["test", "train", "validation"]:
        for i in range(len(dataset[split]['text'])):
            sentences.extend([sentence for sentence in dataset[split]['text'][i].split(".")])
    
    return sentences

def fetch_yoruba_corpus() -> list:
    """
    This function generates a a corpus for Yoruba language which is a list of 
    sentences from a huggingface dataset
    Dataset: https://huggingface.co/datasets/yoruba_text_c3
    """
    sentences = []

    dataset = load_dataset("yoruba_text_c3")
    for i in range(len(dataset["train"]["text"])):
        sentences.extend([sentence for sentence in dataset["train"]["text"][i].split(".")])

    return sentences


def fetch_swahili_corpus() -> list:
    """
    This function generates a a corpus for Swahili language which is a list of 
    sentences from a huggingface dataset
    Dataset: https://huggingface.co/datasets/swahili
    """
    sentences = []

    dataset = load_dataset("swahili")
    for split in ["test", "train", "validation"]:
        for i in range(len(dataset[split]["text"])):
            sentences.extend([sentence for sentence in dataset[split]["text"][i].split(".")])

    return sentences



fetch_english_corpus()

In [None]:
import re

def process_text(corpus_list: list) -> list:
    """
    This function is for preprocessing the downloaded texts
    1. lowercase
    2. remove symbols and numbers
    3. remove whitespace
    """

    corpus_list = [x.lower() for x in corpus_list]
    corpus_list = [re.sub(r"(\W", "", x) for x in corpus_list]
    corpus_list = [re.sub(r"\d+", "", x) for x in corpus_list]
    corpus_list = [re.sub(r"\s", '', x) for x in corpus_list]

    return corpus_list