# 2. Word embedding model training

Following approach in [tutorial](https://rare-technologies.com/word2vec-tutorial/).

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## 2.1 Preprocessing
Preprocessing involves Textblob's tokenising, lemmatisation, lower case, matching to `(((?![\d])\w)+)` regex (as per gensim, to remove trailing punctuation), stopword removal (see `api.preprocess.document_to_tokens`), length filter. 

Null value removal isn't necessary because tokenisation sorts everything out.

In [2]:
from api.preprocess import TrainingCorpus

2022-10-04 10:21:02,633 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2022-10-04 10:21:02,634 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
2022-10-04 10:21:02,635 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2022-10-04T10:21:02.635001', 'gensim': '4.2.0', 'python': '3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}


In [3]:
corpus = TrainingCorpus("dataset_wikibios_merged", tokenize_before_training=True)
for i,sentence in enumerate(corpus):
    print(sentence)
    if i==5: break

['leonard', 'shenoff', 'randle', 'bear', 'february', 'major', 'league', 'baseball', 'player', 'he', 'first', 'pick', 'washington', 'senator', 'secondary', 'phase', 'june', 'major', 'league', 'baseball', 'draft', 'tenth', 'overall']
['philippe', 'adnot', 'bear', 'august', 'rhèges', 'member', 'senate', 'france', 'he', 'first', 'elect', 'represent', 'aube', 'department', 'farmer', 'profession', 'he', 'serve', 'independent', 'serve', 'head', 'general', 'council', 'aube', 'he', 'elect', 'represent', 'canton', 'méry', 'he', 'senate', 'first', 'round', 'avoid', 'need', 'run', 'vote', 'contribute', 'creation', 'university', 'technology', 'troyes', 'he', 'first', 'vice', 'president', 'university', 'board', 'he', 'currently', 'president', 'he', 'member', 'senate', 'committee', 'law', 'relate', 'freedom', 'responsibility', 'university', 'he', 'serve', 'delegate', 'administrative', 'meeting', 'senator', 'list', 'group', 'he', 'decorate', 'chevalier', 'ordre', 'national', 'mérite', 'agricole']
['mi

## 2.2 Train single embedding model
If `tokenize_before_training==True`, creating the corpus will take some time tokenizing corpus but will save lots of time in training.

In [4]:
from gensim.models import Word2Vec

In [None]:
corpus = TrainingCorpus("dataset_wikibios_merged", tokenize_before_training=True)

Default epochs is 5, training takes ~ 10 minutes

In [None]:
model = Word2Vec(sentences=corpus, workers=4)

In [None]:
model.save("models/dataset_wikibios_merged.pt")

## 2.3 Train embedding models for all datasets (for ensemble model)

In [None]:
for corpus in ["dataset_bug_stereotype", "dataset_bug", "dataset_doughman", "dataset_doughman_stereotype"]:
    print("########", corpus, "#######")
    model = Word2Vec(sentences=TrainingCorpus(corpus, tokenize_before_training=True), workers=4)
    model.save(f"models/{corpus}.pt")

## 2.3 Demo scoring function

In [5]:
from api.gender_bias import GenderBiasScorer, percentage_bias, biased_words

In [6]:
wv = Word2Vec.load("models/dataset_wikibios_merged.pt").wv

2022-10-04 10:21:27,018 : INFO : loading Word2Vec object from models/dataset_wikibios_merged.pt
2022-10-04 10:21:27,076 : INFO : loading wv recursively from models/dataset_wikibios_merged.pt.wv.* with mmap=None
2022-10-04 10:21:27,077 : INFO : loading vectors from models/dataset_wikibios_merged.pt.wv.vectors.npy with mmap=None
2022-10-04 10:21:27,125 : INFO : loading syn1neg from models/dataset_wikibios_merged.pt.syn1neg.npy with mmap=None
2022-10-04 10:21:27,157 : INFO : setting ignored attribute cum_table to None
2022-10-04 10:21:28,003 : INFO : Word2Vec lifecycle event {'fname': 'models/dataset_wikibios_merged.pt', 'datetime': '2022-10-04T10:21:28.003207', 'gensim': '4.2.0', 'python': '3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'loaded'}


In [7]:
scorer = GenderBiasScorer(wv)

Input document string and get percentage bias and list of biased words. This function will be referred to by `api.py`. Threshold will be tuned for each model in evaluation.

In [8]:
test_document = "My guess is she can afford to pay a nutritionist who limits her to a reasonable caloric intake and a personal trainer who keeps her on a workout plan."

In [12]:
tokens, scores = scorer.score_document(test_document, thresh=0.2)

Does not exist in vocab:  caloric


In [13]:
percentage_bias(scores)

0.4

In [14]:
biased_words(tokens, scores)

{'biased_m': ['reasonable'],
 'biased_f': ['guess', 'she', 'nutritionist', 'her', 'her'],
 'unbiased': ['afford',
  'pay',
  'limit',
  'caloric',
  'intake',
  'personal',
  'trainer',
  'workout',
  'plan']}