# Learning NLP Tutorial Series 
## Tutorial 3 : More on Word Embedding + Vectorized Biases

Topics include: 
* Exploring alternative techniques to Word2Vec, namely: GloVe and FastText

* Fairness in Word Embedding: spotting biases (WEAT, ...)

---

## **Overview**

* [Kaggle API](#section0)

* [FastText](#section1)
  
* [GloVe](#section2)


* [Word Embedding Fairness Evaluation (WEFE)](#section3)

* [References & Additional Material](#section4)

---

<a id="#section0"></a>
# **Kaggle API**

In order to download from kaggle the dataset we've chosen for this tutorial, you have to execute the following instruction. You can find the complete description for this procedure [here](https://www.kaggle.com/general/74235).

Before to execute the first cell, you have to go to your Kaggle account page, scroll to API section and click on "Create New API Token". This command will download kaggle.json file on your machine.

In [1]:
!pip install -q kaggle #

In [2]:
from google.colab import files

In [3]:
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"simoneazeglio","key":"defac3dc7678682e97827fcd04bcd0f2"}'}

In [4]:
!mkdir ~/.kaggle

In [5]:
!cp kaggle.json ~/.kaggle/

What is chmod 600? [Here](https://chmodcommand.com/chmod-600/) some more clarity

In [6]:
!chmod 600 ~/.kaggle/kaggle.json 

In [9]:
!kaggle datasets list

ref                                                         title                                              size  lastUpdated          downloadCount  
----------------------------------------------------------  ------------------------------------------------  -----  -------------------  -------------  
gpreda/reddit-vaccine-myths                                 Reddit Vaccine Myths                              227KB  2021-05-05 14:54:57           4311  
crowww/a-large-scale-fish-dataset                           A Large Scale Fish Dataset                          3GB  2021-04-28 17:03:01           2488  
promptcloud/careerbuilder-job-listing-2020                  Careerbuilder Job Listing 2020                     42MB  2021-03-05 06:59:52            577  
mathurinache/twitter-edge-nodes                             Twitter Edge Nodes                                342MB  2021-03-08 06:43:04            261  
dhruvildave/wikibooks-dataset                               Wikibooks Datase

In [10]:
!kaggle datasets download -d simoneazeglio/bookcorpus

Downloading bookcorpus.zip to /content
100% 2.30G/2.30G [00:21<00:00, 126MB/s]
100% 2.30G/2.30G [00:21<00:00, 113MB/s]


In [11]:
!mkdir train

In [12]:
!unzip bookcorpus.zip -d train

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: train/books1/epubtxt/the-black-joke.epub.txt  
  inflating: train/books1/epubtxt/the-black-knight.epub.txt  
  inflating: train/books1/epubtxt/the-black-parade.epub.txt  
  inflating: train/books1/epubtxt/the-black-room-part-one-in-the-black-room.epub.txt  
  inflating: train/books1/epubtxt/the-black-rose.epub.txt  
  inflating: train/books1/epubtxt/the-black-tide.epub.txt  
  inflating: train/books1/epubtxt/the-blackhearted-saint.epub.txt  
  inflating: train/books1/epubtxt/the-blacksmith-soldier.epub.txt  
  inflating: train/books1/epubtxt/the-blackwater-journal.epub.txt  
  inflating: train/books1/epubtxt/the-blackwood-trilogy.epub.txt  
  inflating: train/books1/epubtxt/the-blade-witch.epub.txt  
  inflating: train/books1/epubtxt/the-blade.epub.txt  
  inflating: train/books1/epubtxt/the-blake-soul-a-supernatural-thriller-and-romance.epub.txt  
  inflating: train/books1/epubtxt/the-blanket-of-blessings.ep

## Preprocessing + Corpus

In [1]:
path_to_books = "train/books1/epubtxt/*.txt"

In [2]:
import glob 

In [3]:
book_names = []
for name in glob.glob(path_to_books):
  book_names.append(name)

In [32]:
#prova = list(filter(lambda k: 'volume' in k, book_names))
#prova

In [4]:
lady_science = list(filter(lambda k: 'lady-science' in k, book_names))
lady_science

['train/books1/epubtxt/lady-science-volume-iii-2016-2017.epub.txt',
 'train/books1/epubtxt/lady-science-volume-i-2014-2015.epub.txt',
 'train/books1/epubtxt/lady-science-volume-ii-2015-2016.epub.txt']

In [35]:
#!cat /proc/cpuinfo # check cores specs

In [61]:
#!pip uninstall gensim
#!pip install gensim==4.0.0

Uninstalling gensim-3.6.0:
  Would remove:
    /usr/local/lib/python3.7/dist-packages/gensim-3.6.0.dist-info/*
    /usr/local/lib/python3.7/dist-packages/gensim/*
Proceed (y/n)? y
  Successfully uninstalled gensim-3.6.0


In [5]:
from smart_open import open
import gensim



In [6]:
gensim.__version__

'4.0.0'

In [55]:
## Classe che prende una lista di file e estrae le parole da ogni linea (streaming)
## on the fly per w2v 
class MyCorpus:
    def __init__(self, list_of_names):
      self.list_of_names = list_of_names

    def __iter__(self):
      for filename in self.list_of_names:
        for line in open(filename):
            # assume there's one document per line, tokens separated by whitespace
            yield gensim.utils.simple_preprocess(line) #line.lower().split()

In [56]:
words = MyCorpus(lady_science)

In [57]:
len(list(words))

3578

In [58]:
model = gensim.models.Word2Vec(words)

## Word2Vec Embedding

## FastText 



In [None]:
## FastText