<a href="https://colab.research.google.com/github/Lanxin-Xiang/is765/blob/main/W3c_fastText_word_representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **W3. fastText Demo code**

In this notebook, we will play around with a pre-trained fastTest model, and train our own model

Ref: https://fasttext.cc/docs/en/crawl-vectors.html#adapt-the-dimension

## **Pre-trained wold representation model**

This fastText pre-trained model was trained on Common Crawl and Wikipedia using **CBOW** with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. [Read more](https://fasttext.cc/docs/en/crawl-vectors.html#adapt-the-dimension).

In [15]:
from google.colab import drive

drive.mount('/content/drive')

%cd drive/My\ Drive/is765

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
[Errno 2] No such file or directory: 'drive/My Drive/is765'
/content/drive/MyDrive/is765


In [16]:
!git clone https://github.com/facebookresearch/fastText.git

fatal: destination path 'fastText' already exists and is not an empty directory.


In [17]:
%cd fastText
!sudo pip install .
%cd ..
# or :
# !sudo python setup.py install

/content/drive/My Drive/is765/fastText
Processing /content/drive/My Drive/is765/fastText
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4291887 sha256=878f2bc5b5b334781cc3ce6d0cddae14656f9082ace08e3995ed95347dbb3eb7
  Stored in directory: /tmp/pip-ephem-wheel-cache-cg47hgng/wheels/0c/be/ee/e24b5d911a7e2b16e5d42b0602bb241b23a42ddd58e26d14f1
Successfully built fasttext
Installing collected packages: fasttext
  Attempting uninstall: fasttext
    Found existing installation: fasttext 0.9.2
    Uninstalling fasttext-0.9.2:
      Successfully uninstalled fasttext-0.9.2
Successfully installed fasttext-0.9.2
/content/drive/My Drive/is765


In [18]:
import fasttext
import fasttext.util

### Download the english model

If downloading with the following code is extremely slow and you do have enough space (4.2GB) on your local laptop. Try to download to local and upload to /content/drive/My Drive/is765.

Click [here](https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz) to download.



In [19]:
fasttext.util.download_model('en', if_exists='ignore')

'cc.en.300.bin'

### Load model and check model dimension

In [20]:
ft = fasttext.load_model('cc.en.300.bin')
ft.get_dimension()

300

### View word vector and find the nearest neighbors

This model provides word vectors of dimension 300. If you need a smaller size, you can use dimension reducer `fasttext.util.reduce_model(ft, [dimension_you_need])` and save if for later use `ft.save_model('[model_name].bin')`.

In [21]:
# view word vector of 'business'
ft.get_word_vector('business')

array([-0.01748701,  0.0205003 ,  0.00045082,  0.05443384, -0.03992728,
        0.06292507,  0.06871703,  0.0110075 ,  0.01524404,  0.01243424,
        0.05515028,  0.04769463, -0.00307405,  0.01624583, -0.01627133,
        0.00505937,  0.01617325,  0.00869957, -0.02411501,  0.01153365,
       -0.04201073, -0.05712525, -0.0291956 ,  0.04447872, -0.02245653,
        0.02838457,  0.0024024 ,  0.02998569, -0.00458549,  0.04846683,
        0.00940121, -0.01559024,  0.03521474, -0.03529881, -0.02251797,
        0.0214111 , -0.0051529 ,  0.01424455,  0.01805655, -0.01387825,
       -0.03698391, -0.02891411, -0.01573465,  0.02866242, -0.07018983,
       -0.02869168,  0.01659216, -0.00428046,  0.0305961 , -0.02691242,
       -0.01994575, -0.00468095,  0.04948655,  0.00340673, -0.04682877,
        0.00863829,  0.00702803, -0.00367357, -0.05073714, -0.01299426,
       -0.01623745, -0.06201141, -0.02648371,  0.01083561, -0.00339216,
       -0.03653119,  0.03942255,  0.03291544,  0.03652163, -0.01

In [22]:
# get dimension of word vector
ft.get_word_vector('business').shape

(300,)

In [23]:
ft.get_nearest_neighbors('business') # return list of top-10 nearest words in decending order by distance, in the form of (distance, word).

[(0.7553378343582153, 'busines'),
 (0.7056190967559814, 'buiness'),
 (0.7006047964096069, 'businesss'),
 (0.6821038126945496, 'busine'),
 (0.6689029932022095, 'businee'),
 (0.6610117554664612, 'businss'),
 (0.6595211625099182, 'businees'),
 (0.6502494215965271, 'busienss'),
 (0.6476604342460632, 'businesses'),
 (0.6353780031204224, 'buisness')]

## **Train your own word representations model**

In this section, the code shows how to build word vectors with the fastText tool. [Read more](https://fasttext.cc/docs/en/unsupervised-tutorial.html#:~:text=fastText%20provides%20two%20models%20for,word%20according%20to%20its%20context.).

### Get the data

In [24]:
%cd data
!mkdir tiny_en_Wiki
!wget -nc http://mattmahoney.net/dc/enwik9.zip -P tiny_en_Wiki
!unzip -n tiny_en_Wiki/enwik9.zip -d tiny_en_Wiki
%cd ..

/content/drive/MyDrive/is765/data
mkdir: cannot create directory ‘tiny_en_Wiki’: File exists
File ‘tiny_en_Wiki/enwik9.zip’ already there; not retrieving.

Archive:  tiny_en_Wiki/enwik9.zip
/content/drive/MyDrive/is765


In [25]:
!pwd # make sure you are under /content/drive/MyDrive/is765

/content/drive/MyDrive/is765


A raw Wikipedia dump contains a lot of HTML / XML data. We pre-process it with the wikifil.pl script bundled with fastText.

In [26]:
!perl fastText/wikifil.pl data/tiny_en_Wiki/enwik9 > data/tiny_en_Wiki/fil9

We can check the file by running the following command:

In [27]:
!head -c 80 data/tiny_en_Wiki/fil9

 anarchism originated as a term of abuse first used against early working class 

### Train word vectors

You are allowed to customize the following parameters by yourself, but in this demo code we use the default ones.

```
unsupervised_default = {
    'model': "skipgram",
    'lr': 0.05,
    'dim': 100,
    'ws': 5,
    'epoch': 5,
    'minCount': 5,
    'minCountLabel': 0,
    'minn': 3,
    'maxn': 6,
    'neg': 5,
    'wordNgrams': 1,
    'loss': "ns",
    'bucket': 2000000,
    'thread': multiprocessing.cpu_count() - 1,
    'lrUpdateRate': 100,
    't': 1e-4,
    'label': "__label__",
    'verbose': 2,
    'pretrainedVectors': "",
    'seed': 0,
    'autotuneValidationFile': "",
    'autotuneMetric': "f1",
    'autotunePredictions': 1,
    'autotuneDuration': 60 * 5,  # 5 minutes
    'autotuneModelSize': ""
}
```

In [28]:
import fasttext
my_model = fasttext.train_unsupervised('data/tiny_en_Wiki/fil9')
# training time: 2hr, RAM: 9.4GB, Disk: 27.0GB, 4 core
# how to get more compute resorces:
# opt1. use better local machine/connect to local host
# opt2. connect to Google Virtual Machine (GVM)

Here is an example code of how to customize your own parameter:
```
model = fasttext.train_unsupervised('data/tiny_en_Wiki/fil9', epoch=1, lr=0.5)
```

In [32]:
my_model.words[:30]

['the',
 'of',
 'one',
 'zero',
 'and',
 'in',
 'two',
 'a',
 'nine',
 'to',
 'is',
 'eight',
 'three',
 'four',
 'five',
 'six',
 'seven',
 'for',
 'are',
 'as',
 'was',
 's',
 'with',
 'by',
 'from',
 'that',
 'on',
 'or',
 'it',
 'at']

In [33]:
my_model.get_word_vector('business')

array([ 0.27145997,  0.04903308,  0.12494824, -0.09472214, -0.0802509 ,
        0.14166524, -0.35092482, -0.26776573,  0.23604995, -0.42009318,
        0.18008986,  0.4908855 , -0.44141048, -0.0344308 ,  0.1765782 ,
       -0.5394984 , -0.00714607, -0.20844656,  0.13427804, -0.01967354,
       -0.34965718, -0.04027196,  0.28748438, -0.61245716, -0.53191185,
       -0.05089269,  0.17186832, -0.49931455,  0.42541125,  0.1389539 ,
        0.17144738, -0.00709736,  0.15081407, -0.08687492, -0.38695407,
        0.15503404, -0.23869658,  0.6247077 , -0.18065308,  0.45242637,
        0.2869621 ,  0.5218008 ,  0.00551489,  0.23677549,  0.4656466 ,
        0.04830477,  0.43622276, -0.46874878, -0.1636521 ,  0.07875084,
        0.28107744, -0.14362164,  0.38284814, -0.11255915,  0.20197074,
        0.00800322, -0.16106468,  0.2194451 ,  0.26752743, -0.2506735 ,
       -0.17488182,  0.29191926,  0.18379547, -0.19985268,  0.18061788,
        0.20076558,  0.26102576,  0.2653019 , -0.21332307,  0.48

In [34]:
my_model.get_nearest_neighbors('business')

[(0.8689184188842773, 'ebusiness'),
 (0.7938203811645508, 'corporate'),
 (0.779841423034668, 'busines'),
 (0.7747765183448792, 'entrepreneurship'),
 (0.7723420262336731, 'consultancy'),
 (0.7720124125480652, 'businessweek'),
 (0.7691093683242798, 'outsourcing'),
 (0.7689657807350159, 'businesses'),
 (0.7660577297210693, 'banking'),
 (0.7647511959075928, 'firms')]

Save model as a binary file

In [35]:
my_model.save_model("my.ft.model.bin")

Reload later instead of training again

In [36]:
import fasttext as ft
new_model = ft.load_model("my.ft.model.bin")