## Guidline on how to run the code

Generally, there are three important steps in this notebbk:
- Preprocessing data and preparing the corpus
- Train a word2vec model from scratch on the corpus (with automatic hyperparameter tuning)
- Download a pretrained model and finetune this model on the existing corpus (with automatic hyperparameter tuning)

### 1- Data Preprocessing

The goal here is to receive tha dataset (in tha appropriate json format) and transform it to the LineSentence object that can be fed to the Word2Vec model. Therefore, the input is json file and the output is a text file with appropriate format.

The input should be tokenized, normalized and go through all the preprocessing step. For this part, I mostly used this open-source project: [mat2vec](https://github.com/materialsintelligence/mat2vec)

In [10]:
import json
import pickle
import pandas as pd
import os
from rt_interview_RV import RV_code_snippet
import numpy
from tqdm import tqdm
import gensim.models
from gensim.models.word2vec import LineSentence
import utils
from copy import deepcopy
import optuna

This can be easily done by the preprocess function inside utils. this function receives json file and produce a text file in the following folder: mat2vec/training/data/[your_file_name]

For example, It saves the output at: mat2vec/training/data/corpus

In [3]:
utils.preprocess("rt_interview_RV/dataset.json", "corpus")

100%|██████████| 19742/19742 [05:20<00:00, 61.57it/s]


Now, The preprocessing is complete and data has been seved in the directory I mentioned. Now, we can go to the next step.

Note: This preprocessing is appropriate for data in the material science or chemistry domain. 

### 2- Training From Scratch

No, in this section I want to learn a waod2vec embedding just from the existing corpus. We want to tune hyperparameters automatically and find the best hyperparameters for the word Embedding. During the hyper parameter tuning of the model, the goal is to try to reach maximum **RV coefficient value**.

For hypermarameter tuning, I have used Optuna library, which is suitable and powerful for this task. Let's first run the algorithm and after that I am explaining how this works.


In [4]:
os.chdir('mat2vec/training')

Here, I am not going to run this on the full copus (because it takes good amount of time!). I am just running this for subset of the dataset for demonstration. You can use --corpus args to select the appropriate training dataset (full corpus).

Here, data/my_file is a file that contains just 200 descriptions. After running the folowing line, It saves the model on models/[model_name] file. 

Note: I am going to learn the 100 dimensions representation, because the reference vocabs for RV coefficients have 100 dimensions.

In [5]:
!python phrase2vec.py --corpus=data/my_file --model_name=model_v1 --size=100 -notmp

Best value: 0.09811377340754925 (params: {'subsample': 2.849606907773163e-06, 'window': 8, 'min_count': 4, 'negative': 10, 'alpha': 0.005161227754353624})

2020-06-01 16:10:12,063 : INFO : Basic min_count trim rule for formula.










2020-06-01 16:10:12,063 : INFO : Not including extra phrases, option not specified.

   number     value  params_alpha  ...  params_subsample  params_window     state







2020-06-01 16:10:12,064 : INFO : collecting all words and their counts

0       0  0.098114      0.005161  ...          0.000003              8  COMPLETE







2020-06-01 16:10:12,081 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types

1       1  0.057459      0.018081  ...          0.003922              8  COMPLETE







2020-06-01 16:10:12,150 : INFO : collected 24418 word types from a corpus of 42112 words (unigram + bigrams) and 200 sentences

2       2  0.031788      0.017352  ...          0.000030              7  COMPLETE







2020-06-01 16:10:12,150 : INFO : using 24418 counts as vocab in Phrases<0 vocab, min_count=10, threshold=15.0, max_vocab_size=40000000>







[3 rows x 8 columns]

2020-06-01 16:10:12,150 : INFO : source_vocab length 24418





2020-06-01 16:10:12,381 : INFO : Phraser built with 61 phrasegrams

  0%|          | 0/61 [00:00<?, ?it/s]
100%|##########| 61/61 [00:00<00:00, 4132.32it/s]
2020-06-01 16:10:12,400 : INFO : collecting all words and their counts
2020-06-01 16:10:12,401 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-06-01 16:10:12,565 : INFO : collected 24796 word types from a corpus of 40513 words (unigram + bigrams) and 200 sentences
2020-06-01 16:10:12,565 : INFO : using 24796 counts as vocab in Phrases<0 vocab, min_count=10, threshold=15.0, max_vocab_size=40000000>
2020-06-01 16:10:12,566 : INFO : source_vocab length 24796
2020-06-01 16:10:12,811 : INFO : Phraser built with 86 phrasegrams

  0%|          | 0/86 [00:00<?, ?it/s]
100%|##########| 86/86 [00:00<00:00, 84992.97it/s]
2020-06-01 16:10:12,815 : INFO : saving Phraser object under models\model_v1_phraser.pkl, separately None
2020-06-01 16:10:12,816 : INFO : saved models\model_v1_phraser.pkl
2020-06-01 16:10:12,817 

2020-06-01 16:10:14,633 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-06-01 16:10:14,633 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-06-01 16:10:14,633 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-06-01 16:10:14,633 : INFO : EPOCH - 4 : training on 40239 raw words (1738 effective words) took 0.2s, 7690 effective words/s
2020-06-01 16:10:14,894 : INFO : worker thread finished; awaiting finish of 15 more threads
2020-06-01 16:10:14,895 : INFO : worker thread finished; awaiting finish of 14 more threads
2020-06-01 16:10:14,895 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:14,895 : INFO : worker thread finished; awaiting finish of 12 more threads
2020-06-01 16:10:14,895 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:14,895 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:14,895 : INFO : worker threa

2020-06-01 16:10:15,837 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-06-01 16:10:15,837 : INFO : EPOCH - 9 : training on 40239 raw words (1810 effective words) took 0.2s, 8169 effective words/s
2020-06-01 16:10:16,087 : INFO : worker thread finished; awaiting finish of 15 more threads
2020-06-01 16:10:16,087 : INFO : worker thread finished; awaiting finish of 14 more threads
2020-06-01 16:10:16,088 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:16,088 : INFO : worker thread finished; awaiting finish of 12 more threads
2020-06-01 16:10:16,088 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:16,088 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:16,088 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:16,088 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-06-01 16:10:16,088 : INFO : worker threa

2020-06-01 16:10:17,320 : INFO : worker thread finished; awaiting finish of 15 more threads
2020-06-01 16:10:17,320 : INFO : worker thread finished; awaiting finish of 14 more threads
2020-06-01 16:10:17,321 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:17,321 : INFO : worker thread finished; awaiting finish of 12 more threads
2020-06-01 16:10:17,321 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:17,321 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:17,321 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:17,321 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-06-01 16:10:17,321 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-01 16:10:17,321 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-06-01 16:10:17,322 : INFO : worker thread finished; awaiting finish of 5 more t

2020-06-01 16:10:18,457 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:18,457 : INFO : worker thread finished; awaiting finish of 12 more threads
2020-06-01 16:10:18,457 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:18,457 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:18,457 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:18,457 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-06-01 16:10:18,457 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-01 16:10:18,457 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-06-01 16:10:18,457 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-06-01 16:10:18,457 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-06-01 16:10:18,457 : INFO : worker thread finished; awaiting finish of 3 more thr

2020-06-01 16:10:19,730 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:19,730 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:19,730 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:19,730 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-06-01 16:10:19,730 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-01 16:10:19,730 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-06-01 16:10:19,730 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-06-01 16:10:19,730 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-06-01 16:10:19,731 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-06-01 16:10:19,731 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-06-01 16:10:19,731 : INFO : worker thread finished; awaiting finish of 1 more threa

2020-06-01 16:10:20,993 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:20,993 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-06-01 16:10:20,993 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-01 16:10:20,993 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-06-01 16:10:20,993 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-06-01 16:10:20,994 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-06-01 16:10:20,994 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-06-01 16:10:20,994 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-06-01 16:10:20,994 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-06-01 16:10:20,994 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-06-01 16:10:20,994 : INFO : EPOCH - 30 : training on 40239 raw words (1799 effective 

2020-06-01 16:10:24,054 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:24,054 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:24,054 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:24,054 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-06-01 16:10:24,054 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-01 16:10:24,054 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-06-01 16:10:24,055 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-06-01 16:10:24,055 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-06-01 16:10:24,055 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-06-01 16:10:24,055 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-06-01 16:10:24,055 : INFO : worker thread finished; awaiting finish of 1 more threa

2020-06-01 16:10:25,377 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:25,377 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-06-01 16:10:25,378 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-01 16:10:25,378 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-06-01 16:10:25,378 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-06-01 16:10:25,378 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-06-01 16:10:25,378 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-06-01 16:10:25,378 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-06-01 16:10:25,378 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-06-01 16:10:25,399 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-06-01 16:10:25,399 : INFO : EPOCH - 8 : training on 40239 raw words (25198 effective 

2020-06-01 16:10:26,800 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-01 16:10:26,800 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-06-01 16:10:26,800 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-06-01 16:10:26,800 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-06-01 16:10:26,800 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-06-01 16:10:26,800 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-06-01 16:10:26,801 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-06-01 16:10:26,821 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-06-01 16:10:26,821 : INFO : EPOCH - 13 : training on 40239 raw words (25169 effective words) took 0.3s, 98121 effective words/s
2020-06-01 16:10:27,059 : INFO : worker thread finished; awaiting finish of 15 more threads
2020-06-01 16:10:27,059 : INFO : worker thread 

2020-06-01 16:10:28,173 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-06-01 16:10:28,173 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-06-01 16:10:28,173 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-06-01 16:10:28,173 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-06-01 16:10:28,173 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-06-01 16:10:28,195 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-06-01 16:10:28,195 : INFO : EPOCH - 18 : training on 40239 raw words (25167 effective words) took 0.3s, 99879 effective words/s
2020-06-01 16:10:28,428 : INFO : worker thread finished; awaiting finish of 15 more threads
2020-06-01 16:10:28,428 : INFO : worker thread finished; awaiting finish of 14 more threads
2020-06-01 16:10:28,428 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:28,428 : INFO : worker threa

2020-06-01 16:10:29,491 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-06-01 16:10:29,491 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-06-01 16:10:29,491 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-06-01 16:10:29,512 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-06-01 16:10:29,512 : INFO : EPOCH - 23 : training on 40239 raw words (25132 effective words) took 0.2s, 102806 effective words/s
2020-06-01 16:10:29,738 : INFO : worker thread finished; awaiting finish of 15 more threads
2020-06-01 16:10:29,739 : INFO : worker thread finished; awaiting finish of 14 more threads
2020-06-01 16:10:29,739 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:29,739 : INFO : worker thread finished; awaiting finish of 12 more threads
2020-06-01 16:10:29,739 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:29,739 : INFO : worker th

2020-06-01 16:10:30,777 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-06-01 16:10:30,798 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-06-01 16:10:30,798 : INFO : EPOCH - 28 : training on 40239 raw words (25105 effective words) took 0.3s, 99916 effective words/s
2020-06-01 16:10:31,063 : INFO : worker thread finished; awaiting finish of 15 more threads
2020-06-01 16:10:31,063 : INFO : worker thread finished; awaiting finish of 14 more threads
2020-06-01 16:10:31,063 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:31,063 : INFO : worker thread finished; awaiting finish of 12 more threads
2020-06-01 16:10:31,063 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:31,064 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:31,064 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:31,064 : INFO : worker th

2020-06-01 16:10:32,928 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-06-01 16:10:32,928 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-06-01 16:10:32,928 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-06-01 16:10:32,929 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-06-01 16:10:32,929 : INFO : EPOCH - 1 : training on 40239 raw words (4386 effective words) took 0.3s, 16804 effective words/s
2020-06-01 16:10:33,187 : INFO : worker thread finished; awaiting finish of 15 more threads
2020-06-01 16:10:33,190 : INFO : worker thread finished; awaiting finish of 14 more threads
2020-06-01 16:10:33,190 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:33,191 : INFO : worker thread finished; awaiting finish of 12 more threads
2020-06-01 16:10:33,191 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:33,191 : INFO : worker threa

2020-06-01 16:10:34,214 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-06-01 16:10:34,214 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-06-01 16:10:34,214 : INFO : EPOCH - 6 : training on 40239 raw words (4550 effective words) took 0.2s, 20515 effective words/s
2020-06-01 16:10:34,434 : INFO : worker thread finished; awaiting finish of 15 more threads
2020-06-01 16:10:34,435 : INFO : worker thread finished; awaiting finish of 14 more threads
2020-06-01 16:10:34,435 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:34,435 : INFO : worker thread finished; awaiting finish of 12 more threads
2020-06-01 16:10:34,435 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:34,435 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:34,435 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:34,436 : INFO : worker thre

2020-06-01 16:10:35,401 : INFO : EPOCH - 11 : training on 40239 raw words (4388 effective words) took 0.2s, 19765 effective words/s
2020-06-01 16:10:35,625 : INFO : worker thread finished; awaiting finish of 15 more threads
2020-06-01 16:10:35,626 : INFO : worker thread finished; awaiting finish of 14 more threads
2020-06-01 16:10:35,626 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:35,626 : INFO : worker thread finished; awaiting finish of 12 more threads
2020-06-01 16:10:35,626 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:35,626 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:35,626 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:35,626 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-06-01 16:10:35,627 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-01 16:10:35,627 : INFO : worker thr

2020-06-01 16:10:36,855 : INFO : worker thread finished; awaiting finish of 15 more threads
2020-06-01 16:10:36,855 : INFO : worker thread finished; awaiting finish of 14 more threads
2020-06-01 16:10:36,855 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:36,855 : INFO : worker thread finished; awaiting finish of 12 more threads
2020-06-01 16:10:36,856 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:36,856 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:36,856 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:36,856 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-06-01 16:10:36,856 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-01 16:10:36,856 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-06-01 16:10:36,856 : INFO : worker thread finished; awaiting finish of 5 more t

2020-06-01 16:10:38,080 : INFO : worker thread finished; awaiting finish of 13 more threads
2020-06-01 16:10:38,080 : INFO : worker thread finished; awaiting finish of 12 more threads
2020-06-01 16:10:38,080 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:38,080 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:38,080 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:38,080 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-06-01 16:10:38,080 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-01 16:10:38,080 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-06-01 16:10:38,080 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-06-01 16:10:38,081 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-06-01 16:10:38,081 : INFO : worker thread finished; awaiting finish of 3 more thr

2020-06-01 16:10:39,261 : INFO : worker thread finished; awaiting finish of 11 more threads
2020-06-01 16:10:39,261 : INFO : worker thread finished; awaiting finish of 10 more threads
2020-06-01 16:10:39,261 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-06-01 16:10:39,261 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-06-01 16:10:39,261 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-01 16:10:39,262 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-06-01 16:10:39,262 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-06-01 16:10:39,262 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-06-01 16:10:39,262 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-06-01 16:10:39,262 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-06-01 16:10:39,262 : INFO : worker thread finished; awaiting finish of 1 more threa

After training, you can observe the RV coefficients and best hyperparameters. Three fils is availble on the model folder which shows the hyperparameters config and best hyperparameters. Let's explore to see the content of these files.

In [7]:
#This shows the best hyperparameters configuration.
utils.read_json('models/results.json')

{'subsample': 2.849606907773163e-06,
 'window': 8,
 'min_count': 4,
 'negative': 10,
 'alpha': 0.005161227754353624}

In [13]:
## This file shows all the information relevant to the hyperparameters. The range of each of them, 
## the loss in each trial
with open('models/results.txt', "r", encoding="utf-8") as f:
        data= f.read()
print(data)

[FrozenTrial(number=0, value=0.09811377340754925, datetime_start=datetime.datetime(2020, 6, 1, 16, 10, 11, 996257), datetime_complete=datetime.datetime(2020, 6, 1, 16, 10, 22, 30431), params={'subsample': 2.849606907773163e-06, 'window': 8, 'min_count': 4, 'negative': 10, 'alpha': 0.005161227754353624}, distributions={'subsample': LogUniformDistribution(high=0.01, low=1e-06), 'window': IntUniformDistribution(high=10, low=5, step=1), 'min_count': IntUniformDistribution(high=10, low=3, step=1), 'negative': IntUniformDistribution(high=20, low=10, step=1), 'alpha': LogUniformDistribution(high=0.1, low=0.0001)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=0, state=TrialState.COMPLETE), FrozenTrial(number=1, value=0.057459311455957444, datetime_start=datetime.datetime(2020, 6, 1, 16, 10, 22, 31430), datetime_complete=datetime.datetime(2020, 6, 1, 16, 10, 31, 443276), params={'subsample': 0.003922404262921013, 'window': 8, 'min_count': 9, 'negative': 20, 'alpha': 0.018080

In [15]:
pd.read_csv('models/results.csv', index_col='number')

Unnamed: 0_level_0,Unnamed: 0,value,params_alpha,params_min_count,params_negative,params_subsample,params_window,state
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,0.098114,0.005161,4,10,3e-06,8,COMPLETE
1,1,0.057459,0.018081,9,20,0.003922,8,COMPLETE
2,2,0.031788,0.017352,8,19,3e-05,7,COMPLETE


Therefore, you can see that all the information about hyperparameters is available in these files. 

Now, let me explain some notes that are important. 

First: I have run just three trials here for the purpose of demonstration. But in practice, you need to run maybe 100 and 1000 trails. You can easily do that by changing one line in phrase2vec.py file.

When you are building study object, you can set `n_trials` to any number you want.

<img src="assets/n_trials.JPG" alt="Drawing" style="width: 600px;"/>


Also, I have just selected 5 hyperparameters for tuning. But, there is not any limitation and you can add to this list. Just you should add the parameters to the `param` dictionary in `phrase2vec.py`. After that, Optuna automatically searchs over this newly added parameter.

<img src="assets/hyperparameters.JPG" alt="Drawing" style="width: 700px;"/>


So, after this, I ran this model on the full corpus for three trials (because each run takes about 30 min and I could not run more trials). The best RV coefficient I could achieve was about 0.23.

### 3- Finetuning Pre-Trained Model

In this section, I want to explain how we can finetune an already existing word2vec model with our corpus. First, I am downloading a pretrained model on the model folder. Then, I am loading this model and after that I am finetuning it with the corpus. I t is important to note that again for finetuning, I have used Optuna to find suitable hyperparameters. 

**Important note: You should download a pre-trained model with 100 dimension embedding.**

In [None]:
!python phrase2vec.py --corpus=data/my_file --model_name=model_finetune --size=100 --finetuning=True -notmp

Just you need to add `finetuning=True` flag to your arguments. Then, it first load the existing kodel and then finetune on the existing corpus.

Here, just for the sake of demonstration, I have finetuned the model on a very toy model. But, in reality, you can choose a sophisticated pretrained model.

<img src="assets/finetune.JPG" alt="Drawing" style="width: 800px;"/>

You can change finetune model to any appropriate model you want. 

That's it. Please let me know if you have any question !!