## Exploratory Data Analysis

In this notebook, I am going to explore the data to become more familiar with the data.

In [14]:
import json
import pickle
import os
from rt_interview_RV import RV_code_snippet
import numpy
from tqdm import tqdm
import gensim.models
from gensim.models.word2vec import LineSentence
import utils
from copy import deepcopy
import optuna

In [2]:
def save_obj(obj, name):
    with open(name + ".pkl", "wb") as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)


def load_obj(name):
    with open(name + ".pkl", "rb") as f:
        return pickle.load(f)

In [3]:
def read_json(filename: str):
    assert isinstance(filename, str)
    with open(filename) as json_data:
        data = json.loads(json_data.read())
    return data

In [4]:
def read_pickle(filename: str):
    assert isinstance(filename, str)
    with open(filename, 'rb') as pickle_file:
        data = pickle.load(pickle_file)
    return data

In [5]:
def write_pickle(x, filename: str):
    assert isinstance(filename, str)
    with open(filename, 'wb') as pickle_file:
        pickle.dump(x, pickle_file)

In [6]:
dataset = read_json("rt_interview_RV/dataset.json")
ref_vocab = read_json("rt_interview_RV/ref_vocab.json")

In [7]:
print(f"Number of available data: {len(dataset)}")
print(f"Number of reference vocabs: {len(ref_vocab)}")
print(f"One sample from dataset to just become familiar with data type:\n{dataset[15000]}")

Number of available data: 19742
Number of reference vocabs: 95
One sample from dataset to just become familiar with data type:
{'description': 'Bainite formation from intercritical austenite is of great practical importance for the production of TRIP-assisted steels. Silicon and aluminium play important roles during this transformation by delaying carbide precipitation, thus favouring the carbon enrichment of untransformed austenite, which makes its stabilisation down to room temperature possible. Previous studies have shown a strong dependence of bainite formation kinetics on both chemical composition and transformation temperature. In the present work, the effect of silicon and aluminium contents on bainite formation kinetics is investigated experimentally using dilatometry combined with microscopical observations. The experimental results are analysed by comparison with thermodynamic parameters, such as the activation energy G* for nucleation of bainite and the carbon content CT0 co

In [8]:
ref_vocab_len = [len(vocab) for vocab in ref_vocab]
print(f"The shortest and longest words length in ref_vocab: {min(ref_vocab_len)} and {max(ref_vocab_len)}")

The shortest and longest words length in ref_vocab: 2 and 17


Now, let's look at the embedding of the reference vocabs:

In [9]:
ref_vocab_matrix = read_pickle("rt_interview_RV/ref_vocab_matrix.pkl")

In [10]:
print(f"the dimension of ref_vocab_matrix: {ref_vocab_matrix.shape}")

the dimension of ref_vocab_matrix: (95, 100)


So, The Embedding dimension for reference words is 100.

## Preparing the corpus

In [11]:
text_corpus = [doc['description'] for doc in dataset]

In [12]:
text_corpus[0]

'The production process of almost all modern steels involves austenitization formation of the austenite phase upon continuous heating. Many of the microstructural features and properties that are obtained upon subsequent cooling are to a large extend determined by the evolution of the microstructure and chemical inhomogeneities during austenitization. In spite of its importance, austenitization so far has received much less attention than the transformations on cooling; however, the interest is continuously increasing, especially for the development of new types of steels (Dual-Phase steel, TRansformation-Induced Plasticity steel etc.). The aim of the thesis is to develop knowledge and to gain better understanding of the formation of the austenite microstructure in steel during heating, e.g. austenite nucleation kinetics, austenite growth modes and morphologies, redistribution of carbon between the phases during the transformatio'

## Tokenization

In [13]:
from mat2vec.processing import MaterialsTextProcessor
text_processor = MaterialsTextProcessor()

clean_corpus = [text_processor.process(abstract) for abstract in tqdm(text_corpus[:200])] 

100%|██████████| 200/200 [00:17<00:00, 11.15it/s]


In [15]:
tmp_corpus = [' '.join(item[0]) for item in clean_corpus]
corpus = ' \n'.join([' '.join(item[0]) for item in clean_corpus])
with open("mat2vec/training/data/my_file", "w", encoding="utf-8") as f:
    f.write(corpus)

You can build the corpus for training by running the followin line:

In [None]:
utils.preprocess("rt_interview_RV/dataset.json", "corpus")

## Online Training

In [None]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

In [None]:
sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

In [None]:
for i, word in enumerate(model.wv.vocab):
    if i == 10:
        break
    print(word)

In [None]:
import tempfile

with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:
    temporary_filepath = tmp.name
    model.save(temporary_filepath)
    #
    # The model is now safely stored in the filepath.
    # You can copy it to other machines, share it with others, etc.
    #
    # To load a saved model:
    #
    new_model = gensim.models.Word2Vec.load(temporary_filepath)

In [None]:
model = gensim.models.Word2Vec.load(temporary_filepath)
more_sentences = [
    ['Advanced', 'users', 'can', 'load', 'a', 'model',
     'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']
]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)

# cleaning up temporary file
import os
os.remove(temporary_filepath)

In [None]:
model.epochs

In [74]:
model = gensim.models.Word2Vec.load('mat2vec/training/models/model_example')
oldmodel = deepcopy(model)

In [75]:
def update_args_(args, params):
    """updates args in-place"""
    dargs = vars(args)
    dargs.update(params)
    
params = {'window':5, 'negative':10}

update_args_(model, params)
#vars(model).update(params)
vars(model)
model.get_latest_training_loss()

10328.8701171875

In [76]:
#preprocess("rt_interview_RV/dataset.json")
#new_data = read_pickle('corpus')
#sentences = LineSentence('mat2vec/training/data/corpus')
sentences = LineSentence('mat2vec/training/data/my_file')

In [77]:
#!cd mat2vec/training
#!ls
model.build_vocab(sentences, update=True)
my_model = model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
#oldmodel.save('newmodel')
#model = Word2Vec.load('newmodel')

In [79]:
my_model

(580758, 1263360)

In [None]:
try:
    print(oldmodel.wv.most_similar('martensitic'))
except KeyError as e:
    print(e)

In [15]:
utils.preprocess("rt_interview_RV/dataset.json", "my_file")

100%|██████████| 200/200 [00:03<00:00, 63.67it/s]


NameError: name 'mat2vec' is not defined

In [29]:
eval('model').wv.vocab['the'].count

3085

### Computing RV

In [12]:
RV_code_snippet.calculate_rv_coefficient_of_arrays(ref_vocab_matrix, ref_vocab_matrix, normalize=True)

1.0

In [37]:
new_model = gensim.models.Word2Vec.load('mat2vec/training/models/model_v3')
new_model.get_latest_training_loss()

2910803.5

In [38]:
RV_code_snippet.calculate_rv_coefficient(ref_vocab, ref_vocab_matrix, new_model)

0.03259757669491132

In [39]:
new_model.wv.most_similar("magnetic")

[('metallurgical', 0.9985067248344421),
 ('slab', 0.9976663589477539),
 ('starting', 0.9975600242614746),
 ('additionally', 0.997474193572998),
 ('all', 0.997473418712616),
 ('whereas', 0.9974271059036255),
 ('transverse', 0.9972761273384094),
 ('finer', 0.9971984624862671),
 ('weight', 0.9971099495887756),
 ('element', 0.9971075057983398)]