## Relabelling and Mimicking model.
The notebook consists of process to train a Relabelling and a Mimicking model. First we will train a Relabelling model, test it, use this trained model to induce noise in the dataset. Then, we will train Mimicking model on this noise induced dataset. 

### 1. Create elmo for the train set. 
Edit the dataset location in the get_elmo_vec.py. Use the data corresponding to the relabelling model.

In [None]:
!python3 get_elmo_vec.py

Reading file: data/16-04-2020_annotated_fold2_persection/train.txt
100%|███████████████████████████████| 139094/139094 [00:00<00:00, 590430.41it/s]
number of sentences: 825
Finishing embedding ELMo sequences, saving the vector files.
Reading file: data/16-04-2020_annotated_fold2_persection/dev.txt
100%|█████████████████████████████████| 17017/17017 [00:00<00:00, 628141.58it/s]
number of sentences: 109
Finishing embedding ELMo sequences, saving the vector files.
Reading file: data/16-04-2020_annotated_fold2_persection/test.txt
100%|█████████████████████████████████| 41511/41511 [00:00<00:00, 614922.44it/s]
number of sentences: 247
Finishing embedding ELMo sequences, saving the vector files.


### Train the model on the relabelling model dataset.

In [None]:
!python3 trainer.py --device "cuda:1" --dataset "16-04-2020_annotated_fold2_persection" --embedding_file "data/material_science_glove2.txt" --learning_rate "0.001" --optimizer "sgd" --context_emb "elmo" --dropout 0.5 --hidden_dim 200 --num_epochs 120 --model_folder "relabelling_model_fold2" --batch_size 8

### Reformat the test dataset.

In [None]:
!python3 create_test_final.py -paperDir ../pdfs/data_16-04-2020_fold2/ -persection true -context elmo

Files: 100%|████████████████████████████████████| 86/86 [00:32<00:00,  2.67it/s]


### Predict on this dataset

In [None]:
!python3 predict_instances.py -paperDir ../pdfs/data_16-04-2020_fold2/ -modelPath relabelling_model_fold2/relabelling_model_fold2.tar.gz

Files:   0%|                                             | 0/43 [00:00<?, ?it/s]../pdfs/data_16-04-2020_fold2/29-01-2020-part2-1702.01144v1.Band_Structure_Band_Offsets_Substitutional_Doping_and_Schottky_Barriers_in_InSe.pdf.txt
Files:   2%|▊                                    | 1/43 [00:09<06:37,  9.46s/it]../pdfs/data_16-04-2020_fold2/09-02-2020-1711.04334v1.Stable_Ultra_thin_CdTe_Crystal_A_Robust_Direct_Gap_Semiconductor.pdf.txt
Files:   5%|█▋                                   | 2/43 [00:22<07:08, 10.44s/it]../pdfs/data_16-04-2020_fold2/09-02-2020-1012.4934v1.Modified_graphene_with_small_BN_domain_an_effective_method_for_band_gap_opening.pdf.txt
Files:   7%|██▌                                  | 3/43 [00:33<07:04, 10.62s/it]../pdfs/data_16-04-2020_fold2/09-02-2020-1007.1864v1.The_third_conformer_of_graphane_A_first_principles_DFT_based_study.pdf.txt
Files:   9%|███▍                                 | 4/43 [00:46<07:25, 11.42s/it]../pdfs/data_16-04-2020_fold2/28-09-2019-5a.pdf.txt
File

### Apply filters and get the final predictions

In [None]:
!python3 final_results_final.py -paperDir ../pdfs/data_16-04-2020_fold2/ -filterMaterial true -filterMethod false -MaterialPred false -printOnlyUnmatched false -filterStructure true

Files:   0%|                                            | 0/172 [00:00<?, ?it/s]../pdfs/data_16-04-2020_fold2/26-08-2019-paper-10.pdf.prediction
['layered perovskites Sr2MO4 ( M=Ti , V , Cr , and Mn']
MATERIAL - 1 ,2, 1, 2
METHOD - 4 ,3, 8, 4
CODE - 3 ,1, 3, 1
PARAMETER - 3 ,3, 7, 3
STRUCTURE - 0 ,0, 1, 0
../pdfs/data_16-04-2020_fold2/29-01-2020-part1-1608.01885v1.Density_functional_theory_calculations_of_the_stress_of_oxidised_110_silicon_surfaces.pdf.prediction
['silicon', 'oxidised ( 110 )']
MATERIAL - 2 ,1, 2, 1
METHOD - 5 ,3, 7, 4
PARAMETER - 1 ,1, 10, 3
CODE - 2 ,2, 2, 2
Files:   3%|█                                   | 5/172 [00:00<00:03, 48.29it/s]../pdfs/data_16-04-2020_fold2/29-01-2020-part2-1510.00513v1.Electronic_structure_and_magnetic_properties_of_FeTe_BiFeO__3_SrFe___12_O___19_and_SrCoTiFe___10_O___19_compounds.pdf.prediction
['FeTe BiFeO 3 SrFe 12 O 19']
METHOD - 11 ,5, 22, 5
MATERIAL - 1 ,1, 1, 1
STRUCTURE - 1 ,1, 7, 1
PARAMETER - 0 ,0, 6, 0
../pdfs/data_16-04-2020_fol

## MIMICKING MODEL
We have trained the Relabelling model, now we will train Mimicking model. For that we will use the previously trained model to induce noise in the dataset.

In [None]:
!python3 make_predictions.py -paperDir data/16-04-2020_alllines_fold2_persection -modelPath relabelling_model_fold2/relabelling_model_fold2.tar.gz

In [None]:
import random
paper_dir = "data/16-04-2020_alllines_fold2_persection/"
final_dir = "data/fold2_mimicking_data/"
file1 = open(paper_dir+"train.txt","r+")
file2 = open(paper_dir+"train.prediction","r+")
lines1 = file1.readlines()
lines2 = file2.readlines()
new_file = open(final_dir+"train.txt","w+")
for i in range(len(lines1)):
    line1=lines1[i].strip("\n")
    line2=lines2[i].strip("\n")
    if line1!="":
        word,label1 = line1.split()
        _,label2 = line2.split()
        if label1=="O" and "MATERIAL" in label2:
            temp = random.randint(0,1)
            if temp==0:
                label1 = label2
        if label1=="O" and "STRUCTURE" in label2:
            temp = random.randint(0,1)
            if temp==0: #only 20% of structure will be added
                label1 = label2
        if label1=="O" and "CODE" in label2:
            temp = random.randint(0,2)
            if temp==0: #only 20% of code will be added
                label1 = label2
        if label1=="O" and "METHOD" in label2:
            temp = random.randint(0,3)
            if temp==0: #only 20% of code will be added
                label1 = label2
        if label1=="O" and "PARAMETER" in label2:
            temp = random.randint(0,9)
            if temp==0: #only 10% of code will be added
                label1 = label2
        new_file.write(word+" "+label1+"\n")
    else:
        new_file.write("\n")
new_file.close()
file1.close()
file2.close()

### Create elmo for the training dataset.
This time for the mimicking model.

In [None]:
!python3 get_elmo_vec.py

Reading file: data/fold2_mimicking_data/train.txt
100%|█████████████████████████████| 1005778/1005778 [00:01<00:00, 612748.67it/s]
number of sentences: 1514
^C
Traceback (most recent call last):
  File "get_elmo_vec.py", line 90, in <module>
    get_vector()
  File "get_elmo_vec.py", line 75, in get_vector
    read_parse_write(elmo, file, outfile, mode)
  File "get_elmo_vec.py", line 57, in read_parse_write
    vec = parse_sentence(elmo, inst.input.words, mode=mode)#Remove pos_tags argument for model without additional embeedding for materials
  File "get_elmo_vec.py", line 20, in parse_sentence
    vectors = elmo.embed_sentence(words)
  File "/home/jatin-pg/.local/lib/python3.6/site-packages/allennlp/commands/elmo.py", line 230, in embed_sentence
    return self.embed_batch([sentence])[0]
  File "/home/jatin-pg/.local/lib/python3.6/site-packages/allennlp/commands/elmo.py", line 255, in embed_batch
    embeddings, mask = self.batch_to_embeddings(batch)
  File "/home/jatin-pg/.local/lib

### Train the mimicking model

In [None]:
!python3 trainer.py --device "cuda:1" --dataset "fold2_mimicking_data" --embedding_file "data/material_science_glove2.txt" --learning_rate "0.001" --optimizer "sgd" --context_emb "elmo" --dropout 0.5 --hidden_dim 200 --num_epochs 120 --model_folder "mimicking_model_fold2" --batch_size 8

### Reformat the test dataset

In [None]:
!python3 create_test_final.py -paperDir ../pdfs/data_16-04-2020_fold2/ -persection true -context elmo

Files: 100%|██████████████████████████████████| 172/172 [00:46<00:00,  3.68it/s]


### Get predictions on the test dataset

In [None]:
!python3 predict_instances.py -paperDir ../pdfs/data_16-04-2020_fold2/ -modelPath mimicking_model_fold2/mimicking_model_fold2.tar.gz

Files:   0%|                                             | 0/43 [00:00<?, ?it/s]../pdfs/data_16-04-2020_fold2/29-01-2020-part2-1702.01144v1.Band_Structure_Band_Offsets_Substitutional_Doping_and_Schottky_Barriers_in_InSe.pdf.txt
Files:   2%|▊                                    | 1/43 [00:13<09:12, 13.15s/it]../pdfs/data_16-04-2020_fold2/09-02-2020-1711.04334v1.Stable_Ultra_thin_CdTe_Crystal_A_Robust_Direct_Gap_Semiconductor.pdf.txt
Files:   5%|█▋                                   | 2/43 [00:30<09:53, 14.46s/it]../pdfs/data_16-04-2020_fold2/09-02-2020-1012.4934v1.Modified_graphene_with_small_BN_domain_an_effective_method_for_band_gap_opening.pdf.txt
Files:   7%|██▌                                  | 3/43 [00:45<09:44, 14.61s/it]../pdfs/data_16-04-2020_fold2/09-02-2020-1007.1864v1.The_third_conformer_of_graphane_A_first_principles_DFT_based_study.pdf.txt
Files:   9%|███▍                                 | 4/43 [01:03<10:08, 15.61s/it]../pdfs/data_16-04-2020_fold2/28-09-2019-5a.pdf.txt
File

### Apply filters to get the final predictions on the test set and calculate the scores.

In [None]:
!python3 final_results_final.py -paperDir ../pdfs/data_16-04-2020_fold2/ -filterMaterial true -filterMethod false -MaterialPred false -printOnlyUnmatched false -filterStructure false