# .JSON/.TXT Tokenizer and Embedding Creator Tutorial <br>
### Hello! This is the tutorial for using the word embedding creator on the Luscombe Group's ChemLP project. To facilitate training a neural network to identify chemicals and chemical properties, we need to generate vector representations (embeddings) of BERT tokens produced from chemistry related literature. In short, this program does **four** things to help do this: 
#### **1.** Reads a local corpus of chemistry related files in .json and .txt form 
#### **2.** BERT tokenizes each file, sentence by sentence
#### **3.** Using word2vec, creates word embeddings for each token
#### **4.** Writes the token/embedding dictionary into a local .json file
<br>


### Before working through that list, the first step is to download Louisa_w2v_functions from the ChemLP repository and import the required modules. 
Here is what my local folder looks like before running the code; it includes the tutorial, the full_pkg-checkpoint, and Louisa's functions. 
![markdown.JPG](attachment:markdown.JPG) <br>

The imports are listed below. Note the import of Louisa's functions. Many of these imports support additional capabilities for Louisa's functions, such as TSNE representation of word embeddings to visualize data. If you are continuing work with this project, it would be valuable to spend some time looking at her various
functions. Two important imports are Word2Vec and the various BERT tokenizers and models (**these may require downloading, double check this**)


In [3]:
import re, string 
import pandas as pd 
from time import time  
from collections import defaultdict
import spacy
from sklearn.manifold import TSNE
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
from gensim.models import Word2Vec
import matplotlib.pyplot as plt
from nltk.stem import WordNetLemmatizer
from nltk import sent_tokenize
import matplotlib.pyplot as plt
import json
%matplotlib inline
import torch
from transformers import BertModel, BertConfig, BertTokenizer, PreTrainedTokenizer
import csv
import glob 
import logging  
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

import Louisa_w2v_functions as w2v_functions


## Now the tasks are divided amongst two functions and a final implementation section of code. <br>

### Tasks 1 and 2: Reading and Tokenizing 
Now, the four jobs of this program are broken up and performed by two functions and a final hardcoded section that implements the functions.
The first of these functions is called super_list_maker, and it performs the first two tasks listed above.

In [4]:
def super_list_maker(separate = True):
    
    """
    This function prompts the user for a local folder/corpus location where the .json and .txt files are located. It then BERT tokenizes them and reads the 
    tokens into a list or list of lists depending on the default variable separate(true => list of lists vs false => single token list). In either case, the final list is then 
    returned. For users on windows, the filepath format leading to the corpus would be: C:\Users\bowri\square1\ChemLP\Bowman\textbook_files for example
    """
    
    initPath = input("Enter a file location")
    bookType = input("Enter a data type(.json please)")
    token_list = []
    final_list = []
    sep_token_list_of_lists = []
    if bookType:
        path = initPath + "\*" + bookType
    else:
        path = initPath
    raw_path = r"{}".format(path)
    bookList = glob.glob(raw_path)
    
    yesOrNo = input("Is there an additional file type? (y or n) ")
    if yesOrNo == "y":
        bookType2 = input("Enter a second data type(.txt) ")
        path2 = initPath + "\*" + bookType2
        raw_path2 = r"{}".format(path2)
        bookList2 = glob.glob(raw_path2)
        for bookSite in bookList2:
            bookList.append(bookSite)
    
    bookCount = len(bookList)
    print("bookCount is ", bookCount)
    print("book list is ", bookList)

    
    for i in range(bookCount): #(len(bookList)):
        sep_book_list = w2v_functions.feed2vec(bookList[i], tokenize = True)
        sep_token_list_of_lists.append(sep_book_list[0])
        print("token list of book ", i+1, "is ", sep_book_list[0])
        print("tokens in book ", i+1, "is", len(sep_book_list[0]))
        print()
        for word in sep_book_list[0]:
            token_list.append(word)  #use true here
        print("data type of token_list[", i+1, "]", type(token_list[i]))

    print()
    print("length of vocab ", len(token_list))
    
    if separate == False:
        print("final list length/token count ", len(token_list))
        print ("all tokens ", token_list)
        return token_list
    else:
        print("list of all tokens (list of list form) ", sep_token_list_of_lists)
        return sep_token_list_of_lists
    print()


#local book path = C:\Users\bowri\square1\ChemLP\Bowman\textbook_files   this format is what works, then use .json as type  


SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 425-426: truncated \UXXXXXXXX escape (<ipython-input-4-934b4dfb0bbd>, line 7)

### Task 3: Creating Vector Embeddings 
The third job required of the program is to turn the BERT tokens into embeddings using Word2Vec. While it may seem like Word2vec's CBOW or Skip-gram model is not
suited to be tasked on a list of tokens, where the context of words is lost, it functions here to simply create vectors out of the tokens. The sentence tokenizer
used by BERT in super_list_maker() is already utilizing the context of the words to produce tokens. Because of that, the word2vec model creation and training is
done in such a way as to minimize the amount of tokens thrown out. For instance, the default setting of the minimum count is 1. 

In [None]:
#add default variables to support variable in w2v_train!
#not sure if it's bad form to have a function that solely exists in another function (ie super_list_maker here)

def embedding_returner(list_of_lists = True, minC = 1):
    
    """
    This function utilizes the above function, as well as the imported Louisa w2v functions to return a list of length 500 word embeddings. Using the above 
    function it will prompt the user for a local corpus location as well as for file types and the name of the model to be saved. Currently it only reads 
    in .json and .txt files. It will also print out the unique tokens generated from BERT tokenization for each book. The default variable is passed into
    the super_list_maker function, and the minC default variable is the minimum count variable passed into Louisa's w2v_train function for word2vec. The 
    function returns a dictionary with keys: "embeddings" and "words/tokens".
    """
    
    list_of_tokens = super_list_maker(separate = list_of_lists)
    if list_of_tokens:
        saveName = input("Enter a name for the model saved: ")
        w2v_functions.tokens = list_of_tokens
        vec_emb_dict = w2v_functions.w2v_train(saveName, min_count = minC)               #issue here changing it to dictionary, bc this fxn doesn't return
        print("Number of unique word embeddings generated: ", len(vec_emb_dict["embeddings"]))
        print("Size of word embeddings: ", len(vec_emb_dict["embeddings"][0]))
        return vec_emb_dict
    else:
        print("Check your corpus location and your data types!")


### Task 4: Saving the Embeddings and Implementing the Program
Now, the final step is to simply call embedding_returner(), which will ultimately return a dictionary of embeddings and words, and write that dictionary to a .json. The below code does just that. Using glob and json imports it prompts the user for a name for the saved file, and then writes the dictionary into a .json. It may be more useful to construct a function, but for a single case this is effective. 

In [None]:
def main_program_runner(sep_lists = True, MC = 1):
    
    """
    This function calls the above functions and saves whichever corpus is read to a singular .json file. It will prompt the user for a save name and save it to 
    the local file folder location. It's default variables are the default variables of super_list_maker and the embedding_returner. 
    """

    vecs_and_words_dict = dict()
    vecs_and_words_dict = embedding_returner(list_of_lists = sep_lists, minC = MC )
    json = json.dumps(vecs_and_words_dict)

    SaveName = input("Enter a file name for the .json file of words and embeddings")
    SaveName = SaveName + ".json"

    f = open(SaveName,"w")
    f.write(json)
    f.close()

### Practice:
Now using this program, we are going to practice on a few different .json files. First, we will try using a document that is well suited to BERT tokenization, such as a book that uses regular english. Then, we will try the program using a single chemistry textbook.
<br>
<br>
Try using the program first with the ChemLP path: "C:\Users\bowri\square1\ChemLP\Completed_Word2Vec" and be sure just to read .txt files to ensure the program just reads Pride and Prejudice. You should notice that there are fewer tokens generated; given that the book uses more traditional english language, the models do not need to create new tokens as they do while breaking apart complex chemistry related words.
<br>
<br>
Afterwards try it with the same path but instead just have the program read .json files, which will ensure that it reads only the Brewing Science book. 

In [None]:
main_program_runner()



### Final Thoughts and Current Bugs:

1) Currently, there is some sort of stored variable issue with the program in that after running it, sometimes the kernal needs to be reset before running it again.

2) Right now it saves the written .json of words/embeddings in the same folder as the program, might be nice to have it save it somewhere else?


Lastly, if you're working on this project, don't forget how cool it is! and how translatable it is to other projects. It's also a great group of people!
They're a good resource if you're stuck too.

