# Lesson 2 - Capturing more of the syntax

In the previous lesson we saw how simple bag-of-words representations could be used to find similar documents. While it works somewhat well to find relevant documents, the model we use has a very simple representation of language, where all meaning derived from syntax are lost. We'll now look at how can use pre-trained neural networks to get representations of text which capture some of this syntax.

Today, this is most often done by using _Transformer_ neural networks pre-trained with _language modelling_. Essentially, the pretraining task is framed as learning the joint distribution over text by estimating the factorized distribution. This can be done in many ways (e.g. GPT, BERT, XLNet).

It has been noted that this pre-training task works well when later fine-tuning on some supervised task. In our case though, we would like to use some representation of the documents for similarity search, without doing any additional fine tuning.

To do this, we will use _sentence BERT_ (sBERT), a variant of the BERT training procedure which strives to improve performance of the model for semantic representations.

## Huggingface Transformers

Much of the community surrounding pre-trained language models has centered on a project named Hugginface Transformers. This started as a library of basic Transformer models (in particular including pretrained BERT and GPT models), but has grown to be a substantial platform for pre-trained models.

Huggingface makes working with these models simple, and hides much of the inner workings behind an easy to use interface.

In [2]:
import zipfile
from pathlib import Path

In [3]:
import urllib
data_url = "https://cdn.thingiverse.com/assets/d0/b3/68/63/1e/Gate_Guide_Spacer_v9.stl"
data_root = Path('data')
data_path = data_root / 'sampled_archive.zip'
data_root.mkdir(exist_ok=True)

In [4]:
from collections.abc import Sequence
from collections import defaultdict
import json

class ZipPatentCorpus:
    def __init__(self, *, document_archive: Path, document_parts=('abstract', 'description', 'claims'), lang='en'):
        self.document_archive = document_archive
        self.document_zf = zipfile.ZipFile(self.document_archive)
        self.document_parts = document_parts
        self.lang = lang

        self.documents  = sorted(filename for filename in self.document_zf.namelist())
        self.symbolic_labels = []
        self.labeled_documents = defaultdict(list)
        for document in self.documents:
            label, sep, file = document.rpartition('/')
            self.symbolic_labels.append(label)
            self.labeled_documents[label].append(document)
        self.label_codes = {label: i for i, label in enumerate(sorted(self.labeled_documents.keys()))}
        self.labels = [self.label_codes[label] for label in self.symbolic_labels]
    
    def __len__(self):
        return len(self.documents)

    def load_document(self, document_path):
        with self.document_zf.open(document_path) as fp:
            document = json.load(fp)
            document_str = '\n'.join([document[part][self.lang] for part in self.document_parts])
            return document_str

    def __getitem__(self, item):
        # Lazily load documents here
        if isinstance(item, slice):
            document_paths = self.documents[item]
            document_str = [self.load_document(document_path) for document_path in document_paths]
        elif isinstance(item, Sequence):
            document_str = [self.load_document(self.documents[idx]) for idx in item]
        else:
            document_str = self.load_document(self.documents[item])
        return document_str
    
    def get_label(self, i):
        return self.labels[i]
    
    def get_symbolic_label(self, i):
        return self.symbolic_labels[i]

In [14]:
import re
from collections import Counter
from tqdm import tqdm

class Tokenizer:
    def __init__(self, 
                 *, 
                 max_vocab_size, 
                 stoplist=('the', 'of', 'a', 'and', 'to', 'in', 'is', 'or', 'an', 'by', 'as', 'be', 'for'),
                 wordpattern=r"[A-Za-z0-9\-\+='.]*[A-Za-z][A-Za-z0-9\-\+='.]*"
                 ):
        self.max_vocab_size = max_vocab_size
        self.stoplist = stoplist
        self.wordpattern = re.compile(wordpattern)

    def tokenize(self, text):
        return [word.strip('.') for word in re.findall(self.wordpattern, text.lower())]

    def encode(self, tokenized_text):
        try:
            term_to_index = self.term_to_index
        except AttributeError:
            raise RuntimeError("Tokenizer is missing term to index, did you call Tokenizer.fit() or Tokenizer.fit_transform()?")
        return [term_to_index[term] for term in tokenized_text if term in term_to_index]
    
    def decode(self, encoded_text):
        try:
            index_to_term = self.index_to_term
        except AttributeError:
            raise RuntimeError("Tokenizer is missing term to index, did you call Tokenizer.fit() or Tokenizer.fit_transform()?")

        return [index_to_term[idx] for idx in encoded_text]
    
    def make_vocab(self, documents_term_frequencies):
        document_occurance_counts = Counter()
        for document_term_frequency in documents_term_frequencies:
            # And a count once for each unique term in a document
            document_occurance_counts.update(document_term_frequency.keys()) 
        
        for stopword in self.stoplist:
            del document_occurance_counts[stopword]
        
        self.vocabulary = sorted(term for term, count in document_occurance_counts.most_common(self.max_vocab_size) if count > 1)
        self.term_to_index = {term: i for i, term in enumerate(self.vocabulary)}
        self.index_to_term = {i: term for term, i in self.term_to_index.items()}

    def fit(self, corpus):
        documents_term_frequencies = [Counter(self.tokenize(doc)) for doc in tqdm(corpus, desc="Tokenizing", leave=False)]
        self.make_vocab(documents_term_frequencies)

    def fit_transform(self, corpus):
        tokenized_docs = [self.tokenize(doc) for doc in tqdm(corpus, desc="Tokenizing", leave=False)]
        documents_term_frequencies = [Counter(tokens) for tokens in tokenized_docs]
        self.make_vocab(documents_term_frequencies)
        return [self.encode(tokenized_text) for tokenized_text in tqdm(tokenized_docs, desc="Encoding", leave=False)]

    def transform(self, text):
        tokenized_text = self.tokenize(text)
        encoded_text = self.encode(tokenized_text)
        return encoded_text
    
    def __len__(self):
        return len(self.vocabulary)

In [None]:
import re
from collections import Counter
from tqdm import tqdm

class NGramTokenizer(Tokenizer):
    def __init__(self, 
                 *,
                 n,
                 **kwargs
                 ):
        self.n = n
        super().__init__(**kwargs)
        
    def fit(self, corpus):
        documents_term_frequencies = []
        for doc in tqdm(corpus, desc="Tokenizing", leave=False):
            tokenized = self.tokenize(doc)
            if not self.include_stop_ngrams:
                tokenized = [token for token in tokenized if token not in self.stoplist]
            document_terms = Counter(tokenized)
            
            for n in range(1, self.n):  # note that since we use the n in the slice below, for 2-grams we want this offset to be 1 and so on
                n_grams = [' '.join(tokenized[i:i+n]) for i in range(len(tokenized))-1]
                document_terms.update(n_grams)
            documents_term_frequencies.append(document_terms)
        self.make_vocab(documents_term_frequencies)
            
        

    def fit_transform(self, corpus):
        tokenized_docs = []
        documents_term_frequencies = []
        for doc in tqdm(corpus, desc="Tokenizing", leave=False):
            tokenized = self.tokenize(doc)
            if not self.include_stop_ngrams:
                tokenized = [token for token in tokenized if token not in self.stoplist]
            document_terms = Counter(tokenized)
            
            for n in range(1, self.n):  # note that since we use the n in the slice below, for 2-grams we want this offset to be 1 and so on
                n_grams = [' '.join(tokenized[i:i+n]) for i in range(len(tokenized))-1]
                document_terms.update(n_grams)
            documents_term_frequencies.append(document_terms)
        self.make_vocab(documents_term_frequencies)
        
        tokenized_docs = [self.tokenize(doc) for doc in tqdm(corpus, desc="Tokenizing", leave=False)]
        documents_term_frequencies = [Counter(tokens) for tokens in tokenized_docs]
        self.make_vocab(documents_term_frequencies)
        return [self.encode(tokenized_text) for tokenized_text in tqdm(tokenized_docs, desc="Encoding", leave=False)]

    def transform(self, text):
        tokenized_text = self.tokenize(text)
        encoded_text = self.encode(tokenized_text)
        return encoded_text
    
    def __len__(self):
        return len(self.vocabulary)

In [7]:
text_corpus = ZipPatentCorpus(document_archive=data_path, document_parts=['abstract'])

In [15]:
tokenizer = Tokenizer(max_vocab_size=100000)
tokenized_docs = tokenizer.fit_transform(text_corpus)

                                                                 

## N-gram models


In [15]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForMaskedLM
device = "cuda" if torch.cuda.is_available() else "cpu"

In [16]:
import importlib
importlib.reload(transformers)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


<module 'transformers' from 'F:\\Anaconda\\envs\\enccs-nlp-workshop\\lib\\site-packages\\transformers\\__init__.py'>

In [17]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('AI-Growth-Lab/PatentSBERTa', device=device)
embeddings = model.encode(sentences)
print(embeddings)

ImportError: 
AutoModel requires the PyTorch library but it was not found in your environment. Checkout the instructions on the
installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.
