# Lesson 1 Notebook
In this notebook we will work with documents as vectors in a high dimensional space. We will look at some different ways of doing this using well established methods.

## Running this file on Google Colab
In this workshop we will be using Google Colab. To open this notebook in colab use the following URL: [Lesson 1 Notebook](https://colab.research.google.com/github/ENCCS/contemporary-nlp/blob/main/content/notebooks/lesson_1.ipynb)

You need to save a local copy of this notebook to your own Google Drive if you want the changes to be saved.

In [7]:
import zipfile
from pathlib import Path

## The data

In this notebook we will be working with comparing documents. We will work with real world data in the form of *patent applications*. These are documents which are submitted to patent organizations (in our case the European Patent Office, EPO) when requesting a patent. 

Published patent applications are open, so is a good source of example data for natural language processing.

The documents used in this workshop are downloaded from the EPO [European Publication Server web service](https://www.epo.org/searching-for-patents/data/web-services/publication-server.html). In particular, the patent applications are available as XML documents with clear structure. Below is an example of how these might look and the corresponding PDF version (which we will not be using).

![title](images/patent_application_pdf_xml.png)

We've prepared data for the workshop by extracting relevant fields from the XML documents and saved them as json-formatted documents. 

**N.b.** in a patent, each paragraph is numbered which can be used to cross reference between parts of the document. While this is essential to e.g. understand the claims, the methods we will use in this workshop can not make use of it, and therefore we haven't included them in the processed text.

### Downloading the data
We supply an archive of the data which we will now download

In [1]:
import urllib

In [12]:
data_url = "https://cdn.thingiverse.com/assets/d0/b3/68/63/1e/Gate_Guide_Spacer_v9.stl"
data_root = Path('data')
data_path = data_root / 'sampled_archive.zip'
data_root.mkdir(exist_ok=True)

In [11]:
urllib.request.urlretrieve(data_url, data_path)

(WindowsPath('data/sampled_archive.zip'),
 <http.client.HTTPMessage at 0x2623a2143c8>)

In [46]:
# This extracts the documents in the document root
with zipfile.ZipFile(data_path) as zf:
    zf.extractall(data_root)

Now let's have a look at how these files are structured. The archive contained two subdirectories, each having documents belonging to a specific class. Throughout this workshop we will be working with these two classes. When using NLP for semantic search, we are often tasked with searching some large set of documents for a specific subset. We will call the subset we're searching for the "relevant" set.

Our task is to achive as good *precision* and *recall* as possible on this relevant set.

In [29]:
import json
document_directories = [d for d in data_root.iterdir() if d.is_dir()]
document_classes = [d.name for d in document_directories]
print(document_classes)

['negative', 'positive']


The data is organized in text files containing JSON data, where each file is a single patent application. When parsed in python, the documents are organized as dictionaries.

Let's have a look at the keys they contain:

In [47]:
document_sample_path = next(document_directories[0].iterdir())
with open(document_sample_path) as fp:
    document_sample = json.load(fp)
print(document_sample.keys())

dict_keys(['abstract', 'claims', 'description', 'document_number', 'ipc_classes', 'language', 'publication_date'])


Here's a description of the different keys:
  - `abstract`: The text from the abstract of the document. This is often a succint text actually describing the contents of the patent.
  - `description`: This is the main body of text of the document, containing all materials which support the claims being made.
  - `claims`: This is a structured descriptions of the claims the patent seek. In other words, this is the description of what should be protected and is what the patent office should decide on.
  - `document_number`: This is the number assigned to an application when published.
  - `ipc_classes`: This is a list of IPC codes which has been assigned to the patent. The IPC system is a way of "tagging" the contributions of a patent so that it can be searched for in the future.
  - `language`: The language the application is publshied in. In this workshop we're only working with english patent applications.
  - `publication_date`: The date the application was published.

In this workshop, we'll mainly focus on the `abstract`, `claims` and `description`, since these are the parts containing natural language. Each of these keys in turn index a dictionary like below:

In [48]:
document_sample['abstract'].keys()

dict_keys(['en'])

Let's have a look at the contents. Claims and description can be long so we only look at the first 1000 characters.

In [49]:
document_sample['abstract']['en']

'Provided is a voltage discharging device. The voltage discharging device includes a battery, an inverter converting a DC power supplied from the battery into an AC power to output the converted AC power, a motor driven by the AC power outputted through the inverter, a main relay disposed between the battery and the inverter to switch the DC power supplied from the battery into the inverter, and a control unit detecting a key-off signal of the vehicle to discharge a DC link voltage of the inverter when the key-off signal is detected. The control unit discharges the DC link voltage by applying one of first and second forced discharging logics different from each other according to a driving state of the vehicle at a time point at which the key-off signal is detected.\n'

In [50]:
document_sample['claims']['en'][:1000]

'\nA voltage discharging device of a vehicle, the voltage discharging device comprising:\na battery(110);\nan inverter(130) for converting a DC power supplied from the battery(110) into an AC power to output the converted AC power;\na motor(140) driven by the AC power outputted through the inverter(130);\na main relay(120) disposed between the battery(110) and the inverter(130) to switch the DC power supplied from the battery(110) into the inverter(130); and\na control unit(150) for detecting a key-off signal of the vehicle and discharging a DC link voltage of the inverter when the key-off signal is detected,\nwherein the control unit(150) discharges the DC link voltage by applying one of first and second forced discharging logics different from each other according to a driving state of the vehicle at a time point at which the key-off signal is detected.\n\nThe voltage discharging device according to claim 1, wherein the control unit(150) determines a time point at which the main rela

In [51]:
document_sample['description']['en'][:1000]

'CROSS-REFERENCE TO RELATED APPLICATIONS\nThe present application claims priority under 35 U.S.C. 119 and 35 U.S.C. 365 to Korean Patent Application No. 10-2012-0109550 (filed on September 28, 2012), which is hereby incorporated by reference in its entirety.\nBACKGROUND\nEmbodiments relates to a vehicle, and more particularly, to a vehicle and a voltage discharging method thereof.\nAn inverter system that is a motor control unit used in eco-friendly vehicles is a main component belonging to an electric motor of a vehicle as an electric/electronic sub assembly (ESA) that converts a high-voltage DC power into an AC or DC power for controlling a motor.\nAs described above, a permanent magnet type motor may be applied to the eco-friendly vehicles. A motor that is applied as a driving unit in the eco-friendly vehicles is driven by phase current transmitted from an inverter for converting a DC voltage into a three-phase voltage by a pulse width modulation (PWM) signal of a control unit throu

In [None]:
text_corpus = 

### Corpus

We'll be implementing a class to easily iterate over the documents 

In [73]:
from collections.abc import Sequence
from collections import defaultdict

class PatentCorpus:
    def __init__(self, *, document_root: Path, document_parts=('abstract', 'description', 'claims'), lang='en'):
        self.document_root = document_root
        self.document_parts = document_parts
        self.lang = lang

        self.documents  = sorted(self.document_root.glob('**/*.json'))
        self.labeled_documents = defaultdict(list)
        for document in self.documents:
            label = str(document.parent)
            self.labeled_documents[label].append(document)
    
    def __len__(self):
        return len(self.documents)

    def load_document(self, document_path):
        with open(document_path) as fp:
            document = json.load(fp)
            document_str = '\n'.join([document[part][self.lang] for part in self.document_parts])
            return document_str

    def __getitem__(self, item):
        # Lazily load documents here
        if isinstance(item, slice):
            document_paths = self.documents[item]
            document_str = [self.load_document(document_path) for document_path in document_paths]
        elif isinstance(item, Sequence):
            document_str = [self.load_document(self.documents[idx]) for idx in item]
        else:
            document_str = self.load_document(self.documents[item])
        return document_str

In [96]:
from collections.abc import Sequence
from collections import defaultdict
class ZipPatentCorpus:
    def __init__(self, *, document_archive: Path, document_parts=('abstract', 'description', 'claims'), lang='en'):
        self.document_archive = document_archive
        self.document_zf = zipfile.ZipFile(self.document_archive)
        self.document_parts = document_parts
        self.lang = lang

        self.documents  = sorted(filename for filename in self.document_zf.namelist())
        self.labeled_documents = defaultdict(list)
        for document in self.documents:
            label, sep, file = document.rpartition('/')
            self.labeled_documents[label].append(document)
    
    def __len__(self):
        return len(self.documents)

    def load_document(self, document_path):
        with self.document_zf.open(document_path) as fp:
            document = json.load(fp)
            document_str = '\n'.join([document[part][self.lang] for part in self.document_parts])
            return document_str

    def __getitem__(self, item):
        # Lazily load documents here
        if isinstance(item, slice):
            document_paths = self.documents[item]
            document_str = [self.load_document(document_path) for document_path in document_paths]
        elif isinstance(item, Sequence):
            document_str = [self.load_document(self.documents[idx]) for idx in item]
        else:
            document_str = self.load_document(self.documents[item])
        return document_str

In [105]:
text_corpus = ZipPatentCorpus(document_archive=data_path, document_parts=('abstract',))

In [103]:
text_corpus = PatentCorpus(document_root=data_root, document_parts=('abstract',))

In [98]:
text_corpus[3:6]

['The present invention provides genetically modified eukaryotic host cells that produce isoprenoid precursors or isoprenoid compounds. A subject genetically modified host cell comprises increased activity levels of one or more of mevalonate pathway enzymes, increased levels of prenyltransferase activity, and decreased levels of squalene synthase activity. Methods are provided for the production of an isoprenoid compound or an isoprenoid precursor in a subject genetically modified eukaryotic host cell. The methods generally involve culturing a subject genetically modified host cell under conditions that promote production of high levels of an isoprenoid or isoprenoid precursor compound.\n',
 'Object To provide a peripheral structure of a battery for saddle-ride type vehicle which enhances maintenance capabilities while taking into account a case where other wires are connected to the battery.\nSolving Means In a saddle-ride type vehicle including: a battery (30) for supplying power to 

## Bag-of-Words

We'll start by using one of the simplest representations of these documents: the bag-of-words representation. In these, each document  is represented by a vector $\mathbf{d}_i$ of term presence variables $t_{i,j}$:

$$ \mathbf{d}_i = \begin{bmatrix}t_{i,1}& \dots & t_{i,n}\end{bmatrix} $$

The $t_{i,j}$ is the variable which expresses to what degree the term $j$ is present in the document. This variable can be a binary, where the prescence of the word is indicated or it can be a frequency of how often the word occurs in the document. In the latter case, the sum of all these frequencys are typically normalized so that they add up to $1$.

### Tokenization
To be able to construct this bag-of-words reprsentation, we need to decide on what terms are actually valid. We need to take this string of characters and divide it into *tokens*, a process referred to as *tokenization*. 

Tokenization is a difficult topic, because we need to make hard decisions which might destroy information. In english, it has been common to base tokenization on white space, since words in english are typically separated in this manner. Furthermore, some characters like `,` and `(` are mostly syntactic and the methods we'll look at today will not make use of this. We will therefore strip away a lot of these kinds of characters. We will also convert any upper case letter to lower case.

We start with looking at a simple way of doing this, and later progress to use prebuilt preprocessing steps with better fidelity.




In [106]:
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))
# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in text_corpus]

# Count word frequencies
from collections import Counter
frequency = Counter(token for text in texts for token in text)

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
#pprint.pprint(processed_corpus)

In [100]:
processed_corpus

[['invention',
  'relates',
  'method',
  'removing',
  'labels',
  'from',
  'containers',
  'used',
  'production,',
  'storage,',
  'transport',
  'and/or',
  'distribution',
  'food',
  'products',
  'or',
  'pharmaceutical',
  'products,',
  'which',
  'is',
  'characterized',
  'that',
  'solid',
  'co2',
  'particles,',
  'especially',
  'co2',
  'pellets',
  'or',
  'co2',
  'snow',
  'particles,',
  'are',
  'onto',
  'said',
  'containers',
  '1).'],
 ['rotating',
  'control',
  'apparatus,',
  'comprising:',
  'an',
  'outer',
  'member;',
  'an',
  'inner',
  'member',
  'having',
  'first',
  'sealing',
  'element',
  'second',
  'sealing',
  'element;',
  'said',
  'inner',
  'member,',
  'said',
  'first',
  'sealing',
  'element',
  'said',
  'second',
  'sealing',
  'element',
  'rotatable',
  'relative',
  'said',
  'outer',
  'member;',
  'first',
  'cavity',
  'defined',
  'by',
  'said',
  'inner',
  'member,',
  'said',
  'first',
  'sealing',
  'element',
  'said