# Generate Word Vectors by Doc2Vec
We use gensim's open source library Doc2Vec to train our model.
This note book takes the input generated from `feature-selection.ipynb` to load the patent claims as texts. And generate the corresponding word vectors to describe the claims. This notebook generates the following files.
1. Training corpus and corresponding vocabulary set (both in `.txt`)
2. gensim Doc2Vec model (stored as 4 files in `.model`, `.model.dv.vectors.npy`, `.model.dv.syn1neg.npy`, and `.model.wv.vectors.npy`.)
3. word vector files (in `.csv`)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# The default version of gensim is 3.6 and seems too old to support corpus_file document tagging
# So we need to upgrade the version first
!pip install gensim --upgrade

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
import numpy as np
import pandas as pd
import gensim
import ast

from tqdm import tqdm
from typing import *

## Choosing the parameters

In [None]:
# To select either to train a new Doc2Vec model
# If True, train a new model
# If False, directly use a old pre-trained model for vector inference
train_new_model = False

# Number of files with the same name format to load
num_files = 43
# Probability to sample applications when loading
sample_prob = 0.1

# Model training Specs
vector_size = 450     # The vector dimension to represent any document. Only valid when training new model
model_min_count = 5   # The minimum count of words when training. Any word appears less than this number of times is discarded during training.
model_epochs = 10      # The number of epochs for training
model_sample = 1e-4   # The ratio where high frequency words be downsampled during the training process

# The path to the source data. Must contains (1) application number (2) claim number (3) claim texts
source_path = '/content/drive/MyDrive/EPFL/ML/ML-Patent_data/ML_patent_data_filtered/final_dataset/claims/all_together_ordered_by_application_number'

# The path where the model files are located. Same path for saving and loading the model.
model_path = '/content/drive/MyDrive/EPFL/ML/ML-Patent_data/claim_with_vector/sampled/model_sampled.model'

# The path to output the post-processed dataset, where each row contains
# (1) application number (in integer)
# (2) word count (in integer) number of words used to describe this application
# (3) word vector (in list). A list representing the word vector (can be transformed to numpy ndarray afterwards), represent the claims of this application as a document vector.
output_dataset_path = '/content/drive/MyDrive/EPFL/ML/ML-Patent_data/claim_with_vector/word_vector/word_vector.csv'

# The path to our corpus file in the format of TaggedLineDocument as specified by gensim. Each line is all the claims of one application
corpus_path = '/content/drive/MyDrive/EPFL/ML/ML-Patent_data/claim_with_vector/sampled/corpus_sampled.txt'
# The path to the vocabulary file
vocab_path = '/content/drive/MyDrive/EPFL/ML/ML-Patent_data/claim_with_vector/sampled/vocab_sampled.txt'

## Prepare helper classes to generate handle large texts
Since the number of texts are too large to be loaded into one single machine, we need to store our preprocessed texts in the corpus files and record its vocabulary set. So that the Doc2Vec model can train from these files without loading everything into memory, which not feasible.

In [None]:
class myVocab():
  '''
  Handle vocabulary for a language, contains all unique words seen in a corpus.
  This class is used as a facilitation for training a Doc2Vec model
  '''

  def __init__(self, words:Iterable=None, path:str=None):
    '''
    Parameters
    ------------
    words: a list of words (in any iterable container) to be assigned as initial vocabulary
    path: the file path indicates a corpus file, where each line is a word
    '''
    # Use frozenset instead of normal set can boost the performance
    self.vocab = frozenset()

    if words is not None:
      self.add_vocab(words)

    if path is not None:
      self.load(path)
    

  def __len__(self):
    return len(self.vocab)

  def add_vocab(self, words:Iterable) -> None:
    '''Given a new list of words (in any iterable container), merge them into our vocabulary set'''
    self.vocab = self.vocab.union(frozenset(words))
    # Exclude empty string
    self.vocab -= {'', ' '}

  def save(self, path:str, append:bool=False) -> None:
    '''Save the current vocabulary set into an external file'''
    # Decide whether to append at the end or overwrite as new file
    open_mode = 'a' if append else 'w'

    with open(path, open_mode) as vocab_file:
      # Each line is an unique word
      for word in self.vocab:
        vocab_file.write(f'{word}\n')

  def load(self, path:str) -> None:
    '''Load the previously saved vocabulary set file to contruct the object'''
    with open(path, 'r') as vocab_file:
      word = vocab_file.readline()
      while word != '':
        # Add the new word into our vocabulary
        self.vocab.union(word)
        # Each line is an unique word
        word = vocab_file.readline()
  
  def build(self, path:str) -> None:
    '''Build the vocabulary set from a document file'''
    with open(path, 'r') as doc:
      for line in doc:
        # Remove the symbols that often attached with words: '.' ',' '?' '!' '(' ')' ':' ';' '/' '[' ']'
        line = line.replace('.', '')
        line = line.replace(',', '')
        line = line.replace('?', '')
        line = line.replace('!', '')
        line = line.replace('(', '')
        line = line.replace(')', '')
        line = line.replace('[', '')
        line = line.replace(']', '')
        line = line.replace(':', '')
        line = line.replace(';', '')
        line = line.replace('/', ' ')
        # Get each word
        words = line.split()
        # Add new vocabulary
        self.add_vocab(words)


In [None]:
class Claims():
  '''A class to organize claims with their respective patent case.'''
  def __init__(self):

    # Dictionary using the application number to access the claims (in the respective order)
    # Key: application number (in int64)
    # Value: List of patent sentences (in the order of claim number)
    self.claims = {}
    
    # Dictionary using the application number to access the word vector for claims
    # Key: application number (in int64)
    # Value: numpy array
    self.word_vector = {}


    # Dictionary using the application number to access the number of words used in claims
    # Key: application number (in int64)
    # Value: number of words used
    self.word_count = {}

  def reset(self):
    '''Reset all the containers to empty'''
    self.claims = {}
    self.word_vector = {}
    self.word_count = {}


  def get_app_num(self):
    '''Return the list of application numbers currently stored.'''
    return list(self.claims.keys())

  def get_claims(self, app_number:int) -> list:
    '''Return the claims of the specified application number'''
    return self.claims[app_number]

  def get_word_vector(self, app_number:int) -> np.ndarray:
    '''Return the word_vector of the specified application number'''
    return self.word_vector[app_number]

  def get_word_count(self, app_number:int) -> int:
    '''Return the word_count of the specified application number'''
    return self.word_count[app_number]


  def set_new_claims(self, app_number:int, claims:list, model:gensim.models.doc2vec.Doc2Vec=None, count_word:bool=True) -> None:
    '''Set a new entry'''
    ## Assign the new claims
    if app_number in self.claims:
      self.claims[app_number] += claims
    else:
      self.claims[app_number] = claims

    ## Assign the new word vector
    if model is not None:
      # Get the representation of claims in words 
      words = self.get_claim_words(app_number)
      # Filter out the words which is not in our corpus
      words = list(filter(lambda w: w in model.wv.key_to_index, words))
      # Infer the vector by it's word representation
      self.word_vector[app_number] = model.infer_vector(words)
    
    ## Assign the updated word_count
    if count_word:
      self.word_count[app_number] = len(words)


 
  def add_claims(self, df:pd.DataFrame, column_names:Iterable=None, vocab:myVocab=None, sample_func:callable=None) -> None:
    '''
    Add the claims into our structure.

    Parameters
    ---------------
    df: a dataframe which contains at least (1) application number (2) claim number (3) claim text
    column_names: any ordered iterable containing the name of application number, claim number, and claim text in the dataframe df. 
                  If not specified, it's assumed that the first 3 columns are already the 3 columns mentioned above.
    vocab: if specified, every time we examine a new sentence, it adds all new vocabulary into the vocab object.
    sample_func: a callable function returns True or false. We process an application only if sample_func returns True
    '''
    # If not assigned, our target columns are assumed to be the first 3 columns
    if column_names is None:
      column_names = df.columns
    # Obtain the correct column label for the dataframe
    app_number_column_name   = column_names[0]
    claim_number_column_name = column_names[1]
    claim_text_column_name   = column_names[2]

    # Get the unique list of application number
    app_number_list = df[app_number_column_name].unique()

    # Add claims to each application
    for app in tqdm(app_number_list):

      # If a sample function is defined, discard this application number if it returns false
      if sample_func is not None:
        if not sample_func():
          continue

      # Extract the sub-dataframe for current application number
      df_current_app = df[ df[app_number_column_name] == app ]
      # Sort the sub-dataframe by the claim number, so that we store the content in the correct order
      df_current_app = df_current_app.sort_values(by=[claim_number_column_name])
      # Extract the claims as list of claims
      current_claims = df_current_app[claim_text_column_name].to_list()

      # Store the extracted claims into the dictionary
      if app in self.claims:
        # If already exist such application, append new claims to the original one
        self.claims[app] += current_claims
      else:
        # If there's no such application, assign the claims
        self.claims[app] = current_claims
      
      # If vocabulary is specified, we add new words into the vocabulary set
      if vocab is not None:
        vocab.add_vocab(self.get_claim_words(app))



  def get_claim_words(self, app_num:int) -> list:
    '''
    Return the claims in a list of words with respect to the assigned patent.
    Return an empty list if such application number does not exist

    Parameter
    -------------
    app_num: the application number
    '''

    # If such application number does not exist, return an empty list (not found)
    if app_num not in self.claims:
      return []

    words = []
    # Call the claims from the specified application
    for sentence in self.claims[app_num]:
      # Some times nan happens, but afterall this entry is meaningless if it's not a string
      if type(sentence) != str:
        continue

      # Remove the symbols that often attached with words: '.' ',' '?' '!' '(' ')' ':' ';' '/' '[' ']'
      sentence = self.__clear_special_symbols(sentence)

      # Transform the sentence into a list of words
      new_words = sentence.split()

      # Append the new_words into the structure
      words = words + new_words
    
    return words


  def get_claim_as_sentence(self, app_num:int) -> str:
    '''Collect all the claims and merge them into a single sentence.'''
    # Initialize the sentence structure
    sen = ''

    # Call the claims from the specified application
    for sentence in self.claims[app_num]:
      # Some times nan happens, but afterall this entry is meaningless if it's not a string
      if type(sentence) != str:
        continue

      # Remove the symbols that often attached with words: '.' ',' '?' '!' '(' ')' ':' ';' '/' '[' ']'
      sentence = self.__clear_special_symbols(sentence)
      
      # Append the new sentence to the original sentence
      sen += ' ' + sentence

    return sen



  def get_tagged_documents(self) -> list:
    '''Return a list of documents as the training input for gensim.doc2vec'''
    return list(self.tag_claims())


  def tag_claims(self):
    '''Tag all the claims to be trained'''
    print('Tagging words for all the applications')
    for ind, app_num in enumerate(tqdm(self.claims)):
      # Get the representation of claims in list of words
      words = self.get_claim_words(app_num)
      yield gensim.models.doc2vec.TaggedDocument(words, [ind])


  def assign_word_vector(self, model:gensim.models.doc2vec.Doc2Vec) -> None:
    '''Assign word vectors to each application, and store it in self.word_vector as a dictionary'''
    print('Assigning word vectors to claims of each application')
    for app_num in tqdm(self.claims):
      # Get the representation of claims in words 
      words = self.get_claim_words(app_num)
      # Filter out the words which is not in our corpus
      words = list(filter(lambda w: w in model.wv.key_to_index, words))
      # Infer the vector by it's word representation
      self.word_vector[app_num] = model.infer_vector(words)


  def calculate_word_count(self) -> None:
    '''Construct the word_counts for the existing claims'''
    print("Calculating word counts...")
    for app in tqdm(self.claims):
      # Get the representation of claim in individial words
      words = self.get_claim_words(app)
      # Store the respective word number
      self.word_count[app] = len(words)
      


  
  def save_all(self, path:str) -> None:
    '''
    Save the processed dataset as another csv file.

    path: the path where the file is to be stored
    '''

    print("Saving the processed dataset...")

    with open(path, 'w') as ofile:
      # Start a new csv file and initialize the columns
      ofile.write('application_number,claims,word_vector,word_count\n')

      for app_num in tqdm(self.claims):
        # Use 2 single quotes to represent the single quote in the sentence
        claims_list = [ x.replace('\'', '\'\'') for x in self.claims[app_num] if type(x) is str]
        # Since double quotes have special meaning in .csv files, we change all the double quote representations into single quote
        claim_list_in_str = str(claims_list).replace('\"', '\'')
        # Write the result and enclose list with double quotes
        ofile.write(f"{app_num},\"{claim_list_in_str}\",\"{self.word_vector[app_num].tolist()}\",{self.word_count[app_num]}\n")


  def save_vector(self, path:str) -> None:
    '''
    Save the processed dataset as another csv file with (1) application number, (2) word_count (3) word_vector (in a list)

    path: the path where the file should be stored
    '''
    print("Saving word count and word vector...")


    with open(path, 'w') as ofile:
      # Start a new csv file and initialize the columns
      ofile.write('application_number,word_count,word_vector\n')

      for app_num in tqdm(self.claims):
        # Write the result and enclose list with double quotes
        ofile.write(f"{app_num},{self.word_count[app_num]},\"{self.word_vector[app_num].tolist()}\"\n")


  def save_corpus(self, path:str, append:bool=False) -> None:
    '''
    Save the claims into a file with the TaggedLineDocument format specified by gensim Doc2Vec
    
    Parameters
    -------------
    path: the path including file name to save
    append: True if want to append to the end to the existing file on 'path'. 
            False if want to overwrite the result
    '''
    print("Writing corpus to file...")

    # Decide whether to append at the end or overwrite as new file
    open_mode = 'a' if append else 'w'

    with open(path, open_mode) as corpus_file:
      app_num_list = self.get_app_num()
      # Each line represents a document (with unique tag)
      # so in our case it's an application.
      for app in tqdm(app_num_list):
        corpus_file.write(self.get_claim_as_sentence(app)+'\n')
    
  
  def load(self, path:str, num_of_files:int) -> None:
    '''
    Load from the previously saved post-processed dataset
    
    Parameters
    -------------
    path: a path template, don't append the _[index] to the end. Just put [path_to_file]/[name].csv. The function will iterate through all the indices.
    num_of_files: number of files to be loaded. The function automatically loads files with indices 0 to (num_of_files-1).
    '''
    print("Loading from the previously saved dataset")

    # Define the converters for data parsing
    # all data is imported as strings, we do some extra handling to successfully convert the strings to lists.
    #
    # since we use 2 single quotes to represent the single quote in original text,
    # we append a backslash '\' before it so that we can parse it as a single quote in text, instead of a deliminator for strings with Python syntax
    conv = {'claims': lambda x: ast.literal_eval(x.replace('\'\'', '\\\'')),
            'word_vector': lambda x: ast.literal_eval(x)}

    # Just a trick to for the loop below
    path = path.replace('.csv', '_-1.csv')

    for file_index in range(num_of_files):
      # Update the file index for each new file
      path = path.replace(f'_{file_index-1}.csv', f'_{file_index}.csv')
      with open(path) as infile:
        print(f"Loading processed data file {file_index}")
        df = pd.read_csv(path, converters=conv)

        self.add_claims(df)

  
  def __clear_special_symbols(self, sentence:str) -> str:
     '''
     Remove the symbols that often attached with words: '.' ',' '?' '!' '(' ')' ':' ';' '/' '[' ']'
     '''
    sentence = sentence.replace('.', '')
    sentence = sentence.replace(',', '')
    sentence = sentence.replace('?', '')
    sentence = sentence.replace('!', '')
    sentence = sentence.replace('(', '')
    sentence = sentence.replace(')', '')
    sentence = sentence.replace('[', '')
    sentence = sentence.replace(']', '')
    sentence = sentence.replace(':', '')
    sentence = sentence.replace(';', '')
    sentence = sentence.replace('/', ' ')
    return sentence



## Load from the data source for training
We construct the corpus and vocabulary from the source data. So that we can train a Doc2Vec model.

In [None]:
# The root for the source file name. For example, the file can be like
# "{Path_to_datasets}/{NAME_ROOT}_000000000001.csv"
NAME_ROOT = 'features_claims_'

# Maintain the vocabulary set with the claim data for training Doc2Vec
vocab = myVocab()
claims = Claims()

## The below parts are only for training
if train_new_model:

  def app_sample_func():
    '''Bernoulli r.v. with probability "sample_prob" for random sampling data '''
    return np.random.rand() < sample_prob


  for file_index in range(num_files):
    # Format the file name
    filename = f'{NAME_ROOT}{file_index:012}.csv'
    path = os.path.join(source_path, filename)
    with open(path) as patent_claim_file:
      print(f"Loading claim file \'{filename}\'. Current vocab size: {len(vocab)}")
      df = pd.read_csv(patent_claim_file)

      claims.add_claims(df, vocab=vocab, sample_func=app_sample_func)

    # Save corpus and vocabulary result to file (prevent session problem, allow us to have check points)
    claims.save_corpus(corpus_path, append=True)
    vocab.save(vocab_path)

    # Reset the claim object to free up some RAM.
    # Because claim texts are quite a large burden for memory :)
    claims.reset()

## Obtain the Model
Either train a new model with the previously loaded data; or
load an existing model.

In [None]:

if train_new_model:
  # Create the model object for training
  model = gensim.models.doc2vec.Doc2Vec(min_count=model_min_count, vector_size=vector_size, epochs=model_epochs)

  # Build vocabulary from the corpus file (in the format of TaggedLineDocument)
  model.build_vocab(corpus_file=corpus_path)
  # Train the model by the corpus file
  model.train(corpus_file=corpus_path,
              total_words=len(vocab),
              sample=model_sample,
              epochs=model.epochs)

  # Save the model so that we can load it later without retrain it
  model.save(model_path)

else:
  print(f'Loading from existing model from \'{model_path}\'')
  model = gensim.models.doc2vec.Doc2Vec.load(model_path)

Loading from existing model from '/content/drive/MyDrive/EPFL/ML/ML-Patent_data/claim_with_vector/sampled/model_sampled.model'


## Apply word vector
Once having a good Doc2Vec model, we now apply the model to infer word vector for each application.

In [None]:
# Determine the file index to start loading
file_from = 0

# Generate the file path with indices
output_path = output_dataset_path.replace('.csv', f'_{file_from-1}.csv')

claims = Claims()

for file_index in range(file_from, num_files):
  # Format the file name
  filename = f'{NAME_ROOT}{file_index:012}.csv'
  path = os.path.join(source_path, filename)
  with open(path) as patent_claim_file:
    print(f"Loading claim file \'{filename}\'")
    df = pd.read_csv(patent_claim_file)

    claims.add_claims(df)

    # Infer the word vector for the claims
    claims.assign_word_vector(model)

    # Calculate the word count
    claims.calculate_word_count()

    # Update the path index and save the processed file
    output_path = output_path.replace(f'_{file_index-1}.csv', f'_{file_index}.csv')
    claims.save_vector(output_path)

    claims.reset()


Loading claim file 'features_claims_000000000040.csv'


100%|██████████| 46523/46523 [01:48<00:00, 427.23it/s]


Assigning word vectors to claims of each application


100%|██████████| 46523/46523 [19:14<00:00, 40.29it/s]


Calculating word counts...


100%|██████████| 46523/46523 [00:07<00:00, 6277.30it/s]


Saving word count and word vector...


100%|██████████| 46523/46523 [00:21<00:00, 2206.68it/s]


Loading claim file 'features_claims_000000000041.csv'


100%|██████████| 48667/48667 [01:57<00:00, 415.40it/s]


Assigning word vectors to claims of each application


100%|██████████| 48667/48667 [19:30<00:00, 41.58it/s]


Calculating word counts...


100%|██████████| 48667/48667 [00:07<00:00, 6656.76it/s]


Saving word count and word vector...


100%|██████████| 48667/48667 [00:22<00:00, 2177.91it/s]


Loading claim file 'features_claims_000000000042.csv'


100%|██████████| 45525/45525 [01:43<00:00, 438.15it/s]


Assigning word vectors to claims of each application


100%|██████████| 45525/45525 [17:44<00:00, 42.77it/s]


Calculating word counts...


100%|██████████| 45525/45525 [00:06<00:00, 6564.67it/s]


Saving word count and word vector...


100%|██████████| 45525/45525 [00:20<00:00, 2185.52it/s]
