# Model generation using doc2vec

## Introduction

Answering a research question based on information within a corpus of text relies on the ability to extract relevant subsets of the corpus (documents, parts of documents, etc.). The defintion of what constitutes a relevant subset may, however, not be immediately implementable in classical information retrieval terms such as keywords (which may also be perceived as limiting), and is not an intuitive starting point for researchers accustomed to working close to text, i.e. close-readng documents to identify and retrieve the information which interests them.

As an alternative to classical keyword defintion, and inspired by the approach of identifying relevant documents or passages using a more holistic assesment of their content, in the following we present the construction of a shallow neural network based model, trained on a user supplied corpus (which has been preprocessed by the user), aiming to encapsualte the content of a corpus element, with the goal, of enabling the user to retrieve elements of similar content by querying the corpus using the content of a seed element as numerically encoded by the model.

In the fllowing we construct the model using the `gensim` Python package and the doc2vec model framework it provides.

## Setup
The following steps set up our environment

First import the standard and required framework packages

In [1]:
import os
import numpy as np
import scipy as sp

Next import packages directly related to the model construction and serialization of model output

In [2]:
import gensim
from smart_open import smart_open
import pandas as pd
import import_ipynb
import filepaths as fp
import corpus_reinferral

importing Jupyter notebook from filepaths.ipynb
importing Jupyter notebook from corpus_reinferral.ipynb


Finally, as model construction is a process which requires a significant amout of computation, import packages to enable efficent usage of the available computational infrastructure, and specify allowed usage of infrastructure


In [3]:
import multiprocessing
import dask

cores = multiprocessing.cpu_count()
usecores=np.int(3*cores/4)

### get full file paths

In [4]:
corpus_file_path = fp.corpus_file_path
corpus_ids_file_path = fp.corpus_ids_file_path
model_output_file_path = fp.model_output_file_path
model_output_corpus_vectors_path = fp.model_output_corpus_vectors_path

## Utility Functions

Having setup the environment we next define a small number of utility functions which will enable us to import the (already preprocessed) corpus and subsequently train our model.

#### Function to read in corpus into input format supported by `gensim` `doc2vec`

This assumes that the corpus token file has been produced following the process encoded in the pre-processing notebook (), i.e. each corpus element (a 'document' in the sense of the doc2vec model) to be included in the model has been preprocessed and tokenized and is stored as a single string of tokens, with one element ('document') per line.

We note, that in many cases these elements/documents will, in fact, correspond to e.g. paragraphs of a larger document.

In [5]:
def read_corpus(corpus_file):
    with smart_open(corpus_file,'r') as tf:
        for i,text_line in enumerate(tf):
            tokens = text_line.split(' ')
            yield gensim.models.doc2vec.TaggedDocument(tokens,[i])
            

#### Function to read in corpus and list of corpus element file names, creating an object which supports inspecting and linking model output

In [6]:
def read_corpus_lookup(corpus_ids_file, corpus_file) :
    with smart_open(corpus_ids_file, 'r') as fnf, smart_open(corpus_file,'r') as tf:
        i=0
        for (fn_line,tf_line) in zip(fnf,tf):
            yield ([i],[fn_line.rstrip()],[tf_line])
            i+=1

## Load corpus

With the utility functions defined we can now load the corpus

In [7]:
corp = list(read_corpus(corpus_file_path))
#import pickle
#with open('/Users/eslt0101/Data/eScience/Evidence/Data/NR-Teksts/EviDENce_NR_output_clean/TargetSize150/corp_clean.pkl','rb') as f:
#    corp = pickle.load(f)

and the lookup corpus

In [9]:
corp_lookup = list(read_corpus_lookup(corpus_ids_file_path,corpus_file_path))
#with open('/Users/eslt0101/Data/eScience/Evidence/Data/NR-Teksts/EviDENce_NR_output_clean/TargetSize150/corp_lookup_clean.pkl','rb') as f:
#    corp_lookup = pickle.load(f)

Unique identifiers for each corpus element can then be constructed from the lookup corpus.

In [10]:
#This assumes that all corpus file names will end with .txt
corp_ids =[]
for i in range(len(corp)):
    corp_ids.append(corp_lookup[i][1][0].split('.txt')[0])

## Build Model

Having imported the corpus we can now build a model from it.

First we specify the model we want to bulid. In this case that is doc2vec with largely default settings, with modifications as specified in the following:

In [11]:
vector_dimension=50
word_min_count=2
number_of_epochs=30

In [12]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=vector_dimension, min_count=word_min_count, epochs=number_of_epochs, workers=usecores)

next, using the model, we build the vocabulary that will be used

In [13]:
model.build_vocab(corp)

then, we train the model, timing the process 

In [14]:
%time model.train(corp, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 4.23 s, sys: 1.3 s, total: 5.53 s
Wall time: 2.89 s


and save the resulting trained model

In [15]:
model.save(model_output_file_path)
#model = gensim.utils.SaveLoad.load('/Users/eslt0101/Projects/EviDENce/ML/model_default_v50_mc2_e30_freeze_clean.d2v')

## Reinfer corpus vectors 

Significant sections of the corpus vectors have been constructed during early epochs of the model training. Furthermore, in general, the corpora being modelled will be relatively small, so that individual instances of a derived/infered vector may be unstable and fluctuate in component dimensions. To adress this issue we reinfer the vector for each element of the corpus using the fully trained and frozen model multiple times, using the component wise median as the descriptive vector associated with a corpus element in further processing.

Use imported reinferral engine (corpus_reinferral.ipynb)

#### Reinferral settings

In [16]:
number_reinferral_instances=100  #this corresponds to the default setting

### Execute reinferral 

In [17]:
reinferred_corpus_vectors = corpus_reinferral.reinfer_corpus_single(corp,model,reinferral_instances=number_reinferral_instances)


Having reinferred the corpus vectors the results are saved in `numpy` binary format for potenial later (rapid) use.

### Create dataframe with mapping

The reinferred vectors save above require a separate mapping structure to ensure the correct association of a vector with the corresponding corpus element in a more general use scenario which might include reordering. To this end, we save a dataframe containing the reinferred vectors and the associated element names/ids. 

In [19]:
data_list = list(zip(corp_ids,reinferred_corpus_vectors))
vectorDF = pd.DataFrame(data_list,columns=['id','vector'])
vectorDF.to_pickle(model_output_corpus_vectors_path)

## Summary

With the code above, we have built a doc2vec model on the ALREADY PREPROCCESSED corpus supplied, and have constructed averaged (i.e. numerically stabablized) vectors in the model space for all corpus elements. Both the model and the vectors have been serialized for further use.