## Model Training Notebook

This notebook is for training the models that support the semantic search capability in the streamlit apps.

In [1]:
%load_ext autoreload
%autoreload 2
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from umap import UMAP
from hdbscan import HDBSCAN
import joblib
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


Reading in the chosen dataset for the model you want to produce. For the abstract model it is the arxiv-metadata-oai-snapshot.json file in the data folder and for the model on the full papers it is the corpus_file in the data folder.

In [2]:
meta = pd.read_json('../data/arxiv-metadata-oai-snapshot/arxiv-metadata-oai-snapshot.json', lines=True)

Cleaning all of the unwanted characters out of the strings. This will reduce the number of words that become one offs due to adding a dash or other special character and will allow the model to learn a better contextual representation of each word. This step is only for the abstracts model since this is already done in creating the corpus file.

In [3]:
def string_clean(text):
    text = "".join([x.lower() if x.isalnum() or x.isspace() else " " for x in text])
    return text

Sometimes there are extra spaces left over or initially present between words so we need to strip those out as well since we will be splitting on spaces. This step is only for the abstracts model since this is already done in creating the corpus file.

In [4]:
meta["clean"] = meta.loc[:, "abstract"].apply(string_clean).str.strip()

Documents need to be converted into a TaggedDocument object for Doc2Vec to train on them. In the Case of the Papers model this is already done when creating the corpus file.

In [5]:
documents = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(meta["clean"])]

Now we can train the model. I have a machine with 32 cores so I set the workers to 28. Check your cores before setting the workers for your machine. If you max out your CPU you may notice that your computer will lock up until training is done.

In [6]:
model = Doc2Vec(documents, epochs = 200, vector_size=100, window=6, min_count=1, workers=28)

Once training is complete we can save the model to be used in the streamlit app.

In [None]:
model.save('../models/abstracts/archive_model')

In [3]:
# model = Doc2Vec.load('../models/archive_model')

After Training the Model we need to train the UMAP lower Dimensional representation of the vectors of the documents. To do that we use the command below. THis is the first step in doing the topic analysis. We chose 2 components so that the final reduction could be plotted in 2D for visualization in the app.

In [41]:
umap_args = {'n_neighbors': 50,
            'n_components': 2,
            'metric': 'cosine'}
umap_model = UMAP(**umap_args).fit(model.dv.vectors)

After UMAP reduction the final vectors have to be clustered. To do this we use HDBSCAN. This is because density based methods have no relationship to the origin. they cluster based on what is the densest space. In embedding layers the initialization and the final orientation of the vectors can be somewhat randomly dispersed through out space. This method allows for that without failing to cluster based on the relationship of the cluster to the origin. You can read more about this in the documentation for HDBSCAN.

In [42]:
hdbscan_args = {'min_cluster_size': 50,
                'metric': 'euclidean',
                'cluster_selection_method': 'eom'}
cluster = HDBSCAN(**hdbscan_args).fit(umap_model.embedding_)

After we train both models we now hove our topic clusters and our 2d vectors for plotting so lets save the models and move over to the app for exploration.

In [43]:
with open('../models/UMAP', 'wb') as f:
    joblib.dump(umap_model, f)
with open('../models/clusters', 'wb') as f:
    joblib.dump(cluster, f)

You may notice that we are using joblib to save these models. THis is important because joblib uses a pickle function to save and that means in order to load the models you must be useing the same version of python tha tthey were trained in. __We are using python 3.10.8__