# Smart Document Retrieval
<p>This notebook is being created to help understand the concepts behind smart document retrieval. The primary focus of the notebook is to illustrate the process of using a transformer model to embed text data into a numerical representation that can be used to calculating a similarity score as compared to a query string. Additionally we explore some of the architectual theory of a complete application.</p>
<p>Smart Document Retrieval can be devided into the following key components:
    <li>1) Document Management</li>
    <li>2) Source Text Extraction</li>
    <li>3) Source Text Storage</li>
    <li>4) Source Text Embedding
        <ul>
            <li>4a) Model Selection</li>
            <li>4b) Model Evaluation (TODO)</li>
        </ul>
    </li>
    <li>5) Source Embedding Storage and Management</li>
    <li>6) Query String Embedding</li>
    <li>7) Similarity Scoring and Ranking</li>
    <li>8) Relance Classification (optional / TODO)</li>
    <li>9) User interface (optional / TODO)</li>
    <li>10) Advanced techniques (TODO)
        <ul>
            <li><a href='https://www.sbert.net/examples/applications/retrieve_rerank/README.html'>Retrieval and Re-Ranking - Bi-Encoders(Retrieval) and Cross-Encoders(Re-Ranker)</a></li>
        </ul>
    </li>
</p>

## 1) Document Management
<p>The exact manner you manage the documents/resources to use will be based on your use case and is beyond the scope of this notebook; however it is important to consider several items.
    <li><b>Access</b>
        <ul>
            <li>With the application have continious access to source documents?</li>
            <li>Will the application need privileged permissions?</li>
        </ul>
    </li>
    <li><b>Versioning</b>
        <ul>
            <li>Is there a document versioning process?</li>
            <li>Are there duplicate/variations of a documents?</li>
        </ul>
    </li>
    <li><b>Document/Resource Types</b>
        <ul>
            <li>What type of document formats will be used? (eg. MS Word, Excel, Google Docs, Websites, Emails, Etc..)</li>
            <li>Are there diffent format versions?</li>
        </ul>
    </li>
    </p>

## 2) Source Text Extraction
<p>The process for extracting the source text will vary by use case, we offer some things to consider during your design but there may be other considerations based on your requirements. The example dataset used in this notebook was extracted fromSource Text Storage USPTO patent XML files selecting just the abstract for embedding.
    <li><b>Content Extraction - Technical</b>
        <ul>
            <li>How will you access the source text within the resource/document?</li>
            <li>What libraries / tools will be needed to extract text?</li>
        </ul>
    </li>
    <li><b>Content Selection</b>
        <ul>
            <li>What parts of the document will be selected for extraction?(e.g. Subject Line, Executive Summary, Individual Sections, etc..)</li>
            <li>If you would like the application to identify specific locations within a document that contain the relevant information you will need to extract source text at the same level.</li>
        </ul>
    </li>
    <li><b>Content Quality</b>
        <ul>
            <li>Do you need to remove meta-data or file formatting components such as XML tags?</li>
            <li>Are there errors that need to be fixed? (e.g. spelling, formatting)</li>
        </ul>
    </li>
    </p>

## 3) Source Text Storage
<p>The example dataset used in this notebook has been stored in a simple csv file format however if your usecase needs to scale to millions,billions,or more items a database may be benificial. One option could be to use MongoDB running in its own container to store the Source Text data.<p>

In [1]:
# Importing the needed libraries & Modules

# Import cudf. cudf is part of the NVIDIA RAPIDS datascience SDK and is used to store the dataframes 
# used in gpu memory.
import cudf

# Import SentenceTransformer and util from the HuggingFace sentence_transformer library which has
# been pre-installed in this environment.
from sentence_transformers import SentenceTransformer, util

# Import smart_search_models. This module was created for this example to simplify the management of the 
# various models that can be used for the embedding process.
import smart_search_models

### Loading Example Dataset
<p>The dataset being used in this example is comprised of nearly 7,000 USPTO Patent submissions. It is important to note that the dataset has not been cleaned and contains incomplete abstract entrees, this is useful for understanding the performance of various models. Although having clean data is always prefered it is important to understand how the models will perform on incomplete data.</p>

<p>The source text dataset is stored in plain text CSV file containing the source XML file name and the extracted Abstract text. This is an important step in the document retreival process. The method of storing the source text may vary based on your use case (e.g. CSV, JSON, MongoDB, etc..). We use CSV in this example for simplicity.</p>

<b>Incomplete Examples</b>
Listed below are just a couple examples of entrees with inclomplete abstracts. 
<li>data/xml/us10882359-20210105.xml - 'In a tire'</li>
<li>data/xml/us10881285-20210105.xml - 'A method ('</li>

In [2]:
# Load in the example dataset
df = cudf.read_csv('abstracts.csv')
df.head()

Unnamed: 0,FileName,Abstract
0,data/xml/us10885437-20210105.xml,Security systems and methods for detecting int...
1,data/xml/us10884005-20210105.xml,The present invention provides biomarkers usef...
2,data/xml/us10887313-20210105.xml,The described technology provides a single sig...
3,data/xml/us10887088-20210105.xml,A computing device includes an interface confi...
4,data/xml/us10887228-20210105.xml,Techniques for enabling peer-to-peer transmiss...


## 4) Source Text Embedding
<p>Historical methods for search involved simple <a href='https://en.wikipedia.org/wiki/Lexicography'>lexicographical</a> similarity pattern matching such as regex. Although methods such as lexical search can be useful for some use cases they have several disadvantages such as needing to specific the precise terms to search for. To improve search results it can be advantagous to search based on <a href='https://en.wikipedia.org/wiki/Semantic_similarity#:~:text=Semantic%20similarity%20is%20a%20metric,as%20opposed%20to%20lexicographical%20similarity.'>sematic similarity</a> using concepts rather than word for comparison.</p>

<p>To be able to seach by concept we must be able to represent our data in the form of concepts. This is where <a href='https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)'>Transformers</a> come in. <a href='https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)'>Transformers</a> are a form of Machine Learning that can be applied to Natural Language Processing (NLP) where the models have been trained on extremely large datasets such as Wikipedia to develop the ability to represent input text as a highly dimensional numerical representation, this process is called <a href='https://vaclavkosar.com/ml/transformer-embeddings-and-tokenization'>embedding</a>. If this sounds complicated, don't worry the hard parts are all abstracted away for us, we just need to use the sentence transformer libriary. Although there are benefits of understanding how the models work, sometimes it can be just as valuable to show how easy they are to use and how impressive the results can be using off-the-shelf models. If greater accuracy is needed you can always <a href='https://www.sbert.net/docs/training/overview.html'>train transformers</a> on your own datasets to improve their capabilities.</p>

### 4a) Model Selection
<p> There are a large number of models to choose from on <a href='https://huggingface.co/'>HuggingFace</a> even for just the task of <a href='https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads'>Sentence Similarity</a>(>800 as of 11/2022). We have include a python module to help simplify organization and selection of a smaller subset of models to experiment with (~100). Using <a href='https://huggingface.co/'>HuggingFace</a> simplifies the process of downloading and running the various models, it is not the only way to consume Transformers but it was choosen as it is one of the easiest ways to get started.</p>
    
There are a number of areas to consider when selecting a model for a given task
<li><b>Model Size</b> - Large models need more VRAM and can take longer to run but may be more 'accurate'</li>
<li><b>Model Architecture</b> - Some models might be designed for specific use cases or finetuned for a given problem. If your use case is similar you might have high performance out of the box.</li>

### 4b) Model Evaluation
<p> TODO: Add content on steps to understand model performance</p>

### Loading the Model
<p>Loading the model is a simple as passing the model name as an input argument to create a model object. If the model isn't available locally it will be downloaded automatically. One of the hardest part of working with HuggingFace is keeping track of all the models available. You can view all the models availabe for <a href='https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads'>Sentence Similarity</a> and copy the name into the code or to simplify things we have created a very basic python module <a href='smart_search_models.py'>smart_search_models.py</a> to hold model names.</p>

<details>
  <summary>SentenceTransformer Parameters</summary>
<li><b>model_name_or_path</b> – If it is a filepath on disc, it loads the model from that path. If it is not a path, it first tries to download a pre-trained SentenceTransformer model. If that fails, tries to construct a model from Huggingface models repository with that name.</li>
<li><b>modules</b> – This parameter can be used to create custom SentenceTransformer models from scratch.</li>
<li><b>device</b> – Device (like ‘cuda’ / ‘cpu’) that should be used for computation. If None, checks if a GPU can be used.</li>
<li><b>cache_folder</b> – Path to store models</li>
<li><b>use_auth_token</b> – HuggingFace authentication token to download private models.</li>
    </details>

In [3]:
# Check how many models are in the module
len(smart_search_models.sentence_models)

101

In [4]:
# Select and load model.
# Note: If a given model hasn't been used since the container has been loaded it will be downloaded automatically.
model_name = smart_search_models.sentence_models[6]
print("Loading model: '{}'".format(model_name))
model = SentenceTransformer(model_name,cache_folder='./models/')

Loading model: 'all-MiniLM-L12-v2'


### Source Text Embedding
<p>To embed the source text we can pass the entire column of our dataset into the model object in a single line of code as shown in the cell block below.</p>

<p>A couple important items to note here:
    <li>You only need to embed the source text once for a given model. Depending on the your use case you may wish to database the embeddings for later use, just remember to keep track of the model used for embedding and the source document.</li>
    <li>As each model will embed the input text differently you need to ensure the source text and query text were embedded using the same model. If you choose to database or store your embedding for later just be sure to track which models where used for the embedding as you will likely get unexpected results if comparing embedding from different models.</li>
    </p>

<details>
  <summary>encode Parameters</summary>
    <li><b>sentences</b> – the sentences to embed</li>
    <li><b>batch_size</b> – the batch size used for the computation</li>
    <li><b>show_progress_bar</b> – Output a progress bar when encode sentences</li>
    <li><b>output_value</b> – Default sentence_embedding, to get sentence embeddings. Can be set to token_embeddings to get wordpiece token embeddings. Set to None, to get all output values</li>
    <li><b>convert_to_numpy</b> – If true, the output is a list of numpy vectors. Else, it is a list of pytorch tensors.</li>
    <li><b>convert_to_tensor</b> – If true, you get one large tensor as return. Overwrites any setting from convert_to_numpy</li>
    <li><b>device</b> – Which torch.device to use for the computation</li>
    <li><b>normalize_embeddings</b> – If set to true, returned vectors will have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used.</li>
    </details>

In [5]:
%%time
source_embeddings = model.encode(df.Abstract.to_pandas(), convert_to_tensor=True)

CPU times: user 20.8 s, sys: 1.22 s, total: 22 s
Wall time: 5.65 s


## 6) Query String Embedding
<p>Using the same model we then embed our query string to be used for comparison</p>

In [6]:
%%time
# Embed the query string
query_string = 'datascience'
query_embedding = model.encode(query_string,convert_to_tensor=True)

CPU times: user 2.46 ms, sys: 6.47 ms, total: 8.94 ms
Wall time: 8.11 ms


## 7) Similarity Scoring and Ranking
<p>Next we need to calculate the similarity between the to query embedding and all the source text embeddings. One of the most common approaches is to calculate the cosinse similarity. Again the complexities and math have been abstracted here with the <a href='https://www.sbert.net/docs/package_reference/util.html'>util.cos_sim</a> function.</p>

In [7]:
%%time
# Create a new dataframe to store the results.
results_df = df.copy()

#Compute cosine-similarities
results_df['score'] = util.cos_sim(source_embeddings, query_embedding).tolist()

CPU times: user 4.46 ms, sys: 144 µs, total: 4.6 ms
Wall time: 3.89 ms


In [8]:
# Get the top ten results
top_ten = results_df.nlargest(10,'score').reset_index(drop=True)
top_ten.head()

Unnamed: 0,FileName,Abstract,score
0,data/xml/us10885055-20210105.xml,One or more datasets are received by a data wr...,[0.4848591983318329]
1,data/xml/us10885120-20210105.xml,A search request relating to one or more datas...,[0.46172472834587097]
2,data/xml/us10884402-20210105.xml,Sensor data is received characterizing operati...,[0.45607712864875793]
3,data/xml/us10884574-20210105.xml,A computer displays a graphical user interface...,[0.4508618414402008]
4,data/xml/us10887433-20210105.xml,Systems and methods are provided for segmentin...,[0.43512162566185]


In [9]:
# Show the highest scoring Abstract
print('The below Abstract resulted in a similarity score of {} \n'.format(top_ten.score[0]))
print(top_ten.Abstract[0])

The below Abstract resulted in a similarity score of [0.4848591983318329] 

One or more datasets are received by a data wrangling module and wrangled into a form that is computationally actionable by a user. At least some data from the one or more datasets are enriched by one or more data enrichment modules to generate an enriched form of at least some data corresponding to the one or more datasets that is computationally actionable by the user. The one or more datasets and the enriched form of the at least some data are processed by a signal detection module to identify relationships, anomalies, and/or patterns within the one or more datasets.


### Additional Resources

<li><a href='https://huggingface.co/'>HuggingFace</a></li>
<li><a href='https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads'>Sentence Similarity Models</a></li>
<li><a href='https://www.sbert.net/docs/usage/semantic_textual_similarity.html'>Sematic Textual Similarity</a></li>

### Environment
This notebook has been developed and tested on the following:
<li>RAPIDS - rapidsai-core:22.10-cuda11.5-base-ubuntu20.04-py3.9</li>
<li>Pytorch 1.12.1</li>
<li>sentence-transformers</li>