# cat-AI-log. An AI-based product group allocation system

Capstone project.

Sebastian Thomas @ neue fische Bootcamp Data Science<br />
(datascience at sebastianthomas dot de)

# Part 5: The search engine

We develop and illustrate the search engine, which is the core functionality of cat-AI-log.

## Import of modules, classes and functions

In [None]:
# python object persistence
import joblib

# data
import numpy as np
import pandas as pd

# machine learning
from sklearn.feature_extraction.text import CountVectorizer

# custom modules
from modules.spelling_correction import SpellingCorrector, edit_distance
from modules.quotient_extraction import pairwise_damerau_levenshtein_distances, pairwise_damerau_levenshtein_similarities, symmetric_matrix, csgraph, quotient_matrix
from modules.search import SearchEngine

# development:
# scientific computations
#from scipy.sparse import load_npz, save_npz

## Import of data

We import our data.

In [None]:
mira = pd.read_pickle('data/mira_processed.pickle')
mira.sample(5, random_state=0)

We define the corpus on which the objects in this notebook are fitted.

In [None]:
corpus = mira['article'].values

## Spelling corrector

To correct spelling mistakes in search queries, we use a simple spelling corrector, which is a scikit-learn transformer. When fitted on the corpus, this spelling corrector computes and saves the vocabulary of the documents. Search queries are given via the transform method, which tokenizes search queries, replaces every token by its nearest word in the vocabulary, and returns the joined strings as corrected search queries.

In [None]:
spelling_corrector = SpellingCorrector()
spelling_corrector.fit(corpus);

In [None]:
spelling_corrector.transform(['Asprin Cmplx', 'Paracetamol', 'Syndikort'])

In [None]:
joblib.dump(spelling_corrector, 'objects/spelling_corrector.joblib');

## Search engine

We initialize an instance of the search engine and fit it on the corpus. Since fitting takes a long computation time, we avoid double computations by persisting the object.

In [None]:
# due to long computation time, only fit search engine once
try:
    search_engine = joblib.load('objects/search_engine.joblib')
except FileNotFoundError:
    search_engine = SearchEngine()
    search_engine.fit(corpus);
    joblib.dump(search_engine, 'objects/search_engine.joblib');

# development: allow import of intermediate steps 

#try:
#    vocabulary = np.load('objects/vocabulary.npy')
#except FileNotFoundError:
#    count_vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w+\b')
#    count_vectorizer.fit(corpus)
#    vocabulary = np.array(count_vectorizer.get_feature_names()).astype('U')
#    np.save('objects/vocabulary.npy', vocabulary)
    
#try:
#    distances = np.load('objects/distances.npy')
#except FileNotFoundError:
#    distances = pairwise_damerau_levenshtein_distances(vocabulary, dtype=np.uint8)
#    np.save('objects/distances.npy', distances)

#try:
#    q = load_npz('objects/quotient_matrix.npz')
#except IOError:
#    similarities = pairwise_damerau_levenshtein_similarities(vocabulary, distances)
#    strong_similarities = csgraph(symmetric_matrix(similarities), 0.8)
#    q = quotient_matrix(strong_similarities)
#    save_npz('objects/quotient_matrix.npz', q, compressed=True)

#try:
#    search_engine = joblib.load('objects/search_engine.joblib')
#except FileNotFoundError:
#    search_engine = SearchEngine(quotient_matrix=q)
#    search_engine.fit(corpus);
#    joblib.dump(search_engine, 'objects/search_engine.joblib');

By default, the search engine returns only the document in the corpus that best matches the search query.

In [None]:
search_engine.recommend('Aspirin')

The search engine can also return all matching documents.

In [None]:
search_engine.recommend('Aspirin', max_count=None)

There is a parameter that controls the output of the search engine. The default value of `output` is `'documents'`. For further processing, it can also return the indices of the matching documents as well as the similarities.

In [None]:
search_engine.recommend('Aspirin', max_count=None, output='indices')

In [None]:
mira.iloc[search_engine.recommend('Aspirin', max_count=None, output='indices')]\
[['article', 'product group', 'prediction print', 'certainty print']]

In [None]:
search_engine.recommend('Aspirin', max_count=None, output='with_similarities')

Outputs can be limited by a threshold for the similarities.

In [None]:
search_engine.recommend('Aspirin', max_count=None, output='with_similarities', threshold=0.5)

The spelling corrector can be placed in front of the search engine. The ordering of the tokens in the search query does not matter.

In [None]:
search_engine.recommend(spelling_corrector.transform(['CMPLX Asprin'])[0], max_count=None)

By default, the search engine may return documents which do not match all tokens of the search query. This behaviour can be controled by another parameter.

In [None]:
search_engine.recommend(spelling_corrector.transform(['CMPLX Asprin'])[0], max_count=None, include_all=True)

The search engine might also find documents which have spelling mistakes.

In [None]:
search_engine.recommend('Aspirin effect', max_count=None, include_all=True)

We illustrate some more examples.

In [None]:
search_engine.recommend('Paracetamol', max_count=None)

In [None]:
search_engine.recommend('Ibuprofen', max_count=None)

In [None]:
search_engine.recommend('Symbicort', max_count=None)

In [None]:
search_engine.recommend('Hydrocortison', max_count=None)

In [None]:
search_engine.recommend('Hüft', max_count=None)