## AutoComplete Search Algorithm for Flask App

The purpose of this notebook is to build a autocomplete algorithm for users to search for a product name even without knowing the exact name of the product.

The output will be the most similar product name to the search based on cosine similarity metrics.

We will use the list of product names as documents and we will vectorize each character of the using CountVectorizer to create a character vector space.

The user search input will be compared to every product name and the product with the highest cosine similarity will be the output

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
df = pd.read_pickle('rules.pkl')

**Grab a list of product names from itemA and itemB**

In [4]:
itemA = df.iloc[:,0]
itemB = df.iloc[:,1]

In [5]:
product_names = list(set(list(itemA.values) + list(itemB.values)))

In [6]:
def split(word):
    return [char for char in word]

In [7]:
product_names[:10]

['Original Crescent Rolls',
 'Pitted Prunes',
 'Maple Quinoa Cluster With Chia Seeds',
 'Traditional Refried Beans',
 'Organic Quinoa Dark Chocolate Bar',
 'Chunk Light Tuna In Water',
 'Organic Australian Style Vanilla Lowfat Yogurt',
 'Raspberries Dark Chocolate Bar',
 'Ground Turkey Breast',
 'Organic Chewy Chocolate Banana Bites']

In [44]:
product_documents = []

for product in product_names:
    split_product = ''
    for word in product.lower().split():
        split_product += ' '.join(split(word)) + ' '
    product_documents.append(split_product[:-1])

In [46]:
char_vectorizer = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
char_doc = char_vectorizer.fit_transform(product_documents)

In [47]:
import pickle
with open("char_vectorizer.pkl", "wb") as fp:   #Pickling
    pickle.dump(char_vectorizer, fp)

In [48]:
char_vectorizer = pd.read_pickle('char_vectorizer.pkl')

In [97]:
# Put this into a dataframe
product_vect_df = pd.DataFrame(char_doc.todense(), index = product_names)
product_vect_df.to_pickle('product_vect_df.pkl')

In [50]:
product_vect_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
Original Crescent Rolls,0,0,0,0,0,0,0,0,0,0,...,0,3,2,1,0,0,0,0,0,0
Pitted Prunes,0,0,0,0,0,0,0,0,0,0,...,0,1,1,2,1,0,0,0,0,0
Maple Quinoa Cluster With Chia Seeds,0,0,0,0,0,0,0,0,0,0,...,1,1,3,2,2,0,1,0,0,0
Traditional Refried Beans,0,0,0,0,0,0,0,0,0,0,...,0,3,1,2,0,0,0,0,0,0
Organic Quinoa Dark Chocolate Bar,0,0,0,0,0,0,0,0,0,0,...,1,3,0,1,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Almond Butter,0,0,0,0,0,0,0,0,0,0,...,0,1,0,2,1,0,0,0,0,0
Unbleached All-Purpose Flour,0,0,0,0,0,0,0,0,0,0,...,0,2,1,0,3,0,0,0,0,0
Frozen Peaches,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,1
Plain Mini Bagels,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [132]:
# Count how many spaces does each product name have
num_spaces = []

for product in product_names:
    num_space = len(product.split()) - 1
    num_spaces.append(num_space)
    
product_vect_df['num_spaces'] = num_spaces

product_vect_df.to_pickle('product_vect_df.pkl')
product_vect_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,27,28,29,30,31,32,33,34,35,num_spaces
Original Crescent Rolls,0,0,0,0,0,0,0,0,0,0,...,3,2,1,0,0,0,0,0,0,2
Pitted Prunes,0,0,0,0,0,0,0,0,0,0,...,1,1,2,1,0,0,0,0,0,1
Maple Quinoa Cluster With Chia Seeds,0,0,0,0,0,0,0,0,0,0,...,1,3,2,2,0,1,0,0,0,5
Traditional Refried Beans,0,0,0,0,0,0,0,0,0,0,...,3,1,2,0,0,0,0,0,0,2
Organic Quinoa Dark Chocolate Bar,0,0,0,0,0,0,0,0,0,0,...,3,0,1,1,0,0,0,0,0,4


In [4]:
def find_similar_products(product_name, top_num = 3):
    
#     def split(word):
#     return [char for char in word]
    
    # text preprocessing
    split_product = ''
    for word in product_name.lower().split():
        split_product += ' '.join([char for char in word]) + ' '
    
    product_document = split_product[:-1]
    
    char_vectorizer = pd.read_pickle('char_vectorizer.pkl')
    product_char_array = char_vectorizer.transform([product_document]).todense()
    
    # find the number of spaces in the product name
    num_space = len(product_name.split()) -1
    
    # generate the array needed to pass into cosine similarity
    product_array = np.append(np.array(product_char_array).reshape(-1), [[num_space]])
    
    # create an empty list of cosine similarity values
    cosine_sim_list = []
    
    product_vect_df = pd.read_pickle('product_vect_df.pkl')
    
    for array in product_vect_df.values:
        cosine_sim = cosine_similarity([array,product_array.reshape(-1)])[1][0]
        cosine_sim_list.append(cosine_sim)
    
    cosine_sim_df = pd.DataFrame(np.array(cosine_sim_list), 
                                 index = product_vect_df.index,
                                columns = ['Similarity'])
    
    top_products = cosine_sim_df.sort_values(by = 'Similarity', ascending = False)
    return top_products.index.values[:top_num], top_products.values[:top_num]

In [5]:
def auto_complete(product_name):
    user_input = product_name
    product_search = True
    similar_products, sim_values = find_similar_products(user_input)

    while product_search:
        if sim_values[0] >= 0.95:
            return similar_products[0]
        
        print('Did you mean one of these?') 
        for item in similar_products:
            print(item)
        print('Type yes or enter another product name')
        
        user_reply = input()
        
        if user_reply.lower() != 'yes':
            user_input = user_reply
            similar_products, sim_values = find_similar_products(user_input)
              
        else:
            print('Which product? From 1 to 3')
            which_one = int(input()) - 1
            product_search = False
            return similar_products[which_one]
    

In [6]:
product_name = 'Double Fiber bread'
auto_complete(product_name)

'Double Fiber Bread'

In [14]:
def Auto_Complete(product_name, top_num = 3):
    
#     def split(word):
#     return [char for char in word]
    
    # text preprocessing
    split_product = ''
    for word in product_name.lower().split():
        split_product += ' '.join([char for char in word]) + ' '
    
    product_document = split_product[:-1]
    
    char_vectorizer = pd.read_pickle('char_vectorizer.pkl')
    product_char_array = char_vectorizer.transform([product_document]).todense()
    
    # find the number of spaces in the product name
    num_space = len(product_name.split()) -1
    
    # generate the array needed to pass into cosine similarity
    product_array = np.append(np.array(product_char_array).reshape(-1), [[num_space]])
    
    # create an empty list of cosine similarity values
    cosine_sim_list = []
    
    product_vect_df = pd.read_pickle('product_vect_df.pkl')
    
    for array in product_vect_df.values:
        cosine_sim = cosine_similarity([array,product_array.reshape(-1)])[1][0]
        cosine_sim_list.append(cosine_sim)
    
    cosine_sim_df = pd.DataFrame(np.array(cosine_sim_list), 
                                 index = product_vect_df.index,
                                columns = ['Similarity'])
    
    top_products = cosine_sim_df.sort_values(by = 'Similarity', ascending = False)
    
    if top_products.iloc[0,0] >= 0.95:
        return [top_products.index[0]]
    else:
        print('Did you mean one of these?') 
        return [item for item in top_products.index[:top_num]]
        print('Try again with these options')

In [17]:
product_name = 'Animal Crackers'
search = Auto_Complete(product_name)

Did you mean one of these?


In [18]:
len(search)

3

In [19]:
product_name = 'Organic Milks'
search = Auto_Complete(product_name)

In [20]:
len(search)

12