# Keyword Searching using BERT

The pirpose of this notebook is to use the BERT NLP framework to build a keyword search engine for the review section of the running shoe reviews. 

In [1]:
import pandas as pd
import numpy as np

import torch
from transformers import BertTokenizer, BertModel
import matplotlib.pyplot as plt

import time

import nltk
from nltk.corpus import stopwords
 


In [2]:
review_data=pd.read_pickle('Processed_dataset.pkl')
review_data.head(4)

Unnamed: 0,link,page,date,headline,description,review_body,comments
0,https://www.runningshoesguru.com/2017/09/hoka-...,"[b', html, \n \n, [ \n, [\n, <link as=""style"" ...",2017-09-28T09:49:35-04:00,Hoka One One Stinson ATR 4,The Stinson ATR 4 is built for long days on th...,The Stinson ATR 4 is the newest version of th...,[The bondi 5’s are like boards now-so stiff an...
1,https://www.runningshoesguru.com/2015/08/nike-...,"[b', html, \n \n, [ \n, [\n, <link as=""style"" ...",2017-11-25T13:13:48-05:00,Nike Free Flyknit 4.0,It\xe2\x80\x99s hard to believe that the Nike ...,Nike Free Flyknit 4.0 General Info: The Free ...,[Is this shoe recommended for marathons? Pls s...
2,https://www.runningshoesguru.com/2018/08/new-b...,"[b', html, \n \n, [ \n, [\n, <link as=""style"" ...",2018-08-15T08:54:42-04:00,New Balance Fresh Foam Beacon,The New Balance Fresh Foam Beacon is a lightwe...,"Light, Airy, Comfortable, Responsive, Flexibl...",[I have 60 miles on my Beacons and I can’t pra...
3,https://www.runningshoesguru.com/2012/09/scott...,"[b', html, \n \n, [ \n, [\n, <link as=""style"" ...",2012-09-28T18:38:48-04:00,Scott T2C,The Scott T2C provides an eminently enjoyable ...,Scott T2C General info The T2C is a lightweig...,[I love mine and wish I could still buy them s...


## Linear algeba helper functions

Once we have encoded our articles, we will want to use an inner product to compare them, these functions simplify this. 

In [3]:
def normalise(vect):
    if np.linalg.norm(vect)!=0:
        return vect/np.linalg.norm(vect)
    else:
        return vect

In [4]:
def similarity(vect1,vect2):
    return np.inner(normalise(vect1),normalise(vect2))

## Load Bert Model

Load the Bert model with hidden states, whcih we will use as the basis for the encoding

In [5]:
bert_model= BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # We need all hidden states for our encoding
                                  )
bert_model = bert_model.to('cuda')
bert_model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

## Encode a sentence 

To encode a stennce the steps are as follows: 

1) Add special tokens and tokanise the text
2) Create a segment to tell the model everything is one sentence (Bert is trained on 2 sentences)
3) Feed tokens and segments to the model ('cuda' is set so GPU is used)
4) Use PyTorch to stack all 12 hidden states to create our embedding for the whole sentence 
5) Create a list of embeddigns for each word using 4 of the hidden states

In [6]:
def encode_sentence(sentence,model,tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')):
    #add end tokens 
    sentence = "[CLS] " + sentence + " [SEP]"
    #tokenize text 
    tokenized_text = tokenizer.tokenize(sentence)
    #covert to indecies 
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    #create segment index (all 1's)
    segments_ids = [1] * len(tokenized_text)
    #convert to tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    #get output from model
    output_states=model(tokens_tensor.to('cuda'), segments_tensors.to('cuda'))[2]
    #stack all hidden layers
    token_embeddings = torch.stack(output_states, dim=0)
    token_embeddings = torch.squeeze(token_embeddings, dim=1)
    token_embeddings = token_embeddings.permute(1,0,2)

    token_vecs_cat = []

    

    # For each token in the sentence...
    for token in token_embeddings:
    
    # `token` is a [12 x 768] tensor

        # Concatenate the vectors (that is, append them together) from the last 
        # four layers.
        # Each layer vector is 768 values, so `cat_vec` is length 3,072.
        cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
        #convert to numpy 
        cat_vec=cat_vec.cpu()
        cat_vec=cat_vec.detach().numpy()
    
        # Use `cat_vec` to represent `token`.
        token_vecs_cat.append(cat_vec)
    #get rid of blank tokens    
    token_vecs_cat=token_vecs_cat[1:-1]
    
    return token_vecs_cat

## Test: Encode sentence 

Here we test that the function is returning a list of numpy arrays with dimension 3072

In [7]:
def unit_test1(sentence, model=bert_model):
    start = time.time()
    test_case=encode_sentence(sentence, model)
    end= time.time()
    print('Return type:', type(test_case))
    if type(test_case)==list:
        print('Element type:', type(test_case[0]))
    print('Return shape:',np.shape(test_case))
    print('Run time for encoding:',end-start)

In [8]:
test_cases=['This is a test sentence','This is a much longer test sentance, containing the word antidisestablishmentarianism',
           'antidisestablishmentarianism']

for case in test_cases:
    print('Test case:', case)
    unit_test1(case)
    print()

Test case: This is a test sentence
Return type: <class 'list'>
Element type: <class 'numpy.ndarray'>
Return shape: (5, 3072)
Run time for encoding: 0.5337522029876709

Test case: This is a much longer test sentance, containing the word antidisestablishmentarianism
Return type: <class 'list'>
Element type: <class 'numpy.ndarray'>
Return shape: (20, 3072)
Run time for encoding: 0.026926040649414062

Test case: antidisestablishmentarianism
Return type: <class 'list'>
Element type: <class 'numpy.ndarray'>
Return shape: (8, 3072)
Run time for encoding: 0.02094864845275879



We can see that not all words are encoded as a single token, this is not an issue merely how Bert tokenises text.

## Encode an Article 

Now we are in need of a function to encode whole articles. This is pretty simple, we are going to split an article into sentences and encode those and sum them, using numpy. 

In [9]:
def article_encode(review_text,model):
    review_df=pd.Series(review_text.strip().split('.'))
    review_df=review_df.apply(lambda x: encode_sentence(x,model))
    return review_df.sum()

## Test: Encode an Article 

In [10]:
def unit_test2(article):
    start = time.time()
    article_encoded=article_encode(article,bert_model)
    end= time.time()
    
    print('Number of sentences in encoding:',len(article_encoded))
    print('Run time:', end-start)
    print('Total floats stored:', np.shape(article_encoded[0])[0]*len(article_encoded))

In [11]:
# get first 5 example reviews

example_reviews=review_data['review_body'][0:5]

In [12]:
for review in example_reviews:
    print('words in review:', len(review.split()))
    unit_test2(review)
    print()

words in review: 1253
Number of sentences in encoding: 1483
Run time: 0.7297601699829102
Total floats stored: 4555776

words in review: 934
Number of sentences in encoding: 1187
Run time: 0.7282774448394775
Total floats stored: 3646464

words in review: 1025
Number of sentences in encoding: 1242
Run time: 0.5659785270690918
Total floats stored: 3815424

words in review: 526
Number of sentences in encoding: 721
Run time: 0.2979886531829834
Total floats stored: 2214912

words in review: 941
Number of sentences in encoding: 1090
Run time: 0.556598424911499
Total floats stored: 3348480



## Tokenise an article 

We could have added the tokenisation as a return from our encoding function, but this would have required unnessesary computation when tokenisation is needed on it's own. Mirroring the previous sections, we'll write functions to tokenise a single sentence and then apply this to a whole article.

In [13]:
def sentence_tokenize(sentence,tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')):
    sentence = "[CLS] " + sentence + " [SEP]"
    tokenized_text = tokenizer.tokenize(sentence)
    tokenized_text=tokenized_text[1:-1]
    return tokenized_text

In [14]:
def article_tokenizer(review_text,tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')):
    review_df=pd.Series(review_text.strip().split('.'))
    review_df=review_df.apply(lambda x: sentence_tokenize(x))
    return review_df.sum()

## Test: Tokenise an article 

The key thing we need to check here is that the tokenisation is always producing the same length result as the encoder 

In [15]:
def unit_test3(article):
    start = time.time()
    article_tokenised=article_tokenizer(article)
    end= time.time()
    
    article_encoded=article_encode(article,bert_model)
    if len(article_tokenised)==len(article_encoded):
        print('Lengths match!')
    else:
        print('Lengths do NOT match!')
        
    print('number of tokens:',len(article_tokenised))
    print('number of encodings:',len(article_encoded))
    print('Run time:', end-start)
    print()
    

In [16]:
example_reviews=review_data['review_body'][0:5]

for review in example_reviews:
    print('words in review:', len(review.split()))
    unit_test3(review)
    print()

words in review: 1253
Lengths match!
number of tokens: 1483
number of encodings: 1483
Run time: 0.015005350112915039


words in review: 934
Lengths match!
number of tokens: 1187
number of encodings: 1187
Run time: 0.015999555587768555


words in review: 1025
Lengths match!
number of tokens: 1242
number of encodings: 1242
Run time: 0.01100015640258789


words in review: 526
Lengths match!
number of tokens: 721
number of encodings: 721
Run time: 0.016201019287109375


words in review: 941
Lengths match!
number of tokens: 1090
number of encodings: 1090
Run time: 0.010125398635864258




## Encode all the articles 

The plan here is to apply the article_encoder and the article_tokeniser functions to every article in the dataset. The end result will be a new column containing lists with the tokens and encodings

In [17]:
def article_create_encoding(review_text,model):
    
    article_encoding=article_encode(review_text,model)
    article_tokens=article_tokenizer(review_text)
    
    return [article_tokens, article_encoding] 

#this is not an ideal way to store this information but it does allow for easy comparison of tokens and encodings 

Running the encoding on all articles takes 8.5 minutes to run on my laptops GPU

In [18]:
start = time.time()
#review_data['encoded_review_body']=review_data['review_body'].apply(lambda x: article_create_encoding(x,bert_model))
end= time.time()
print((end-start)/60)
print(review_data.head(4))

0.0
                                                link  \
0  https://www.runningshoesguru.com/2017/09/hoka-...   
1  https://www.runningshoesguru.com/2015/08/nike-...   
2  https://www.runningshoesguru.com/2018/08/new-b...   
3  https://www.runningshoesguru.com/2012/09/scott...   

                                                page  \
0  [b', html, \n \n, [ \n, [\n, <link as="style" ...   
1  [b', html, \n \n, [ \n, [\n, <link as="style" ...   
2  [b', html, \n \n, [ \n, [\n, <link as="style" ...   
3  [b', html, \n \n, [ \n, [\n, <link as="style" ...   

                        date                       headline  \
0  2017-09-28T09:49:35-04:00     Hoka One One Stinson ATR 4   
1  2017-11-25T13:13:48-05:00          Nike Free Flyknit 4.0   
2  2018-08-15T08:54:42-04:00  New Balance Fresh Foam Beacon   
3  2012-09-28T18:38:48-04:00                      Scott T2C   

                                         description  \
0  The Stinson ATR 4 is built for long days on th...   
1  It\

In [19]:
#review_data.to_pickle('Review_data_BERT.pkl')

review_data=pd.read_pickle('Review_data_BERT.pkl')


## Clear stopwords for each article 

When search for keywords we really are not interested in matching with stop words, so remove this issue let's simply remove them from the articles before searching

In [20]:
def clear_stopwords(encoded_article):
    tokens=encoded_article[0]
    vectors=encoded_article[1]
    
    cleared_tokens=[]
    cleared_vectors=[]
    
    if len(tokens)!=len(vectors):
        return 'tokens and vector lengths dont match'
    
    for i in range(len(tokens)):
        if tokens[i] not in stopwords.words('english') and len(tokens[i])>1:
            cleared_tokens+=[tokens[i]]
            cleared_vectors+=[vectors[i]]
        
    
    return [cleared_tokens,cleared_vectors]

In [21]:
start = time.time()
review_data['encoded_review_body']=review_data['encoded_review_body'].apply(lambda x: clear_stopwords(x))
end= time.time()
print((end-start)/60)


2.5759114066759747


## Article keyword count

This section contains two functions. Keyword_count, this uses a fixed sentence to help encode our keyword. I found this to be a much more powerful method than encoding just the word on it's own. I chose the sentence "look for --- in the text" but to refine the algorithm it might be advantageuous to experiment with this. 

The second function counts the instances of similar words in an article. I set the threshiold for this as a similarity of 0.5 or greater, this is another parameter that could be tuned. 

In [22]:
def keyword_encode(keyword):
    keyword.strip()
    key_sentence='look for '+keyword+' in the text'
    
    return encode_sentence(key_sentence,bert_model)[2]

In [23]:
'''
keyword: the vector for our chosen keyword 

encoded_article: entry in the pandas 

'''

def article_keyword_count(keyword, encoded_article, threshold=0.5):
    article_df=pd.DataFrame(encoded_article[0], columns=['tokens'])
    article_df['encoding']=pd.Series(encoded_article[1])
    
    article_df['similarity']=article_df['encoding'].apply(lambda x: similarity(x,keyword))
    article_df=article_df[article_df['similarity']>=threshold]
    #print(article_df.index)
    return np.shape(article_df)[0], article_df['tokens']

## Test: article keyword count 

The test for the keyword encoding is just there to check that the output is a numpy array with the correct dimension and that the time taken is reasonable 

In [24]:
def unit_test4(keyword):
    start = time.time()
    test_encoding=keyword_encode(keyword)
    end= time.time()
    print('Type of return',type(test_encoding))
    print('Shape of return',np.shape(test_encoding))
    print('Run time:',end-start)

In [25]:
for keyword in ['race','trail','marathon', 'zero']:
    print(keyword)
    unit_test4(keyword)
    print()

race
Type of return <class 'numpy.ndarray'>
Shape of return (3072,)
Run time: 0.1623682975769043

trail
Type of return <class 'numpy.ndarray'>
Shape of return (3072,)
Run time: 0.03900003433227539

marathon
Type of return <class 'numpy.ndarray'>
Shape of return (3072,)
Run time: 0.03762507438659668

zero
Type of return <class 'numpy.ndarray'>
Shape of return (3072,)
Run time: 0.03663778305053711



When testing the keyword count, we are interested that all the outputs have the correct data types but also that the matchign words in the text fit the keyword acceptably well 

In [26]:
def unit_test5(keyword, article):
    start = time.time()
    test_count=article_keyword_count(keyword_encode(keyword),article)
    end= time.time()
    
    print('Type of return:',type(test_count))
    print('Type of first element:',type(test_count[0]))
    print('Type of second element:',type(test_count[1]))
    
    print('Keywords matched to:',test_count[1].to_list())
    print('Run time:',end-start)

In [27]:
for keyword in ['race','trail','marathon', 'zero']:
    print(keyword)
    unit_test5(keyword, review_data['encoded_review_body'].iloc[100])
    print()

race
Type of return: <class 'tuple'>
Type of first element: <class 'int'>
Type of second element: <class 'pandas.core.series.Series'>
Keywords matched to: ['speed', 'rider', 'run', 'run', 'distance', 'speed', 'runs', 'trainer', 'speed', 'miles']
Run time: 0.06927824020385742

trail
Type of return: <class 'tuple'>
Type of first element: <class 'int'>
Type of second element: <class 'pandas.core.series.Series'>
Keywords matched to: ['form', 'cushion', 'experience', 'confidence', 'heel', 'cup', 'cushion', 'tempo', 'abuse']
Run time: 0.056177616119384766

marathon
Type of return: <class 'tuple'>
Type of first element: <class 'int'>
Type of second element: <class 'pandas.core.series.Series'>
Keywords matched to: ['shoes', 'runners', 'runner', 'glide', 'run', 'mile', 'run', 'distance', 'runs', 'marathon', '##e', 'runners', 'runs', 'running', 'runs', 'runs', 'marathon', 'long', 'runs', 'runner', 'tempo', 'runs', 'workout', 'abuse', 'ran', 'running', 'marathon', 'trainer', 'speed', 'tempo', 'da

## Keyword Search 

This is the final section, where everything comes together into a function that can search for articles in our dataset based on how many words in the article are similar to a keyword. 

search_keywords does all the significant work here: first encoding the keyword, then searching in all the articles using pandas apply before finding the to n mathces.

print_search just presents this information in a nicely readable format 

In [28]:
def search_keywords(keyword,dataset=review_data.copy(), topn=3):
    encoded_keyword=keyword_encode(keyword)
    dataset['keyword_count']=dataset['encoded_review_body'].apply(lambda x: article_keyword_count(encoded_keyword,x)[0])
    dataset=dataset.nlargest(topn, 'keyword_count')
    return dataset

In [29]:
def print_search(keyword):
    start = time.time()
    test_search=search_keywords(keyword)
    
    
    for i in range(3):
    
        print(test_search['headline'].iloc[i])
        print()
        print(test_search['description'].iloc[i])
        print()
        print(test_search['link'].iloc[i])
        print('-'*50)
        print()
    end= time.time()
    print(f'Search time: {end-start}')

In [30]:
print_search('marathon')

ON Cloudrunner

The newest Cloudrunner is the best one to date. It's stronger than the last one and the realignment of the ringlets in conjunction with the Speedboard make this a very fast shoe for endurance runners. It handles well on the trails, but is not a true trail shoe. It's fun to run with this shoe and it looks great.

https://www.runningshoesguru.com/2013/07/on-running-cloudrunner-review/
--------------------------------------------------

Adidas Adizero Adios Pro

The Adidas Adizero Adios Pro is the flagship Adidas racing shoe and is a serious competitor to the Vaporfly and Alphafly Next%.  Its new Lightstrike Pro midsole provides a bouncy, fun ride while its extreme rocker results in a propulsive, forward tipping sensation during every toe-off.  The Adios Pro is a force to be reckoned with and at $50 less than the Vaporfly Next%, it's a steal. 

https://www.runningshoesguru.com/2020/11/adidas-adizero-adios-pro-review/
--------------------------------------------------

New 

# Conclusion

In this notebook we have built a search algorithm that uses BERT to find articles in the running dataset based on a keyword that we are interested in. Considering that this is an unsupervised task it is difficult to assess how successfuly this has been, but I welcome you to test out different search terms and decide for yourself how well the results match. 