# Seedtag codetest: NLP Researcher

## Part 3. Message-matcher baseline model
This communication contains a message matcher baseline model. Given a query text message and a corpus of historical messages, this matcher model retrieves all historical messages that are similar to the queried one. Your goal is to improve this model.

In [15]:
pip install -r ../requirements.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [16]:
import os
from hashlib import md5
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sentence_transformers import SentenceTransformer
import torch

### 0. Auxiliary Functions

In [17]:
# Load pre-trained BERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

def create_df(path, tag):
    '''
    Creates a data frame for a given class
    --------------------------------------
    Input:
        path (str): path where all classes folders are stored.
        tag (str): name of the folder containing class "tag".
    Output:
        df (pd.DataFrame): dataframe with file as index and columns=[text, tag]
    '''
    list_of_text = []
    tag_dir = os.path.join(path, tag)
    for file in os.listdir(tag_dir):

        with open(os.path.join(tag_dir, file), encoding="utf-8", errors="ignore") as f:
            text = f.read()
            list_of_text.append((text, file))
            df = pd.DataFrame(list_of_text, columns = ['Text', 'file'])
            df = df.set_index('file')
    df['tag'] = tag
    return df


def get_all_dfs(path, tags):
    '''
    Loops over all classes in path, each in the corresponding folder
    --------------------------------
    Input:
        path (str): path where all classes folders are stored.
        tags (list): list of classes names.
    Output:
        df (pd.DataFrame): pandas dataframe with the dataframes corresponding to all classes concatenated.
    '''
    list_of_dfs = []
    for tag in tags:

        df = create_df(path, tag)
        list_of_dfs.append(df)
    data = pd.concat(list_of_dfs)
    return data


def to_md5(rsc_id: str) -> str:
    """
    Convert rcs_id string into a hexdigest md5.
    :param rcs_id: str.
    :return: hexdigext representation of md5 codification of input string.
    """
    md5_rsc = bytes(rsc_id, 'utf-8')
    result_1 = md5(md5_rsc)
    return result_1.hexdigest()


def find_similar_rsc(similarity_scores: np.array, threshold: float) -> pd.DataFrame:
    """
    Get a dictionary relating resources to a list of [resource, score] pairs per resource.
    :param similarity_scores: matrix of similarity score per pair of resources of shape
    (number of resoures, number of resources).
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :return: a pd.DataFrame with 'resource_idx', 'similar_res_idx' and 'similarity_score' as columns relating resources
    to a given resource.
    """
    similar_rsc_idx = np.where((similarity_scores >= threshold) & (similarity_scores < 0.999))
    similar_scores = np.round(similarity_scores[similar_rsc_idx], 3)
    sim_res = pd.DataFrame({'resource_idx': similar_rsc_idx[0],
                            'similar_res_idx': similar_rsc_idx[1],
                            'similarity_score': similar_scores})
    return sim_res


def get_bert_embeddings(texts):
    """
    Get BERT embeddings for a list of texts.
    :param texts: list of text strings.
    :return: numpy array of embeddings.
    """
    embeddings = model.encode(texts, convert_to_tensor=True)
    return embeddings

def get_similarity_bert(resources: pd.DataFrame) -> np.array:
    """
    Compute pairwise cosine similarity for resources using BERT embeddings.
    :param resources: pd.DataFrame with the resources as rows and at least 'Text' as column.
    :return: symmetric np.array with cosine similarity score for each resource pair.
    """
    embeddings = get_bert_embeddings(resources['Text'].fillna('').tolist())
    sims = cosine_similarity(embeddings.cpu().numpy())
    return sims

def get_similar_rsc_bert(resources: pd.DataFrame, threshold: float = 0.75) -> dict:
    """
    Get similar resources per resource using BERT embeddings.
    :param resources: pd.DataFrame with the resources as rows and at least 'Text' as column.
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :return: a dictionary with resources as keys and similar resources as values.
    """
    sims = get_similarity_bert(resources)
    find_sims = find_similar_rsc(sims, threshold)
    sim_df = find_sims.copy()
    sim_df.reset_index(inplace=True)
    sim_df['resource_id'] = resources['resource_id'].iloc[find_sims.resource_idx].values
    sim_df['similar_res'] = resources['resource_id'].iloc[find_sims.similar_res_idx].values
    sim_df['sim_resources'] = sim_df.apply(lambda x: [[x.similar_res, x.similarity_score]], axis=1)
    grouped_sim_res = sim_df[['resource_id', 'sim_resources']].groupby('resource_id').agg(lambda x: np.sum(x))
    similar_res_dict = grouped_sim_res.T.to_dict('records')[0]
    sim_res = {k: sorted(v, key=lambda x: x[1], reverse=True) for k, v in similar_res_dict.items()}
    return sim_res

def get_similar_bert(input_text: str, corpus: pd.DataFrame, threshold: float=0.75) -> list:
    """
    Retrieves a set of messages from a given corpus that are similar enough to an input message using BERT embeddings.
    :param input_text: query text.
    :param corpus: pd.DataFrame with historical messages as column 'Text'.
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :return: a list with all the similar messages content and corresponding score to the queried one.
    """
    input_id = to_md5(input_text)
    input_df = pd.DataFrame({'Text': [input_text], 'resource_id': [input_id]})
    data = pd.concat([input_df, corpus])
    sim_dict = get_similar_rsc_bert(data, threshold)
    result = list()
    if sim_dict.get(input_id):
        for sim_id, sim_score in sim_dict.get(input_id):
            result.append([corpus['Text'][corpus['resource_id'] == sim_id].values[0], sim_score])
    else:
        result = [None, 0]
    return result


### 1. Preparing data

From a given set of messages, a historical corpus and a query message are defined. Thus, the query message is fed into the message matcher so that all messages from the corpus similar to the query one are retrieved.

In [18]:
path = '../part1/dataset'
tags = os.listdir(path)
data_full = get_all_dfs(path, tags)[['Text']]
data_full['resource_id'] = data_full['Text'].apply(to_md5)

In [19]:
corpus = data_full.sample(int(data_full.shape[0] * 0.9))
test_data = data_full[~data_full.resource_id.isin(corpus.resource_id)]
print(corpus.shape)
corpus.tail()

(3467, 2)


Unnamed: 0_level_0,Text,resource_id
file,Unnamed: 1_level_1,Unnamed: 2_level_1
38642,\nRay Knight (rknight@stiatl.salestech.com) wr...,497c82afc95d6122d07d7764fd28d965
103104,\n\n\n\nA list of options that would be useful...,eb9388e731d1d975815dc15800ddd9a2
54535,<C5JIF8.I4n@boi.hp.com> <1993Apr16.022926.272...,4f296ea31612f60a5690b4a534358dd4
61144,\t<1r6aqr$dnv@access.digex.net> <C5w5zJ.HHq@mu...,473d7bbcb077575ae7466030ba7105ba
38467,\nsp1marse@lina (Marco Seirio) writes:\n\n>I h...,7f39365554abfb58da8ae00af76ee4b6


### 2. Getting similar messages

In [20]:
query_text = test_data.iloc[46]['Text']
print(query_text)


Actually I wasn't too surprised, since I bought it with the rust.  Any of you 
got some ideas of getting rid of this CHEAPLY (key word)??  It has eaten all 
the
way through on the door panels.  Can I use Bondo?  
 
Also, is there a good paint that will bond to Aluminum rims?  The paint thati
was on my rims has peeled off, actually, there's some rust looking 'stuff' on
the rims themselves...  but it comes off pretty easily.  
 
One more thing...
Have any of you done self-painting to a car?  How do you start?  What do I need
to do this?
 
Please help me!
Jesse




In [21]:
similar_results = get_similar_bert(query_text, corpus, 0.2)
if similar_results[0]:
    print("Similar Messages:")
    for result in similar_results:
        print("-"*75)
        print(result[0])
        print(f"Similarity score: {result[1]}")
        print("-"*75)

Similar Messages:
---------------------------------------------------------------------------

I just had my 41 Chrysler painted. I was told to refrain from waxing it and
to leave it out in the sun!! Supposedly this let's the volatiles escape from
the paint over a month or so (I can smell it 15 feet away on a hot day) and
lets any slight irregularites in the surface flow out, as the paint remains
a little soft for a while.

Similarity score: 0.4189999997615814
---------------------------------------------------------------------------
---------------------------------------------------------------------------

Sayeth "Joseph D. Mazza" <mazz+@andrew.cmu.edu>:
$I waxed my car a few months ago with a liquid wax and now have whiteish
$smears where I inadvertantly got some wax on the black plastic molding. 
$I've tried repeatedly to remove the smears with no luck.  I'm on the
$verge of replacing the molding altogether (it's a nice car).

   Armor All removes Raindance wax on my Mazda Proteg