# Seedtag codetest: NLP Researcher

## Part 3. Message-matcher baseline model
This communication contains a message matcher baseline model. Given a query text message and a corpus of historical messages, this matcher model retrieves all historical messages that are similar to the queried one. Your goal is to improve this model.

In [1]:
import os
from hashlib import md5
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

### 0. Auxiliary Functions

In [34]:
def create_df(path, tag):
    '''
    Creates a data frame for a given class
    --------------------------------------
    Input:
        path (str): path where all classes folders are stored.
        tag (str): name of the folder containing class "tag".
    Output:
        df (pd.DataFrame): dataframe with file as index and columns=[text, tag]
    '''
    list_of_text = []
    tag_dir = os.path.join(path, tag)
    for file in os.listdir(tag_dir):

        with open(os.path.join(tag_dir, file), encoding="utf-8", errors="ignore") as f:
            text = f.read()
            list_of_text.append((text, file))
            df = pd.DataFrame(list_of_text, columns = ['Text', 'file'])
            df = df.set_index('file')
    df['tag'] = tag
    return df


def get_all_dfs(path, tags):
    '''
    Loops over all classes in path, each in the corresponding folder
    --------------------------------
    Input:
        path (str): path where all classes folders are stored.
        tags (list): list of classes names.
    Output:
        df (pd.DataFrame): pandas dataframe with the dataframes corresponding to all classes concatenated.
    '''
    list_of_dfs = []
    for tag in tags:

        df = create_df(path, tag)
        list_of_dfs.append(df)
    data = pd.concat(list_of_dfs)
    return data


def to_md5(rsc_id: str) -> str:
    """
    Convert rcs_id string into a hexdigest md5.
    :param rcs_id: str.
    :return: hexdigext representation of md5 codification of input string.
    """
    md5_rsc = bytes(rsc_id, 'utf-8')
    result_1 = md5(md5_rsc)
    return result_1.hexdigest()


def get_similarity(resources: pd.DataFrame, space: str = 'tfidf', max_df: float = .75) -> np.array:
    """
    Compute pairwise cosine similarity for resources in a given vector representation (tf or tfidf).
    :param resources: pd.DataFrame with the resources as rows and at least 'Text' as column.
    :param space: vector space representation of resources, either 'tf' or 'tfidf'.
    :param max_df: maximum valur for document frequency just as in sklearn Vectorizers.
    :return: symmetric np.array with cosine similarity score for each resource pair.
    """
    if space == 'tf':
        vec = CountVectorizer(min_df=2, max_df=max_df)
    elif space == 'tfidf':
        vec = TfidfVectorizer(min_df=2, max_df=max_df)
    else:
        print('The "space" input must be either "tf" or "tfidf", using the default "tfidf" option...')
        vec = TfidfVectorizer(min_df=2, max_df=max_df)
    vec_res = vec.fit_transform(resources['Text'].fillna(''))
    sims = cosine_similarity(vec_res, vec_res)
    return sims


def find_similar_rsc(similarity_scores: np.array, threshold: float) -> pd.DataFrame:
    """
    Get a dictionary relating resources to a list of [resource, score] pairs per resource.
    :param similarity_scores: matrix of similarity score per pair of resources of shape
    (number of resoures, number of resources).
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :return: a pd.DataFrame with 'resource_idx', 'similar_res_idx' and 'similarity_score' as columns relating resources
    to a given resource.
    """
    similar_rsc_idx = np.where((similarity_scores >= threshold) & (similarity_scores < 0.999))
    similar_scores = np.round(similarity_scores[similar_rsc_idx], 3)
    sim_res = pd.DataFrame({'resource_idx': similar_rsc_idx[0],
                            'similar_res_idx': similar_rsc_idx[1],
                            'similarity_score': similar_scores})
    return sim_res


def get_similar_rsc(resources: pd.DataFrame, threshold: float = 0.75, space: str = 'tfidf') -> dict:
    """
    Get similar resources per resource.
    :param resources: pd.DataFrame with the resources as rows and at least 'Text' as column.
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :param space: vector space representation of resources, either 'tf' or 'tfidf'.
    :return: a dictionary with resources as keys and similar resources as values.
    """
    sims = get_similarity(resources, space)
    find_sims = find_similar_rsc(sims, threshold)
    sim_df = find_sims.copy()
    sim_df.reset_index(inplace=True)
    sim_df['resource_id'] = resources['resource_id'].iloc[find_sims.resource_idx].values
    sim_df['similar_res'] = resources['resource_id'].iloc[find_sims.similar_res_idx].values
    sim_df['sim_resources'] = sim_df.apply(lambda x: [[x.similar_res, x.similarity_score]], axis=1)
    grouped_sim_res = sim_df[['resource_id', 'sim_resources']].groupby('resource_id').agg(lambda x: np.sum(x))
    similar_res_dict = grouped_sim_res.T.to_dict('records')[0]
    sim_res = {k: sorted(v, key=lambda x: x[1], reverse=True) for k, v in similar_res_dict.items()}
    return sim_res


def get_similar(input_text: str, corpus: pd.DataFrame, threshold: float=0.75, space: str = 'tfidf') -> list:
    """
    Retrieves a set of messages from a given corpus that are similar enough to an input message.
    :param input_text: query text.
    :param corpus: pd.DataFrame with historical messages as column 'Text'.
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :param space: vector space representation of resources, either 'tf' or 'tfidf'.
    :return: a list with all the similar messages content and corresponding score to the queried one.
    """
    input_id = to_md5(input_text)
    input_df = pd.DataFrame({'Text': [input_text], 'resource_id': [input_id]})
    data = pd.concat([input_df, corpus])
    sim_dict = get_similar_rsc(data, threshold, space)
    result = list()
    if sim_dict.get(input_id):
        for sim_id, sim_score in sim_dict.get(input_id):
            result.append([corpus['Text'][corpus['resource_id'] == sim_id].values[0], sim_score])
    else:
        result = [None, 0]
    return result

### 1. Preparing data

From a given set of messages, a historical corpus and a query message are defined. Thus, the query message is fed into the message matcher so that all messages from the corpus similar to the query one are retrieved.

In [50]:
path = '../part1/dataset'
tags = os.listdir(path)
data_full = get_all_dfs(path, tags)[['Text']]
data_full['resource_id'] = data_full['Text'].apply(to_md5)

In [51]:
corpus = data_full.sample(int(data_full.shape[0] * 0.9))
test_data = data_full[~data_full.resource_id.isin(corpus.resource_id)]
print(corpus.shape)
corpus.tail()

(3467, 2)


Unnamed: 0_level_0,Text,resource_id
file,Unnamed: 1_level_1,Unnamed: 2_level_1
104470,\nDoes anyone know if the Twins games are broa...,fe1134865403a852cf05667ff01ba39c
104920,\nIn article <1993Apr16.163712.2466@VFL.Parama...,f823fc9041e5a9f8404350fe1297e9b0
102943,\nIn article <C5r43y.F0D@mentor.cc.purdue.edu>...,fde34261174f65e47d336f65f0430f77
60158,\nI am looking for a source of orbital element...,a6a481222792b03153858de37329e4a9
61049,"\n ETHER IMPLODES 2 EARTH CORE, IS GRAVITY!!!\...",09e2677ec822057eed319ab763298a21


### 2. Getting similar messages

In [63]:
query_text = test_data.iloc[42]['Text']
print(query_text)


This appeared today in the 

The Japan Economic Journal reported GM plans to build a Toyota-badged car
in the US for sale in Japan.  Bruce MacDonald, VP of GM Corporate
Communications, yesterday confirmed that GM President and CEO Jack Smith
had a meeting recently with Tatsuro Toyoda, President of Toyota.  
this meeting the two discussed business opportunities to increase GM
exports to Japan, including further component sales as well as completed
vehicle sales,
parts sales, the two presidents agreed conceptually to pursue an
arrangement whereby GM would build a Toyota-badged, right-hand drive
vehicle in the US for sale by Toyota in Japan.  A working group has been
formed to finalize model specifications, exact timing and other details.



In [67]:
similar_results = get_similar(query_text, corpus, 0.2)
if similar_results[0]:
    print("Similar Messages:")
    for result in similar_results:
        print("-"*75)
        print(result[0])
        print(f"Similarity score: {result[1]}")
        print("-"*75)

Similar Messages:
---------------------------------------------------------------------------



The Chevrolet brothers were respected racers & test drivers for the
Buick Co. when Durant was there.

When the directors kicked Durant out of GM in 1910 he took Chevrolet and
others with him.  As mentioned before, they founded the successful
Chevrolet company.

A little-known fact is that the Chevrolet Co. actually took over GM!
That was how Durant got back in charge of GM-- legally his new company
Chevrolet Co. did the buying, and GM was a division of Chevrolet!

After 1920 and into the Sloan era, GM shuffled things so that the GM
board was superior, but there was always a degree of autonomy given
the Chevy division, presumably because of the initial structure.
(If you look at the organization chart for GM in Sloan's book, Chevy
division reports directly to 14th floor, not through the "passenger
car division" which covers Buick, Olds, Cadillac, and Oakland/Pontiac)

-Jeff Hagen    (minor d