## Part 2. Message-matcher baseline model
This communication contains a message matcher baseline model. Given a query text message and a corpus of historical messages, this matcher model retrieves all historical messages that are similar to the queried one. Your goal is to improve this model.

In [1]:
import os
from hashlib import md5
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

### 0. Auxiliary Functions

In [2]:
def create_df(path, tag):
    '''
    Creates a data frame for a given class
    --------------------------------------
    Input:
        path (str): path where all classes folders are stored.
        tag (str): name of the folder containing class "tag".
    Output:
        df (pd.DataFrame): dataframe with file as index and columns=[text, tag]
    '''
    list_of_text = []
    tag_dir = os.path.join(path, tag)
    for file in os.listdir(tag_dir):

        with open(os.path.join(tag_dir, file), encoding="utf-8", errors="ignore") as f:
            text = f.read()
            list_of_text.append((text, file))
            df = pd.DataFrame(list_of_text, columns = ['Text', 'file'])
            df = df.set_index('file')
    df['tag'] = tag
    return df


def get_all_dfs(path, tags):
    '''
    Loops over all classes in path, each in the corresponding folder
    --------------------------------
    Input:
        path (str): path where all classes folders are stored.
        tags (list): list of classes names.
    Output:
        df (pd.DataFrame): pandas dataframe with the dataframes corresponding to all classes concatenated.
    '''
    list_of_dfs = []
    for tag in tags:

        df = create_df(path, tag)
        list_of_dfs.append(df)
    data = pd.concat(list_of_dfs)
    return data


def to_md5(rsc_id: str) -> str:
    """
    Convert rcs_id string into a hexdigest md5.
    :param rcs_id: str.
    :return: hexdigext representation of md5 codification of input string.
    """
    md5_rsc = bytes(rsc_id, 'utf-8')
    result_1 = md5(md5_rsc)
    return result_1.hexdigest()


def get_similarity(resources: pd.DataFrame, space: str = 'tfidf', max_df: float = .75) -> np.array:
    """
    Compute pairwise cosine similarity for resources in a given vector representation (tf or tfidf).
    :param resources: pd.DataFrame with the resources as rows and at least 'Text' as column.
    :param space: vector space representation of resources, either 'tf' or 'tfidf'.
    :param max_df: maximum valur for document frequency just as in sklearn Vectorizers.
    :return: symmetric np.array with cosine similarity score for each resource pair.
    """
    if space == 'tf':
        vec = CountVectorizer(min_df=2, max_df=max_df)
    elif space == 'tfidf':
        vec = TfidfVectorizer(min_df=2, max_df=max_df)
    else:
        print('The "space" input must be either "tf" or "tfidf", using the default "tfidf" option...')
        vec = TfidfVectorizer(min_df=2, max_df=max_df)
    vec_res = vec.fit_transform(resources['Text'].fillna(''))
    sims = cosine_similarity(vec_res, vec_res)
    return sims


def find_similar_rsc(similarity_scores: np.array, threshold: float) -> pd.DataFrame:
    """
    Get a dictionary relating resources to a list of [resource, score] pairs per resource.
    :param similarity_scores: matrix of similarity score per pair of resources of shape
    (number of resoures, number of resources).
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :return: a pd.DataFrame with 'resource_idx', 'similar_res_idx' and 'similarity_score' as columns relating resources
    to a given resource.
    """
    similar_rsc_idx = np.where((similarity_scores >= threshold) & (similarity_scores < 0.999))
    similar_scores = np.round(similarity_scores[similar_rsc_idx], 3)
    sim_res = pd.DataFrame({'resource_idx': similar_rsc_idx[0],
                            'similar_res_idx': similar_rsc_idx[1],
                            'similarity_score': similar_scores})
    return sim_res


def get_similar_rsc(resources: pd.DataFrame, threshold: float = 0.75, space: str = 'tfidf') -> dict:
    """
    Get similar resources per resource.
    :param resources: pd.DataFrame with the resources as rows and at least 'Text' as column.
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :param space: vector space representation of resources, either 'tf' or 'tfidf'.
    :return: a dictionary with resources as keys and similar resources as values.
    """
    sims = get_similarity(resources, space)
    find_sims = find_similar_rsc(sims, threshold)
    sim_df = find_sims.copy()
    sim_df.reset_index(inplace=True)
    sim_df['resource_id'] = resources['resource_id'].iloc[find_sims.resource_idx].values
    sim_df['similar_res'] = resources['resource_id'].iloc[find_sims.similar_res_idx].values
    sim_df['sim_resources'] = sim_df.apply(lambda x: [[x.similar_res, x.similarity_score]], axis=1)
    grouped_sim_res = sim_df[['resource_id', 'sim_resources']].groupby('resource_id').agg(lambda x: np.sum(x))
    similar_res_dict = grouped_sim_res.T.to_dict('records')[0]
    sim_res = {k: sorted(v, key=lambda x: x[1], reverse=True) for k, v in similar_res_dict.items()}
    return sim_res


def get_similar(input_text: str, corpus: pd.DataFrame, threshold: float=0.75, space: str = 'tfidf') -> list:
    """
    Retrieves a set of messages from a given corpus that are similar enough to an input message.
    :param input_text: query text.
    :param corpus: pd.DataFrame with historical messages as column 'Text'.
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :param space: vector space representation of resources, either 'tf' or 'tfidf'.
    :return: a list with all the similar messages content and corresponding score to the queried one.
    """
    input_id = to_md5(input_text)
    input_df = pd.DataFrame({'Text': [input_text], 'resource_id': [input_id]})
    data = pd.concat([input_df, corpus])
    sim_dict = get_similar_rsc(data, threshold, space)
    result = list()
    if sim_dict.get(input_id):
        for sim_id, sim_score in sim_dict.get(input_id):
            result.append([corpus['Text'][corpus['resource_id'] == sim_id].values[0], sim_score])
    else:
        result = [None, 0]
    return result

### 1. Preparing data

From a given set of messages, a historical corpus and a query message are defined. Thus, the query message is fed into the message matcher so that all messages from the corpus similar to the query one are retrieved.

In [3]:
path = '../part1/dataset'
tags = os.listdir(path)
data_full = get_all_dfs(path, tags)[['Text']]
data_full['resource_id'] = data_full['Text'].apply(to_md5)

In [4]:
corpus = data_full.sample(int(data_full.shape[0] * 0.9))
test_data = data_full[~data_full.resource_id.isin(corpus.resource_id)]
print(corpus.shape)
corpus.tail()

(3467, 2)


Unnamed: 0_level_0,Text,resource_id
file,Unnamed: 1_level_1,Unnamed: 2_level_1
61164,\nIn article <C5t05K.DB6@research.canon.oz.au>...,a5cd714b0c79776cbf25ce0b9faeef47
102597,\nIn article <12718@news.duke.edu> fierkelab@b...,39af238a890c32b5cbcd8f4c05f84e19
176853,Article-I.D.: shelley.1pqi26INNl8j\n\ndreitman...,d0e662fc665382bc408cc2deda86c9d0
104794,"\n\n\nI hate to be rude, but screw the seating...",5d43851150305698729229bdff0e360b
54440,\nFrom article <1993Apr18.001319.2340@gnv.ifas...,36b1552e3d8ce7218f27bc59c650461e


In [10]:
corpus.tail(2).Text[1]

'\nFrom article <1993Apr18.001319.2340@gnv.ifas.ufl.edu>, by jrm@gnv.ifas.ufl.edu:\n> Yea, there are millions of cases where yoy *say* that firearms\n> \'deter\' criminals. Alas, this is not provable. I think that that\n> there are actually *few* cases where this is so. \n\nIt certainly is provable.  Around a million Americans every year defend\nthemselves with firearms.  In many of these cases the defender doesn\'t even\nhave to fire a shot!  The mere presence of a gun is oftentimes all the\ndeterrent that is needed.\n\nI don\'t like violence anymore than anyone else does.  But, taking away the\nright of Americans to keep and bear arms is not the solution to the violent\ncrime problem in this country.  If honest, law-abiding citizens are unable\nto get firearms then they will be preyed on even more by criminals who will\nbe able to acquire guns through illegal channels.  Expect to start seeing\nthe crime syndicates who smuggle drugs into this country start smuggling\nguns.  Believe me

### 2. Getting similar messages

In [5]:
query_text = test_data.iloc[42]['Text']
print(query_text)



Since this posting, I've received no replies or followups, so I'm posting
here hoping for the feedback I didn't get in rec.audio.car:

article number - 9855

I recently saw a particular third party antenna on a new Camry (not mine,
but it caught my interest) and a new 626.  It seems to replace the
factory power antenna and is about a foot long made of plastic tubing.  I
have seen them on quite a few cars, but I can't find anything more about
them in previous r.a.c articles nor in r.a articles.

I'd like to know all I can, so any feedback is greatly appreciated.

------------------------------------------------------------------
"Mom, we're hungry!" - Bud Bundy        "Why tell me?" - Peg Bundy

Vincent Lai

vinlai@cbnewsb.att.com forwards mail to
vlai@attmail.com which eventually winds up in
wcmnja!lai@somerset.att.com
------------------------------------------------------------------



In [6]:
similar_results = get_similar(query_text, corpus, 0.2)
if similar_results[0]:
    print("Similar Messages:")
    for result in similar_results:
        print("-"*75)
        print(result[0])
        print(f"Similarity score: {result[1]}")
        print("-"*75)