This notebook shows an example of how using multiple tagsets may improve performance for an FAQ. Key takeaways:
* It may be a good idea to improve commonly used FAQs with multiple tag sets that add OR logic
* We may also want to do this for large FAQs that answer a lot of different types of questions
* We still need to be thoughtful about tags and ensure they are different from other ones

Other ideas:
* Our tag testing tool an be improved by add +tive and -tive examples and showing a score on how well the tags were matched

## Example data

Here are some questions that people may ask related to this

In [None]:
from tests.utils import load_wv_pretrained_bin
from faqt import KeyedVectorsScorer as KVS
from faqt.preprocessing import preprocess_text_for_word_embedding
import pandas as pd
import ast
import numpy as np

### Load the data

In [None]:
w2v_model = load_wv_pretrained_bin(
        "pretrained_wv_models", "GoogleNews-vectors-negative300-prenorm.bin"
)

In [None]:
model = KVS(w2v_model)

In [None]:
pd.set_option('display.max_colwidth', 0)

In [None]:
df = pd.read_csv("/Users/sidravi/Downloads/YAL FAQs - Sheet1.csv")
all_tags = [ast.literal_eval(x) for x in df.Tags.values]
df.shape

In [None]:
ID_TO_MATCH = 8 #17

In [None]:
df.loc[[ID_TO_MATCH], :]

In [None]:
# positive samples
q1 = "i have HIV and am afraid and dont know what to do"
q2 = "what support can i get for coping with hiv"
q3 = "i am so mad about hiv. why did this happen to me"
q4 = "i am feeling angry about my hiv diagnosis"

In [None]:
# negative samples
q5 = "I am embarassed about getting tested"
q6 = "can i smoke if I have hiv"
q7 = "i am sad all the time" 
q8 = "my partner has hiv what do i do"

In [None]:
all_qs = [q1, q2, q3, q4, q5, q6, q7, q8]

### Support function

In [None]:
def get_scores(id_to_update, new_tags, tagset, model):
    if new_tags is not None:
        tagset[id_to_update] = new_tags
    model = model.set_tags(tagset)
    outcomes = []
    for q in all_qs:
        tokens = preprocess_text_for_word_embedding(q, {}, 10)
        outcomes.append(model.score(tokens))
    scores = []
    for o in outcomes:
        scores.append([np.mean(list(x.values())) for x in o[0]])
        
    rank_of_id = [len(s) - np.where(np.argsort(s, ) == id_to_update)[0][0] for s in scores]

    return rank_of_id, scores, outcomes

In [None]:
def get_top_n(df, n, scores):
    top_n_dict = {}
    for i, score in enumerate(scores):
        top_n_idx = np.argsort(score)[np.arange(123, 123 - n, -1)]
        top_n_dict[i] = df.loc[top_n_idx, :]
    return top_n_dict

### Current scores

In [None]:
ranks, scores, outcomes = get_scores(ID_TO_MATCH, None, all_tags.copy(), model)

In [None]:
ranks

First problem is that the tags are too narrow. So let's improve them by adding tokens from the questions

### With better tags

"You are not the only person who tells me that you feel overwhelmed by your feelings about your status. In fact lots of people with HIV feel shocked or scared when they find out they have HIV. Some people might feel ashamed. But remember none of this is your fault - you are still the same you! But it does take a little while to get used to. Try to do the same things you usually do - activities, sport, seeing friends. If there is someone you trust, try to open up to them about your feelings. And ask your healthcare worker about a support group near you. If your feelings are very negative and you aren't coping please ðŸ™ reach out for support at lovelife. They have a free youth line that you can call with any questions (0800 121 900). They even have a â€˜call meâ€™ service, where they will call you back if you send them a message."

In [None]:
new_tags = ['dealing', 'HIV', 'fearful', 'help', 'angry']

In [None]:
new_tags = ['dealing', 'HIV', 'fearful', 'help', 'angry']
ranks, scores, outcomes = get_scores(ID_TO_MATCH, new_tags, all_tags.copy(), model)

In [None]:
ranks

In [None]:
new_tags = ['HIV', 'fearful', 'help', 'angry']

In [None]:
ranks, scores, outcomes = get_scores(ID_TO_MATCH, new_tags, all_tags.copy(), model)
ranks

Note that removing the work 'dealing' from the tag list hurts the performance of the second question but **improves** the performance of the others. 

This makes sense since we are averaging all the scores.

## With different tags sets

To simulate with existing code, we'll just create new tagsets and return the highest rank among them.
For example, say for one faq, we have 3 sets and they rank as [3, 12, 14] then we return 3 as the highest rank

In [None]:
def get_scores_multitagset(id_to_update, new_tagsets, tagset, model):
    if new_tagsets is not None:
        tagset[id_to_update] = new_tagsets.pop(0)
        
    for t in new_tagsets:
        tagset.append(t)
 
    model = model.set_tags(tagset)

    outcomes = []
    for q in all_qs:
        tokens = preprocess_text_for_word_embedding(q, {}, 10)
        outcomes.append(model.score(tokens))
    
    scores = []
    for o in outcomes:
        scores.append([np.mean(list(x.values())) for x in o[0]])

    ranks_of_id = []
    number_of_faqs = len(scores[0])
    
    best_rank_list = []
    best_rank_idx = []
    for j, message_scores in enumerate(scores):
        best_rank = number_of_faqs
        for i in [id_to_update, 124, 125, 126]:
            rank = number_of_faqs - np.where(np.argsort(message_scores, ) == i)[0][0]
            if rank < best_rank:
                best_rank = rank
                #print(j, best_rank)
                best_idx = i
        best_rank_list.append(best_rank)
        best_rank_idx.append(best_idx)
                           
    return np.array(best_rank_list).squeeze(), best_rank_idx, scores, outcomes

In [None]:
all_qs

In [None]:
new_tagsets = [['dealing', 'HIV'], 
               ['fearful', 'HIV'],
               ['help', 'HIV'],
               ['angry', 'HIV']]

In [None]:
ranks, idx, scores, outcomes = get_scores_multitagset(ID_TO_MATCH, new_tagsets.copy(), all_tags.copy(), model)
ranks, idx

In [None]:
ranks

The positive examples are now in the top 5 and the negative have of >=10.

Note that the their rank improved as well since there are more tagsets to match here. I suspect performance will go down when others can have multiple tagsets as well but it is something we'll have to implement and try out.