# NLP and the Web - WS 21/22 Home Exercise 6 

## Regarding type hints:

We tried to make the description of the parameters as clear as possible. However, if you believe that something is missing, please reach out to us in Moodle and we will try to help you out. 
We provide type hints for function parameters and return values of functions that you have to implement in the tasks. These are suggestions only, and you may use different types if you prefer. As long as you produce the required output in a coherent and understandable way, you can get full points. 
We use the term 'array-like object' to loosely refer to collection types like lists, arrays, maps, dataframes, etc.

## Regarding documentation:

Please use comments where appropriate to help tutors understand your code. This is especially important for the more extensive exercises later on. We also strongly encourage you to use type hints.

## Regarding output of results:

Please pay attention to output results (e.g. with `print()` or `display()`) when we ask you to in a task. It is your choice how you output results, but for dataframes we recommend the use of `display(df)`.


So far, you have learned how to vectorize text, how to train classifiers and how to do IR. In this exercise we look at a more practical use case: Community Question Answering (cQA). 

The web is full of rich ressources where humans can post questions and answers to specific topics (e.g. https://stackexchange.com/, https://quora.com or https://answers.yahoo.com/). In many cases the information need of a user is not entirely new, but a similar question has already been asked and answered in some form.


The data is a small sample of the SemEval2015 Task 3. The data comes from a Qatar Living Forum.  This subset comes as '\t' seperated file and includes multiple columns:
* **qid** is the unique ID for each question
* **cid** is the unique ID for each comment
* **question_category** is the category of the question (such as "Beauty and Style")
* **question_subject** is the subject associated with a question
* **question** is the textual question
* **comment** is a comment for this question
* **comment_gold** is the label, whether this comment is a "good" or "bad" answer to the question

Each comment is represented as a single row, each question can come with multiple comments.

In [5]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.1.0/en_core_web_md-3.1.0-py3-none-any.whl (45.4 MB)
     |████████████████████████████████| 45.4 MB 16.1 MB/s            
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [2]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import spacy

#download the language model if you haven't already (you may have to restart your Python kernel)
#spacy.cli.download("en_core_web_md")

nlp = spacy.load('en_core_web_md')

## Task 1 : Data Preprocessing- 5 Points

**a)** Load the train and dev set. Lowercase the fields`question_category`, `question_subject`, `question` and `comment` of both data splits. After lowercasing, use `display` to output the `head()` of the training split and development split.


In [5]:
train_data=pd.read_csv('500_comments_good_bad_train.tsv',sep='\t')
dev_data=pd.read_csv('100_comments_good_bad_dev.tsv',sep='\t')

lowerlist=['question_category', 'question_subject', 'question','comment']

for name in lowerlist:
    for i in range(len(train_data)):
        train_data[name].iloc[i]=str(train_data[name].iloc[i].lower())
    for j in range(len(dev_data)):
        dev_data[name].iloc[j]=str(dev_data[name].iloc[j].lower())
display(train_data.head())
display(dev_data.head())

Unnamed: 0,qid,cid,question_category,question_subject,question,comment,comment_gold
0,Q1,Q1_C1,beauty and style,massage oil,where i can buy good oil for massage?,i've done it once at the sharq village & spa ....,Bad
1,Q1,Q1_C4,beauty and style,massage oil,where i can buy good oil for massage?,you might be able to find body massage oil in ...,Good
2,Q1,Q1_C5,beauty and style,massage oil,where i can buy good oil for massage?,hi. according to your body nature in your pict...,Bad
3,Q1,Q1_C6,beauty and style,massage oil,where i can buy good oil for massage?,"oh sorry. iam stupied also, your ?where? my an...",Bad
4,Q17,Q17_C1,doha shopping,rolex watch,i have a rolex oyster perpetual watch day-date...,go to 51 east (either in city centre or in al ...,Good


Unnamed: 0,qid,cid,question_category,question_subject,question,comment,comment_gold
0,Q16,Q16_C1,qatar living lounge,bye bye time.. almost,"it's 4:30 pm,.. almost time for me to go home....",lol md.....,Bad
1,Q22,Q22_C1,moving to qatar,where can i buy globe roam sim here in qatar?,can anyone tell me where can i buy globe roami...,"vivo bonito, did you just cut and paste that f...",Bad
2,Q22,Q22_C2,moving to qatar,where can i buy globe roam sim here in qatar?,can anyone tell me where can i buy globe roami...,nkotb... i am not working to any either of the...,Bad
3,Q22,Q22_C3,moving to qatar,where can i buy globe roam sim here in qatar?,can anyone tell me where can i buy globe roami...,"vb, i didnt say you are working with smart or ...",Bad
4,Q22,Q22_C9,moving to qatar,where can i buy globe roam sim here in qatar?,can anyone tell me where can i buy globe roami...,you can go to filipino souq you cab get it for...,Good


**b)** Create four new columns in the training and development splits: `category_vector`, `subject_vectors`,  `question_vectors`, `comment_vectors`. Use spaCy to convert each token of the fields `question_category`, `question_subject`, `question` and `comment` into a dense word vector (`token.vector`) and store these word vectors in the new columns. This may take a few minutes. 

In [6]:
train_vector=[]
dev_vector=[]
for name in lowerlist:
    vector=[]
    vec=[]
    for i in range(len(train_data)):
        train_doc=nlp(train_data[name].iloc[i])
        token_vector=[]
        for token in train_doc:
            token_vector.append(token.vector)
        vector.append(token_vector)
    train_vector.append(vector)
    
    for m in range(len(dev_data)):
        dev_doc=nlp(dev_data[name].iloc[m])
        dev_token_vector=[]
        for t in dev_doc:
            dev_token_vector.append(t.vector)
        vec.append(dev_token_vector)
    dev_vector.append(vec)
    
vector_name=['category_vector','subject_vectors','question_vectors','comment_vectors']
for counter, value in enumerate(vector_name):
    train_data[value]=train_vector[counter]
    dev_data[value]=dev_vector[counter]


display(train_data.head())
display(dev_data.head()) 


Unnamed: 0,qid,cid,question_category,question_subject,question,comment,comment_gold,category_vector,subject_vectors,question_vectors,comment_vectors
0,Q1,Q1_C1,beauty and style,massage oil,where i can buy good oil for massage?,i've done it once at the sharq village & spa ....,Bad,"[[0.27805, 0.41261, 0.27956, 0.022429, 0.03270...","[[0.38007, -0.51364, -0.010795, 0.013833, 0.53...","[[0.44746, 0.18366, -0.33118, -0.22959, 0.4543...","[[0.18733, 0.40595, -0.51174, -0.55482, 0.0397..."
1,Q1,Q1_C4,beauty and style,massage oil,where i can buy good oil for massage?,you might be able to find body massage oil in ...,Good,"[[0.27805, 0.41261, 0.27956, 0.022429, 0.03270...","[[0.38007, -0.51364, -0.010795, 0.013833, 0.53...","[[0.44746, 0.18366, -0.33118, -0.22959, 0.4543...","[[-0.11076, 0.30786, -0.5198, 0.035138, 0.1036..."
2,Q1,Q1_C5,beauty and style,massage oil,where i can buy good oil for massage?,hi. according to your body nature in your pict...,Bad,"[[0.27805, 0.41261, 0.27956, 0.022429, 0.03270...","[[0.38007, -0.51364, -0.010795, 0.013833, 0.53...","[[0.44746, 0.18366, -0.33118, -0.22959, 0.4543...","[[0.028796, 0.41306, -0.4669, -0.078175, 0.370..."
3,Q1,Q1_C6,beauty and style,massage oil,where i can buy good oil for massage?,"oh sorry. iam stupied also, your ?where? my an...",Bad,"[[0.27805, 0.41261, 0.27956, 0.022429, 0.03270...","[[0.38007, -0.51364, -0.010795, 0.013833, 0.53...","[[0.44746, 0.18366, -0.33118, -0.22959, 0.4543...","[[-0.44526, 0.26707, -0.64482, -0.22993, 0.142..."
4,Q17,Q17_C1,doha shopping,rolex watch,i have a rolex oyster perpetual watch day-date...,go to 51 east (either in city centre or in al ...,Good,"[[0.34761, 0.033417, 0.33893, 0.13354, 0.63203...","[[-0.18472, -0.046026, 0.12382, 0.067505, 0.03...","[[0.18733, 0.40595, -0.51174, -0.55482, 0.0397...","[[0.13893, -0.019056, -0.33891, 0.12151, 0.365..."


Unnamed: 0,qid,cid,question_category,question_subject,question,comment,comment_gold,category_vector,subject_vectors,question_vectors,comment_vectors
0,Q16,Q16_C1,qatar living lounge,bye bye time.. almost,"it's 4:30 pm,.. almost time for me to go home....",lol md.....,Bad,"[[0.34761, 0.033417, 0.33893, 0.13354, 0.63203...","[[0.082178, -0.2513, -0.7495, 0.19097, 0.01664...","[[0.0013629, 0.35653, -0.055497, -0.16607, 0.0...","[[-0.44817, 0.25119, -0.43844, -0.68699, -0.10..."
1,Q22,Q22_C1,moving to qatar,where can i buy globe roam sim here in qatar?,can anyone tell me where can i buy globe roami...,"vivo bonito, did you just cut and paste that f...",Bad,"[[0.25306, 0.044859, -0.27407, -0.02257, 0.755...","[[0.44746, 0.18366, -0.33118, -0.22959, 0.4543...","[[-0.23857, 0.35457, -0.30219, 0.089559, 0.082...","[[-0.87226, 0.38091, -0.62451, 0.70442, -0.421..."
2,Q22,Q22_C2,moving to qatar,where can i buy globe roam sim here in qatar?,can anyone tell me where can i buy globe roami...,nkotb... i am not working to any either of the...,Bad,"[[0.25306, 0.044859, -0.27407, -0.02257, 0.755...","[[0.44746, 0.18366, -0.33118, -0.22959, 0.4543...","[[-0.23857, 0.35457, -0.30219, 0.089559, 0.082...","[[-0.075288, 0.30298, -0.16125, -0.081987, 0.2..."
3,Q22,Q22_C3,moving to qatar,where can i buy globe roam sim here in qatar?,can anyone tell me where can i buy globe roami...,"vb, i didnt say you are working with smart or ...",Bad,"[[0.25306, 0.044859, -0.27407, -0.02257, 0.755...","[[0.44746, 0.18366, -0.33118, -0.22959, 0.4543...","[[-0.23857, 0.35457, -0.30219, 0.089559, 0.082...","[[0.23403, -0.41307, 0.15722, 0.33713, 0.48818..."
4,Q22,Q22_C9,moving to qatar,where can i buy globe roam sim here in qatar?,can anyone tell me where can i buy globe roami...,you can go to filipino souq you cab get it for...,Good,"[[0.25306, 0.044859, -0.27407, -0.02257, 0.755...","[[0.44746, 0.18366, -0.33118, -0.22959, 0.4543...","[[-0.23857, 0.35457, -0.30219, 0.089559, 0.082...","[[-0.11076, 0.30786, -0.5198, 0.035138, 0.1036..."


## Task 2 : - Embedding and Feature Functions 11 Points
You will train a classifier to predict whether a comment is a good answer or not, based on the word vectors. You will explore different strategies
* how to compute a single fixed-length vector out of the word vectors
* how to combine these single vectors from different fields (such as `question` and `comment`)
 
**a)** Implement the function `embedding_fn_max_pooling(word_vectors)`. It should compute a single vector (of the same dimensionality as a single word vector) via max pooling: In each $i$th dimension, the resulting representation must have the maximum value based on all $i$th dimension values from all words.

Example:

w_0 = [0.1, 0.2, 0.3, 0.4] 

w_1 = [1.0, -1.5, 2.0, -2.5]

The resulting embedding should be e = [max(0.1, 1.0), max(0.2, -1.5), max(0.3, 2.0), max(0.4, -2.5)] = [1.0, 0.2, 2.0, 0.4].

Output the dimensionality of the resulting embedding after applying this function on the word vectors of a single question (you can choose any question).

In [16]:
from typing import List, Callable

def embedding_fn_max_pooling(word_vectors: List[np.array ])-> np.array:
    """
    Converts an arbitrary number d-dimensional word embeddings 
    into a single d-dimensional embedding via max-pooling.
    :params word_vectors
        array-like object contains all dense word vectors  
    :returns
        d-dimensional embedding:array-like object
    """
    
    array=np.max(np.stack((word_vectors),axis=-1),axis=1)
    return array

display(embedding_fn_max_pooling(train_data['question_vectors'][0]).shape)


(300,)

**b)** Implement the function `feature_fn_concatenate_question_comment(df, embedding_fn)`. It should
* use the `embedding_fn` to create fixed-length vectors from the fields `question_vectors` and `comment_vectors` individually. (For each sample: one vector for the question, one vector for the comments).
* Concatenate both vectors horizontally
* Return these concatenated vectors (the features) for all samples in the dataframe `df`.

You will use these vectors to train a new classifier. Execute this function using `embedding_fn=embedding_fn_max_pooling` on the train split and output the shape of the resulting matrix.

In [35]:
def feature_fn_concatenate_question_comment(df:pd.DataFrame, embedding_fn:Callable)-> np.array:
    """
    Uses the embedding_fn to create d-dimensional embeddings from the tokens 
    of the questions and comments respectively, and concatenates these embeddings.
    :param df
        Dataframe consisting of multiple samples. Features will be computed for each sample individually
    :param embedding_fn
        As in 2a)
    :returns
        Matrix of shape (n, d*2) whereas 
        n is the number of samples in df and 
        d*2 is the output dimensionality of the embedding_fn
    """

        
    new_array=np.concatenate((to_array(df['question_vectors'].map(embedding_fn)), to_array(df['comment_vectors'].map(embedding_fn))),axis=1)
    
    return new_array

def to_array(series):
    array=np.zeros((len(series),len(series[0])))
    for i in range(len(series)):
        array[i]=series[i]
    return array
        


fn=feature_fn_concatenate_question_comment(train_data,embedding_fn_max_pooling)
print(fn.shape)
#display(fn)

(2324, 600)


**c)** Execute the (already implemented) function `train_and_evaluate` to train a new classifier based on the two functions you implemented. Use `embedding_fn=embedding_fn_max_pooling`, `feature_fn=feature_fn_concatenate_question_comment` and leave the classifier as the default parameter. The function will normalize the computed features and train and evaluate a support vector machine.

What is the advantage of using F1 macro when we want to weight each label equally, regardless of the label distribution in the dataset? Please explian the question in up to 3 sentences.

In [18]:
def train_and_evaluate(df_train:pd.DataFrame, df_dev:pd.DataFrame, embedding_fn:Callable, feature_fn:Callable, 
                      classifier=make_pipeline(StandardScaler(), SVC()))->None:
    
    # 1) Compute vectors
    X_train = feature_fn(df_train, embedding_fn)
    X_dev = feature_fn(df_dev, embedding_fn)
    
    # 2) Train classifier
    classifier.fit(X_train, df_train['comment_gold'])
    
    # 3) Predict dev data
    predictions = classifier.predict(X_dev)
    
    # 4) Compute and output metrics
    print(classification_report(df_dev['comment_gold'], predictions))

df_train=train_data
df_dev=dev_data
train_and_evaluate(df_train, df_dev, embedding_fn_max_pooling, feature_fn_concatenate_question_comment)


              precision    recall  f1-score   support

         Bad       0.58      0.15      0.24        92
        Good       0.80      0.97      0.88       322

    accuracy                           0.79       414
   macro avg       0.69      0.56      0.56       414
weighted avg       0.75      0.79      0.74       414



Answer: It is not affected by the deviations in the distribution of the label.

**d)** Create
* one new embedding function (as in 2a) that converts multiple word vectors into a single fixed-length vector (e.g. by averaging over all word embeddings)
* two new feature functions (as in 2b), that combine these vectors of different fields to generate the final features (e.g. add different columns, use fewer columns, use average instead of concatenating, ...).

Use self-exlplanatory function names or add a comment to describe what each function does.

Run a grid search (parts are already implemented) for all combinations of all two embedding functions and all three feature functions. Which combination yields the highest F1 macro? Explain in up to three sentences how the task of *question similarity* comes into play, when you want to create a basci cQA system with this model.

In [37]:
def embedding_fn_average_pooling(word_vectors: List[np.array ])-> np.array:
    array=np.mean(np.stack((word_vectors),axis=-1),axis=1)
    return array

def feature_fn_average_question_comment(df:pd.DataFrame, embedding_fn:Callable)-> np.array:
        
    new_array=np.mean(np.stack((to_array(df['question_vectors'].map(embedding_fn)), to_array(df['comment_vectors'].map(embedding_fn))),axis=1),axis=1)
    return new_array
    
def feature_fn_max_question_comment(df:pd.DataFrame, embedding_fn:Callable)-> np.array:
        
    new_array=np.max(np.stack((to_array(df['question_vectors'].map(embedding_fn)), to_array(df['comment_vectors'].map(embedding_fn))),axis=1),axis=1)
    return new_array



(300,)

In [38]:
embedding_functions = [
    ('max-pooling', embedding_fn_max_pooling), 
    ('average-pooling',embedding_fn_average_pooling)
]

feature_functions = [
    ('concat-question-comment', feature_fn_concatenate_question_comment),
    ('average-question-comment',feature_fn_average_question_comment),
    ('max-question-comment',feature_fn_max_question_comment)
]

for embedding_name, embedding_fn in embedding_functions:
    for feature_name, feature_fn in feature_functions:
        print('Embeddings:', embedding_name, '; Features:', feature_name)
        
        # Adjust the naming of df_train and df_dev to match your training and dev set:
        train_and_evaluate(df_train, df_dev, embedding_fn, feature_fn)
                

Embeddings: max-pooling ; Features: concat-question-comment
              precision    recall  f1-score   support

         Bad       0.58      0.15      0.24        92
        Good       0.80      0.97      0.88       322

    accuracy                           0.79       414
   macro avg       0.69      0.56      0.56       414
weighted avg       0.75      0.79      0.74       414

Embeddings: max-pooling ; Features: average-question-comment
              precision    recall  f1-score   support

         Bad       0.53      0.23      0.32        92
        Good       0.81      0.94      0.87       322

    accuracy                           0.78       414
   macro avg       0.67      0.58      0.59       414
weighted avg       0.75      0.78      0.75       414

Embeddings: max-pooling ; Features: max-question-comment
              precision    recall  f1-score   support

         Bad       0.57      0.04      0.08        92
        Good       0.78      0.99      0.88       322

    

Answer: 1. The highest F1 macro comes from the combination of average-pooling  and max-question-comment. 2. Fristly, give our a input question and get its embedding. Now we also have the embedding of 'question_vectors',calculate the cosine-similarity with the embedding of input quesiton and each embedding of 'question_vectors',then produce a ranking.

## Task 3 : Question Similarity  - 4 Points

After learning techniques regarding to community question answering. You know about the cQA pipeline:
* Find similar questions in cQA forums. 
* Select answers addressing the information need for our question. 
* Summarize the most relevant information. 

Now you are designing a cQA system and are in step 1 of question similarity. Often two components are used in this step: an IR system (e.g. BM25/tf-idf) and a question re-ranker (often a neural model). 

Given a new question, these two components extract only the top 10 most similar already answered questions from a KB. A pool of answer candidates are taken from the answers of these extracted questions. Then useful answers from the pool of answer candidates are seleted and summarize.

**a)** Explain in which order these components are executed and why usually both are used and not either one of them alone (answers can be less detailed)?

Answer: The questions first go into the state of question retrieval(which is IR System) and then translate to question re-rank(often a neural model). Retrieval models often require a lot of training data, then re-ranker can be trained to produce a better ranking.

**b)** Imagine you must select a good re-ranker for your cQA pipeline. Select a metric (of those introduced throughout this lecture), that can be used here to determine the best model.

Answer: We choose MRR(Mean Reciprocal Rank). The reasons are as followed: 1. This method is simple to compute and is easy to interpret. 2. This method puts a  high focus on the first relevant element of the list. It is best suitable for targeted searches like "best item for users". For these two reasons, we would like to choose MRR to determine the best model.

Please upload in Moodle your working Jupyter-Notebook <b>before next the lab session</b> <span style="color:red">(February 10th, 16:14)</span>. Submission format: Group_No_Exercise_No.zip<br>
Submission should contain your filled out Jupyter notebook template (naming schema: Group_No_Exercise_No.ipynb) and any auxiliar files that are necessary to run your code (e.g. datasets provided by us).<br>
Each submission must only be handed in once per group.