# Babylon Task - Joseph Potashnik

## Thought process

1. Loading dataset.
2. Independent variables: text and agent (Doctor / Patient).
3. Dependent variables: coarse-grained, fine-grained tag.
4. We will build a Pipeline that combines the text and speaker identity (via FeatureUnion), then uses a classifier (my choice was SVM, subject to future experimentation).
5. Hyperparameters tuning via GridSearchCV. The hyperparameter space is vast; here I shall restrict myself to the relative weights of agent / text features in the classification.
6. Results! and analysis.

## text pipeline
1. selecting the 'text' column from the train dataframe.
2. using a bag-of-words (with ngram=1) model with CountVectorizer on each senetence, we also use spaCy tokenizer that strips punctation and stop words.

## agent pipeline
1. selecting the 'agent_type' column from the train dataframe.
2. converting to int with regular CountVectorizer. Assumption on input: we only encounter the category word (e.g, 'Patient', 'Doctor').


## imports

import libraries and read files into dataframe

In [1]:
import pandas as pd
import numpy as np
import spacy as spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

df_train = pd.read_csv('interactions_train.tsv', sep='\t', header=0, index_col=0)
df_test = pd.read_csv('interactions_test.tsv', sep='\t', header=0, index_col=0)

def load_train_test_sets(independentvars, dependentVars):
    X_train = df_train[independentvars]
    y_train = df_train[dependentVars]
    X_test = df_test[independentvars]
    y_test = df_test[dependentVars]
    return (X_train, X_test, y_train, y_test)
    
nlp = spacy.load('en')

In [2]:
class TextSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]
        

In [3]:
def spacy_tokenizer(sentence):
    tokens = nlp(sentence)
    tokens = [tok for tok in tokens if not (tok.is_stop or tok.is_punct)]   
    # spaCy's behavior returns PRON as the lemma of pronouns. 
    #in my mind, it's wrong and subject to future change.
    tokens = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tokens]
    return tokens

### Defining the pipeline for coarse classification: depends on agent and text

In [4]:
def pipelines_for_text_and_agent():
    pipe_for_text = Pipeline([
                    ('selector', TextSelector(key='text')),
                    ('cv', CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)))
                ])

    pipe_for_agent = Pipeline([
                    ('selector', TextSelector(key='agent_type')),
                    ('cv', CountVectorizer())
                ])
    return pipe_for_text, pipe_for_agent

def pipeline_coarse_classification():
    pipe_for_text, pipe_for_agent = pipelines_for_text_and_agent()

    feats = FeatureUnion([('text', pipe_for_text), 
                          ('agent', pipe_for_agent)],
                        transformer_weights = { 'text': 1, 'agent': 2})

    pipeline = Pipeline([
        ('features', feats),
        ('classifier', LinearSVC(loss='hinge'))
    ])
    return pipeline


### Sanity check for coarse classification

A quick sanity check to review predictions on the test set:

In [5]:
X_train, X_test, y_train, y_test = load_train_test_sets(independentvars = ['text', 'agent_type'], 
                                                       dependentVars = 'gold_label_simple')

pipeline = pipeline_coarse_classification()
pipeline.fit(X_train, y_train.values.ravel())
preds = pipeline.predict(X_test)
print(classification_report(y_test, preds))
confidence = pipeline.decision_function(X_test)

             precision    recall  f1-score   support

   Question       0.80      0.91      0.85        58
   Response       0.81      0.91      0.86        57
  Statement       0.80      0.20      0.32        20

avg / total       0.81      0.81      0.78       135



### Defining the pipeline for fine classification: depends on agent, text, and coarse gold label

In [6]:
def pipeline_fine_classification():
    pipe_for_text, pipe_for_agent = pipelines_for_text_and_agent()
    
    pipe_for_coarse_tag = Pipeline([
                    ('selector', TextSelector(key='gold_label_simple')),
                    ('cv', CountVectorizer())
                ])   
    

    feats = FeatureUnion([('text', pipe_for_text), 
                          ('agent', pipe_for_agent),
                          ('tag', pipe_for_coarse_tag)
                         ])

    pipeline = Pipeline([
        ('features', feats),
        ('classifier', LinearSVC(loss='hinge'),
        )
    ])
    return pipeline

### Sanity check for fine classification


In [7]:
X_train, X_test, y_train, y_test = load_train_test_sets(independentvars = ['text', 'agent_type', 'gold_label_simple'], 
                                                       dependentVars = 'gold_label_extended')

pipeline = pipeline_fine_classification()
pipeline.fit(X_train, y_train.values.ravel())
preds = pipeline.predict(X_test)
print(classification_report(y_test, preds))

  'recall', 'true', average, warn_for)


                     precision    recall  f1-score   support

 BinaryOpenQuestion       0.44      0.44      0.44        18
 BinaryOpenResponse       0.33      0.19      0.24        16
     BinaryQuestion       0.11      0.12      0.12         8
     BinaryResponse       0.22      0.71      0.33         7
       Confirmation       0.38      0.60      0.46         5
    ContextQuestion       0.35      0.26      0.30        27
    ContextResponse       0.50      0.29      0.37        24
MultiChoiceQuestion       0.00      0.00      0.00         0
MultiChoiceResponse       0.00      0.00      0.00         0
       OpenQuestion       0.43      0.60      0.50         5
              Other       0.50      0.17      0.25         6
          Statement       0.67      0.63      0.65        19

        avg / total       0.42      0.37      0.38       135



### Hyperparameter tuning

we will do a grid search; the parameter space is vast. For instructive purposes we shall use just the relative weights between the text feature space and the agent feature space 

(text weight, agent weight) = [ ((1, 1), (1, 2), .... (1, 4)), ((4, 1), (3, 1), ..]

We can control the number of folds (number, size of the validation sets). Default = 4, at the user's discretion. 


In [8]:
def GridSearch(pipeline, cv=4):
    weights_range_1 = range(1, 5, 1)
    weights_1 = [{ 'text': 1, 'agent': weight} for weight in weights_range_1]
    weights_range_2 = range(2, 5, 1)
    weights_2 = [{ 'text': weight, 'agent': 1} for weight in weights_range_2]
    features__transformer_weights = []

    features__transformer_weights.extend(weights_1)
    features__transformer_weights.extend(weights_2)
    
    hyperparameters = {'features__transformer_weights': features__transformer_weights}
    clf = GridSearchCV(pipeline, hyperparameters, cv=cv)

    clf.fit(X_train, y_train.values.ravel())
    return clf



### GridSearch for coarse classification:

In [9]:
X_train, X_test, y_train, y_test = load_train_test_sets(independentvars = ['text', 'agent_type'], 
                                                       dependentVars = 'gold_label_simple')

pipeline = pipeline_coarse_classification()
clf = GridSearch(pipeline)

clf.refit
preds = clf.predict(X_test)
print(clf.best_params_)
df_test['predicted_label_simple'] = preds
print(classification_report(y_test, preds))

{'features__transformer_weights': {'text': 1, 'agent': 1}}
             precision    recall  f1-score   support

   Question       0.80      0.91      0.85        58
   Response       0.81      0.91      0.86        57
  Statement       0.80      0.20      0.32        20

avg / total       0.81      0.81      0.78       135



### GridSearch for fine classification:

In [10]:
X_train, X_test, y_train, y_test = load_train_test_sets(independentvars = ['text', 'agent_type', 'gold_label_simple'], 
                                                       dependentVars = 'gold_label_extended')

pipeline = pipeline_fine_classification()
clf = GridSearch(pipeline)

clf.refit
preds = clf.predict(X_test)
print(clf.best_params_)
df_test['predicted_label_extended'] = preds
print(classification_report(y_test, preds))

{'features__transformer_weights': {'text': 1, 'agent': 1}}


  'recall', 'true', average, warn_for)


                     precision    recall  f1-score   support

 BinaryOpenQuestion       0.44      0.44      0.44        18
 BinaryOpenResponse       0.33      0.19      0.24        16
     BinaryQuestion       0.11      0.12      0.12         8
     BinaryResponse       0.22      0.71      0.33         7
       Confirmation       0.38      0.60      0.46         5
    ContextQuestion       0.35      0.26      0.30        27
    ContextResponse       0.50      0.29      0.37        24
MultiChoiceQuestion       0.00      0.00      0.00         0
MultiChoiceResponse       0.00      0.00      0.00         0
       OpenQuestion       0.43      0.60      0.50         5
              Other       0.50      0.17      0.25         6
          Statement       0.67      0.63      0.65        19

        avg / total       0.42      0.37      0.38       135



### Writing into File:

In [12]:
df_test.to_csv('interactions_test_predictions.tsv', sep='\t')

### Future work

What does not classify well?


in coarse classification, the question and response categories are reasonably classified (f_score = 0.85), the recall of the statement category fails miserably - need to understand why is that. Observe the text for statements and try to find a bias.

in fine classification, the results are admittedly poor.
What we can do to improve? One glaring problem is obvious, I think: we have not taken into account the prior information that the fine category depends on the gold coarse category in a very strict way. I.e, at the moment, nothing bars our classifier from assigning, say, BinaryResponse as a fine tag given Statement as a coarse tag - a mistake. We must model this information - SVC will be inappropriate here; we have to cluster hierarchically. 



# Thank You
