# Analyzing the Tennis Corpus with Surprise
This demo is based on the [Tie-breaker paper](https://www.cs.cornell.edu/~liye/tennis.html) on gender-bias in sports journalism. We compare utterances to a language model using cross entropy, as implemented by the Surprise transformer.

In [1]:
import sys
sys.path.insert(0, '/Users/tushaar/Downloads/Cornell/Research/ConvoKit/')

In [2]:
import json

import numpy as np
from collections import defaultdict
from nltk.tokenize import word_tokenize, sent_tokenize

import convokit
from convokit import Surprise, ConvoKitLanguageModel, Kenlm
from convokit import Corpus, Speaker, Utterance, download

from tqdm.notebook import tqdm
import pprint as pp

### Create corpus using tennis game commentary dataset
This dataset consists of a gender-balanced set of play-by-play commentaries from tennis matches.

In [3]:
PATH = '../../../../examples' # replace with your path to tennis_data directory
data_dir = f'{PATH}/tennis_data/'

In [4]:
corpus_speakers = {'COMMENTATOR': Speaker(id = 'COMMENTATOR', meta = {})}

In [5]:
with open(data_dir + 'text_commentaries.json', 'r') as f:
    commentaries = json.load(f)

In [6]:
utterances = []
count = 0
for c in tqdm(commentaries):
    idx = 'c{}'.format(count)
    meta = {'player_gender': c['gender'], 'scoreline': c['scoreline']}
    utterances.append(Utterance(id=idx, speaker=corpus_speakers['COMMENTATOR'], 
                                conversation_id=idx, text=c['commentary'], meta=meta))
    count += 1

  0%|          | 0/3962 [00:00<?, ?it/s]

In [7]:
game_commentary_corpus = Corpus(utterances=utterances)

### Load interview corpus
This dataset contains transcripts from post-match press conferences.

In [8]:
interview_corpus = Corpus(filename=download('tennis-corpus'))

Dataset already exists at /Users/tushaar/.convokit/downloads/tennis-corpus


In [9]:
interview_corpus.print_summary_stats()

Number of Speakers: 359
Number of Utterances: 163948
Number of Conversations: 81974


To help with the analysis, let's add a metadata attribute `'player_gender'` to each utterance that is a reporter question describing the gender of the player the question is posed to.

In [10]:
for utt in interview_corpus.iter_utterances(selector=lambda u: u.meta['is_question']):
    utt.add_meta('player_gender', 
                 utt.get_conversation().get_utterance(utt.id.replace('q', 'a')).get_speaker().meta['gender'])

## Part 1: How surprising is each interview question based on typical language used to describe tennis?

For this demo, we want to train one model for the entire game language corpus, so we'll make our `model_key_selector` a function that returns the same key for every utterance in a corpus. We will use a custom tokenizer to convert to lowercase and remove punctuation. We will set the `context_sample_size` parameter to `None`, so that the entire game commentary corpus is used as the context.

In [11]:
from nltk import word_tokenize

def tokenizer_alnum(text):
    return list(filter(lambda w: w.isalnum(), word_tokenize(text.lower())))

def tokenizer_lower(text):
    tokens = []
    for sentence in sent_tokenize(text):
        tokens += (word_tokenize(sentence.lower()) + ['\n'])
    return tokens

surp = Surprise(model_key_selector=lambda utt: 'corpus', target_sample_size=10, tokenizer=tokenizer_lower,
                context_sample_size=None, n_samples=3, n_jobs=8)

Since we just want to look at how surprising questions asked by reporters are, we'll fit the transformer just on utterances that are questions.

In [12]:
surp = surp.fit(game_commentary_corpus, 
                text_func=lambda utt: [' '.join([u.text for u in game_commentary_corpus.iter_utterances()])])

fit: 0it [00:00, ?it/s]

fit:   0%|          | 0/1 [00:00<?, ?it/s]

To speed up the demo, we'll select a random subset of interview questions to compute surprise scores for. To run the demo on the entire interview corpus, set `SAMPLE` to `False`.

In [13]:
import itertools

SAMPLE = True
SAMPLE_SIZE = 500  # edit this to change the number of interview questions to calculate surprise for

subset_utts = \
    [interview_corpus.get_utterance(utt)
     for utt in interview_corpus.get_utterances_dataframe(selector=lambda utt: 
                                                          utt.meta['is_question']).sample(SAMPLE_SIZE).index]
subset_corpus = Corpus(utterances=subset_utts) if SAMPLE else interview_corpus

Again we want to select only utterances that are questions to compute surprise for.

In [14]:
kenlm = Kenlm(kenlm_path='/Users/tushaar/kenlm', models_dir=f'{PATH}/kenlm_models', 
              model_filename='kenlm_surprise', is_persistent=True)
pp.pprint(kenlm.config)

{'is_persistent': True,
 'kenlm_path': '/Users/tushaar/kenlm',
 'model_filename': 'kenlm_surprise',
 'model_type': 'kenlm',
 'models_dir': '../../../../examples/kenlm_models',
 'n_jobs': 1,
 'ngram_order': 2}


In [15]:
subset_corpus = surp.transform(subset_corpus, obj_type='utterance',
                               selector=lambda utt: utt.meta['is_question'], 
                               language_model=kenlm, eval_type='cross_entropy')

transform: 0it [00:00, ?it/s]

### Results
Let's take a look at the average surprise score for questions posed to female players compared to those posed to male players. Based on results from the Tie-breaker paper, we should expect to see a higher average surprise score for questions posed to female players. A higher average surprise would indicate that questions asked to female players tend to be more different from typical tennis language. This may mean that female players are being asked questions that are less relevant to tennis.

In [16]:
utterances = subset_corpus.get_utterances_dataframe(selector=lambda utt: utt.meta['is_question'])

In [17]:
import pandas as pd
get_scores = lambda utterances: pd.Series([score['corpus']for score in utterances], index=utterances.index)

female_qs = get_scores(utterances[utterances['meta.player_gender'] == 'F']['meta.surprise']).dropna()
female_qs.median()

36.54281806945801

In [18]:
male_qs = get_scores(utterances[utterances['meta.player_gender'] == 'M']['meta.surprise']).dropna()
male_qs.median()

36.73420365651448

When running this demo multiple times, we see that sometimes the average surprise for female players is higher than male players, but sometimes it is lower. This may be due to the random sampling used by the Surprise transformer when selecting targets and contexts. Another possible explanation for the difference in results from the Tie-breaker paper may be that the paper used a bigram language model with modified Kneser-Nay smoothing. Our transformer currently only allows for unigram language models and add one Laplace smoothing. These differences may explain why we do not get the same statistically significant results as the paper.

Looking at the most and least surprising questions posed to each gender, we can see that the surprise scores assigned seem to make sense. The least surprising questions seem to relate well to the game of tennis while the most surprising focus on other things such as fashion choices or social lives.

In [19]:
sorted_female_qs = female_qs.sort_values().keys()
sorted_male_qs = male_qs.sort_values().keys()

In [20]:
for utt in sorted_female_qs[:5]:
    print(interview_corpus.get_utterance(utt).text)

Does she give you a program of what you need to do?
How did you feel today? Sleepy? Awake? Energetic?
Congratulations. That seemed like a very strong win for you. How did you feel about your performance?
That could be against Sabine. I don't know the score right now.
That was a highlevel match today. You start off pretty well and fall off. I saw the entire match. What happened then? Because you had more troubles for the serve, I think.


In [21]:
for utt in sorted_male_qs[:5]:
    print(interview_corpus.get_utterance(utt).text)

You had a bit of a slow start. Was that nerves out there?
Is it fair to say that last year you had this opponent who beat you, is that in your mind or is it old or do you want to beat him more?
Rafael Nadal said something about a special tax regime in UK tournaments yesterday. What is your feeling about that issue?
Is that a sign of lack of confidence?
When you're remembering it, are you seeing it, too?


In [22]:
for utt in sorted_female_qs[-1:-6:-1]:
    print(interview_corpus.get_utterance(utt).text)

With this win and also the win at the US Open, wondering if you think that you have an edge if this match goes to three sets against her? Just physically and mentally seems you've been able to outlast her in these tough, grueling matches.
He didn't say Carlos sent an email or anything?
Gulbis recently talked about top players in general?
This is only your third Grand Slam main draw match. Do you think even without all the difficulties you've had this year you would still consider that a reasonably good success rate, to win your third main draw Grand Slam match?
Your ambitions for the year have been enhanced by this tournament?  What had you hoped to do first of the year coming into here?


In [23]:
for utt in sorted_male_qs[-1:-6:-1]:
    print(interview_corpus.get_utterance(utt).text)

Janko Tipsarevic came through after being down two sets. He said he hates the idea of being called a top10 player because you have this expectation that you're going to walk in and crush everybody. He felt that his energy level was down. Do you have to worry at all in these early rounds that you're overconfident going into matches?
Did you take anything from John Millman's performance last night?
One of the TV guys told me you said you had blistered hands.
Based on today's performance, do you think you can repeat your good result here last year?
Can you give us an assessment of what you've been doing differently this year regarding previous years at Wimbledon?


## Part 2: How surprising is a question compared to all questions posed to male players and all questions posed to female players?

Let's see how surprising questions are compared to questions posed to players of each gender. To do this, we'll want to make our `model_key_selector` return a key based on the player's gender. Recall that we added `'player_gender'` as a metadata field to each question earlier.

In [29]:
gender_models_surp = Surprise(model_key_selector=lambda utt: utt.meta['player_gender'],
                              target_sample_size=10, context_sample_size=5000,
                              surprise_attr_name='surprise_gender_model')

In [25]:
gender_models_surp = gender_models_surp.fit(interview_corpus, 
                                            selector=lambda utt: utt.meta['is_question'])

fit: 0it [00:00, ?it/s]

fit:   0%|          | 0/2 [00:00<?, ?it/s]

Since for each question, we want to compute surprise based on both the male interview questions model and the female interview questions model, we will use the `group_and_models` parameter for the `transform` function. Each utterance should belong to it's own group and be compared to both the `'M'` and `'F'` gender models. 

Since each utterance belongs to only one group, we want the surprise attribute keys to just correspond to the model. We use the `group_model_attr_key` parameter to define this. This attribute takes in a group name (which will be the utterance id) and a model key (which will be either 'M' or 'F') and returns the corresponding key that should be added to the surprise metadata. For this case, we simply return the model key.

In [28]:
convokit_lm = ConvoKitLanguageModel(smooth=True)
pp.pprint(convokit_lm.config)

In [None]:
subset_corpus = \
    gender_models_surp.transform(subset_corpus, obj_type='utterance', 
                                 group_and_models=lambda utt: (utt.id, ['M', 'F']), 
                                 group_model_attr_key=lambda _, m: m,
                                 selector=lambda utt: utt.meta['is_question'], 
                                 language_model=convokit_lm, eval_type='cross_entropy')

### Results
Let's take a look at the surprise scores. We see that questions posed to a certain gendered player are on average more surprising when compared to all questions posed to the other gender. From this we can surmise that there may be some difference in the types of questions posed to each gender.

In [35]:
utterances = subset_corpus.get_utterances_dataframe(selector=lambda utt: utt.meta['is_question'])

In [36]:
utterances[utterances['meta.player_gender'] == 'F'] \
          ['meta.surprise_gender_model'].map(lambda x: x['M']).dropna().mean()

5.820603795831438

In [37]:
utterances[utterances['meta.player_gender'] == 'F'] \
          ['meta.surprise_gender_model'].map(lambda x: x['F']).dropna().mean()

5.773426120991382

In [38]:
utterances[utterances['meta.player_gender'] == 'M'] \
          ['meta.surprise_gender_model'].map(lambda x: x['M']).dropna().mean()

5.79686635301196

In [39]:
utterances[utterances['meta.player_gender'] == 'M'] \
          ['meta.surprise_gender_model'].map(lambda x: x['F']).dropna().mean()

5.830856237083713