<a href="https://colab.research.google.com/github/Specril/n-gram-machine-learning-language-model/blob/main/Pset_2_RT_and_surprisal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This executable notebook will guide you through Pset_2 - The Relationship between Surprisal and RTs:

---

Reminder, a few Colab-specific things to note about execution before we get started:

- Google offers free compute (including GPU compute!) on this notebook, but *only for a limited time*. Your session will be automatically closed after 12 hours. That means you'll want to finish within 12 hours of starting, or make sure to save your intermediate work (see the next bullet).
- You can save and write files from this notebook, but they are *not guaranteed to persist*. For this reason, we'll mount a Google Drive account and write to that Drive when any files need to be kept permanently (e.g. model checkpoints, surprisal data, etc.).
- You should keep this tab open until you're completely finished with the notebook. If you close the tab, your session will be marked as "Idle" and may be terminated.

# Getting started

**First**, make a copy of this notebook so you can make your own changes. Click *File -> Save a copy in Drive*.

### What you need to do

Read through this notebook and execute each cell in sequence, making modifications and adding code where necessary. You should execute all of the code as instructed, and make sure to write code or textual responses wherever the text **TODO** shows up in text and code cells.

When you're finished, download the notebook as a PDF file by running the script in the last cell, or alternatively download it as an .ipynb file and locally convert it to PDF.


### Load ngram surprisals


Let's fetch the `ngram` surprisal file:

In [None]:
import pandas as pd
surprisals = pd.read_csv('https://gist.githubusercontent.com/omershubi/f19f77f5157f7ba7ea1adf72a72847da/raw/d5d553b1217ea70fe3261ce5d9a0532f29769817/5gram_surprisals.tsv', index_col=False, sep='\t')
surprisals

Unnamed: 0,sentence_id,token_id,token,surprisal
0,1,1,In,4.57937
1,1,2,<unk>,7.45049
2,1,3,County,12.65410
3,1,4,<unk>,6.11317
4,1,5,near,12.22380
...,...,...,...,...
7693,464,17,a,3.23962
7694,464,18,leader,12.81650
7695,464,19,and,5.90348
7696,464,20,<unk>,4.62292


### Load RT data

Let's fetch also the Brown_RTs dataset and see how it looks like



In [None]:
sprt = pd.read_csv('https://gist.githubusercontent.com/omershubi/01b55eab89b81dc882055e0d27d61016/raw/046dbb7f0586b5dc1a368ee882f2cb923caad3df/brown-spr-data-for-pset.csv', index_col=0).sort_values(by='code')
sprt

Unnamed: 0,word,code,subject,text_id,text_pos,word_in_exp,time
2286,In,17000,s001,0,0,2285,399.90
109460,In,17000,s028,0,0,2503,290.32
50709,In,17000,s014,0,0,1394,501.59
80486,In,17000,s021,0,0,2525,210.93
35626,In,17000,s010,0,0,579,862.35
...,...,...,...,...,...,...,...
79391,captain.,35763,s021,12,763,1430,425.18
116505,captain.,35763,s030,12,763,1489,383.32
26975,captain.,35763,s007,12,763,3426,506.40
15206,captain.,35763,s004,12,763,3528,669.29


## Harmonize N-gram surprisal and RT data


We have the model-derived surprisal values. To align it with human reading times, complete the following cell. This will create for us a data frame containing both metrics in sync.


In `surprisals` each row represents a word. In `sprt` each row represents a word that was displayed in a trial. Therefore, in `sprt` there are multiple row for each word - one for each subject.

Note that the words are ordered the same in both files (i.e. they both start with 'In', then 'Ireland's'/'\<unk\>, then 'County', and so on.
However, there are differences, such as a special token for end of sentence which appears only in `surprisals`, among others.

See the PDF instructions for more details.


In [None]:
def harmonize(rt_data: pd.DataFrame, surprs_data: pd.DataFrame) -> pd.DataFrame:
    # TODO
    return pd.DataFrame()

harmonized_df = harmonize(sprt, surprisals)
harmonized_df

When you are done with this step, save the result using the following code

In [None]:
harmonized_df.to_csv("harmonized_ngram.csv")

Great, now you're ready to start doing analysis on this output data!

# Analyses

Now that we've obtained our harmonized surprisal-vs-RT files, let's perform some analysis on the data.

## 1. Univariate linear regression
Here is an overview of the analysis we want you to run.

* For each of `metric` in `{surprisal, raw_probability}`:

    * Fit a linear regression model to predict RTs from the metric. You should report the
coefficient for the metric term (slope) and a corresponding $t$-score and $p$-value (to
determine whether it is significantly different from 0), as well as an $R^2$-score (the
coefficient of determination) of the model.;
    * Draw metric-RT scatterplot with best-fit line, **without** binning RT values; and
    * Draw metric-RT scatterplot with best-fit line, **with** binning RT values.


### `Metric` = `Surprisal`


Fit a linear regression

In [None]:
import numpy as np
import statsmodels.api as sm
import pandas as pd

data = pd.read_csv("harmonized_ngram.csv")

# Fit and summarize OLS model
#TODO

lin_model= #TODO

print(lin_model.summary())

The function `.summary()` outputs a variety of metrices and statistical tests. Here we are intrested in model's parameters (the coefficients), their $t$ score, and the corresponding $p$-values, as well as in the overall  $R^2$ - score of the model.

Now let's create a scatterplot of our data accompanied by the best-fit line

Without Binning:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style="white", color_codes=True)

g = sns.jointplot(x="surprisal", y="mean_rt", data=data, kind='reg')
# We're going to make the regression line red so it's easier to see
regline = g.ax_joint.get_lines()[0]
regline.set_color('red')



With Binning:

In [None]:
g = sns.regplot(x="surprisal", y="mean_rt", data=ngram, x_bins=15)
g.set_ylim([250, 350])


### `Metric` = `Raw_probability`


After running the code cells above, your next task is to reproduce this analysis for `metric=raw_probability`.

Note that you can transform the surprisal values in the data frames by simply applying standard math and `numpy` operators. For example, this code takes each surprisal value to the power of 3 and adds 0.1:

In [None]:
np.power(ngram.surprisal, 3) + 0.1

In [None]:
### TODO: Your code here
# Compute contextual word probabilities from the surprisal data,
# and repeat the analysis of the above section.

### Interpret the results


* Does the univariate analysis support the hypothesis of a linear
relationship between word surprisal and word reading time?
* Is that hypothesis better or
worse than an alternative hypothesis of a linear relationship between raw word probability and word reading time?
* Are there other alternative hypotheses that might be even more
compelling given the data?


**TODO**: Your answer



### 2. Multiple regression analysis : Adding control variables

In this stage we want to add two control variables to our linear model and reexamine the effect of surprisal *above and beyond* these variables. The two variables are **word-length** and **word log-frequency**.

First, you should write a code that creates those variables.


Word-length:

In [None]:
### TODO: your code here
# calculate the word-length for each word in the dataset and add this information as a new column in harmonized_ngram.csv

Word log-frequency:

For each word $w_i$ in our `harmonized_ngram.csv` dataset, we want to obtain the  $log(frequency(w_i))$ of $w_i$ using a different, large corpus of text. You will first download the *tokenized* version of the **PTB** dataset (no other preprocessing stages are needed) and then write a code for calculating each word's log-frequency.

In [None]:
# Downloads ptb_tok_train.txt
!wget -qO ptb_tok_train.txt https://gist.githubusercontent.com/omershubi/cdd4231472d6188f03ab21e2b2729fee/raw/e1b4c764561fd038470830534baaa220b0eb4c6d/ptb_tok_train.txt
!head ptb_tok_train.txt

In an Oct. 19 review of `` The Misanthrope '' at Chicago 's Goodman Theatre -LRB- `` <unk> <unk> Take the Stage in <unk> City , '' Leisure & Arts -RRB- , the role of Celimene , played by Kim <unk> , was mistakenly attributed to Christina Haag .
Ms. Haag plays <unk> .
Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990 .
The luxury auto maker last year sold <unk> cars in the U.S.
Howard <unk> , president and chief executive officer , said he anticipates growth for the luxury auto maker in Britain and Europe , and in Far Eastern markets .
<unk> INDUSTRIES Inc. increased its quarterly to 10 cents from seven cents a share .
The new rate will be payable Feb. 15 .
A record date has n't been set .
Bell , based in Los Angeles , makes and distributes electronic , computer and building products .
Investors are appealing to the Securities and Exchange Commission not to limit their access to information about stock purchases and sales by corporat

In [None]:
###TODO: your code here
#calculate log frequencies and add the information as a new column in harmonized_ngram.csv

***Multiple regression analysis:***

Based on the code above (section 1: univariate linear regression), write a new code for multiple regresion analysis.

In [None]:
###TODO: your code here

###Interpret the results
* How does the surprisal coefficient of this model compare to the
surprisal coefficient in the univariate model?
* Does your conclusion regarding the effect of
suprisal on RTs from the univariate analysis still hold?



**TODO**: Your answer

# Export to PDF

Run the following cell to download the notebook as a nicely formatted pdf file.

In [None]:
# Add to a new cell at the end of the notebook and run the follow code,
# which will save the notebook as pdf in your google drive (allow the permissions) and download it automatically.

!wget -nc https://raw.githubusercontent.com/omershubi/colab-pdf/master/colab_pdf.py

from colab_pdf import colab_pdf

# If you saved the notebook in the default location in your Google Drive,
#  and didn't change the name of the file, the code should work as is. If not, adapt accordingly.
# E.g. in your case the file name may be "Copy of XXXX.ipynb"

colab_pdf(file_name='Pset_2_RT_and_surprisal.ipynb', notebookpath="drive/MyDrive/Colab Notebooks")