# Part 1: Understanding the Pipeline - Input ‚Üí Output

## üéØ What Actually Happens?

Let's follow the REAL pipeline step by step:

```
Step 1: INPUT  ‚Üí JSON file (text with context/target pairs)
Step 2: MODEL  ‚Üí Reads JSON, calculates surprisal 
Step 3: OUTPUT ‚Üí scores.json (surprisal per word)
Step 4: MERGE  ‚Üí Combine with eye-tracking CSV
Step 5: STATS  ‚Üí Regression analysis ‚Üí PPP score
```

## Let's see each piece of REAL data!

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (14, 6)

## STEP 1: MODEL INPUT - What does calc_surprisal_hf.py receive?

**File**: `data/DC/ngram_2-contextfunc_delete.json`

This is the ACTUAL input to the model. Let's see what's inside:

In [2]:
# Load the ACTUAL model input
json_path = '../data/DC/ngram_2-contextfunc_delete.json'
with open(json_path, 'r') as f:
    model_input = json.load(f)

print("üîµ MODEL INPUT (what calc_surprisal_hf.py reads)")
print("="*70)
print(f"Type: {type(model_input)}")
print(f"Number of articles: {len(model_input)}")
print(f"Article IDs: {list(model_input.keys())[:5]}...")
print(f"\nStructure: dict[article_id] = list of (context, target) tuples")
print("="*70)


article_1 = model_input['1']
print(f"\n Article '1' contains {len(article_1)} word pieces\n")

üîµ MODEL INPUT (what calc_surprisal_hf.py reads)
Type: <class 'dict'>
Number of articles: 20
Article IDs: ['1', '2', '3', '4', '5']...

Structure: dict[article_id] = list of (context, target) tuples

 Article '1' contains 2573 word pieces



In [3]:
# Show first 5 actual (context, target) pairs
print("First 5 (context, target) pairs that the model will process:")
print("-"*70)
for i, (context, target) in enumerate(article_1[:5]):
    context_display = context.replace('‚ñÅ', ' ').strip() if context.strip() else '<EMPTY>'
    target_display = target.replace('‚ñÅ', ' ').strip()
    print(f"{i+1}. Context: '{context_display}'")
    print(f"   Target:  '{target_display}'")
    print()

First 5 (context, target) pairs that the model will process:
----------------------------------------------------------------------
1. Context: '<EMPTY>'
   Target:  'Are'

2. Context: 'Are'
   Target:  'tourists'

3. Context: 'tourists'
   Target:  'ent iced'

4. Context: 'ent iced'
   Target:  'by'

5. Context: 'by'
   Target:  'these'



## What does this mean?

**Context**: What the model can see (previous words)  
**Target**: What the model must predict (next word)

For 2-gram: context = only the LAST 1 word (that's why many contexts are short!)

The model's job: Given context, calculate probability of target ‚Üí Convert to surprisal

## STEP 2: MODEL OUTPUT - What does calc_surprisal_hf.py produce?

**File**: `surprisals/quick-test/arch_gpt2-ngram_2-contextfunc_delete/scores.json`

After running the model, we get surprisal scores. Let's look at the ACTUAL output:

In [4]:
# Load the ACTUAL model output from our quick test
scores_path = '../surprisals/quick-test/arch_gpt2-ngram_2-contextfunc_delete/scores.json'
with open(scores_path, 'r') as f:
    model_output = json.load(f)

print("MODEL OUTPUT (what calc_surprisal_hf.py produces)")
print("="*70)
print(f"Type: {type(model_output)}")
print(f"Number of articles: {len(model_output)}")
print(f"Article IDs: {list(model_output.keys())[:5]}...")
print(f"\nStructure: dict[article_id] = list of surprisal values (floats)")
print("="*70)

# Look at Article 1 output
article_1_scores = model_output['1']
print(f"\n Article '1' has {len(article_1_scores)} surprisal scores\n")

# Show first 10 surprisal values
print("First 10 surprisal scores:")
print("-"*70)
for i, score in enumerate(article_1_scores[:10]):
    print(f"Word {i+1}: surprisal = {score:.4f}")

print(f"\n Higher surprisal = model was more 'surprised' (word was unexpected)")

MODEL OUTPUT (what calc_surprisal_hf.py produces)
Type: <class 'dict'>
Number of articles: 20
Article IDs: ['1', '2', '3', '4', '5']...

Structure: dict[article_id] = list of surprisal values (floats)

 Article '1' has 2573 surprisal scores

First 10 surprisal scores:
----------------------------------------------------------------------
Word 1: surprisal = 12.3294
Word 2: surprisal = 11.6519
Word 3: surprisal = 11.4062
Word 4: surprisal = 1.1518
Word 5: surprisal = 6.3439
Word 6: surprisal = 10.7538
Word 7: surprisal = 11.1548
Word 8: surprisal = 5.6977
Word 9: surprisal = 6.3907
Word 10: surprisal = 14.4223

 Higher surprisal = model was more 'surprised' (word was unexpected)


## STEP 3: HUMAN DATA - What do we compare against?

**File**: `data/DC/all.txt.annotation.filtered.csv`

This contains REAL human reading times. Let's see the ACTUAL structure:

In [5]:
# Load the ACTUAL human reading data
csv_path = '../data/DC/all.txt.annotation.filtered.csv'
human_data = pd.read_csv(csv_path, sep='\t')

print("üü¢ HUMAN DATA (what dundee.py uses for regression)")
print("="*70)
print(f"Shape: {human_data.shape} (rows √ó columns)")
print(f"Total observations: {len(human_data):,}")
print(f"\nKey columns:")
print(f"  - article: which article (1-20)")
print(f"  - subj_id: which person read it")  
print(f"  - time: gaze duration in milliseconds (THE TARGET VARIABLE)")
print(f"  - length: word length")
print(f"  - log_gmean_freq: word frequency")
print("="*70)

# Show first 10 rows - ONLY relevant columns
key_cols = ['article', 'subj_id', 'wnum', 'time', 'length', 'log_gmean_freq', 'pos']
print(f"\nFirst 10 rows (article 1, subject {human_data['subj_id'].iloc[0]}):")
print(human_data[human_data['article']==1].head(10)[key_cols].to_string(index=False))

üü¢ HUMAN DATA (what dundee.py uses for regression)
Shape: (515010, 106) (rows √ó columns)
Total observations: 515,010

Key columns:
  - article: which article (1-20)
  - subj_id: which person read it
  - time: gaze duration in milliseconds (THE TARGET VARIABLE)
  - length: word length
  - log_gmean_freq: word frequency

First 10 rows (article 1, subject sf):
 article subj_id  wnum  time  length  log_gmean_freq  pos
       1      sf     1     0       3        7.684325  VBP
       1      sf     2   294       8        7.246369  NNS
       1      sf     3   364       7        7.834414  VBN
       1      sf     4     0       2       13.302279   IN
       1      sf     5   234       5       10.752163   DT
       1      sf     6   322      11        6.527959  NNS
       1      sf     7   307      11        7.180071  VBG
       1      sf     8   256       5       12.134303 PRP$
       1      sf     9     0       4       10.273982   JJ
       1      sf    10   312      10        8.638401   NN

## STEP 4: MERGE - How do convert_scores.py combine them?

**Script**: `convert_scores.py`

It takes scores.json and adds 3 columns to create scores.csv:
- `surprisals_sum` (current word surprisal)
- `surprisals_sum_prev_1` (previous word surprisal)
- `surprisals_sum_prev_2` (2 words ago surprisal)

Let's see the ACTUAL merged output:

In [6]:
# Load the ACTUAL merged output
scores_csv_path = '../surprisals/quick-test/arch_gpt2-ngram_2-contextfunc_delete/scores.csv'
merged_scores = pd.read_csv(scores_csv_path, sep='\t')

print(" MERGED DATA (convert_scores.py output)")
print("="*70)
print(f"Shape: {merged_scores.shape}")
print(f"Columns: {list(merged_scores.columns)}")
print("="*70)

print("\nFirst 10 rows:")
print(merged_scores.head(10).to_string(index=False))

print("\nüí° These surprisal columns will be added to the human data CSV")

 MERGED DATA (convert_scores.py output)
Shape: (515010, 4)
Columns: ['surprisals_sum', 'surprisals_sum_prev_1', 'surprisals_sum_prev_2', 'surprisals_sum_prev_3']

First 10 rows:
 surprisals_sum  surprisals_sum_prev_1  surprisals_sum_prev_2  surprisals_sum_prev_3
      12.329370               7.502281               7.502281               7.502281
      11.651941              12.329370               7.502281               7.502281
      11.406207              11.651941              12.329370               7.502281
       1.151818              11.406207              11.651941              12.329370
       6.343907               1.151818              11.406207              11.651941
      10.753804               6.343907               1.151818              11.406207
      11.154766              10.753804               6.343907               1.151818
       5.697666              11.154766              10.753804               6.343907
       6.390707               5.697666              11.15

## STEP 5: REGRESSION - What does dundee.py do?

**Script**: `dundee.py`

It combines the human CSV with surprisal scores and runs 2 regression models:

**Baseline**: `time ~ length * freq + controls + random_effects`  
**Test**: `time ~ length * freq + **surprisals_sum** + controls + random_effects`

Then compares: Does adding surprisal improve prediction?

Let's see the ACTUAL output:

In [7]:
# Load the ACTUAL regression output
likelihood_path = '../surprisals/quick-test/arch_gpt2-ngram_2-contextfunc_delete/likelihood.txt'
with open(likelihood_path, 'r') as f:
    output_text = f.read()

print("FINAL OUTPUT (dundee.py result)")
print("="*70)
print(output_text)
print("="*70)

# Parse the values
lines = output_text.strip().split('\n')
for line in lines:
    if 'delta_linear_fit_logLik' in line:
        ppp = float(line.split(':')[1].strip())
        print(f"\n‚úÖ PPP (Psychometric Predictive Power) = {ppp:.6f}")
        print(f"   This means: Adding surprisal improved the model!")
        print(f"   Higher PPP = surprisal predicts human reading time better")

FINAL OUTPUT (dundee.py result)
linear_fit_logLik: -5.94250763908901
delta_linear_fit_logLik: 0.007719792489860211
delta_linear_fit_chi_p: 0.0

# Additional metrics:
base_loglik_total: -1265309.9130978151
test_loglik_total: -1263668.306944639
lr_statistic: 3283.2123063523322
n_observations: 212649


‚úÖ PPP (Psychometric Predictive Power) = 0.007720
   This means: Adding surprisal improved the model!
   Higher PPP = surprisal predicts human reading time better


## ? COMPLETE PIPELINE SUMMARY

```
INPUT:  ngram_2-contextfunc_delete.json
        ‚Üì (context, target) pairs
        
STEP 1: calc_surprisal_hf.py -m gpt2 -d INPUT -o OUTPUT
        Model reads JSON, calculates surprisal
        ‚Üì
        
OUTPUT: scores.json  
        {article_id: [surprisal1, surprisal2, ...]}
        ‚Üì
        
STEP 2: convert_scores.py --dir OUTPUT
        Add lag features (prev_1, prev_2)
        ‚Üì
        
OUTPUT: scores.csv
        [surprisals_sum, surprisals_sum_prev_1, surprisals_sum_prev_2]
        ‚Üì
        
STEP 3: dundee.py OUTPUT/
        Merge with human CSV (all.txt.annotation.filtered.csv)
        Run regression: time ~ surprisal + length + freq + ...
        ‚Üì
        
OUTPUT: likelihood.txt
        PPP = delta_logLik (how much surprisal helps predict reading time)
```

### Real Files in Our Test:
‚úÖ Input: `data/DC/ngram_2-contextfunc_delete.json` (2573 word pieces)  
‚úÖ Output 1: `surprisals/.../scores.json` (2573 surprisal values)  
‚úÖ Output 2: `surprisals/.../scores.csv` (2573 rows √ó 4 columns)  
‚úÖ Human data: `data/DC/all.txt.annotation.filtered.csv` (515,010 observations)  
‚úÖ Final: `surprisals/.../likelihood.txt` (PPP = 0.0077)

## ‚úÖ Now Do You Understand?

### The Pipeline in Simple Terms:

**INPUT**: Text with limited context (JSON)  
**MODEL**: GPT-2 calculates "how surprised am I by each word?"  
**OUTPUT**: Surprisal scores (JSON ‚Üí CSV)  
**MERGE**: Combine surprisal with human reading times  
**ANALYSIS**: Does surprisal predict reading time? ‚Üí **YES! PPP = 0.0077**

### Key Files You Just Saw:
1. `ngram_2-contextfunc_delete.json` - What model sees (input)
2. `scores.json` - Model's surprisal values (output)
3. `scores.csv` - Surprisal with lag features
4. `all.txt.annotation.filtered.csv` - Human reading times
5. `likelihood.txt` - Final PPP score

### Questions?
1. What goes INTO calc_surprisal_hf.py? ‚Üí **JSON with (context, target) pairs**
2. What comes OUT of calc_surprisal_hf.py? ‚Üí **scores.json with surprisal values**
3. What does dundee.py do? ‚Üí **Regression: time ~ surprisal + other_features**
4. What is PPP? ‚Üí **How much surprisal improves reading time prediction**

---

**Ready for Part 2?** Now we'll understand WHY we have 31 different JSON files (different context lengths)!