# Academic Writter Assistant — Notebook

## NLP GROUP PROJECT

**Purpose**: End-to-end notebook for building an autocomplete assistant that predicts next-sentence continuations from paragraph context (100–200 tokens). This notebook contains dataset templates, NSP evaluation approaches, fine-tuning recipe, re-ranking head design, context-window extension ideas, stylistic control methods, vocabulary/token distribution analysis, and evaluation guidance.

**Run notes**: Install `transformers`, `datasets`, `accelerate`, and other libraries in your runtime before executing heavy training cells.

In [None]:
#!pip install -q transformers datasets accelerate evaluate sentencepiece tokenizers faiss-cpu evaluate nltk

In [None]:
import os
import json, math, random
from pathlib import Path
import pandas as pd
import numpy as np
import torch
from datasets import load_dataset
import re
from typing import List, Optional
from transformers import AutoTokenizer, AutoModelForCausalLM, BertTokenizer, BertForNextSentencePrediction, TrainingArguments, Trainer
import nltk
from peft import LoraConfig, get_peft_model, TaskType

# Separating helper functions to declutter the code
import dataset_utils

# Defining the device that the models run on.
device = 'cpu'

# Building a dataset
Here, we create a dataset of contexts and text sentence, using a number of sources. The current implementation only retrieves text from ASAP Essays. Data is first cleaned then loaded into a dataframe with context and next sentence.

In [5]:
df = dataset_utils.build_academic_dataset(tokenizer=None, limit_each=3000)
contexts = df['context']
print(df.head())
print()
print(contexts[0])

Loading datasets...
Loading ASAP Essays
Total raw documents loaded: 723
Generating (context, continuation) pairs...
Generated 720 training pairs.
                                             context  \
0  A long time ago when I was in third grade I ha...   
1  Softball has to be one of the single most grea...   
2  Some people like making people laugh, I love i...   
3  "LAUGHTER" @CAPS1 I hang out with my friends, ...   
4  Well ima tell a story about the time i got @CA...   

                                        continuation  
0  The next day @PERSON2 and I were eating lunch ...  
1  Many of these girls were like sisters to me th...  
2  For example one time I hit myself in the head ...  
3  @CAPS1 I say trash can I really mean trash can...  
4  Then she said stupid @CAPS2 on the bus and com...  

A long time ago when I was in third grade I had a friend @PERSON2 who's mom was in a bad mood. She never laughed and she never smiled. Every time I saw her I would smile at her and all s

# Testing next text generation
Here, we load a GPT2 model to test text generation on the dataset that we have.

In [6]:
import model_utils

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [7]:
# Generating text with the GPT-2 model
context = "Deep reinforcement learning has been widely adopted in robotic navigation."

# Tokenize the input context
inputs = tokenizer(context, return_tensors="pt").to(device)

# Generate continuation using the fine-tuned model
# You can adjust max_new_tokens, do_sample, temperature, top_p as needed
output_ids = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask, # Explicitly pass attention_mask
    max_new_tokens=100, # Generate up to 100 new tokens
    num_return_sequences=1,
    do_sample=True,      # Enable sampling for more diverse outputs
    temperature=0.7,     # Control creativity (lower for less, higher for more)
    top_p=0.9,           # Nucleus sampling
    pad_token_id=tokenizer.eos_token_id
)

# Decode the generated tokens, skipping the input context and special tokens
generated_text = tokenizer.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

print("Context:", context)
print("\nGenerated Continuation:", generated_text.strip())

Context: Deep reinforcement learning has been widely adopted in robotic navigation.

Generated Continuation: However, there is a limit to how well it can be applied to artificial intelligence.

This week, we analyzed data from the world's largest robot park in California, the San Diego Zoo.

Advertisement

In this study, we used a robotic robotic arm to measure the effectiveness of reinforcement learning in a range of tasks. This arm was trained to take a series of steps to complete a task, then trained to move the robotic arm forward. We found that the robot arm's performance


# Model Evaluation
Here we are going to test some model evaluation methods for next sentence prediction (NSP). 

## Language model scoring task

This approach treats NSP as a language-model scoring task.

Process:

- Given a sentence pair (A, candidate continuation B)

- Compute the log-likelihood or perplexity of B given A

- Higher probability = more plausible continuation

- Compare scores between true and false continuations

In [9]:
# Run the following code if you want to switch to distilgpt2

# model_name = "distilgpt2"
# lm_tokenizer = AutoTokenizer.from_pretrained(model_name)
# lm_model = AutoModelForCausalLM.from_pretrained(model_name)
# lm_model.to(device)
# lm_model.eval()

score = model_utils.lm_score(model=model, 
				 tokenizer=tokenizer,
				 context=contexts[0],
				 continuation=df['continuation'][0],
				 device="cpu")

print("Classification score:", score)

Classification score: -574.0275120735168


## Binary classification

This approach treats NSP as a binary classification task.

Process:

- Concatenate sentence pair (A + B)

- Feed into a classifier (e.g., BERT, RoBERTa, DeBERTa)

- Output:

  - 1 → B is a valid continuation

  - 0 → B is an invalid continuation

- Evaluate using: Accuracy, Precision, Recall, F1

In [10]:
eval_model = "bert-base-uncased"
eval_tokenizer = BertTokenizer.from_pretrained(eval_model)
eval_model = BertForNextSentencePrediction.from_pretrained(eval_model)
eval_model.eval()
eval_model.to(device)

from model_utils import nsp_score

score = nsp_score(bert_model=eval_model, 
		  bert_tokenizer=eval_tokenizer, 
		  context=contexts[0], 
		  continuation=df['continuation'][0], 
		  device=device)

print("Classification score:", score)

Classification score: 0.9999972581863403


# Model evaluation with Ground Truth

- BERTScore **Semantic similarity**

- ROUGE (ROUGE-1, ROUGE-2, ROUGE-L) **Lexical overlap**

- BLEU **N-gram precision**

In [11]:
from model_utils import evaluate_predictions, all_model_evaluation

model_score = all_model_evaluation(
		model=model,
		tokenizer=tokenizer,
		bert_model=eval_model,
		bert_tokenizer=eval_tokenizer,
        contexts=[contexts[0]],
        candidates=["He is very good with it"]
    )
model_score

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tranh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\tranh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


{'avg_lm': -471.00285816192627,
 'avg_nsp': 0.010369982570409775,
 'avg_final': -282.58097493201495,
 'bertscore_precision': np.float64(0.42000794410705566),
 'bertscore_recall': np.float64(0.27861449122428894),
 'bertscore_f1': np.float64(0.33500295877456665),
 'rouge1': np.float64(0.02631578947368421),
 'rouge2': np.float64(0.0),
 'rougeL': np.float64(0.02631578947368421),
 'bleu': 3.169663824442922e-242}

# Comparing model
Testing some smaller language models to see which one is the most suitable for our task. These models include:
- gpt2
- EleutherAI/gpt-neo-125M
- facebook/opt-125m
- microsoft/DialoGPT-small


In [12]:

causal_models = [
    "gpt2",
    "EleutherAI/gpt-neo-125M",
    "facebook/opt-125m",
    "microsoft/DialoGPT-small"
]

model.to(device)
eval_model.to(device)

all_model_evaluation_score = []
for model_name in causal_models:
    print(f"Generating continuations for {model_name}...")
    # Make sure 'gen_model' and 'gen_tokenizer' are available or reloaded if needed
    # (assuming they are from the first model generation cell)
    generate_continuations_list = model_utils.generate_continuations(model_name, contexts, max_new_tokens=50, device=device)
    print(f"Generated {len(generate_continuations_list)} continuations.")
    print(generate_continuations_list[:5])
    model_score = all_model_evaluation(
        model=model,
        tokenizer=tokenizer,
        bert_model=eval_model,
        bert_tokenizer=eval_tokenizer,
        contexts=contexts,
        candidates=generate_continuations_list,
        device=device
    )
    all_model_evaluation_score.append(model_score)

print(all_model_evaluation_score)

Generating continuations for gpt2...
Generated 720 continuations.
["I was still young then and I was really depressed. I couldn't believe how much he was trying to make me laugh and so bad I couldn't stop laughing. I told my mom I wouldn't laugh at her because she was trying to make me", 'I was very excited to get into softball. I loved playing with my dad, and my brother and I, and the team I grew up with. We were the only team that went to a team game in the 80s (that was then', 'I always wanted to be a spaz, I think I was a kid. I love to play with my toys. I love to play with my friends. I love to watch the world burn. I love to love to make people laugh. So', 'I am also a big fan of the way they talk to each other. We get along pretty well when it comes to their personality. They are all very friendly and easygoing. They talk about the things that they do for a living, and we', 'She was like are ugh. So i went back and said were you at then i said no no i was at then i said i was.

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tranh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\tranh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Generating continuations for EleutherAI/gpt-neo-125M...


KeyboardInterrupt: 

In [None]:
import pandas as pd

# Convert the list of dictionaries to a DataFrame
df_scores = pd.DataFrame(all_model_evaluation_score)

# Save the DataFrame to a CSV file
df_scores.to_csv('model_evaluation_scores.csv', index=False)

print('Model evaluation scores saved to model_evaluation_scores.csv')

# Display the DataFrame to show the saved data
display(df_scores)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Ensure `df_scores` and `causal_models` are available
# If `causal_models` is not updated or might be out of sync with df_scores rows,
# we might need to recreate it or infer model names.
# For now, assuming causal_models is still the list of models that generated the scores.

# Adding model names to the DataFrame for easier plotting
model_names = causal_models # Use the last set of causal_models that was evaluated
df_scores['model_name'] = model_names[:len(df_scores)] # Ensure lengths match

metrics_to_plot = [
    'bertscore_f1',
    'rougeL',
    'bleu',
    'avg_final'
]

metric_titles = {
    'bertscore_f1': 'BERTScore F1',
    'rougeL': 'ROUGE-L',
    'bleu': 'BLEU Score',
    'avg_final': 'Average Hybrid Score'
}

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))
axes = axes.flatten()

for i, metric in enumerate(metrics_to_plot):
    sns.barplot(x='model_name', y=metric, data=df_scores, ax=axes[i], palette='viridis', hue='model_name', legend=False)
    axes[i].set_title(metric_titles[metric])
    axes[i].set_xlabel('Model Name')
    axes[i].set_ylabel('Score')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('model_evaluation_metrics.png')
plt.show()

print("Evaluation metrics plots saved to model_evaluation_metrics.png")

## Choosing Best Pretrained-Model

From the Result we can conclude that Overall Best Model: → GPT-Neo 125M (Model 1)

It gives the best balance between likelihood scoring, semantic understanding, lexical overlap, and NSP performance.