# Week 5: Sentence Completion Challenge

Make sure you refer to the pdf document `lab4.pdf` as well as this starter notebook

The cell below will load the language_model class (developed last week) and train it using the files in the training directory.

In [None]:
%load_ext autoreload
%autoreload 2  
#this means that language_model will be reloaded when you run this cell - this is important if you change the language_model class!

import os
from language_model import * ## import language model (developed in previous lab)
parentdir="/Users/juliewe/Dropbox/teaching/AdvancedNLP/2026/2026_content/week3/lab3/lab3resources/sentence-completion" #you may need to update this 

trainingdir=os.path.join(parentdir,"Holmes_Training_Data")
training,testing=get_training_testing(trainingdir)
MAX_FILES=10   #use a small number here whilst developing your solutions
mylm=language_model(trainingdir=trainingdir,files=training[:MAX_FILES])

Note that if you are using my language_model class (from the resources directory) rather than your own, I have added an extra step in the finalising of the probability distributions.  I have counted the number of words which were deleted from the vocabulary during the _make_unknowns step and then used this to adjust the probability of the _UNK token.  This is because it is currently the probability of ANY unknown token occurring.  The adjustment makes it an estimate of the probablity of a particular rare word.  This is important when we are doing the sentence completion challenge as otherwise OOV words will seem much more likely as the target words just due to the sheer number of them.

Let's have a look at the most frequent words in the training data.

In [None]:
vocab=sorted(mylm.unigram.items(),key=lambda x:x[1],reverse =True)

In [None]:
vocab[:10]

How big is the vocabulary?  What kind of words are low frequency?  What kind of words are mid-frequency?

In [None]:
len(vocab)

In [None]:
vocab[-10:]

Without the _adjust_unknowns step, the P(_UNK) would be much higher.  Its worth checking what this would be if you turn off self.adjust_unknwowns.  I'll leave this for you to do.

Make sure you can:
* look up bigram probabilities
* generate a sentence according to the model
* calculate the perplexity of a test sentence

Now lets load in and have a look at the sentence completion challenge questions.

In [None]:
import pandas as pd, csv
questions=os.path.join(parentdir,"testing_data.csv")
answers=os.path.join(parentdir,"test_answer.csv")

with open(questions) as instream:
    csvreader=csv.reader(instream)
    lines=list(csvreader)
qs_df=pd.DataFrame(lines[1:],columns=lines[0])
qs_df.head()

Need to be able to tokenize questions so that the gaps can be located.

In [None]:
from nltk import word_tokenize as tokenize

tokens=[tokenize(q) for q in qs_df['question']]
print(tokens)

Getting the context of the blank: looking at the preceding words (number given in window)

In [None]:
def get_left_context(sent_tokens,window,target="_____"):
    found=-1
    for i,token in enumerate(sent_tokens):
        if token==target:
            found=i
            break 
            
    if found>-1:
        return sent_tokens[i-window:i]
    else:
        return []
    

qs_df['tokens']=qs_df['question'].map(tokenize)
qs_df['left_context']=qs_df['tokens'].map(lambda x: get_left_context(x,2))
qs_df.head()    

##  Building and evaluating an SCC system
1. always predict the same answer (e.g., "a")


In [None]:
# from scc import *
### you can import this the above line but I have included the code here to make it easier to inspect it

class question:
    
    def __init__(self,aline):
        self.fields=aline
    
    def get_field(self,field):
        return self.fields[question.colnames[field]]
    
    def add_answer(self,fields):
        self.answer=fields[1]
   
    def chooseA(self):
        return("a")
    
    def predict(self,method="chooseA"):
        #eventually there will be lots of methods to choose from
        if method=="chooseA":
            return self.chooseA()
        
    def predict_and_score(self,method="chooseA"):
        
        #compare prediction according to method with the correct answer
        #return 1 or 0 accordingly
        prediction=self.predict(method=method)
        if prediction ==self.answer:
            return 1
        else:
            return 0

class scc_reader:
    
    def __init__(self,qs=questions,ans=answers):
        self.qs=qs
        self.ans=ans
        self.read_files()
        
    def read_files(self):
        
        #read in the question file
        with open(self.qs) as instream:
            csvreader=csv.reader(instream)
            qlines=list(csvreader)
        
        #store the column names as a reverse index so they can be used to reference parts of the question
        question.colnames={item:i for i,item in enumerate(qlines[0])}
        
        #create a question instance for each line of the file (other than heading line)
        self.questions=[question(qline) for qline in qlines[1:]]
        
        #read in the answer file
        with open(self.ans) as instream:
            csvreader=csv.reader(instream)
            alines=list(csvreader)
            
        #add answers to questions so predictions can be checked    
        for q,aline in zip(self.questions,alines[1:]):
            q.add_answer(aline)
        
    def get_field(self,field):
        return [q.get_field(field) for q in self.questions] 
    
    def predict(self,method="chooseA"):
        return [q.predict(method=method) for q in self.questions]
    
    def predict_and_score(self,method="chooseA"):
        scores=[q.predict_and_score(method=method) for q in self.questions]
        return sum(scores)/len(scores)
    
            

In [None]:
SCC = scc_reader()

In [None]:
SCC.get_field("b)")

In [None]:
SCC.predict()

In [None]:
SCC.predict_and_score()

### Adding a random choice

### Using the language model
using unigram probabilities

### Adding Context
looking up context and bigram probabilities


### Right context

### Left and right context

### Backing off to unigram probs

Backing off might not change the decision (the correct answer may not be in the bestchoices given back by the bigram model)

Investigate: 
* the effect of the amount of training data on each of the strategies
* plot on a graph - should see a cross-over (unigram than bigram for small training data but bigram better than unigram for large training data)

Extend:
* trigram model
* incorporation of distributional similarity / word2vec vectors
* RNNLM ...?