# Language Modelling Lab (week 3)
This notebook provides the "starter" code in the week 3 lab.  You should refer to the pdf document for what to do in this lab.


## 1 Getting Started
We need to get the names of files in the training directory and split them into training and testing 50:50.

In [None]:
import os,random,math
TRAINING_DIR="sentence-completion/Holmes_Training_Data"  #this needs to be the parent directory for the training corpus

def get_training_testing(training_dir=TRAINING_DIR,split=0.5):

    filenames=os.listdir(training_dir)
    n=len(filenames)
    print("There are {} files in the training directory: {}".format(n,training_dir))
    random.seed(53)  #if you want the same random split every time
    random.shuffle(filenames)
    index=int(n*split)
    return(filenames[:index],filenames[index:])

trainingfiles,heldoutfiles=get_training_testing()


In [None]:
len(trainingfiles)

## 2  Building a unigram model

THe code below implements a simple unigram model.  The class stores the unigram probability distribution and provides methods for training and lookup.

In [None]:
from nltk import word_tokenize as tokenize
import operator

class language_model():
    
    def __init__(self,trainingdir=TRAINING_DIR,files=[]):
        #store the names of the files containing training data and run the training method
        self.training_dir=trainingdir
        self.files=files
        
        self.train()
        
    def train(self):
        #initialise an empty dictionary which will be the unigram model {w:P(w)} when training is complete
        self.unigram={}
        #process all of the training data, accumulating counts of events
        self._processfiles()
        #convert the accumulated counts to probabilities
        print("Finalising probability distribution")
        self._convert_to_probs()
        
    def _processline(self,line):
        #process each line of a file
        #each line is tokenized and has a special start and end token added
        #counts of tokens are added to the self.unigram count model
        tokens=["__START"]+tokenize(line)+["__END"]
        for token in tokens:
            self.unigram[token]=self.unigram.get(token,0)+1
    
    
    def _processfiles(self):
        #process each file in turn
        for afile in self.files:
            print("Processing {}".format(afile))
            with open(os.path.join(self.training_dir,afile),errors='ignore') as instream:
                    for line in instream:
                        line=line.rstrip()
                        if len(line)>0:
                            self._processline(line)
      
            
    def _convert_to_probs(self):
        #self.unigram initially counts counts for each token {token:freq(token)}
        #sum all of the frequencies and divide each frequency by that sum to get probabilities
        
        self.unigram={k:v/sum(self.unigram.values()) for (k,v) in self.unigram.items()}
       
    def get_prob(self,token,method="unigram"):
        #simple look up method
        if method=="unigram":
            return self.unigram.get(token,0)
        else:
            print("Not implemented: {}".format(method))
            return 0
    

    
        
       

In [None]:
MAX_FILES=5
mylm=language_model(files=trainingfiles[:MAX_FILES])

Make sure you look up some probabilities of words in your model.  Pick some words which you would expect to have high probabilities and some words which you would expect to have low probabilities.

As an extension, see how these change if you use a bigger portion of the training data to train your model.


In [None]:
mylm.get_prob("man")

## 2.2 Generation
Add some functionality to your class so that you can generate a string of highly probably words.

Refer to the pdf document for tips on how to do this.

## 3 Adding Bigrams

Refer to the pdf document

## 4 Perplexity

Refer to the pdf document

## 5 Smoothing

Refer to the pdf document

## 6 Extensions