In this notebook, we will build a recomendation system for academic papers. 

We will be using two data files.
- `cb.txt` contains the titles of the papers, together with a label indicating whether a person likes it or not. The label is not used by the recommendation system, but is used by the feedback logic. The label (0 or 1) appears on column 1, and the title appears on the remainder of the line, separated from the label by a space.
- `vocab2.txt` contains the words in the titles of the academic papers. The first four columns of each line contains the count of how many times the word appears (only approximately matches cb.txt), and the word appears on the remainder of the line, separated from the count by a space.

Here we write a few functions that will help us in processing the data:

- `readvocab` creates a `vocab_dict` associating a unique ID with each word that occurs in our paper titles
- `tokenize` turns each set of words in a title, `string`, into a count of the number of times each word in the title occurs
- `getdat`, which takes the titles and returns a list of titles with their word counts, `dat`, and a list of labels indicating if the user found the title interesting, `labs`

We will use our function to process our data to get our vectorized titles, `dat`, and their labels `labs`

In [None]:
# read in the vocabulary file 
def readvocab():
   # keep track of the number of words
    lexiconsize = 0
   # initialize an empty dictionary
    vocab_dict = {}
   # create a catch-all feature (vector component) for all unknown words
    vocab_dict["@unk"] = lexiconsize
    lexiconsize += 1
   # read in the vocabulary file
    with open("data/vocab2.txt", "r") as f:
        data = f.readlines()
   # Process the file a line at a time.
    for line in data:
        # The count is the first 3 characters
        count = int(line[0:4])
        # The word is the rest of the string
        token = line[5:-1]
       # Create a feature if it’s appeared at least twice
        if count > 1: 
            vocab_dict[token] = lexiconsize
            lexiconsize += 1
    # squirrel away the total size for later reference
    vocab_dict["@size"] = lexiconsize
    return(vocab_dict)

# vocab_dict is dict[str, int], with the following keys:
# * vocab_dict["@unk"] == 0, representing words unknown to the vocab.
# * vocab_dict["@size"] is the size of the vocab, including @unk, but not including @size itself.
# * All other keys are words in the vocab, with values being their unique IDs. Each ID is an int between 1 and @size-1.
vocab_dict = readvocab()

# Turn string str into a vector.
def tokenize(string, vocab_dict):
  # initially the vector is all zeros
  vec = [0] * vocab_dict["@size"]
  unk = vocab_dict["@unk"]
  # for each word
  for t in string.split(" "):
   # if the word has a feature, add one to the corresponding feature
   # otherwise, count it as an unk
    vec[vocab_dict.get(t, unk)] += 1
  return(vec)

# read in labeled examples and turn the strings into vectors
def getdat(vocab_dict):
    with open("data/cb.txt", "r") as f:
        data = f.readlines()
    dat = []
    labs = []
    for line in data:
        labs = labs + [int(line[0])]
        dat = dat + [tokenize(line[2:], vocab_dict)]
    return(dat, labs)

(dat, labs) = getdat(vocab_dict)

We define two additional helper functions to make our recommendations:

- `playgame` makes `rounds / b` recommendations using `chooser` and is given a `score`.  For the first `alpha` rounds the selections are random. It makes one recommendation per round.

- `argmax` returns the index from `indices` associated with the item in the `vals` list with the highest value

In [None]:
import random
from sklearn.naive_bayes import MultinomialNB

def playgame(chooser, rounds, alpha):
  curitem = 0
  score = 0
  trainset = []
  trainlabs = []
  b = 5
  clf = MultinomialNB()

  while curitem < rounds:
    chosenitem = chooser(curitem, b, trainset, trainlabs, alpha, clf)
    score = score + labs[chosenitem]
    trainset = trainset + [dat[chosenitem]]
    trainlabs = trainlabs + [labs[chosenitem]]
    curitem += b
  return(score)

def argmax(indices, vals):
 best = max(vals)
 for i in range(len(indices)):
   if vals[i] == best: 
     return(indices[i])

This function is our choosing function, `probachooser` and chooses between `b` options. `currentitem` is the initial item to consider.  `trainset` represents the results of previous selections. 


If we have not yet made `alpha` selections, the selection is random. 

If we have made `alpha` selections in the past, we fit our `clf` Naive Bayes model using the traing data of academic papers by title, `trainset`, and training labels for if the academic papers were interesting, `trainlabs`.  After we fit our `clf` model, we use it to select the item most likely to be labeled as interesting.

In [None]:
def probachooser(curitem, b, trainset, trainlabs, alpha, clf):
  if len(trainset) == alpha:
    clf = clf.fit(trainset, trainlabs)
#comment?
  if len(trainset) < alpha:
    chosenitem = random.randint(curitem,curitem+b-1)
  else:
    yhat = clf.predict_proba(dat[curitem:(curitem+b)])
    chosenitem = argmax(range(curitem,curitem+b), [p for (c,p) in yhat])
  return(chosenitem)

We will see how the number of rounds with random choices, `alpha` affects the final score.  We will run our `playgame` function with `alpha` values ranging from 10 to 200.  We will plot our scores below.

In [None]:
# Exactly half of the entries in trainlabs have value of 1, so an algorithm that
# always randomly select a title should get 50% right. The distribution of favorable
# titles is not completely even, but close. The code below only looks at the first
# 1000 titles, out of which 49.4% have label 1.

# Test below shows that a small alpha (somewhere around 10 to 20) is good enough for
# training the classifier. A larger alpha *might* be able to generate a better model,
# but it also reduces the number of time the model can be used. In particular, when
# alpha == 195, the algorithm would make 195 random recommendations, and then use
# the classifier to make only 5 recommendations, thus the score would be around 50%
# regardless of how good the classifier is.

# An approach to get better score (though at the expense of more computation power)
# might be start training after a few rounds, but periodically re-train the model
# as more data are gathered. This is explored in following cells.

alphas = range(10,200,5)
ress = []
for alpha in alphas:
  res = playgame(probachooser, 1000, alpha)
  print(alpha, res)
  ress += [res]
  
import matplotlib.pyplot as plt

plt.scatter(alphas, ress)
plt.plot(alphas, ress)
plt.show()


We rewrite our `probachooser` function so that `alpha` is now used as a smoothing parameter for our Naive Bayes model, ranging between 0 and 1.  We train our Naive Bayes classifier on every data element using our smoothing parameter in our Naive Bayes model, `alpha`. The `chosenitem` returned is just the one our classifier thinks is most likely to be interesting.

We also will rewrite our `playgame` function to accomodate the changes we made in `probachooser`.

Given these changes, we will plot how our score changes as we vary `alpha` from 0.00005 to 1. Note: Can take 15-20 minutes to run!

In [None]:
def probachooser(curitem, b, trainset, trainlabs, alpha):
  # Original code just checks for len(trainset) == 0. However, when all
  # labels are the same, the classifier can only predict for that label.
  # This causes clf.predict_proba to return an n*1 array instead of
  # n*2 array, and results in an error later.
  if len(set(trainlabs)) < 2:
    chosenitem = random.randint(curitem,curitem+b-1)
  else:
    clf = MultinomialNB(alpha=alpha)
    clf = clf.fit(trainset, trainlabs)	
    yhat = clf.predict_proba(dat[curitem:(curitem+b)])
    chosenitem = argmax(range(curitem,curitem+b), [p for (c,p) in yhat])
  return(chosenitem)

def playgame(chooser, rounds, alpha):
  curitem = 0
  score = 0
  trainset = []
  trainlabs = []
  b = 5

  while curitem < rounds:
    chosenitem = chooser(curitem, b, trainset, trainlabs, alpha)
    score = score + labs[chosenitem]
    trainset = trainset + [dat[chosenitem]]
    trainlabs = trainlabs + [labs[chosenitem]]
    curitem += b
  return(score)

rep = 10
alphas = [0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0]
ress = []
mins = []
maxs = []
for alpha in alphas:
  print("Processing", alpha)
  total = 0
  res = []
  for i in range(rep):
    res += [playgame(probachooser, 1000, alpha)]
  ress += [sum(res)/rep]
  mins += [min(res)]
  maxs += [max(res)]

We plot the results of varying our smoothing parameter `alpha` below.

In [None]:
# alpha is the smoothing factor. It accounts for features not present in the
# learning samples and prevents zero probabilities. See
# https://scikit-learn.org/0.21/modules/naive_bayes.html#multinomial-naive-bayes

# Original code used linear scale on both axes. This caused small alpha values
# to crowd into each other. I changed the x-axis to log scale. The plots shows
# that as alpha increases, the score first starts to increase, peaking around
# alpha of 0.01, and then starts decreasing again.

import matplotlib.pyplot as plt

plt.scatter(alphas, ress)
plt.plot(alphas, ress)
plt.xscale('log')
# plt.fill_between(alphas, mins, maxs, alpha=0.6)   
plt.show()