<a href="https://colab.research.google.com/github/cs432-websci-fall20/hw9-classify-Razor308/blob/master/Copy_of_432_PCI_Ch06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Filtering

Ch 6 from *Programming Collective Intelligence*, based on code from
* https://github.com/arthur-e/Programming-Collective-Intelligence/tree/master/chapter6
* https://go.oreilly.com/old-dominion-university/library/view/programming-collective-intelligence/9780596529321/

**Goal:** Classify email as spam or not spam.

**Implemented Example:** Classify a given document as "bad" or "good".

## General Functions

In [None]:
import sqlite3 as sqlite   # replaces import stmt from book
import re
import math

`getwords(doc)` - returns a list of unique words found in the given document

* breaks up the text into words, by dividing on any character that isn’t a letter
* leaves only actual words, converted to lowercase
* returns only unique words (so doesn't calculate the count if a word is used multiple times in a document)

Note that this reduces the number of features because text is now case insensitive. However, this will completely miss ALL CAPS as potential feature for spam.


In [None]:
def getwords(doc):
  splitter=re.compile('\W+')  # different than book
  #print (doc)
  # Split the words by non-alpha characters
  words=[s.lower() for s in splitter.split(doc) 
          if len(s)>2 and len(s)<20]
  
  # Return the unique set of words only
  uniq_words = dict([(w,1) for w in words])

  return uniq_words

## Basic Classifier

`class basic_classifer` - holds what the classifier has learned so far
* implemented in pgs. 119-127, no SQL DB involved (this is in the `class classifier` below)

Instance variables:
* `fc` - stores counts for different features in the different classifications \\
example: `{'python': {'bad': 0, 'good': 6}, 'the': {'bad': 3, 'good': 3}}`
* `cc` - dictionary of how many times every classification has been used, will be used in later probability calculations
* `getfeatures()` - extracts the features from the items being classified, we use `getwords()`

Helper functions - increment and access the counts (so that we can later store the training data in a file or db)
* `incf()` - increase the count of a feature/category pair
* `incc()` - increase the count of a category
* `fcount()` - num times a feature has appeared in a category
* `catcount()` - number of items in a category
* `totalcount()` - total number of items
* `categories()` - list of all categories

Other functions:
* `train()` - processes the training data, extracts words, and updates counts
* `fprob()` - returns Pr(w|c), probability that a word appears in a category, implements the Multiple Bernoulli method
* `weightedprob()` - returns the weighted probability of Pr(w|c), using assumed probabilities

In [None]:
class basic_classifier:

  def __init__(self,getfeatures,filename=None):
    # Counts of feature/category combinations
    self.fc={}
    # Counts of documents in each category
    self.cc={}
    self.getfeatures=getfeatures
    
  # Increase the count of a feature/category pair  
  def incf(self,f,cat):
    self.fc.setdefault(f, {})
    self.fc[f].setdefault(cat, 0)
    self.fc[f][cat]+=1
  
  # Increase the count of a category  
  def incc(self,cat):
    self.cc.setdefault(cat, 0)
    self.cc[cat]+=1  

  # The number of times a feature has appeared in a category
  def fcount(self,f,cat):
    if f in self.fc and cat in self.fc[f]:
      return float(self.fc[f][cat])
    return 0.0

  # The number of items in a category
  def catcount(self,cat):
    if cat in self.cc:
        return float(self.cc[cat])
    return 0

  # The total number of items
  def totalcount(self):
    return sum(self.cc.values())

  # The list of all categories
  def categories(self):
    return self.cc.keys()

  def train(self,item,cat):
    features=self.getfeatures(item)
    # Increment the count for every feature with this category
    for f in features:
      self.incf(f,cat)

    # Increment the count for this category
    self.incc(cat)

  def fprob(self,f,cat):
    if self.catcount(cat)==0: return 0

    # The total number of times this feature appeared in this 
    # category divided by the total number of items in this category
    return self.fcount(f,cat)/self.catcount(cat)

  def weightedprob(self,f,cat,prf,weight=1.0,ap=0.5):
    # Calculate current probability
    basicprob=prf(f,cat)

    # Count the number of times this feature has appeared in
    # all categories
    totals=sum([self.fcount(f,c) for c in self.categories()])

    # Calculate the weighted average
    bp=((weight*ap)+(totals*basicprob))/(weight+totals)
    return bp

## Training Examples

In [None]:
def sampletrain(cl):
  cl.train('Nobody owns the water.','good')
  cl.train('the quick rabbit jumps fences','good')
  cl.train('buy pharmaceuticals now','bad')
  cl.train('make quick money at the online casino','bad')
  cl.train('the quick brown fox jumps','good')

### Example 1 - simple counts

First, instantiate the basic classifier with `getwords()` as the getfeatures function.

In [None]:
cl = basic_classifier(getwords)

Load sample training data and print out data from the classifier

In [None]:
sampletrain(cl)
print("")
print("Total items:", cl.totalcount())
print("Categories:", cl.categories())
for cat in cl.categories():
  print(cat, cl.catcount(cat))

In [None]:
cl.fcount('quick', 'good')

In [None]:
cl.fcount('quick', 'bad')

### Example 2 (pg. 122) - simple prob

First, reset the classifier by re-instantiating

In [None]:
cl = basic_classifier(getwords)

In [None]:
sampletrain(cl)
cl.fprob('quick', 'good')

### Example 3 (pg. 122) - simple weightedprob

In [None]:
cl = basic_classifier(getwords)
cl.weightedprob('money', 'bad', cl.fprob)

In [None]:
cl.train("This money is bad.", "bad")
cl.weightedprob('money', 'bad', cl.fprob)

### Example 4 - fprob vs. weightedprob

In [None]:
cl = basic_classifier(getwords)
sampletrain(cl)

In [None]:
cl.fprob('money', 'good')

In [None]:
cl.weightedprob('money', 'good', cl.fprob)

### Example 5 (pg. 123) - adding more training data

In [None]:
cl = basic_classifier(getwords)
sampletrain(cl)

In [None]:
cl.weightedprob('money', 'good', cl.fprob)

In [None]:
sampletrain(cl)
cl.weightedprob('money', 'good', cl.fprob)

## Naive Bayes Classifier

*To use this with the basic classifier (and to change it back later), make the following changes:*
* `class naivebayes(classifier)` -> `class naivebayes(basic_classifier)`
* `classifier.__init__(self,getfeatures)` -> `basic_classifier.__init__(self,getfeatures)`

In [None]:
class naivebayes(classifier):   # change for basic_classifier

  def __init__(self,getfeatures):   
    classifier.__init__(self,getfeatures)  # change for basic_classifier
    self.thresholds={}
  
  def docprob(self,item,cat):
    features=self.getfeatures(item)   

    # Multiply the probabilities of all the features together
    p=1
    for f in features: p*=self.weightedprob(f,cat,self.fprob)
    return p

  def prob(self,item,cat):
    catprob=self.catcount(cat)/self.totalcount()
    docprob=self.docprob(item,cat)
    return docprob*catprob
  
  def setthreshold(self,cat,t):
    self.thresholds[cat]=t
    
  def getthreshold(self,cat):
    if cat not in self.thresholds: return 1.0
    return self.thresholds[cat]
  
  def classify(self,item,default=None):
    probs={}
    # Find the category with the highest probability
    max=0.0
    for cat in self.categories():
      probs[cat]=self.prob(item,cat)
      if probs[cat]>max: 
        max=probs[cat]
        best=cat

    # Make sure the probability exceeds threshold*next best
    for cat in probs:
      if cat==best: continue
      if probs[cat]*self.getthreshold(best)>probs[best]: return default
    return best

## Bayesian Examples

### Example 1 (pg. 125) - prob

Training dataset: 
```
('Nobody owns the water.','good')
('the quick rabbit jumps fences','good')
('buy pharmaceuticals now','bad')
('make quick money at the online casino','bad')
('the quick brown fox jumps','good')
```

In [None]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.prob('quick rabbit', 'good')

In [None]:
cl.prob('quick rabbit', 'bad')

### Example 2 (pg. 127) - using thresholds

In [None]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.classify('quick rabbit', default='unknown')

In [None]:
cl.classify('quick money', default='unknown')

In [None]:
cl.setthreshold('bad', 3.0)
cl.classify('quick money', default='unknown')

In [None]:
for i in range(10): sampletrain(cl)
cl.classify('quick money', default='unknown')

## Classifier w/SQL

Uses a SQL database

In [None]:
class classifier:
  def __init__(self,getfeatures,filename=None):
    # Counts of feature/category combinations
    self.fc={}
    # Counts of documents in each category
    self.cc={}
    self.getfeatures=getfeatures
    
  def setdb(self,dbfile):
    self.con=sqlite.connect(dbfile)    
    self.con.execute('create table if not exists fc(feature,category,count)')
    self.con.execute('create table if not exists cc(category,count)')

  def incf(self,f,cat):
    count=self.fcount(f,cat)
    if count==0:
      self.con.execute("insert into fc values ('%s','%s',1)" 
                       % (f,cat))
    else:
      self.con.execute(
        "update fc set count=%d where feature='%s' and category='%s'" 
        % (count+1,f,cat)) 
  
  def fcount(self,f,cat):
    res=self.con.execute(
      'select count from fc where feature="%s" and category="%s"'
      %(f,cat)).fetchone()
    if res==None: return 0
    else: return float(res[0])

  def incc(self,cat):
    count=self.catcount(cat)
    if count==0:
      self.con.execute("insert into cc values ('%s',1)" % (cat))
    else:
      self.con.execute("update cc set count=%d where category='%s'" 
                       % (count+1,cat))    

  def catcount(self,cat):
    res=self.con.execute('select count from cc where category="%s"'
                         %(cat)).fetchone()
    if res==None: return 0
    else: return float(res[0])

  def categories(self):
    cur=self.con.execute('select category from cc');
    return [d[0] for d in cur]

  def totalcount(self):
    res=self.con.execute('select sum(count) from cc').fetchone();
    if res==None: return 0
    return res[0]

  def train(self,item,cat):
    features=self.getfeatures(item)
    # Increment the count for every feature with this category
    for f in features:
      self.incf(f,cat)

    # Increment the count for this category
    self.incc(cat)
    self.con.commit()

  def fprob(self,f,cat):
    if self.catcount(cat)==0: return 0

    # The total number of times this feature appeared in this 
    # category divided by the total number of items in this category
    return self.fcount(f,cat)/self.catcount(cat)

  def weightedprob(self,f,cat,prf,weight=0.5,ap=0.1):
    # Calculate current probability
    basicprob=prf(f,cat)

    # Count the number of times this feature has appeared in
    # all categories
    totals=sum([self.fcount(f,c) for c in self.categories()])

    # Calculate the weighted average
    bp=((weight*ap)+(totals*basicprob))/(weight+totals)
    return bp


## Examples - Full Bayesian Classifier w/SQL


In [None]:
# Read in files for training
# emails that are shopping
file = open("trainOn1.txt", "r")
trainOn1 = file.read()
file = open("trainOn2.txt", "r")
trainOn2 = file.read()
file = open("trainOn3.txt", "r")
trainOn3 = file.read()
file = open("trainOn4.txt", "r")
trainOn4 = file.read()
file = open("trainOn5.txt", "r")
trainOn5 = file.read()
file = open("trainOn6.txt", "r")
trainOn6 = file.read()
file = open("trainOn7.txt", "r")
trainOn7 = file.read()
file = open("trainOn8.txt", "r")
trainOn8 = file.read()
file = open("trainOn9.txt", "r")
trainOn9 = file.read()
file = open("trainOn10.txt", "r")
trainOn10 = file.read()
file = open("trainOn11.txt", "r")
trainOn11 = file.read()
file = open("trainOn12.txt", "r")
trainOn12 = file.read()
file = open("trainOn13.txt", "r")
trainOn13 = file.read()
file = open("trainOn14.txt", "r")
trainOn14 = file.read()
file = open("trainOn15.txt", "r")
trainOn15 = file.read()
file = open("trainOn16.txt", "r")
trainOn16 = file.read()
file = open("trainOn17.txt", "r")
trainOn17 = file.read()
file = open("trainOn18.txt", "r")
trainOn18 = file.read()
file = open("trainOn19.txt", "r")
trainOn19 = file.read()
file = open("trainOn20.txt", "r")
trainOn20 = file.read()
# emails that are not shopping
file = open("trainOff1.txt" , "r")
trainOff1 = file.read()
file = open("trainOff2.txt" , "r")
trainOff2 = file.read()
file = open("trainOff3.txt" , "r")
trainOff3 = file.read()
file = open("trainOff4.txt" , "r")
trainOff4 = file.read()
file = open("trainOff5.txt" , "r")
trainOff5 = file.read()
file = open("trainOff6.txt" , "r")
trainOff6 = file.read()
file = open("trainOff7.txt" , "r")
trainOff7 = file.read()
file = open("trainOff8.txt" , "r")
trainOff8 = file.read()
file = open("trainOff9.txt" , "r")
trainOff9 = file.read()
file = open("trainOff10.txt" , "r")
trainOff10 = file.read()
file = open("trainOff11.txt" , "r")
trainOff11 = file.read()
file = open("trainOff12.txt" , "r")
trainOff12 = file.read()
file = open("trainOff13.txt" , "r")
trainOff13 = file.read()
file = open("trainOff14.txt" , "r")
trainOff14 = file.read()
file = open("trainOff15.txt" , "r")
trainOff15 = file.read()
file = open("trainOff16.txt" , "r")
trainOff16 = file.read()
file = open("trainOff17.txt" , "r")
trainOff17 = file.read()
file = open("trainOff18.txt" , "r")
trainOff18 = file.read()
file = open("trainOff19.txt" , "r")
trainOff19 = file.read()
file = open("trainOff20.txt" , "r")
trainOff20 = file.read()
# Read in files for testing
# Test emails that should be shopping
file = open("testOn1.txt", "r")
testOn1 = file.read()
file = open("testOn2.txt", "r")
testOn2 = file.read()
file = open("testOn3.txt", "r")
testOn3 = file.read()
file = open("testOn4.txt", "r")
testOn4 = file.read()
file = open("testOn5.txt", "r")
testOn5 = file.read()
# Test emails that should not be shopping
file = open("testOff1.txt", "r")
testOff1 = file.read()
file = open("testOff2.txt", "r")
testOff2 = file.read()
file = open("testOff3.txt", "r")
testOff3 = file.read()
file = open("testOff4.txt", "r")
testOff4 = file.read()
file = open("testOff5.txt", "r")
testOff5 = file.read()

In [None]:
def spamTrain(cl):
  # emails that are shopping
  cl.train(trainOn1, 'shopping')
  cl.train(trainOn2, 'shopping')
  cl.train(trainOn3, 'shopping')
  cl.train(trainOn4, 'shopping')
  cl.train(trainOn5, 'shopping')
  cl.train(trainOn6, 'shopping')
  cl.train(trainOn7, 'shopping')
  cl.train(trainOn8, 'shopping')
  cl.train(trainOn9, 'shopping')
  cl.train(trainOn10, 'shopping')
  cl.train(trainOn11, 'shopping')
  cl.train(trainOn12, 'shopping')
  cl.train(trainOn13, 'shopping')
  cl.train(trainOn14, 'shopping')
  cl.train(trainOn15, 'shopping')
  cl.train(trainOn16, 'shopping')
  cl.train(trainOn17, 'shopping')
  cl.train(trainOn18, 'shopping')
  cl.train(trainOn19, 'shopping')
  cl.train(trainOn20, 'shopping')
  # emails that are not shopping
  cl.train(trainOff1, 'not shopping')
  cl.train(trainOff2, 'not shopping')
  cl.train(trainOff3, 'not shopping')
  cl.train(trainOff4, 'not shopping')
  cl.train(trainOff5, 'not shopping')
  cl.train(trainOff6, 'not shopping')
  cl.train(trainOff7, 'not shopping')
  cl.train(trainOff8, 'not shopping')
  cl.train(trainOff9, 'not shopping')
  cl.train(trainOff10, 'not shopping')
  cl.train(trainOff11, 'not shopping')
  cl.train(trainOff12, 'not shopping')
  cl.train(trainOff13, 'not shopping')
  cl.train(trainOff14, 'not shopping')
  cl.train(trainOff15, 'not shopping')
  cl.train(trainOff16, 'not shopping')
  cl.train(trainOff17, 'not shopping')
  cl.train(trainOff18, 'not shopping')
  cl.train(trainOff19, 'not shopping')
  cl.train(trainOff20, 'not shopping') 

*Don't forget to adjust `class naivebayes` to use `classifier`*

In [None]:
# HW9 test
cl = naivebayes(getwords)
cl.setdb('test.db')
spamTrain(cl)

In [None]:
cl.classify(testOn1, default='unknown')

'shopping'

In [None]:
cl.classify(testOn2, default='unknown')

'shopping'

In [None]:
cl.classify(testOn3, default='unknown')

'shopping'

In [None]:
cl.classify(testOn4, default='unknown')

'shopping'

In [None]:
cl.classify(testOn5, default='unknown')

'shopping'

In [None]:
cl.classify(testOff1, default='unknown')

'not shopping'

In [None]:
cl.classify(testOff2, default='unknown')

'not shopping'

In [None]:
cl.classify(testOff3, default='unknown')

'not shopping'

In [None]:
cl.classify(testOff4, default='unknown')

'not shopping'

In [None]:
cl.classify(testOff5, default='unknown')

'not shopping'

In [None]:
print(testOff5)

In [None]:
cl = naivebayes(getwords)
cl.setdb('test1.db')
spamTrain(cl)
cl.setthreshold('spam', 3.0)
cl.classify('the banking dinner', default='unknown')

In [None]:
cl2 = naivebayes(getwords)
cl2.setdb('test2.db')
sampletrain(cl2)
cl2.setthreshold('bad', 3.0)
cl2.classify('quick money', default='unknown')

In [None]:
cl = naivebayes(getwords)
cl.setdb('test1.db')
cl.classify('cheap money', default='unknown')

In [None]:
cl2.classify('online casino now', default='unknown')

# Multinomial Model

In [None]:
def getwordsMn(doc):
  splitter=re.compile('\W+')  # different than book
  #print (doc)
  # Split the words by non-alpha characters
  words=[s.lower() for s in splitter.split(doc) 
          if len(s)>2 and len(s)<20]
  
  

  return words

In [None]:
class classifierMn:
  def __init__(self,getfeatures,filename=None):
    # Counts of feature/category combinations
    self.fc={}
    # Counts of documents in each category
    self.cc={}
    self.getfeatures=getfeatures
    
  def setdb(self,dbfile):
    self.con=sqlite.connect(dbfile)    
    self.con.execute('create table if not exists fc(feature,category,count)')
    self.con.execute('create table if not exists cc(category,count)')

  def incf(self,f,cat):
    count=self.fcount(f,cat)
    if count==0:
      self.con.execute("insert into fc values ('%s','%s',1)" 
                       % (f,cat))
    else:
      self.con.execute(
        "update fc set count=%d where feature='%s' and category='%s'" 
        % (count+1,f,cat)) 
  
  def fcount(self,f,cat):
    res=self.con.execute(
      'select count from fc where feature="%s" and category="%s"'
      %(f,cat)).fetchone()
    if res==None: return 0
    else: return float(res[0])

  def incc(self,cat):
    count=self.catcount(cat)
    if count==0:
      self.con.execute("insert into cc values ('%s',1)" % (cat))
    else:
      self.con.execute("update cc set count=%d where category='%s'" 
                       % (count+1,cat))    

  def catcount(self,cat):
    res=self.con.execute('select count from cc where category="%s"'
                         %(cat)).fetchone()
    if res==None: return 0
    else: return float(res[0])

  def categories(self):
    cur=self.con.execute('select category from cc');
    return [d[0] for d in cur]

  def totalcount(self):
    res=self.con.execute('select sum(count) from cc').fetchone();
    if res==None: return 0
    return res[0]

  def train(self,item,cat):
    features=self.getfeatures(item)
    # Increment the count for every feature with this category
    for f in features:
      self.incf(f,cat)

    # Increment the count for this category
    self.incc(cat)
    self.con.commit()

  def fprob(self,f,cat):
    if self.catcount(cat)==0: return 0

    # The total number of times this feature appeared in this 
    # category divided by the total number of items in this category
    #return self.fcount(f,cat)/self.catcount(cat)
    return 

  def weightedprob(self,f,cat,prf,weight=0.5,ap=0.1):
    # Calculate current probability
    basicprob=prf(f,cat)

    # Count the number of times this feature has appeared in
    # all categories
    totals=sum([self.fcount(f,c) for c in self.categories()])

    # Calculate the weighted average
    bp=((weight*ap)+(totals*basicprob))/(weight+totals)
    return bp
