Consider the data set below, for spam detection. 

We will use the Naive Bayes Classifier to learn from this data and predict new sentences. 

> Sentence | Spam
> --- | ---  
> congrats you are selected | N 
> congrats you won lottery | Y
> travel for free | Y
> selected for credit cards | Y
> very good | N
> good night | N
> lottery | Y 

In this context, each word can be treated as an attribute, and their values will be 0 or 1 depending on the absence or existence of the word in the sentence. 

For example, "congrats" will be attribute $a_0$, "you" will be attribute $a_1$, etc. 

> word | attribute | dictionary index (see python code below)
> --- | --- | ---
> congrats | $a_0$ | 0
> you | $a_1$ | 1
> are | $a_2$ | 2
> selected | $a_3$ | 3
> won | $a_4$ | 4
> lottery | $a_5$ | 5
> travel | $a_6$ | 6
> for | $a_7$ | 7
> free | $a_8$ | 8
> credit | $a_9$ | 9
> cards | $a_{10}$ | 10
> very | $a_{11}$ | 11
> good | $a_{12}$ | 12
> night | $a_{13}$ | 13




So, the sentence "you won free travel" would be 
> $\small (a_0=0, a_1=1, a_2=0, a_3=0, a_4=1, a_5=0, a_6=1, a_7=0, a_8=1, a_9=0, a_{10}=0, a_{11}=0, a_{12}=0, a_{13}=0)$

or simply
> $\small (0,1,0,0,1,0,1,0,1,0,0,0,0,0)$

So, the probability that "you won free travel" is a spam can be written as:
> $\small P(Spam=Yes| a_0=0, a_1=1, a_2=0, a_3=0, a_4=1, a_5=0, a_6=1, a_7=0, a_8=1, a_9=0, a_{10}=0, a_{11}=0, a_{12}=0, a_{13}=0)$

or as

> $\small P(Yes|0,1,0,0,1,0,1,0,1,0,0,0,0,0)$.

Note that word ordering does not matter, and multiple occurances of the same word is simply represented as 1. 


In [1]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB

#---------------------------------------------------------------------
# dictionary, to look up words from the data vector -- case sensitive! 
#---------------------------------------------------------------------
dictionary = np.array(["congrats","you","are","selected","won","lottery","travel","for","free","credit","cards","very","good","night"])

#---------------------------------------
# vec2word: convert data vector to words
#---------------------------------------
def vec2word(vec):
  """
  arguments: vec = np.array([0,1,...])
  returns: string of sentence corresponsing to the vector (word may not be ordered properly)
  """
  dictionary = np.array(["congrats","you","are","selected","won","lottery","travel","for","free","credit","cards","very","good","night"])
  dict = {1:"congrats",2:"you",3:"are",4:"selected",5:"won",6:"lottery",7:"travel",8:"for",9:"free",10:"credit",11:"cards",12:"very",13:"good",14:"night" }
  new_arr = np.array([[1,2,3,4,5,6,7,8,9,10,11,12,13,14]])

  n_a = np.multiply(new_arr,vec)
  n_a = np.delete(n_a, np.where(n_a == 0))
  n_a = list(n_a)
  a = []
  for i in n_a: 
    a.append(dict[i])
  print(a)


#--------------------------------
# spam data : enter your data here (SOL)
#--------------------------------
X = np.array([
 [1,1,1,1,0,0,0,0,0,0,0,0,0,0], [1,1,0,0,1,1,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,1,1,1,0,0,0,0,0],[0,0,0,1,0,0,0,1,0,1,1,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,1,1,0],[0,0,0,0,0,0,0,0,0,0,0,0,1,1],[0,0,0,0,0,1,0,0,0,0,0,0,0,0] 
])

y = np.array([0,1,1,1,0,0,1])

clf = MultinomialNB()
clf.fit(X, y)

print("Score (accuracy: 1.0 = 100%)= ",end="")
print(clf.score(X,y))


Score (accuracy: 1.0 = 100%)= 1.0


Providing 3 test sentences that are classified as "Spam" (output = 1). 

- These cannot be from the provided data set above.
- Each sentence must be at least 4 words long.

In [2]:
print("\nThis is test sentence 1: You won free travel cards")
test1 = np.array([[0,1,0,0,1,0,1,0,1,0,1,0,1,0]])
print(vec2word(test1)) #verifying that sentence is correct 
print("The prediction is")
print(clf.predict(test1))

print("\nThis is test sentence 2: You are selected for lotttery")
test2 = np.array([[0,1,1,1,0,1,0,1,0,0,0,0,0,0]])
print(vec2word(test2)) #verifying that sentence is correct 
print("The prediction is")
print(clf.predict(test2))

print("\nThis is test sentence 3: You won good credit cards")
test3 = np.array([[0,1,0,0,1,0,0,0,0,1,1,0,1,0]])
print(vec2word(test3)) #verifying that sentence is correct 
print("The prediction is")
print(clf.predict(test3))



This is test sentence 1: You won free travel cards
['you', 'won', 'travel', 'free', 'cards', 'good']
None
The prediction is
[1]

This is test sentence 2: You are selected for lotttery
['you', 'are', 'selected', 'lottery', 'for']
None
The prediction is
[1]

This is test sentence 3: You won good credit cards
['you', 'won', 'credit', 'cards', 'good']
None
The prediction is
[1]


Providing 3 test sentences that are classified as "Not spam" (output = 0). 

- These cannot be from the provided data set above.
- Each sentence must be at least 4 words long.

In [3]:
print("\nThis is test sentence 1: You are selected for night")
test1 = np.array([[0,1,1,1,0,0,0,1,0,0,0,0,0,1]])
print(vec2word(test1)) #verifying that sentence is correct 
print("The prediction is")
print(clf.predict(test1))

print("\nThis is test sentence 2: You are very good")
test2 = np.array([[0,1,1,0,0,0,0,0,0,0,0,1,1,0]])
print(vec2word(test2)) #verifying that sentence is correct 
print("The prediction is")
print(clf.predict(test2))

print("\nThis is test sentence 3: Good for travel night")
test3 = np.array([[0,0,0,0,0,0,1,1,0,0,0,0,1,1]])
print(vec2word(test3)) #verifying that sentence is correct 
print("The prediction is")
print(clf.predict(test3))


This is test sentence 1: You are selected for night
['you', 'are', 'selected', 'for', 'night']
None
The prediction is
[0]

This is test sentence 2: You are very good
['you', 'are', 'very', 'good']
None
The prediction is
[0]

This is test sentence 3: Good for travel night
['travel', 'for', 'good', 'night']
None
The prediction is
[0]
