Step 1.1: Understanding our dataset
--
- Parse the data
- Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [76]:
import pandas as pd
df = pd.read_table('C:\Users\MLUSER\Documents\GitHub\Udacity\Naive Bayes Tutorial/SMSSpamCollection',
                  sep='\t',
                  header=None,
                  names=['label','sms_message'])
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Step 1.2: Data Preprocessing
--
   - convert labels(str) to binary var(int)
   - 0 -> 'ham',not spam
   - 1 -> 'spam'
- why: beacuse scikit-learn only deal with numerical values

-----
```python
lambda argument: manipulate(argument)
map(function_to_apply, list_of_inputs)
map(int, ["12", "37", "999"])
[12, 37, 999]
```

In [280]:
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Step 2.1: Bag of words
--
   - BoW concepts: take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter.
    
Step 2.2: Implementing Bag of Words from scratch
--

```python
class sklearn.feature_extraction.text.CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)
```


Step 1: Convert all strings to their lower case form.
- Let's say we have a document set:
```python
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']
```

In [281]:
doc = ['Hello, how are you!',
           'Win money, win from home.',
           'Call me now.',
           'Hello, Call hello you tomorrow?']
doc_lower = []

for x in range(len(doc)):
    doc_lower.append(doc[x].lower())

print doc_lower

    
    

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


Step 2: Removing all punctuations

In [282]:
import string
sans_doc_punc = []
for i in doc_lower:
    #sans_doc_punc.append(i.translate(string.maketrans('', '', string.punctuation)))
    sans_doc_punc.append(i.translate(string.maketrans(',!?.','    ')))
print sans_doc_punc

['hello  how are you ', 'win money  win from home ', 'call me now ', 'hello  call hello you tomorrow ']


Step 3: Tokenization

In [283]:
doc_pre = []
for i in sans_doc_punc:
    doc_pre.append(i.split())
    
print doc_pre

[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]


Step 4: Count frequencies
```python
#Counter dict subclass for counting hashable objects
#Tally occurrences of words in a list
cnt = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
cnt[word] += 1
cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
```
https://docs.python.org/2/library/collections.html?highlight=counter

A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.

In [284]:
import collections as colt
freq_list = []
for i in doc_pre:
    freq_count = colt.Counter(i)
    freq_list.append(freq_count)
    print freq_count
    print '\n'
    
print freq_list
print '\n'
print type(freq_count)
print type(freq_list)


#freq_count = colt.Counter()
#for i in doc_pre:
#    for j in i:
#        freq_count[j] +=1
#print freq_count


Counter({'how': 1, 'you': 1, 'hello': 1, 'are': 1})


Counter({'win': 2, 'home': 1, 'from': 1, 'money': 1})


Counter({'me': 1, 'now': 1, 'call': 1})


Counter({'hello': 2, 'you': 1, 'call': 1, 'tomorrow': 1})


[Counter({'how': 1, 'you': 1, 'hello': 1, 'are': 1}), Counter({'win': 2, 'home': 1, 'from': 1, 'money': 1}), Counter({'me': 1, 'now': 1, 'call': 1}), Counter({'hello': 2, 'you': 1, 'call': 1, 'tomorrow': 1})]


<class 'collections.Counter'>
<type 'list'>


Step 2.3: Implementing Bag of Words in scikit-learn
--

- Instructions: Import the sklearn.feature_extraction.text.CountVectorizer method and create an instance of it called 'count_vector'.
- Instructions: Fit your document dataset to the CountVectorizer object you have created using fit(), and get the list of words which have been categorized as features using the get_feature_names() method.

#### Data preprocessing with CountVectorizer()
In Step 2.2, we implemented a version of the CountVectorizer() method from scratch that entailed cleaning our data first. This cleaning involved converting all of our data to lower case and removing all punctuation marks. CountVectorizer() has certain parameters which take care of these steps for us. They are:
##### lowercase = True
The lowercase parameter has a default value of True which converts all of our text to its lower case form.
##### token_pattern = (?u)\\b\\w\\w+\\b
The token_pattern parameter has a default regular expression value of (?u)\\b\\w\\w+\\b which ignores all punctuation marks and treats them as delimiters, while accepting alphanumeric strings of length greater than or equal to 2, as individual tokens or words.
##### stop_words
The stop_words parameter, if set to english will remove all words from our document set that match a list of English stop words which is defined in scikit-learn. Considering the size of our dataset and the fact that we are dealing with SMS messages and not larger text sources like e-mail, we will not be setting this parameter value.


In [285]:
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
print count_vector

print type(count_vector)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
<class 'sklearn.feature_extraction.text.CountVectorizer'>


In [286]:
count_vector.fit(documents)
count_vector.get_feature_names()

[u'are',
 u'call',
 u'from',
 u'hello',
 u'home',
 u'how',
 u'me',
 u'money',
 u'now',
 u'tomorrow',
 u'win',
 u'you']

In [287]:
doc_array = count_vector.transform(documents).toarray()
print doc_array
print type(doc_array)

[[1 0 0 1 0 1 0 0 0 0 0 1]
 [0 0 1 0 1 0 0 1 0 0 2 0]
 [0 1 0 0 0 0 1 0 1 0 0 0]
 [0 1 0 2 0 0 0 0 0 1 0 1]]
<type 'numpy.ndarray'>


In [288]:
frq_matrix = pd.DataFrame(doc_array,index=None,columns = count_vector.get_feature_names())

frq_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


Step 3.1: Training and testing sets
--
Now that we have understood how to deal with the Bag of Words problem we can get back to our dataset and proceed with our analysis. Our first step in this regard would be to split our dataset into a training and testing set so we can test our model later.

#### Instructions: Split the dataset into a training and testing set by using the train_test_split method in sklearn. Split the data using the following variables:
- X_train is our training data for the 'sms_message' column.
- y_train is our training data for the 'label' column
- X_test is our testing data for the 'sms_message' column.
- y_test is our testing data for the 'label' column Print out the number of   rows we have in each our training and testing data.

In [289]:
'''
NOTE: sklearn.cross_validation will be deprecated soon to sklearn.model_selection 
'''
from sklearn.model_selection import train_test_split 

x_train,x_test,y_train,y_test = train_test_split(df['sms_message'],df['label'],random_state=1)

print'Number of rows in the total set: {}'.format(df.shape[0])
print'Number of rows in the training set: {}'.format(x_train.shape[0])
print'Number of rows in the test set: {}'.format(x_test.shape[0])


Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


In [290]:
count_vector = CountVectorizer()
count_vector.fit(x_train)
count_vector.fit(x_test)
training_data = count_vector.transform(x_train).toarray()
testing_data = count_vector.transform(x_test).toarray()

print training_data
print testing_data

[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]


```python
# Solution

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)
```

Step 4.1: Bayes Theorem implementation from scratch
--
新訊息出現後的A概率 = A概率 x 新訊息帶來的調整

條件機率表示為P（A|B），讀作「在B條件下A的機率」。
 A 與 B 為樣本空間 Ω 中的兩個事件，其中 P(B)>0。那麼在事件 B 發生的條件下，事件 A 發生的條件機率為：
 P(A|B)= P(A\cap B)/P(B)
條件機率有時候也稱為：後驗機率。

In [291]:
#P(D) - disease(Known)
p_dis = 0.01

#P(~D) - well(Known)
p_no_dis = 0.99

#P(Pos|D) - disease, Pos (Known)
p_pos_dis = 0.9

#P(Neg|~D) - well, Neg (Known)
p_neg_no_dis = 0.9

#P(Pos) - Pos  = P(D)P(Pos|D) + P(~D)P(1-P(Neg|~D))
p_pos = (p_dis * p_pos_dis) + (p_no_dis * (1 - p_neg_no_dis))
print 'The probability of getting a positive test result P(Pos) is: '\
,format(p_pos)

#P(D|Pos) - P(D)P(Pos|D)/P(Pos)
p_dis_pos = (p_dis * p_pos_dis) / p_pos
print 'The probability of getting a disease, given that postive result P(D|Pos) is: '\
,format(p_dis_pos)

#P(Pos|~D) = 1 - P(Neg|~D) = 0.1
p_pos_no_dis = 0.1


#P(~D|Pos) = P(~D)P(Pos|~D)/P(Pos)
p_no_dis_pos = (p_no_dis * p_pos_no_dis)/p_pos
print 'The probability of no disease, given that postive result P(~D|Pos) is: '\
,format(p_no_dis_pos)




The probability of getting a positive test result P(Pos) is:  0.108
The probability of getting a disease, given that postive result P(D|Pos) is:  0.0833333333333
The probability of no disease, given that postive result P(~D|Pos) is:  0.916666666667


Congratulations! You have implemented Bayes theorem from scratch. Your analysis shows that even if you get a positive test result, there is only a 8.3% chance that you actually have diabetes and a 91.67% chance that you do not have diabetes. This is of course assuming that only 1% of the entire population has diabetes which of course is only an assumption.

 #### What does the term 'Naive' in 'Naive Bayes' mean ?
 Naive Bayes' is an extension of Bayes' theorem that assumes that all the features are independent of each other.

Step 4.2: Naive Bayes implementation from scratch
--
+ Probability that Jill Stein says 'freedom': 0.1 ---------> P(F|J)
+ Probability that Jill Stein says 'immigration': 0.1 -----> P(I|J)
+ Probability that Jill Stein says 'environment': 0.8 -----> P(E|J)
+ Probability that Gary Johnson says 'freedom': 0.7 -------> P(F|G)
+ Probability that Gary Johnson says 'immigration': 0.2 ---> P(I|G)
+ Probability that Gary Johnson says 'environment': 0.1 ---> P(E|G)
+ Probablility of Jill Stein giving a speech, P(J) is 0.5 and the same for Gary Johnson, P(G) = 0.5.

In [292]:
'''
Step 1
'''
#P(J)
p_j = 0.5

#P(G)
p_g = 0.5

#P(F|J)
p_j_f = 0.1

#P(I|J)
p_j_i = 0.1

#P(E|J)
p_j_e = 0.8

#P(F|G)
p_g_f = 0.7

#P(I|G) 
p_g_i = 0.2

#P(E|G) 
p_g_e = 0.1


'''
Step 2
'''
#P(F,I|J) = P(J)P(I|J)P(F|J)
p_j_if = p_j * p_j_i * p_j_f

#P(F,I|G) = P(G)P(I|G)P(F|G)
p_g_if = p_g * p_g_i * p_g_f


'''
Step 3
'''
#P(F,I) = P(F,I|J) + P(F,I|G)
p_if = p_j_if + p_g_if

print('Probability of words freedom and immigration being said are: '\
      , format(p_if))

'''
Step 4
'''
#P(J|F,I) = P(J)P(F|J)P(I|J)/P(F,I)
p_if_j = (p_j * p_j_f * p_j_i)/p_if

#P(G|F,I) = P(G)P(F|G)P(I|G)/P(F,I)
p_if_g = (p_g * p_g_f * p_g_i)/p_if


print('The probability of Jill Stein saying the words Freedom and Immigration: '\
      , format(p_if_j))
print ('The probability of Gary Johnson saying the words Freedom and Immigration: '\
, format(p_if_g))


('Probability of words freedom and immigration being said are: ', '0.075')
('The probability of Jill Stein saying the words Freedom and Immigration: ', '0.0666666666667')
('The probability of Gary Johnson saying the words Freedom and Immigration: ', '0.933333333333')


Step 5: Naive Bayes implementation using scikit-learn
--
Specifically, we will be using the multinomial Naive Bayes implementation. This particular classifier is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input. On the other hand Gaussian Naive Bayes is better suited for continuous data as it assumes that the input data has a Gaussian(normal) distribution.


In [296]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [297]:
prediction = naive_bayes.predict(testing_data)

Step 6: Evaluating our model
--
- __Accuracy__ measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points).
- __Precision__ tells us what proportion of messages we classified as spam, actually were spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classification), in other words it is the ratio of
[True Positives/(True Positives + False Positives)]

- __Recall(sensitivity)__ tells us what proportion of messages that actually were spam were classified by us as spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all the 
words that were actually spam, in other words it is the ratio of
[True Positives/(True Positives + False Negatives)]

- For classification problems that are skewed in their classification distributions like in our case, for example if we had a 100 text messages and only 2 were spam and the rest 98 weren't, accuracy by itself is not a very good metric. We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score. For such cases, precision and recall come in very handy. These two metrics can be combined to get the __F1 score__ which is weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score.

In [298]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, prediction)))
print('Precision score: ', format(precision_score(y_test, prediction)))
print('Recall score: ', format(recall_score(y_test, prediction)))
print('F1 score: ', format(f1_score(y_test, prediction)))

('Accuracy score: ', '0.977027997128')
('Precision score: ', '0.880597014925')
('Recall score: ', '0.956756756757')
('F1 score: ', '0.917098445596')
