Step 1.1: Understanding our dataset
--
- Parse the data
- Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [76]:
import pandas as pd
df = pd.read_table('C:\Users\MLUSER\Documents\GitHub\Udacity\Naive Bayes Tutorial/SMSSpamCollection',
                  sep='\t',
                  header=None,
                  names=['label','sms_message'])
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Step 1.2: Data Preprocessing
--
   - convert labels(str) to binary var(int)
   - 0 -> 'ham',not spam
   - 1 -> 'spam'
- why: beacuse scikit-learn only deal with numerical values

-----
```python
lambda argument: manipulate(argument)
map(function_to_apply, list_of_inputs)
map(int, ["12", "37", "999"])
[12, 37, 999]
```

In [45]:
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Step 2.1: Bag of words
--
   - BoW concepts: take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter.
    
Step 2.2: Implementing Bag of Words from scratch
--

```python
class sklearn.feature_extraction.text.CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)
```


Step 1: Convert all strings to their lower case form.
- Let's say we have a document set:
```python
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']
```

In [93]:
doc = ['Hello, how are you!',
           'Win money, win from home.',
           'Call me now.',
           'Hello, Call hello you tomorrow?']
doc_lower = []

for x in range(len(doc)):
    doc_lower.append(doc[x].lower())

print doc_lower

    
    

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


Step 2: Removing all punctuations

In [182]:
import string
sans_doc_punc = []
for i in doc_lower:
    #sans_doc_punc.append(i.translate(string.maketrans('', '', string.punctuation)))
    sans_doc_punc.append(i.translate(string.maketrans(',!?.','    ')))
print sans_doc_punc

['hello  how are you ', 'win money  win from home ', 'call me now ', 'hello  call hello you tomorrow ']


Step 3: Tokenization

In [183]:
doc_pre = []
for i in sans_doc_punc:
    doc_pre.append(i.split())
    
print doc_pre

[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]


Step 4: Count frequencies
```python
#Counter dict subclass for counting hashable objects
#Tally occurrences of words in a list
cnt = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
cnt[word] += 1
cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
```
https://docs.python.org/2/library/collections.html?highlight=counter

A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.

In [184]:
import collections as colt
freq_list = []
for i in doc_pre:
    freq_count = colt.Counter(i)
    freq_list.append(freq_count)
    print freq_count
    print '\n'
    
print freq_list
print '\n'
print type(freq_count)
print type(freq_list)


#freq_count = colt.Counter()
#for i in doc_pre:
#    for j in i:
#        freq_count[j] +=1
#print freq_count


Counter({'how': 1, 'you': 1, 'hello': 1, 'are': 1})


Counter({'win': 2, 'home': 1, 'from': 1, 'money': 1})


Counter({'me': 1, 'now': 1, 'call': 1})


Counter({'hello': 2, 'you': 1, 'call': 1, 'tomorrow': 1})


[Counter({'how': 1, 'you': 1, 'hello': 1, 'are': 1}), Counter({'win': 2, 'home': 1, 'from': 1, 'money': 1}), Counter({'me': 1, 'now': 1, 'call': 1}), Counter({'hello': 2, 'you': 1, 'call': 1, 'tomorrow': 1})]


<class 'collections.Counter'>
<type 'list'>


Step 2.3: Implementing Bag of Words in scikit-learn
--

- Instructions: Import the sklearn.feature_extraction.text.CountVectorizer method and create an instance of it called 'count_vector'.

In [188]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
print count_vector
print type(count_vector)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
<class 'sklearn.feature_extraction.text.CountVectorizer'>
