Word n-grams (Bag of Words - BOW)
------
**What it does**: A tweet (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. 

**Strengths**: Traditional, pretty solid feature representation.

**Weaknesses**: Lose grammar/word order.

**Hyperparameters**:
- `CountVectorizer`:
  - `ngram_range`: the window length of words to look at -- `(min, max)`. In this notebook, we look at unigrams and bigrams
  - `min_df`, `max_df`: The minimum and maximum document freqency for an n-gram, respectively. Can be a count (`3`) or a percent (`0.95`)
  - `stop_words`: Whether to remove stopwords based on the `english` word list. Can input another stopword list.
  - `binary`: Whether to convert to a binary (yes/no) occurence. Can also just apply later in pipeline using `Binarizer`

In [33]:
from collections import OrderedDict
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [34]:
sts_gold = pd.read_csv('../data/sts_gold_v03/sts_gold_tweet.csv', index_col='id', sep=';')

In [35]:
sts_gold.head()

Unnamed: 0_level_0,polarity,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1467933112,0,the angel is going to miss the athlete this we...
2323395086,0,It looks as though Shaq is getting traded to C...
1467968979,0,@clarianne APRIL 9TH ISN'T COMING SOON ENOUGH
1990283756,0,drinking a McDonalds coffee and not understand...
1988884918,0,So dissapointed Taylor Swift doesnt have a Twi...


In [36]:
tweets = sts_gold['tweet']

In [45]:
cv = CountVectorizer(ngram_range=(1,2), min_df=3, max_df=.95, stop_words='english')
bow = cv.fit_transform(tweets)

# use below if you need a data frame
bow_df = pd.DataFrame(bow.toarray(), index=tweets.index, columns=cv.get_feature_names())

In [46]:
bow_df.head()

Unnamed: 0_level_0,10,100,101,12,13,14,15,1st,20,24,...,yay,yea,yeah,year,years,yep,yes,yesterday,youtube,youtube channel
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1467933112,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2323395086,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1467968979,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1990283756,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1988884918,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Feature Evaluation

In [47]:
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.svm import SVC
from sklearn.preprocessing import Binarizer, StandardScaler
from sklearn.dummy import DummyClassifier

In [48]:
models = [('DUMMY', DummyClassifier(strategy='most_frequent')),
          ('mNB' , MultinomialNB()),
          ('bNB' , BernoulliNB()),
          ('svc' , SVC())]

In [49]:
print('{0}\t{1:<1}\t{2:<4}\t{3:<4}'.format("MODEL", "MEAN CV", "MIN CV", "MAX CV"))

for name, model in models:    
    X, Y = bow, (sts_gold['polarity'] == 4).ravel()
    
    if name == 'bNB':
        binarize = Binarizer()
        X = binarize.fit_transform(X)
    elif name == 'svc':
        ss = StandardScaler()
        X = X.toarray()
        X = ss.fit_transform(X)
        
    cv = cross_val_score(model, X, Y, cv=5, scoring='accuracy')
    
    print('{0}\t{1:<3}\t{2:<4}\t{3:<4}'.format(name, round(cv.mean(), 4), round(cv.min(), 4), round(cv.max(), 4)))

MODEL	MEAN CV	MIN CV	MAX CV
DUMMY	0.6893	0.6887	0.6897
mNB	0.8274	0.8128	0.8529
bNB	0.824	0.8054	0.848
svc	0.7512	0.7402	0.766


