# Count Vectorizer Pair Solution

This morning we'll implement a method for featurizing text, i.e. turning words into numeric values that can be fed to machine learning models. We'll use a small set of amazon reviews with positive/negative sentiment labels (balanced classes), and a simple multinomial naive bayes model to test our method.

You can use sklearn for everything except the Count Vectorizer methods :)

**Steps:**

**1.** Read in the data, take a quick look, and do a train/test split. This one's a gift to you:

```
df = pd.read_csv("Data/amazon_cells_labelled.txt", sep='\t', names=['text', 'sentiment'])

df.head()
```

```
# train test split
X_train, X_test, y_train, y_test = train_test_split(df.text, df.sentiment,
                                                    test_size=.2, random_state=2018)
```    

**2.** We want to predict the sentiment of the review only using the text. ??? How do we turn words into numeric features? 

This is where the **Count Vectorizer** method comes into play. It turns out we can do something pretty simple - just count word occurences across all of our different text samples (documents). Each word in the the entire corpus (collection of documents) gets its own feature column. 

Your major task is to write a class and/or series of functions that accomplish the following:

* Iterate through a corpus and collect all of the distinct words that occur into a global **vocabulary**. Hint: try using Counter from the collections library.

* To each word in the vocabulary, assign a consistent ordered position - for example, you could sort by the number of occurences (but any consistent positioning is fine).

* **Transform a corpus into a numeric dataframe**, with one column for each word in the vocabulary. For each document in the corpus (row in the dataframe), count occurences of each word and fill the corresponding dataframe columns with the appropriate counts. The positioning in bullet 2 allows you to do this consistently. This output is called a **document-term matrix**. 

**3.** Once you've built your Count Vectorizer, apply it to the review data. Build your vocabulary off of the entire corpus (df.text), and convert the train and test corpuses (X_train, X_test) to document-term matrices using your transform function. Congrats, you now have numeric features and targets! Fit a multinomial naive bayes model to the train data and score it for accuracy on the test data.

In [1]:
import string
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

df = pd.read_csv("https://thisismetis.github.io/datasets/amazon_cells_labelled.txt", sep='\t', names=['text', 'sentiment'])

df.head()

Unnamed: 0,text,sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [2]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(df.text, df.sentiment,
                                                    test_size=.2, random_state=2018)

In [3]:
class My_CountVectorizer:

    def fit_vocab(self, corpus):
        
        # collect vocabulary word counts
        self.counter = Counter()
        for doc in corpus:
            self.counter.update(doc.split(' '))
        
        # map words to columns in descending order of frequency
        self.feature_map = {word : i for i, (word, count) 
                            in enumerate(self.counter.most_common())}
           
    def transform_corpus(self, corpus):
        
        vectors = []
        
        # fill doc rows by iterating through words and 
        # accumulating counts to term columns
        for doc in corpus:
            vector = np.zeros(len(self.feature_map))
            for word in doc.split(' '):
                if word in self.feature_map:
                    vector[self.feature_map[word]] += 1
            vectors.append(vector)
        
        # document-term matrix with word column names
        word_df = pd.DataFrame(vectors, columns=self.feature_map.keys())
        return word_df

In [4]:
cv = My_CountVectorizer()
cv.fit_vocab(X_train)
X_train = cv.transform_corpus(X_train)
X_test = cv.transform_corpus(X_test)

In [5]:
X_train.head()

Unnamed: 0,the,I,and,is,a,to,it,this,my,for,...,mess,hair.,samsung...crap.....,crappy,E715..,seeen.,glove,"strong,","secure,",durable.
0,0.0,1.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,1.0,0.0,2.0,1.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb.score(X_test, y_test)

0.76

## Sklearn

Sklearn lets us get everything we just wrote automatically! And it's better, because we can expect it to be more efficient and it uses better preprocessing out of the box (we'll cover a bunch of the core preprocessing steps soon). 

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

# same train/test split
X_train, X_test, y_train, y_test = train_test_split(df.text, df.sentiment,
                                                    test_size=.2, random_state=2018)

X_train = cv.fit_transform(X_train)
X_test = cv.transform(X_test)

nb = MultinomialNB()
nb.fit(X_train, y_train)
nb.score(X_test, y_test)

0.805

In [8]:
type(X_train)

scipy.sparse.csr.csr_matrix

The result is a sparse matrix.  To convert back into a dataframe:

In [9]:
pd.DataFrame(X_train.toarray(), columns = cv.get_feature_names()).head(20)

Unnamed: 0,10,100,11,13,15,20,2000,2005,2160,24,...,would,wow,wrong,wrongly,year,years,yell,yet,you,your
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Additional Resources

- [Python docs about classes](https://docs.python.org/3/tutorial/classes.html)
- [sklearn docs on CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)