# Building an NLP Pipeline

For the pair problem today, we'll build a pipeline which manages the *basic* requirements for an NLP project. The goal is to build a toolbox for converting one or more strings of text into a matrix (retaining textual information along the way).

## Step 1: Read in Data

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('https://thisismetis.github.io/datasets/coffee.csv')

## Step 2: Vectorize (part 1)

Using one of the below vectorizers provided by Sci-Kit Learn, **convert the `reviews` pandas Series to a matrix**, where each row represents a document, and each column represents a term (or, a word in a document). The number of rows should match the number of rows in `df` — this is called the "corpus". And, the number of columns should be the total number of *distinct* terms (i.e., words) in the corpus — this is called the "vocabulary".

**Build the matrix such that the value at `(i,j)` is the *Count* of term (column) `j` in document (row) `i`.**

**What are the terms in this corpus?** *Hint: When using one of these vectorizers, what is the difference between `.vocabulary_` and `.get_feature_names()`?*

*Note: The default behaviour for vectorizers is to output a Sparse matrix.*

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [4]:
docs = df.reviews

In [5]:
vec = CountVectorizer()

In [6]:
doc_term = vec.fit_transform(docs)

doc_term.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [7]:
pd.unique(doc_term.toarray().reshape(-1))

array([ 0,  1,  2,  4,  3,  6,  5, 10,  8,  7, 16,  9, 11, 14, 13, 12])

In [8]:
vec.vocabulary_

{'wanted': 2255,
 'to': 2135,
 'love': 1242,
 'this': 2110,
 'was': 2263,
 'even': 740,
 'prepared': 1571,
 'for': 852,
 'it': 1120,
 'be': 197,
 'somewhat': 1917,
 'like': 1207,
 'cheap': 374,
 'circle': 395,
 'cappuccino': 330,
 'unfortunately': 2195,
 'the': 2090,
 'product': 1593,
 'itself': 1126,
 'is': 1115,
 'really': 1648,
 'greasy': 939,
 'you': 2362,
 'actually': 57,
 'see': 1795,
 'grease': 938,
 'in': 1072,
 'cup': 531,
 '80': 40,
 'calories': 309,
 'per': 1502,
 'serving': 1817,
 'and': 113,
 'taste': 2064,
 'powder': 1559,
 'tasting': 2070,
 'powdered': 1560,
 'milk': 1310,
 'wasn': 2265,
 'expecting': 762,
 'starbucks': 1956,
 'cap': 320,
 'out': 1451,
 'of': 1409,
 'but': 291,
 'little': 1220,
 'more': 1345,
 'than': 2084,
 'br': 262,
 'read': 1638,
 'reviews': 1720,
 'they': 2102,
 'were': 2290,
 'sort': 1924,
 'mixed': 1329,
 'so': 1903,
 'chose': 389,
 'try': 2176,
 'won': 2321,
 'buy': 294,
 'these': 2101,
 'again': 78,
 'will': 2311,
 'sit': 1870,
 'on': 1426,
 'to

In [9]:
doc_term.shape

(542, 2371)

In [10]:
df.shape

(542, 3)

## Vectorize (part 2)

**Build the matrix such that the value at `(i,j)` is a *Boolean* (1 or 0) value, indicating whether term (column) `j` is in document (row) `i`.**

In [11]:
vec = CountVectorizer(binary=True)

In [12]:
doc_term = vec.fit_transform(docs)

doc_term.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [13]:
pd.unique(doc_term.toarray().reshape(-1))

array([0, 1])

In [21]:
i = 313

doc = pd.Series(name=docs[i],
                data=doc_term.toarray()[i], 
                index=vec.get_feature_names()) \
        .sort_values(ascending=False)

doc[:20]

flavored      1
cups          1
offers        1
great         1
they          1
always        1
time          1
good          1
coffee        1
price         1
square        1
cappuccino    1
deliver       1
grove         1
very          1
and           1
at            1
on            1
problems      0
problem       0
Name: Grove Square offers great flavored coffee and cappuccino K-cups at a very good price, and they always deliver on time!, dtype: int64

## Vectorize (part 3)

**Build the matrix such that the value at `(i,j)` represents a sort of *normalized frequency*,** which takes into account (a) the density of term `j` in document `i`, as well as (b) the number of documents in which that term occurs.

*Hint: Try `TfidfVectorizer`. What is this?*

In [22]:
vec = TfidfVectorizer()

In [23]:
doc_term = vec.fit_transform(docs.values)

# doc_term.toarray()

In [24]:
doc_term.shape

(542, 2371)

In [25]:
i = 309

doc = pd.Series(name=docs[i],
                data=doc_term.toarray()[i], 
                index=vec.get_feature_names()) \
        .sort_values(ascending=False)

doc.head()

process      0.318522
describes    0.318522
step         0.298966
black        0.285090
pleasant     0.265534
Name: I'm a black coffee drinker, but once in a while I want cream and sugar with it.  That pretty much describes this K-cup.  Decent, pleasant taste in a convenient one-step process., dtype: float64

## Vectorize (part 4)

Some words in this corpus might not carry interesting information (e.g., "the", and "is"). How can we **remove words like "the" and "is" from our corpus using the vectorizer**?

*Hint: Look up the `stop_words`, and `max_df` arguments.*

In [26]:
vec = CountVectorizer(stop_words='english',
                      max_df=0.8)

In [27]:
doc_term = vec.fit_transform(docs)

doc_term.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [28]:
'the' in vec.get_feature_names()

False

In [29]:
'is' in vec.get_feature_names()

False

In [30]:
doc_term.shape

(542, 2141)

## Stemming

Maybe we want the option to stem our corpus of data. How can we do this by adjusting the `preprocessor` argument in our vectorizer?

*Hint: You might want to investigate how the `.stem` method works for an instantiation of one of these stemmers.*

In [31]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

In [32]:
stemmer = SnowballStemmer("english")

In [33]:
def prep(word, stemmer=None):
    
    with open('./stop_words_english.txt', 'r') as f:
        stopwords = [s.strip() for s in f.readlines()]
        
    if word.lower() in stopwords:
        return None
    
    elif stemmer is None:
        return word.lower()
    
    else:
        return stemmer.stem(word)

In [36]:
vec = CountVectorizer(stop_words='english',
                      min_df=1,  # This is default; this is just a reminder it exists
                      max_df=0.8,
                      preprocessor=prep)

# This could work, too, but *it will not clear stop words!*
# vec = CountVectorizer(min_df=1,  # This is default; this is just a reminder it exists
#                       max_df=0.8,
#                       preprocessor=stemmer.stem)

In [37]:
doc_term = vec.fit_transform(docs)

doc_term.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [38]:
doc_term.shape

(542, 2141)

In [44]:
vec.get_feature_names()[300:310]

['carry',
 'case',
 'caseinate',
 'casey',
 'cash',
 'casual',
 'caught',
 'caution',
 'cavity',
 'ceamy']

## What can you do with this?

Try a few different operations, and try to **interpret their meaning/usecase**:

* Calculate the correlation between documents, or between terms
* Consider bigrams or n-grams in your vectorizer
* Determine if there is multicollinearity between documents, or between terms
* Try to incorporate the `user_id` into your analysis
* Build a Python Class to make your work repeatable
* Build a model of the the number of `stars`