# Intro to scikit-learn (sklearn)

*This notebook was written in 2019 by Lauren Klein. It was updated in 2020 by Dan Sinykin, and again by Lauren Klein 2021.*

Thus far, we've examined words in terms of their grammar and syntax. We've also looked at words in terms of various units: the word, the line, the document, etc. We've also touched on the idea of ngrams: short sequences of *n* words in a row (e.g. 2-grams or *bigrams*, 3-grams or *trigrams*, and so on).

From here on out, however, we'll be taking a different approach. We'll be turning words into numbers, and then applying statistical measures and models to the numbers that represent the words. Things like tf-idf, topic modeling, BERT, similarity, classification, and clustering--the set of methods we'll be learning in the second half of the course--all rely on this basic transformation.

While the actual transformation from words into numbers may be (relatively) easy to do, thanks to scikit-learn (or sklearn), it represents a major conceptual shift. For this reason, we're going to take some time to sit with it. For today, we'll just introduce ourselves to the concept via sk-learn, Python's major machine learning library which also happens to be crucial to many of the more advanced methods named above. 

## What is a token? What is a feature? What is a document-term matrix?

We'll begin by importing sk-learn's `CountVectorizer`, which [converts a collection of text documents into a matrix of token counts](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). I've used the word "token" in passing before, but here I'll take a minute to formally define it, along with some related terms:

According to the [Stanford NLP group](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html), a *token* is "an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing." The unit of the token is usually the word, but it can also be the sentence, the subword, or anything else that makes sense for that particular task.

Take this famous phrase, for example:

"To be or not to be"

This line has six tokens: "to", "be", "or", "not", "to", "be".

It has four features: "to", "be", "or", "not"

But wait, what is a feature?

In this case, a *feature* is a unique token in the corpus. (Caveat: features, like tokens, can actually be anything that makes sense for the task, but for the purposes of turning words into numbers, features are most often unique words, or "terms," as they're also sometimes called).

When sk-learn's CountVectorizer does its thing, it first *tokenizes* all of the documents in the corpus--that is, it breaks up each document into its individual tokens--and then then creates a *document-term matrix* that counts up how many times each term, or feature, appears in each document. 

For example, the document-term matrix for the line above might look something like this:

|   | to | be | or | not |
|---|----|----|----|-----|
|   | 2  | 2  | 1  | 1   |


If we add in the second part of that phrase as a new document, we might get something like this:


|         | to | be | or | not | that | is | the | question |
|---------|----|----|----|-----|------|----|-----|----------|
| line 1  | 2  | 2  | 1  | 1   | 0    | 0  | 0   | 0        |
| line 2  | 0  | 0  | 0  | 0   | 1    | 1  | 0   | 0        |


But enough of vectorizing by hand; let's try it out using sk-learn!


## Importing sk-learn's CountVectorizer

In [1]:
# import CountVectorizer from sk-learn
from sklearn.feature_extraction.text import CountVectorizer

## Vectorize a teeny corpus

Now let's vectorize a teeny corpus we can see:

In [2]:
# here's our corpus: the first stanza of "Persimmons" in which each line is its own document
corpus = [
    'In sixth grade Mrs. Walker',
    'slapped the back of my head',
    'and made me stand in the corner',
    'for not knowing the difference',
    'between persimmon and precision.',
    'How to choose',
]

# instantiate the CountVectorizer object
# note that this is the same conceptual process we used to instantiate
# the VADER sentiment analysis object, and the spaCy document object
cv=CountVectorizer()

# this steps generates document-term matrix for the doc; 
# it's required before you do almost anything else
dtm=cv.fit_transform(corpus)

# this method gives us the feature names that the CountVectorizer vectorized:
features = cv.get_feature_names()

# this method turns our doc-term matrix into an array that can be manipulated:
dtm_array = dtm.toarray()

print("All of the features in our corpus:")
print(str(features))

print ("\nAnd their counts in each of the \"documents,\" each of which is really just a single line of the poem:")
print(dtm_array)

All of the features in our corpus:
['and', 'back', 'between', 'choose', 'corner', 'difference', 'for', 'grade', 'head', 'how', 'in', 'knowing', 'made', 'me', 'mrs', 'my', 'not', 'of', 'persimmon', 'precision', 'sixth', 'slapped', 'stand', 'the', 'to', 'walker']

And their counts in each of the "documents," each of which is really just a single line of the poem:
[[0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1]
 [0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0]
 [1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0]
 [0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0]
 [1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]]


In [3]:
# here is some code that uses dataframes to make the above slightly more legible
# note that this is now the second or third time I've mentioned Python dataframes
# and said we'll talk about them later--we will, promise!

import pandas as pd

df = pd.DataFrame(data=dtm_array,columns=features)
print(df)

   and  back  between  choose  corner  difference  for  grade  head  how  ...  \
0    0     0        0       0       0           0    0      1     0    0  ...   
1    0     1        0       0       0           0    0      0     1    0  ...   
2    1     0        0       0       1           0    0      0     0    0  ...   
3    0     0        0       0       0           1    1      0     0    0  ...   
4    1     0        1       0       0           0    0      0     0    0  ...   
5    0     0        0       1       0           0    0      0     0    1  ...   

   not  of  persimmon  precision  sixth  slapped  stand  the  to  walker  
0    0   0          0          0      1        0      0    0   0       1  
1    0   1          0          0      0        1      0    1   0       0  
2    0   0          0          0      0        0      1    1   0       0  
3    1   0          0          0      0        0      0    1   0       0  
4    0   0          1          1      0        0      0  

Let's take a minute to figure out what we're looking at:

* Each column is a feature, or "term," labeled with the name of the term, which in this case is a unique token
* Each row is a document, labeled in order of being ingested
* The "1" in row 0 of the "grade" column means that the term "grade" appears 1 time in the first document... and so on.  

## Vectorizing a corpus from a set of files

The reality is that you almost always will be vectorizing a corpus from a set of files, and not a list that you type in by hand. This is how you'd do it with our song lyrics:

In [4]:
# import this library for directory/file manipulation
import os

# set the base directory -- note that this may need to change if you've saved a copy
# of this notebook elsewhere 
base_dir = "../corpora/lyrics/"

# read in a list of all the filenames 
docs = os.listdir(base_dir)

# a list for storing the text of all the docs
all_docs = []

# iterate through each of the docs in the directory
for doc in docs:
    with open(base_dir + doc, "r") as file:     # open the doc file 
        text = file.read()                      # read the contents of the file 
        all_docs.append(text)                   # append the contents of the file to our
                                                # all_docs list for future vectorizing

# just take a look at the first item to be sure it worked
print("Filename: " + str(docs[0]) + "\n") 
print(all_docs[0])

Filename: The-who-baba-oriley.txt

Out here in the fields, I fight for my meals
I get my back into my living
I don't need to fight to prove I'm right
I don't need to be forgiven, yeah, yeah, yeah, yeah, ye-ah


Don't cry, don't raise your eye
It's only teenage wasteland

Sally, take my hand, we'll travel south 'cross land
Put out the fire and don't look past my shoulder
The exodus is here, the happy ones are near
Let's get together before we get much older

Teenage wasteland, it's only teenage wasteland
Teenage wasteland, oh, yeah
Teenage wasteland
They're all wasted!





After all that, the process of vectorizing the text of all the documents is the exact same one as before:

In [5]:
# instantiate the vectorizer
cv=CountVectorizer()

# generates document-term matrix for all the docs
dtm=cv.fit_transform(all_docs)

# get the feature names aka terms
features = cv.get_feature_names()

# take a look at the first 10 features
print(features[0:9])

['128', '1956', '1989', '22nd', '41', '441', '45', '57', 'abilities']


In [6]:
# you can also check the overall shape of the doc-term matrix 
dtm.shape

(76, 2849)

**What does this tell you about how many documents there are?** 

76

**What about the number of terms?**

2849

## Helpful CountVectorizer Parameters

Here are a few more helpful CountVectorizer parameters to know about:



In [6]:
# lowercase all words -- this is True by default, but if you want to preserve case,
# you can set lowercase to False
cv_caps = CountVectorizer(lowercase=False)

# generates document-term matrix for all the docs
dtm2=cv_caps.fit_transform(all_docs)

# check the shape
dtm2.shape

(76, 3271)

So, there are more terms since it's not merging the uppercase and the lowercase versions of each word together.

Another parameter to know about has to do with stopwords. These are common words like "and", "not", "or", etc. that are not usually that interesting.  

In [7]:
# use the built-in English stopwords list
cv_no_stops = CountVectorizer(stop_words='english')

# generates document-term matrix for all the docs
dtm3=cv_no_stops.fit_transform(all_docs)

# check the shape
dtm3.shape

(76, 2628)

In [8]:
# use a custom stopwrods list
cv_no_stops = CountVectorizer(stop_words=['love','heart','star'])

# generates document-term matrix for all the docs
dtm4=cv_no_stops.fit_transform(all_docs)

# check the shape
dtm4.shape

(76, 2846)

One last helpful feature of CountVectorizer is that you can tell it very easily to tokenize by ngrams as well as words. To wit:

In [9]:
bigram_cv = CountVectorizer(analyzer='word', ngram_range=(2, 2))

# generates document-term matrix for all the docs
dtm5=bigram_cv.fit_transform(all_docs)

# get the feature names -- bigrams in this case
features = bigram_cv.get_feature_names()

# take a look at the first 10 features
print(features[0:9])

['128 when', '1956 patient', '1989 my', '22nd row', '41 rollie', '441 like', '45 king', '57 the', 'abilities there']


## Converting the doc-term matrix to a dictionary

Finally, here is some helpful code for creating a dictonary with the features as keys and the counts as values. Don't worry about being able to parse all the syntax unless you feel like it.

In [10]:
# here's our number of features
num_feats = dtm.shape[1]

# here's a dictionary to store the features and counts key/value pairs
feature_dict = {}

for x in range(num_feats):      # the for x in range() syntax is how you iterate over integers
    key = cv.get_feature_names()[x]  # this gets the feature name at position [x]
    value = dtm.toarray().sum(axis=0)[x]  # this sums the counts of the feature at
                                          # position [x] for all documents
    
    feature_dict[key] = value # add the new key/value pair to the dictionary
    
# then sort the dictionary in order of counts
sortFeats = sorted(feature_dict.items(), key=lambda x: x[1], reverse=True)

# for more on the sorted function, see: https://www.w3schools.com/python/ref_func_sorted.asp
# for more on lambda functions, see: https://towardsdatascience.com/lambda-functions-with-practical-examples-in-python-45934f3653a8

# then print top 30

for item in sortFeats[0:30]:
    print(str(item[0]) + ": " + str(item[1]))

the: 1013
you: 924
to: 588
and: 510
it: 477
me: 380
on: 336
my: 334
we: 274
that: 257
in: 254
yeah: 233
of: 227
be: 214
your: 210
with: 207
can: 205
no: 196
got: 192
just: 185
is: 181
all: 178
baby: 177
love: 173
for: 171
oh: 171
don: 169
la: 154
day: 151
re: 146


In [11]:
# we can also print from a bit lower in the counts

for item in sortFeats[200:230]:
    print(str(item[0]) + ": " + str(item[1]))

an: 20
everybody: 20
generation: 20
look: 20
nothing: 20
tune: 20
why: 20
check: 19
enough: 19
everywhere: 19
eyes: 19
last: 19
nothin: 19
pain: 19
queen: 19
radio: 19
spark: 19
these: 19
truth: 19
alright: 18
believer: 18
boy: 18
break: 18
hell: 18
huh: 18
keeps: 18
mutha: 18
something: 18
soul: 18
than: 18


**How would you create a list of the top 30 words in our lyrics corpus but with stopwords removed?**

In [13]:
# here's our number of features
num_feats = dtm3.shape[1] # using the doc-term matrix from up above w/ the stopwords

# here's a dictionary to store the features and counts key/value pairs
feature_dict = {}

for x in range(num_feats):      # iterate over the number of features
    key = cv_no_stops.get_feature_names()[x]  # get the feature name at position [x]
                                              # note that we're using the stopwords cv object
    value = dtm.toarray().sum(axis=0)[x]  # sum the counts of the feature at position [x] for all documents
    
    feature_dict[key] = value # add the new key/value pair to the dictionary
    
# then sort the dictionary in order of counts
sortFeats = sorted(feature_dict.items(), key=lambda x: x[1], reverse=True)

# then print top 30

for item in sortFeats[0:30]:
    print(str(item[0]) + ": " + str(item[1]))

themselves: 1013
today: 588
and: 510
its: 477
mean: 380
one: 336
más: 334
them: 257
inch: 254
often: 227
be: 214
can: 205
noche: 196
got: 192
kalamazoo: 185
isn: 181
all: 178
baby: 177
lover: 173
for: 171
oigo: 171
don: 169
lac: 154
day: 151
reaction: 146
thou: 146
get: 130
only: 126
knowers: 125
outs: 122


OK. That's it for today! scikit-learn and CountVectorizer set us up for the rest of the semester...