In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

# Discussion Section Exercise: Week 2
Since there's no homework assigned yet, this notebook is meant to gently introduce some of the Python packages we've been discussing in lecture.

The goal of this exercise isn't to get a "right answer", but instead to introduce to `sklearn` and its documentation. I've provided some suggested code at the bottom of this notebook. You should only refer to these suggestions after you have put in some time working independently.

## Loading some toy data

Below are some sentences that we'll be working with in following cells. They are all <I>incipits</I> (you pronounce the "c" like a "k"), which are the first sentence(s) of a written work. I found them [here](https://www.penguin.co.uk/discover/articles/best-first-lines-in-books).

In [2]:
incipits = [
    "Mother died today. Or maybe, yesterday; I can't be sure.", # The Outsider by Camus
    "As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.",  # Metamorphosis by Franz Kafka
    "124 was spiteful. Full of Baby's venom.", # Beloved by Toni Morrison
    "The story so far: in the beginning, the universe was created. This has made a lot of people very angry and been widely regarded as a bad move.", # The Restaurant at the End of the Universe by Douglas Adams
    "It was a bright cold day in April, and the clocks were striking thirteen.", # 1984 by George Orwell
    "I write this sitting in the kitchen sink." # I Capture the Castle by Dodie Smith
]

## Counting tokens

First, let's vectorize these incipits using a `CountVectorizer`, which we imported in the first cell. Try first without passing any arguments to `CountVectorizer` when initializing it. Print the vectorized sequences after you've run your vectorizer on `incipits` as a `numpy array` (not a CSR sparse matrix).

In [3]:
# 1
# Your code here

Cool. What's the shape of this array? 

In [4]:
# 2
# Your code here

Now inspect the `CountVectorizer`'s vocabulary. 

In [5]:
# 3
# Your code here

Using the vocabulary, print the counts of the word "the" for all sequences in `incipits`.

In [6]:
# 4
# Your code here

## TF-IDF

Now, let's convert the count vectors you computed to TF-IDF vectors using a `TFIDFTransformer`. Print out an `array` of vectors, as well as its shape. 

In [7]:
# 5
# Your code here

Now, let's try using the `TfidfVectorizer`, which is the same as running `CountVectorizer` followed by `TfidfTransformer`. Use `TfidfVectorizer` to vectorize `incipits`.

This time, though, when initializing the `TfidfVectorizer`, use `sklearn`'s built-in English stopword list. In addition, limit your vocabulary to the top 15 most frequent words across all sequences. Check the docs [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to find out how to do this.

Again, print the vectors as an array, print its shape, and print the `TfidfVectorizer`'s vocabulary. 

In [8]:
# 6
# Your code here

## Suggested code

Please don't look at the below until doing the above yourself!

In [None]:
# 1
count_vectorizer = CountVectorizer()
incipit_vectors = count_vectorizer.fit_transform(incipits).toarray() # toarray casts CSR sparse matrix to array
incipit_vectors

array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1],
       [0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
        1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0],
       [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
        0, 0],
       [0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
        1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 3, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1,
        0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,

In [10]:
# 2
incipit_vectors.shape # Gets the shape: (num_sequences, num_features)

(6, 68)

In [11]:
# 3
count_vectorizer.vocabulary_ # Note the underscore. It's a dict of {token: index}, where the index
                             # corresponds to the location of that token's counts in a vector.

{'mother': 39,
 'died': 18,
 'today': 57,
 'or': 43,
 'maybe': 37,
 'yesterday': 67,
 'can': 13,
 'be': 8,
 'sure': 53,
 'as': 4,
 'gregor': 25,
 'samsa': 46,
 'awoke': 5,
 'one': 42,
 'morning': 38,
 'from': 22,
 'uneasy': 59,
 'dreams': 19,
 'he': 27,
 'found': 21,
 'himself': 28,
 'transformed': 58,
 'in': 30,
 'his': 29,
 'bed': 9,
 'into': 32,
 'gigantic': 24,
 'insect': 31,
 '124': 0,
 'was': 63,
 'spiteful': 50,
 'full': 23,
 'of': 41,
 'baby': 6,
 'venom': 61,
 'the': 54,
 'story': 51,
 'so': 49,
 'far': 20,
 'beginning': 11,
 'universe': 60,
 'created': 16,
 'this': 56,
 'has': 26,
 'made': 36,
 'lot': 35,
 'people': 44,
 'very': 62,
 'angry': 2,
 'and': 1,
 'been': 10,
 'widely': 65,
 'regarded': 45,
 'bad': 7,
 'move': 40,
 'it': 33,
 'bright': 12,
 'cold': 15,
 'day': 17,
 'april': 3,
 'clocks': 14,
 'were': 64,
 'striking': 52,
 'thirteen': 55,
 'write': 66,
 'sitting': 48,
 'kitchen': 34,
 'sink': 47}

In [12]:
# 4
incipit_vectors[:, count_vectorizer.vocabulary_["the"]]

array([0, 0, 0, 3, 1, 1])

In [13]:
# 5
tfidf_transformer = TfidfTransformer() 
tfidf_incipit = tfidf_transformer.fit_transform(incipit_vectors).toarray() 

print(f"TF-IDF Vectors:\n{tfidf_incipit}\nTF-IDF Array Shape: {tfidf_incipit.shape}")


TF-IDF Vectors:
[[0.         0.         0.         0.         0.         0.
  0.         0.         0.33333333 0.         0.         0.
  0.         0.33333333 0.         0.         0.         0.
  0.33333333 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.33333333 0.         0.33333333 0.         0.
  0.         0.33333333 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.33333333
  0.         0.         0.         0.33333333 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.33333333]
 [0.         0.         0.         0.         0.19314847 0.2355428
  0.         0.         0.         0.2355428  0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.2355428  0.         0.2355428  0.2355428  0.
  0.2355428  0.2355428  0.         0.2355428

In [14]:
# 6
tfidf_vectorizer = TfidfVectorizer(
    stop_words="english",
    max_features=15,
)

tfidf_incipit_small_vocab = tfidf_vectorizer.fit_transform(incipits).toarray()

print(f"TF-IDF vectors with English stopwords and |V| <= 15:\n{tfidf_incipit_small_vocab}")
print(f"\nShape:", tfidf_incipit_small_vocab.shape)
print("\nVocab:\n", tfidf_vectorizer.vocabulary_)

TF-IDF vectors with English stopwords and |V| <= 15:
[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         1.         0.        ]
 [0.         0.         0.         0.57735027 0.         0.
  0.57735027 0.         0.         0.         0.         0.
  0.         0.         0.57735027]
 [0.70710678 0.         0.         0.         0.70710678 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.        ]
 [0.         0.5        0.         0.         0.         0.5
  0.         0.5        0.         0.         0.         0.5
  0.         0.         0.        ]
 [0.         0.         0.4472136  0.         0.         0.
  0.         0.         0.4472136  0.4472136  0.4472136  0.
  0.4472136  0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.        ]]

Shape: 