## Calculating Containment

In this notebook, you'll implement a containment function that looks at a source and answer text and returns a *normalized* value that represents the similarity between those two texts based on their n-gram intersection.

In [1]:
import numpy as np
import sklearn

### N-gram counts

One of the first things you'll need to do is to count up the occurrences of n-grams in your text data. To convert a set of text data into a matrix of counts, you can use a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

Below, you can set a value for n and use a CountVectorizer is used to count up the n-gram occurrences. In the next cell, we'll see that the CountVectorizer constructs a vocabulary, and later, we'll look at the matrix of counts.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

a_text = "This is an answer text"
s_text = "This is a source text"

# set n
n = 1

# instantiate an ngram counter
counts = CountVectorizer(analyzer='word', ngram_range=(n,n))

# create a dictionary of n-grams by calling `.fit`
vocab2int = counts.fit([a_text, s_text]).vocabulary_

# print dictionary of words:index
print(vocab2int)

{'this': 5, 'is': 2, 'an': 0, 'answer': 1, 'text': 4, 'source': 3}


### EXERCISE: Create a vocabulary for 2-grams (aka "bigrams")

Create a `CountVectorizer`, `counts_2grams`, and fit it to our text data. Print out the resultant vocabulary.

In [14]:
# create a vocabulary for 2-grams
counts_2grams = CountVectorizer(analyzer='word', 
                                ngram_range=(1,3),
                               stop_words='english')


In [15]:
vocab3int =counts_2grams.fit([a_text, s_text]).vocabulary_
vocab3int

{'answer': 0, 'text': 4, 'answer text': 1, 'source': 2, 'source text': 3}

### What makes up a word?

You'll note that the word "a" does not appear in the vocabulary. And also that the words have been converted to lowercase. When `CountVectorizer` is passed `analyzer='word'` it defines a word as *two or more* characters and so it ignores uni-character words. In a lot of text analysis, single characters are often irrelevant to the meaning of a passage, so leaving them out of a vocabulary is often desired behavior. 

For our purposes, this default behavior will work well; we don't need uni-character words to determine cases of plagiarism, but you may still want to experiment with uni-character counts.

> If you *do* want to include single characters as words, you can choose to do so by adding one more argument when creating the `CountVectorizer`; pass in the definition of a token, `token_pattern = r"(?u)\b\w+\b"`. 

This regular expression defines a word as one or more characters. If you want to learn more about this vectorizer, I suggest reading through the [source code](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L664), which is well documented.

**Next, let's fit our `CountVectorizer` to all of our text data to make an array of n-gram counts!**

The below code, assumes that `counts` is our `CountVectorizer` for the n-gram size we are interested in.

In [16]:
# create array of n-gram counts for the answer and source text
ngrams = counts.fit_transform([a_text, s_text])

# row = the 2 texts and column = indexed vocab terms (as mapped above)
# ex. column 0 = 'an', col 1 = 'answer'.. col 4 = 'text'
ngram_array = ngrams.toarray()
print(ngram_array)

[[1 1 1 0 1 1]
 [0 0 1 1 1 1]]


In [25]:
for row in ngram_array:
    for ele in row:
        print(ele)

1
1
1
0
1
1
0
0
1
1
1
1


In [28]:
sum(sum([ele for ele in row]) for row in ngram_array)

9

In [30]:
len(ngram_array[0])

6

So, the top row indicates the n-gram counts for the answer text `a_text`, and the second row indicates those for the source text `s_text`. If they have n-grams in common, you can see this by looking at the column values. For example they both have one "is" (column 2) and "text" (column 4) and "this" (column 5).

```
[[1 1 1 0 1 1]    =   an  answer  [is]  ______  [text] [this]
 [0 0 1 1 1 1]]   =   __  ______  [is]  source  [text] [this]
```

### EXERCISE: Calculate containment values

Assume your function takes in an `ngram_array` just like that generated above, for an answer text (row 0) and a source text (row 1). Using just this information, calculate the containment between the two texts. As before, it's okay to ignore the uni-character words.

To calculate the containment:
1. Calculate the n-gram **intersection** between the answer and source text.
2. Add up the number of common terms.
3. Normalize by dividing the value in step 2 by the number of n-grams in the answer text.

The complete equation is:

$$ \frac{\sum{count(\text{ngram}_{A}) \cap count(\text{ngram}_{S})}}{\sum{count(\text{ngram}_{A})}} $$

In [35]:
def containment(ngram_array):
    ''' Containment is a measure of text similarity. It is the normalized, 
       intersection of ngram word counts in two texts.
       :param ngram_array: an array of ngram counts for an answer and source text.
       :return: a normalized containment value.'''
    
    
    # your code here
    t_count = 0
    for idx in range(0,len(ngram_array[0])):
        if ngram_array[0][idx] == 1 and ngram_array[1][idx] == 1:
            t_count += 1
    return 1.0 * t_count / sum(ngram_array[0])


In [36]:
# test out your code
containment_val = containment(ngrams.toarray())

print('Containment: ', containment_val)

# note that for the given texts, and n = 1
# the containment value should be 3/5 or 0.6
assert containment_val==0.6, 'Unexpected containment value for n=1.'
print('Test passed!')

Containment:  0.6
Test passed!


In [37]:
# test for n = 2
counts_2grams = CountVectorizer(analyzer='word', ngram_range=(2,2))
bigram_counts = counts_2grams.fit_transform([a_text, s_text])

# calculate containment
containment_val = containment(bigram_counts.toarray())

print('Containment for n=2 : ', containment_val)

# the containment value should be 1/4 or 0.25
assert containment_val==0.25, 'Unexpected containment value for n=2.'
print('Test passed!')

Containment for n=2 :  0.25
Test passed!


I recommend trying out different phrases, and different values of n. What happens if you count for uni-character words? What if you make the sentences much larger?

I find that the best way to understand a new concept is to think about how it might be applied in a variety of different ways.