# Tokenizing Sentences
1. Split apart corpus into sentences.
2. Split apart sentences into words.

In [2]:
# Why not just tokenize myself?
import nltk
text = "I made two purchases today! I bought a bag of grapes for $4.99, \
but then... realized John Francis already bought some at the Y.M.C.A!"

In [3]:
# trying to write our own tokenizer
text.split(".")

['I made two purchases today! I bought a bag of grapes for $4',
 '99, but then',
 '',
 '',
 ' realized John Francis already bought some at the Y',
 'M',
 'C',
 'A!']

In [4]:
# Using NLTK sent_tokenize()
sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
sent_text

['I made two purchases today!',
 'I bought a bag of grapes for $4.99, but then... realized John Francis already bought some at the Y.M.C.A!']

## Stemming

<img src="images/stemming-examples.png" alt="Different Stemming Techniques" style="width:600px;"/>

Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language [Source](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python)

In Python, we can use **`nltk.stem.porter.PorterStemmer`** stem our words:

```python
stemmer = PorterStemmer()
print(stemmer.stem("caressed"))  # caress
print(stemmer.stem("athlete"))  # athlet
print(stemmer.stem("athletics"))  # athlet
print(stemmer.stem("media"))  # media
print(stemmer.stem("photography"))  # photographi
print(stemmer.stem("sexy"))  # sexi
print(stemmer.stem("journalling"))  # journal
print(stemmer.stem("Slovakia")) # slovakia
print(stemmer.stem("corpora")) # corpora
print(stemmer.stem("thieves")) # thiev
print(stemmer.stem("rocks")) # rock
```

## Lemmatization

```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("caressed")) #caressed
print(lemmatizer.lemmatize("athlete")) #athlete
print(lemmatizer.lemmatize("athletics")) #athletics
print(lemmatizer.lemmatize("media"))
print(lemmatizer.lemmatize("photography")) #photography
print(lemmatizer.lemmatize("sexy")) #sexy
print(lemmatizer.lemmatize("journalling")) #journalling
print(lemmatizer.lemmatize("Slovakia")) #Slovakia
print(lemmatizer.lemmatize("corpora")) # corpus
print(lemmatizer.lemmatize("thieves")) # thief
print(lemmatizer.lemmatize("rocks")) #rock
```

Why would you ever care to use stemming?
- smaller and faster
- simplicity in "good enough"
- can often **provide higher recall (coverage)** if you are using it for text searching: `drives` and `drivers` will likely shorten to `driv`, which may be useful if your search engine wants to make sure to get all relevant documents, even at the cost of surfacing a few irrelevant documents
- could potentially be more useful for predictive models that tend to overfit

## Scoring Metrics

<img src="images/confusion_matrix2.png" alt="Different Stemming Techniques" style="width:600px;"/>

### Precision/Recall

**Recall:** What percent of the positive classes did the model successfully predict?
**Precision:** When a model predicted a positive class, what percentage of the time was it correct?

In terms of NLP / stemming / lemmatization:

**Recall**: After processing (tokenizing, stemming/lemmatizing) the data, what percent of the relevant search results were surfaced? Ie. - when a user searches for "blue jeans", did all the results returned include all the relevant items (blue-ish colored denim pants)?

**Precision**: After processing (tokenizing, stemming/lemmatizing) the data, what percent of the results returned were relevant?

<img src="images/matrix_practice2.png" alt="Different Stemming Techniques" style="width:600px;"/>

**Precision:** $\frac{?}{?}$

**Recall:** $\frac{?}{?}$

### F1 Score

The F1 score of a model represents the harmonic mean between precision and recall, and is defined as 

$$
\begin{equation}
F_{1} = 2 * \frac{P * R}{P + R}
\end{equation}
$$

## Exercise:

##### 1. For each of the following statements, label them True or False. If False, briefly explain why:

A. Text typically should be processed via either stemming or lemmatization, but not both.

B. Texts processed using lemmatization will typically have higher recall than stemming.

C. If the **F1 score** of a model is **1.0 (100%)**, then the accuracy of your model must also be **100%**.

##### 2. Calculate precison and recall given the following results from a confusion matrix:

<img src="images/exercise.jpeg" alt="Different Stemming Techniques" style="width:600px;"/>


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["It's still early, so box-office disappointments are still among the highest-grossing movies of the year.", 
        "That movie was terrific", "You love cats", 
        "Pay for top executives at big US companies is vastly higher than what everyday workers make, and a new report from The Wall Street Journal has found that CEOs have hit an eye-popping milestone in the size of their monthly paychecks."]
# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# vectorize the corpus
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)

# Notice what type of object this is
print(type(vector))

(4, 59)
<class 'scipy.sparse.csr.csr_matrix'>


In [6]:
# see the outputted vectors
print(vector.toarray())
print(vectorizer.get_feature_names())

[[1 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1
  1 0 0 0 0 0 1 2 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
  0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 1 1
  0 1 1 1 1 1 0 0 1 0 1 1 2 1 1 1 1 1 0 1 1 0 0]]
['among', 'an', 'and', 'are', 'at', 'big', 'box', 'cats', 'ceos', 'companies', 'disappointments', 'early', 'everyday', 'executives', 'eye', 'for', 'found', 'from', 'grossing', 'has', 'have', 'higher', 'highest', 'hit', 'in', 'is', 'it', 'journal', 'love', 'make', 'milestone', 'monthly', 'movie', 'movies', 'new', 'of', 'office', 'pay', 'paychecks', 'popping', 'report', 'size', 'so', 'still', 'street', 'terrific', 'than', 'that', 'the', 'their', 'top', 'us', 'vastly', 'wall', 'was', 'what', 'workers', 'y

In [7]:
# load vectorized corpus into Pandas dataframe
import pandas as pd
corpus_df = pd.DataFrame(vector.toarray(), columns=vectorizer.get_feature_names())
corpus_df.describe()

Unnamed: 0,among,an,and,are,at,big,box,cats,ceos,companies,...,their,top,us,vastly,wall,was,what,workers,year,you
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,...,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25
std,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,...,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Removing Stopwords

It's your call if you want to remove stopwords. We discussed already the advantages and disadvantages to both approaches. You will first need to run `nltk.download("stopwords")` to download the set of stopwords for NLTK:

In [10]:
from nltk.corpus import stopwords
print(set(stopwords.words('english'))) # see the set of words NLTK considers stopwords

{"you'd", 'm', 'below', 'or', "mustn't", 'up', 'has', 't', 'until', 'their', 'most', 'these', 'few', 'no', 'hasn', "aren't", 'as', 'each', 'them', 'hers', 'there', 'whom', 'why', 'under', "don't", 're', 'wasn', "hadn't", 'nor', "doesn't", 'to', "won't", 'our', 'against', 'so', 'themselves', 'some', 'while', "hasn't", 'itself', 'other', 'just', 'its', 's', "that'll", 'through', 'ain', 'haven', 'her', 'off', 'only', 'can', 'we', 'further', 'isn', 'over', 'when', 'i', 'are', 'shan', 'after', "it's", 'him', 'on', 'with', 've', 'own', "she's", "shouldn't", "weren't", 'too', 'your', 'does', 'his', 'myself', "should've", "needn't", "shan't", "isn't", 'then', 'yourselves', 'they', 'about', 'ourselves', 'won', 'but', 'am', 'than', 'those', 'into', 'doing', 'is', 'same', 'doesn', 'before', 'which', 'having', 'herself', 'did', 'shouldn', 'out', 'himself', 'she', 'again', 'here', 'where', 'me', 'y', 'at', 'an', 'yours', "you'll", 'it', "mightn't", 'that', 'both', 'this', 'will', "wouldn't", 'been'

In [14]:
# iterate through the Pandas dataframe, and drop the columns that reflect stopwords:

original_columns = corpus_df.columns # get existing columns

to_drop_columns = set(original_columns).intersection(set(stopwords.words('english'))) # get the list of words to drop
print(f"Dataframe shape was {corpus_df.shape}")
corpus_df.drop(columns=to_drop_columns, inplace=True)
print(f"Dataframe shape is now{corpus_df.shape}")

Dataframe shape was (4, 59)
Dataframe shape is now(4, 39)


## Co-Occurence Matrix

In [16]:
# run a quick correlation analysis to see if any word pairs show rough co-occurence
corpus_df.corr()

Unnamed: 0,among,big,box,cats,ceos,companies,disappointments,early,everyday,executives,...,size,still,street,terrific,top,us,vastly,wall,workers,year
among,1.0,-0.333333,1.0,-0.333333,-0.333333,-0.333333,1.0,1.0,-0.333333,-0.333333,...,-0.333333,1.0,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,1.0
big,-0.333333,1.0,-0.333333,-0.333333,1.0,1.0,-0.333333,-0.333333,1.0,1.0,...,1.0,-0.333333,1.0,-0.333333,1.0,1.0,1.0,1.0,1.0,-0.333333
box,1.0,-0.333333,1.0,-0.333333,-0.333333,-0.333333,1.0,1.0,-0.333333,-0.333333,...,-0.333333,1.0,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,1.0
cats,-0.333333,-0.333333,-0.333333,1.0,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,...,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333
ceos,-0.333333,1.0,-0.333333,-0.333333,1.0,1.0,-0.333333,-0.333333,1.0,1.0,...,1.0,-0.333333,1.0,-0.333333,1.0,1.0,1.0,1.0,1.0,-0.333333
companies,-0.333333,1.0,-0.333333,-0.333333,1.0,1.0,-0.333333,-0.333333,1.0,1.0,...,1.0,-0.333333,1.0,-0.333333,1.0,1.0,1.0,1.0,1.0,-0.333333
disappointments,1.0,-0.333333,1.0,-0.333333,-0.333333,-0.333333,1.0,1.0,-0.333333,-0.333333,...,-0.333333,1.0,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,1.0
early,1.0,-0.333333,1.0,-0.333333,-0.333333,-0.333333,1.0,1.0,-0.333333,-0.333333,...,-0.333333,1.0,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,-0.333333,1.0
everyday,-0.333333,1.0,-0.333333,-0.333333,1.0,1.0,-0.333333,-0.333333,1.0,1.0,...,1.0,-0.333333,1.0,-0.333333,1.0,1.0,1.0,1.0,1.0,-0.333333
executives,-0.333333,1.0,-0.333333,-0.333333,1.0,1.0,-0.333333,-0.333333,1.0,1.0,...,1.0,-0.333333,1.0,-0.333333,1.0,1.0,1.0,1.0,1.0,-0.333333
