Note that you're still focusing on the feature-engineering step of the data-preprocessing pipeline as shown below:

![Feature Engineering](assets/feature_engineering.png)



# Semantics

In the BoW approach, with all the information that you were able to pull out of the text, one thing that you didn't really use was semantics—the *meaning* of the words and sentences. The models that you built in the previous checkpoint may know that Jane Austen tends to use the word `lady` a lot in her writing, but they don't know what a lady is. There is nothing in your work on NLP so far that would allow a model to say whether `queen` or `car` is more similar to `lady`. 

In the absence of semantic information, models can get tripped up on things like synonyms (`milady` and `lady`). You could modify the spaCy dictionary to include `lady` as the lemma of `milady`, then use lemmas for all your analyses. But for this to be an effective approach, you would have to go through your entire corpus and identify all synonyms for all words by hand. This approach would also discard subtle differences in the connotations of words, concepts, ideas, or emotions associated with `lady` (which elicits thoughts of formal manners and England) and `milady` (which elicits thoughts of medieval ages and Renaissance Faires).

Language is complicated, and trying to explicitly model all the information encoded in a language is nearly impossibly complicated. Fortunately, there are some approaches and methods that you can use to get around this to some extent. In general, these methods work on a corpus of text and learn the rules of the language by identifying recurring patterns within the corpus. As the outcome, all of these methods produce a numerical representation of the words. 

In this checkpoint, you will learn about the *term frequency–inverse document frequency* (TF-IDF) method as a modification of the BoW approach of the previous checkpoint. You'll also learn about *latent semantic analysis*.


# BoW revisited: TF-IDF

The BoW approach rests upon counting the occurrences of the words in the documents (which in this case are the sentences). However, there is other information that you can exploit from the occurrences of the words, apart from their counts in the sentences. In the following, you'll go step by step as you learn about TF-IDF vectorization, which takes some clever steps when counting the words. 

Consider the following sentences:

1. "The best Monty Python sketch is the one about the dead parrot; I laughed so hard."
2. "I laugh when I think about Python's Ministry of Silly Walks sketch; it is funny, funny, funny, the best!"
3. "Chocolate is the best ice cream dessert topping, with a great taste."
4. "The Lumberjack Song is the funniest Monty Python bit; I can't think of it without laughing."
5. "I would rather put strawberries on my ice cream for dessert; they have the best taste."
6. "The taste of caramel is a fantastic accompaniment to tasty mint ice cream."

As a human being, it's easy to see that the sentences involve two topics: comedy and ice cream. One way to represent the sentences is in a *term-document matrix* that has a column for each sentence and a row for each word. Ignoring the stop words `the`, `is`, `and`, `a`, `of`, `I`, and `about`, discarding words that occur only once, and reducing words like `laughing` to their root form (`laugh`), the term-document matrix for these sentences would be as follows:

|           | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|---|---|---|---|---|---|
| Monty     | 1 | 0 | 0 | 1 | 0 | 0 |
| Python    | 1 | 1 | 0 | 1 | 0 | 0 |
| sketch    | 1 | 1 | 0 | 0 | 0 | 0 |
| laugh     | 1 | 1 | 0 | 1 | 0 | 0 |
| funny     | 0 | 3 | 0 | 1 | 0 | 0 |
| best      | 1 | 1 | 1 | 0 | 1 | 0 |
| ice cream | 0 | 0 | 1 | 0 | 1 | 1 |
| dessert   | 0 | 0 | 1 | 0 | 1 | 0 |
| taste     | 0 | 0 | 1 | 0 | 1 | 2 |



Note that the term *document* is used here to refer to the individual text chunks that you are working with. It can sometimes mean sentences, sometimes paragraphs, and sometimes whole text files. In your case, each sentence is a document. Also note that, contrary to how you've been operating, a term-document matrix has words as rows and documents as columns.

The comedy sentences use the following words: `Python` (3), `laugh` (3), `Monty` (2), `sketch` (2), `funny` (2), and `best` (2).
The ice cream sentences use the following words: `ice cream` (3), `dessert` (3), `taste` (3), and `best` (2).

The word `best` stands out here—it appears in more sentences than any other word (four out of the six sentences). It is used equally to describe Monty Python and ice cream. If you were to use this term-document matrix as is to teach a computer to parse sentences, `best` would end up as a significant identifier for both topics. And every time that you gave the model a new sentence to identify that included `best`, it would bring up both topics. Not very useful. To avoid this, you want to weight the matrix so that words that occur in many different sentences have lower weights than words that occur in fewer sentences. It is important to put a floor on this though—words that only occur once are totally useless for finding associations between sentences. 

Another word that stands out is `funny`, which appears more often in comedy sentences than any other word. This suggests that `funny` is a very important word for defining the *comedy* topic. 

## Quantifying documents: Collection and document frequencies

*Document frequency* (DF) counts how many sentences a word appears in. *Collection frequency* (CF) counts how often a word appears, total, over all sentences. Now, calculate the DF and CF for your sentence set:

|           |DF |CF| 
|-----------|---|---|
| Monty     | 2 | 2 | 
| Python    | 3 | 3 | 
| sketch    | 2 | 2 | 
| laugh     | 3 | 3 | 
| funny     | 2 | 4 | 
| best      | 4 | 4 | 
| ice cream | 3 | 3 | 
| dessert   | 2 | 2 | 
| taste     | 3 | 4 | 



## Penalizing indiscriminate words: Inverse document frequency

Now, weight the document frequency so that words that occur less often (like `sketch` and `dessert`) are more influential than words that occur a lot (like `best`). Below, you will calculate the ratio of total documents (*N*) divided by *DF*, then take the log (base 2) of the ratio, to get your *inverse document frequency* (IDF) number for each term (t):

$$idf_t=log \dfrac N{df_t}$$


|           |df |cf| idf |
|-----------|---|---|
| Monty     | 2 | 2 | 1.585 |
| Python    | 3 | 3 | 1 |
| sketch    | 2 | 2 | 1.585 |
| laugh     | 3 | 3 | 1 |
| funny     | 2 | 4 | 1.585 |
| best      | 4 | 4 | 0.585 |
| ice cream | 3 | 3 | 1 |
| dessert   | 2 | 2 | 1.585 |
| taste     | 3 | 4 | 1 |

The IDF weights tell the model to consider `best` as less important than other terms.

## Term-frequency weights

The next piece of information to consider for your weights is how frequently a term appears within a sentence. The word `funny` appears three times in one sentence—so it would be good if you could weight `funny` so that the model knows that. You can accomplish this by creating unique weights for each sentence that combines the *term frequency* (how often a word appears within an individual document) with the IDF, like so:

$$tfidf_{t,d}=(tf_{t,d})(idf_t)$$

Now, the term `funny` in the second sentence, where it occurs three times, will be weighted more heavily than the term `funny` in the first sentence, where it only occurs once. If `best` had appeared multiple times in one sentence, it would also have a higher weight for that sentence, but the weight would be reduced by the IDF term that takes into account that `best` is a pretty common word in your collection of sentences.

The TF-IDF score will be highest for a term that occurs a lot within a small number of sentences, and lowest for a word that occurs in most or all sentences.

Now you can represent each sentence as a vector made up of the TF-IDF scores for each word:

|           | 1 | 2 | 3 | 
|-----------|---|---|---|
| Monty     | 1.585 | 0 | 0 |
| Python    | 1 | 1 | 0 | 
| sketch    | 1.585| 1.585 | 0 | 
| laugh     | 1 | 1 | 0 | 
| funny     | 0 | 4.755 | 0 | 
| best      | 0.585 | 0.585 | 0.585 | 
| ice cream | 0 | 0 | 1 | 
| dessert   | 0 | 0 | 1.585 | 
| taste     | 0 | 0 | 1 |


## Things to consider in TF-IDF

As with any feature-generation technique for text data, TF-IDF vectors are a kind of translation from human-readable language to computer-usable numeric form. Some information is inevitably lost in translation, and the usefulness of any model that you build from here on out depends on the decisions that you made during the translation step. Possible decision points include the following:

* **Which stop words to include or exclude**
* **If you should use phrases as terms:** For example, `Monty Python` instead of `Monty` and `Python`.
* **The threshold for infrequent words:** In this example, you excluded words that only occurred once. In longer documents, it may be a good idea to set a higher threshold.
* **How many terms to keep:** You kept all of the terms that fit the specified criteria (not a stop word, occurred more than once). But for bigger document collections or longer documents, this may create unfeasibly long vectors. You may want to decide to only keep the 10,000 words with the highest collection frequency scores, for example.


## Implementing TF-IDF

Now, you're all set to implement TF-IDF vectorization. As you did in the previous checkpoint for the BoW approach, you'll be using the excellent scikit-learn library for generating TF-IDF vectors of Jane Austen's *Persuasion* and Lewis Carroll's *Alice's Adventures in Wonderland*.

Before jumping into the vectorization, apply the same data-cleaning steps as in the previous checkpoint:

In [1]:
import numpy as np
import pandas as pd
import sklearn
import spacy
import re
from nltk.corpus import gutenberg
import nltk
import warnings
import en_core_web_sm as spacy
warnings.filterwarnings("ignore")

#nltk.download('gutenberg')

In [2]:
# Utility function for standard text cleaning
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation that spaCy doesn't
    # recognize: the double dash --. Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

In [3]:
# Load and clean the data
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

In [4]:
# Parse the cleaned novels. This can take some time.
nlp = spacy.load()
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [5]:
# Group into sentences
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one DataFrame
sentences = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])
sentences.head()

Unnamed: 0,text,author
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(Oh, dear, !)",Carroll


In [6]:
# Get rid of stop words and punctuation,
# and lemmatize the tokens
for i, sentence in enumerate(sentences["text"]):
    sentences.loc[i, "text"] = " ".join(
        [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop])

Now, you can start using scikit-learn's `TfidfVectorizer` class. Note that below you'll set some parameters of the `TfidfVectorizer`. Specifically, set the following:

* `max_df=0.5`: This drops words that occur in more than half of the documents.
* `min_df=2`: This makes the vectorizer only use words that appear at least twice.
* `use_idf=True`: This makes the vectorizer use inverse document frequencies in weighting.
* `norm=u'l2'`: This applies a correction factor so that longer and shorter documents are treated equally.
* `smooth_idf=True`: This adds `1` to all document frequencies, as if an extra document existed that used every word once. This prevents divide-by-zero errors.

There are other parameters of `TfidfVectorizer` that you can set. For more information, you can look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.5, min_df=2, use_idf=True, norm=u'l2', smooth_idf=True)


# Applying the vectorizer
X = vectorizer.fit_transform(sentences["text"])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([tfidf_df, sentences[["text", "author"]]], axis=1)

# Keep in mind that log base 2 of 1 is 0,
# so a TF-IDF score of 0 indicates that the word was present once in that sentence.
sentences.head()

Unnamed: 0,abide,ability,able,abominate,abroad,absence,absent,absolute,absolutely,absurd,...,yer,yes,yesterday,yield,young,youth,zeal,zealous,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear Rabbit,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll


As you can see, you now have a dataset in an easy-to-use format: the rows have the observations and the columns have the numeric features. From now on, you can use this dataset as input to machine-learning models. So, you're jumping in the modeling phase of the data-preprocessing pipeline as shown below:

![Modeling](assets/modeling.png)

## TF-IDF in action

As you did in the previous checkpoint, you'll build some machine-learning models using your dataset. Because your features are all numerical now, you can use them directly in your models. As in the previous checkpoint, your task is to predict the author of a sentence.

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

GradientBoostingClassifier()

In [9]:
#Results Table
t = []
name = 'one'
#type indicator for the results dictionary
for i in range(0, 3):
    t.append(name)
#dictionary for these results.
d = {'type': t}


d['model'] = ['logistic']
d['test'] = [lr.score(X_test, y_test)]
d['train'] = [lr.score(X_train, y_train)]


d['model'].append('random')
d['test'].append(rfc.score(X_test, y_test))
d['train'].append(rfc.score(X_train, y_train) )

d['model'].append('gradient')
d['test'].append(gbc.score(X_test, y_test))
d['train'].append(gbc.score(X_train, y_train) )


In [10]:
d = pd.DataFrame(d)
d.set_index(['type', 'model'], inplace=True)
results = d

In [11]:
#Classification Reports
from sklearn.metrics import classification_report
print("----------------------Logistic Regression Scores----------------------")
print(classification_report(y_test, lr.predict(X_test)))
print("----------------------Random Forest Scores----------------------")
print(classification_report(y_test, rfc.predict(X_test)))
print("----------------------Gradient Boosting Scores----------------------")
print(classification_report(y_test, gbc.predict(X_test)))

----------------------Logistic Regression Scores----------------------
              precision    recall  f1-score   support

      Austen       0.85      0.98      0.91      1537
     Carroll       0.95      0.64      0.76       716

    accuracy                           0.87      2253
   macro avg       0.90      0.81      0.84      2253
weighted avg       0.88      0.87      0.87      2253

----------------------Random Forest Scores----------------------
              precision    recall  f1-score   support

      Austen       0.87      0.94      0.91      1537
     Carroll       0.85      0.70      0.77       716

    accuracy                           0.87      2253
   macro avg       0.86      0.82      0.84      2253
weighted avg       0.86      0.87      0.86      2253

----------------------Gradient Boosting Scores----------------------
              precision    recall  f1-score   support

      Austen       0.81      0.98      0.89      1537
     Carroll       0.92      0.5

Compared to the BoW scores of the previous checkpoint, your scores are a little bit lower this time. But at least the TF-IDF features seem to reduce the overfitting of the logistic regression.

## Example of 2-grams

You can also make use of n-grams in `TfidfVectorizer`. To do that, you need to set the `ngram_range` parameter. Below, use 2-grams as your features and apply TF-IDF vectorization. Then apply the same models above to get your predictions:

In [12]:
vectorizer = TfidfVectorizer(
    max_df=0.5, min_df=2, use_idf=True, norm=u'l2', smooth_idf=True, ngram_range=(2,2))


# Applying the vectorizer
X = vectorizer.fit_transform(sentences["text"])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([tfidf_df, sentences[["text", "author"]]], axis=1)

# Keep in mind that log base 2 of 1 is 0,
# so a TF-IDF score of 0 indicates that the word was present once in that sentence.
sentences.head()

Unnamed: 0,able bear,able persuade,absence home,absolute necessity,absolutely hopeless,accident lyme,accommodation man,account louisa,account small,account taste,...,young friend,young lady,young man,young people,young person,young sister,young woman,youth say,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear Rabbit,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll


Notice that you have 3,381 features now (excluding the *text* and *author* columns). As you may remember, the 2-grams features of the BoW vectorizer from the previous checkpoint consist of more than 30,000 features! This is a huge reduction in the number of features due to the values that you set for the parameters of the `TfidfVectorizer`, like `max_df` and `min_df`.

Now, run your models using the 2-grams features:

In [13]:
Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

GradientBoostingClassifier()

In [14]:
#Results Table
t = []
name = 'two'
#type indicator for the results dictionary
for i in range(0, 3):
    t.append(name)
#dictionary for these results.
d = {'type': t}


d['model'] = ['logistic']
d['test'] = [lr.score(X_test, y_test)]
d['train'] = [lr.score(X_train, y_train)]


d['model'].append('random')
d['test'].append(rfc.score(X_test, y_test))
d['train'].append(rfc.score(X_train, y_train) )

d['model'].append('gradient')
d['test'].append(gbc.score(X_test, y_test))
d['train'].append(gbc.score(X_train, y_train) )

In [15]:
d = pd.DataFrame(d)
d.set_index(['type', 'model'], inplace=True)
results = pd.concat([results, d])
results

Unnamed: 0_level_0,Unnamed: 1_level_0,test,train
type,model,Unnamed: 2_level_1,Unnamed: 3_level_1
one,logistic,0.873058,0.9124
one,random,0.865957,0.978692
one,gradient,0.833111,0.849956
two,logistic,0.778074,0.823321
two,random,0.802042,0.882806
two,gradient,0.762983,0.768275


In [16]:
print("----------------------Logistic Regression Scores----------------------")
print(classification_report(y_test, lr.predict(X_test)))
print("----------------------Random Forest Scores----------------------")
print(classification_report(y_test, rfc.predict(X_test)))
print("----------------------Gradient Boosting Scores----------------------")
print(classification_report(y_test, gbc.predict(X_test)))

----------------------Logistic Regression Scores----------------------
              precision    recall  f1-score   support

      Austen       0.76      1.00      0.86      1537
     Carroll       0.99      0.30      0.47       716

    accuracy                           0.78      2253
   macro avg       0.87      0.65      0.66      2253
weighted avg       0.83      0.78      0.73      2253

----------------------Random Forest Scores----------------------
              precision    recall  f1-score   support

      Austen       0.78      0.98      0.87      1537
     Carroll       0.92      0.41      0.57       716

    accuracy                           0.80      2253
   macro avg       0.85      0.70      0.72      2253
weighted avg       0.83      0.80      0.78      2253

----------------------Gradient Boosting Scores----------------------
              precision    recall  f1-score   support

      Austen       0.74      1.00      0.85      1537
     Carroll       0.99      0.2

The results are slightly lower again in comparison to the BoW scores of the previous checkpoint. However, this time, the overfittings of the logistic regression and the random forest models are reduced substantially. This is basically because of the reduction in the number of features.

# Some applications of TF-IDF

So far, you've applied classification models using the TF-IDF vectors to predict the authors of the sentences. Here, you'll briefly touch upon some popular NLP applications that make use of TF-IDF vectorization. In the first one, you'll briefly review how you can measure the similarities of the documents. And in the second one, you'll explore *topic modeling*, which refers to deriving the fundamental topics of a collection of documents.

## Vector space model

By now, you've had some practice thinking of data as existing in multidimensional space. Your sentences exist in an n-dimensional space, where *n* is equal to the number of terms in your term-document matrix. The vector representation of the text is referred to as a *vector space model* (VSM). You can use this representation to compute the similarity between the sentences and a new phrase or sentence. This method is often used by search engines to match a query to possible results.

To compute the similarity of sentences to a new sentence, transform the new sentence into a vector and place it in the space. You can then calculate how different the angles are for the original vectors and the new vector, and identify the vector whose angle is closest to the new vector. Typically, this is done by calculating the cosine of the angle between the vectors. If the two vectors are identical, the angle between them will be 0°, and the cosine will be 1. If the two vectors are orthogonal, with an angle of 90°, the cosine will be 0.

If you were running a search query, then you would return sentences that are most similar to the query sentence, ordered from the highest similarity score (cosine) to the lowest. Pretty handy!

As cool as this is, there are limitations to the VSM. Because it treats each word as distinct from every other word, it can run aground on *synonyms* (treating words that mean the same thing as though they are different, like big and large). Also, because it treats all occurrences of a word as the same regardless of context, it can run aground on *polysemy*, where there are different meanings attached to the same word (for example, "I need a break" versus "I break things"). In addition, the vector space model has difficulty with very large documents because the more words a document has, the more opportunities it has to diverge from other documents in the space. This can make it difficult to see similarities.

## Latent semantic analysis

A solution to this problem is to reduce your TF-IDF-weighted term-document matrix into a lower-dimensional space. In other words, you can express the information in the matrix using fewer rows by combining the information from multiple terms into one new row or dimension. You can accomplish this using principal components analysis (PCA). *Latent semantic analysis* (LSA), also called *latent semantic indexing*, is the process of applying PCA to a TF-IDF matrix. What you get in the end is clusters of terms that presumably reflect a topic. Each document will get a score for each topic, with higher scores indicating that the document is relevant to the topic. Documents can pertain to more than one topic.

LSA is handy when your corpus is too large to topically annotate by hand, or when you don't know what topics characterize your documents. It is also useful as a way of creating features to be used in other models.

Now, try it out! Once again, use the Gutenberg corpus. This time, focus on comparing paragraphs within *Emma*, another novel by Jane Austen.

In [17]:
# Reading in the data, in the form of paragraphs this time
nltk.download('punkt')

emma=gutenberg.paras('austen-emma.txt')

# Processing
emma_paras=[]
for paragraph in emma:
    para=paragraph[0]
    # Removing the double-dash -- from all words
    para=[re.sub(r'--','',word) for word in para]
    # Forming each paragraph into a string and adding it to the list of strings
    emma_paras.append(' '.join(para))

print(emma_paras[0:4])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kalik\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['[ Emma by Jane Austen 1816 ]', 'VOLUME I', 'CHAPTER I', 'Emma Woodhouse , handsome , clever , and rich , with a comfortable home and happy disposition , seemed to unite some of the best blessings of existence ; and had lived nearly twenty - one years in the world with very little to distress or vex her .']


In [18]:
X_train, X_test = train_test_split(emma_paras, test_size=0.4, random_state=0)

vectorizer = TfidfVectorizer(max_df=0.5, # Drop words that occur in more than half the paragraphs.
                             min_df=2, # Only use words that appear at least twice.
                             stop_words='english', 
                             lowercase=True, # Convert everything to lowercase (because Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS).
                             use_idf=True, # You definitely want to use inverse document frequencies in your weighting.
                             norm=u'l2', # Apply a correction factor so that longer paragraphs and shorter paragraphs get treated equally.
                             smooth_idf=True # Add 1 to all document frequencies, as if an extra document existed that used every word once. This prevents divide-by-zero errors.
                            )


# Applying the vectorizer
emma_paras_tfidf=vectorizer.fit_transform(emma_paras)
print("Number of features: %d" % emma_paras_tfidf.get_shape()[1])

# Splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(emma_paras_tfidf, test_size=0.4, random_state=0)

# Reshape the vectorizer output into something that people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()

# Number of paragraphs
n = X_train_tfidf_csr.shape[0]
# A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]
# List of features
terms = vectorizer.get_feature_names()
# For each paragraph, list the feature words and their TF-IDF scores.
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

# Keep in mind that log base 2 of 1 is 0, so a TF-IDF score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train[5])
print('Tf_idf vector:', tfidf_bypara[5])

Number of features: 1948
Original sentence: A very few minutes more , however , completed the present trial .
Tf_idf vector: {'minutes': 0.7127450310382584, 'present': 0.701423210857947}


### Dimension reduction
Okay, now you have your vectors, with one vector per paragraph. It's time to do some dimension reduction. Use the singular value decomposition (SVD) function from scikit-learn rather than PCA, because you don't want to mean-center your variables (and thus lose sparsity).

In [19]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

# The SVD data reducer. You are going to reduce the feature space from 1,379 to 130.
svd= TruncatedSVD(130)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf)

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

# Look at what sorts of paragraphs your solution considers similar for the first five identified topics.
paras_by_component=pd.DataFrame(X_train_lsa,index=X_train)
for i in range(5):
    print('-------COMPONENT {}:'.format(i))
    print(paras_by_component.loc[:,i].sort_values(ascending=False)[0:10])
    print("-------------------------------------------------------------")

Percent variance captured by all components: 45.21460135692508
-------COMPONENT 0:
" Oh !    0.999287
" Oh !    0.999287
Oh !      0.999287
" Oh !    0.999287
" Oh !    0.999287
" Oh !    0.999287
" Oh !    0.999287
" Oh !    0.999287
" Oh !    0.999287
" Oh !    0.999287
Name: 0, dtype: float64
-------------------------------------------------------------
-------COMPONENT 1:
" You have made her too tall , Emma ," said Mr . Knightley .                                                                                                                0.633376
" You get upon delicate subjects , Emma ," said Mrs . Weston smiling ; " remember that I am here . Mr .                                                                     0.583872
" I do not know what your opinion may be , Mrs . Weston ," said Mr . Knightley , " of this great intimacy between Emma and Harriet Smith , but I think it a bad thing ."    0.566214
" You are right , Mrs . Weston ," said Mr . Knightley warmly , " Miss Fairfax 

From glancing at the most representative sample paragraphs, it appears that component 0 targets the exclamation `Oh!`, component 1 seems to largely involve critical dialogue directed at or about the main character Emma, component 2 is chapter headings, component 3 is exclamations involving `Ah!`, and component 4 involves actions by or directly related to Emma. What fun! 

LSA is one of many unsupervised methods that can be applied to text data. Although it is good for dealing with synonyms, it cannot handle polysemy. For that, you'll need to try out other kinds of approaches; you'll see one of them in the next checkpoint. 

Although LSA has been presented here as an unsupervised method, it can also be used to prepare text data for classification in supervised learning. In that case, the goal would be to use LSA to arrive at a smaller set of features that can be used to build a supervised model that classifies text into prelabeled categories.

# CHECKPOINT ASSIGNMENT

Please submit your solutions to the following tasks as a link to your Jupyter Notebook on GitHub.

Converting words or sentences into numeric vectors is fundamental when working with text data. 


To make sure that you have a solid handle on how these vectors work, generate the TF-IDF vectors for the last three sentences of the example from the beginning of this checkpoint (from the BoW revisited: TF-IDF section). If you are feeling uncertain, have your mentor walk you through it.

In [20]:
#The last three sentences:
one = "The Lumberjack Song is the funniest Monty Python bit; I can't think of it without laughing."
two = "I would rather put strawberries on my ice cream for dessert; they have the best taste."
three = "The taste of caramel is a fantastic accompaniment to tasty mint ice cream."

# Clean up the documents.
one = text_cleaner(one)
two = text_cleaner(two)
three = text_cleaner(three)

# documents that are sentences.
one = nlp(one)
two = nlp(two)
three = nlp(three)

#Each document is a sentence (proof):
for sent in three.sents:
    print("sentence:", sent)

#Alright, update the sentences so that they're clean. 
one =  " ".join([token.lemma_ for token in one if not token.is_punct and not token.is_stop])
two =  " ".join([token.lemma_ for token in two if not token.is_punct and not token.is_stop])
three =  " ".join([token.lemma_ for token in three if not token.is_punct and not token.is_stop])
print(one)
print(two)
print(three)

sentence: The taste of caramel is a fantastic accompaniment to tasty mint ice cream.
Lumberjack Song funny Monty Python bit think laugh
strawberry ice cream dessert good taste
taste caramel fantastic accompaniment tasty mint ice cream


In [21]:
corpus = [one, two, three]

In [22]:
vectorizer = TfidfVectorizer(use_idf=True, norm=u'l2', smooth_idf=True)

# Applying the vectorizer
X = vectorizer.fit_transform(corpus)

understanding = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
understanding

Unnamed: 0,accompaniment,bit,caramel,cream,dessert,fantastic,funny,good,ice,laugh,lumberjack,mint,monty,python,song,strawberry,taste,tasty,think
0,0.0,0.353553,0.0,0.0,0.0,0.0,0.353553,0.0,0.0,0.353553,0.353553,0.0,0.353553,0.353553,0.353553,0.0,0.0,0.0,0.353553
1,0.0,0.0,0.0,0.349498,0.459548,0.0,0.0,0.459548,0.349498,0.0,0.0,0.0,0.0,0.0,0.0,0.459548,0.349498,0.0,0.0
2,0.385323,0.0,0.385323,0.293048,0.0,0.385323,0.0,0.0,0.293048,0.0,0.0,0.385323,0.0,0.0,0.0,0.0,0.293048,0.385323,0.0


In the 2-grams example above, you only used 2-grams as your features. This time, use both 1-grams and 2-grams together as your feature set. Run the same models as in the example and compare the results.

In [23]:
vectorizer = TfidfVectorizer(
    max_df=0.5, min_df=2, use_idf=True, norm=u'l2', smooth_idf=True, ngram_range=(1,2))


# applying the vectorizer
X = vectorizer.fit_transform(sentences["text"])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([tfidf_df, sentences[["text", "author"]]], axis=1)

# keep in mind that the log base 2 of 1 is 0,
# so a tf-idf score of 0 indicates that the word was present once in that sentence.
sentences.head()

Unnamed: 0,abide,ability,able,able bear,able persuade,abominate,abroad,absence,absence home,absent,...,young people,young person,young sister,young woman,youth,youth say,zeal,zealous,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear Rabbit,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll


In [24]:
Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

GradientBoostingClassifier()

In [25]:
#Results Table
t = []
name = 'both'
#type indicator for the results dictionary
for i in range(0, 3):
    t.append(name)
#dictionary for these results.
d = {'type': t}


d['model'] = ['logistic']
d['test'] = [lr.score(X_test, y_test)]
d['train'] = [lr.score(X_train, y_train)]


d['model'].append('random')
d['test'].append(rfc.score(X_test, y_test))
d['train'].append(rfc.score(X_train, y_train) )

d['model'].append('gradient')
d['test'].append(gbc.score(X_test, y_test))
d['train'].append(gbc.score(X_train, y_train) )

In [26]:
d = pd.DataFrame(d)
d.set_index(['type', 'model'], inplace=True)
results = pd.concat([results, d])
results

Unnamed: 0_level_0,Unnamed: 1_level_0,test,train
type,model,Unnamed: 2_level_1,Unnamed: 3_level_1
one,logistic,0.873058,0.9124
one,random,0.865957,0.978692
one,gradient,0.833111,0.849956
two,logistic,0.778074,0.823321
two,random,0.802042,0.882806
two,gradient,0.762983,0.768275
both,logistic,0.871727,0.912696
both,random,0.869063,0.978692
both,gradient,0.83178,0.850843


In [27]:
print("----------------------Logistic Regression Scores----------------------")
print(classification_report(y_test, lr.predict(X_test)))
print("----------------------Random Forest Scores----------------------")
print(classification_report(y_test, rfc.predict(X_test)))
print("----------------------Gradient Boosting Scores----------------------")
print(classification_report(y_test, gbc.predict(X_test)))

----------------------Logistic Regression Scores----------------------
              precision    recall  f1-score   support

      Austen       0.85      0.99      0.91      1537
     Carroll       0.96      0.63      0.76       716

    accuracy                           0.87      2253
   macro avg       0.90      0.81      0.83      2253
weighted avg       0.88      0.87      0.86      2253

----------------------Random Forest Scores----------------------
              precision    recall  f1-score   support

      Austen       0.87      0.95      0.91      1537
     Carroll       0.87      0.69      0.77       716

    accuracy                           0.87      2253
   macro avg       0.87      0.82      0.84      2253
weighted avg       0.87      0.87      0.86      2253

----------------------Gradient Boosting Scores----------------------
              precision    recall  f1-score   support

      Austen       0.81      0.98      0.89      1537
     Carroll       0.91      0.5