<a href="https://colab.research.google.com/github/AceroMike/Natural-Language-Processing/blob/main/TF_IDF_Vectorization_w_N_grams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Imports
import numpy as np
import pandas as pd
import spacy
import re

import nltk
from nltk.corpus import gutenberg

import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

nltk.download('gutenberg')
!python -m spacy download en

Converting words or sentences into numeric vectors is fundamental when working with text data. There are many ways to convert text to numerical data and in this Notebook I will discuss **TF-IDF** vectorization. And provide a few examples. Let's start simple with some random sentences. Suppose we have the following sentences.

1. "The Lumberjack Song is the funniest Monty Python bit; I can't think of it without laughing."
2. "I would rather put strawberries on my ice cream for dessert; they have the best taste."
3. "The taste of caramel is a fantastic accompaniment to tasty mint ice cream."

Before we can do anything with these sentences we first want to put it in a format that will allow us to preprocess the data and remove stopwords, punctuation, and lemmatize the resulting tokens. 

In [5]:
# Creating the text string
sentences = "The Lumberjack Song is the funniest Monty Python bit; I can't think of it without laughing. I would rather put strawberries on my ice cream for dessert; they have the best taste. The taste of caramel is a fantastic accompaniment to tasty mint ice cream."
sentences

"The Lumberjack Song is the funniest Monty Python bit; I can't think of it without laughing. I would rather put strawberries on my ice cream for dessert; they have the best taste. The taste of caramel is a fantastic accompaniment to tasty mint ice cream."

Now we can use spaCy to parse the text and tokenize it for use. This is very easy to do but can take a while with larger data sets.

In [7]:
nlp = spacy.load('en')

sentences = nlp(sentences)
sentences

The Lumberjack Song is the funniest Monty Python bit; I can't think of it without laughing. I would rather put strawberries on my ice cream for dessert; they have the best taste. The taste of caramel is a fantastic accompaniment to tasty mint ice cream.

Although I have called the now parsed and tokenized data sentences. I don't actually have sentences, Yet. Now let's group the text by sentences and create a DataFrame. Then, we will be able to apply TF-IDF Vectorization on the data.

In [10]:
# Group into sentences
sentences = [[sent] for sent in sentences.sents]
sentences

[[The Lumberjack Song is the funniest Monty Python bit; I can't think of it without laughing.],
 [I would rather put strawberries on my ice cream for dessert; they have the best taste.],
 [The taste of caramel is a fantastic accompaniment to tasty mint ice cream.]]

In [11]:
# Creating a DataFrame
sentences = pd.DataFrame(sentences, columns = ['text'])
sentences.head()

Unnamed: 0,text
0,"(The, Lumberjack, Song, is, the, funniest, Mon..."
1,"(I, would, rather, put, strawberries, on, my, ..."
2,"(The, taste, of, caramel, is, a, fantastic, ac..."


As we can see now, each sentence is separated by it's corresponding tokens. Now we want to remove punctuation, stopwords, and lemmatizing our tokens. 

In [12]:
for i, sentence in enumerate(sentences["text"]):
    sentences.loc[i, "text"] = " ".join(
        [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop])
    
sentences

Unnamed: 0,text
0,Lumberjack Song funniest Monty Python bit thin...
1,strawberry ice cream dessert good taste
2,taste caramel fantastic accompaniment tasty mi...


Now we can apply TF-IDF!

In [15]:
vectorizer = TfidfVectorizer(min_df=2, use_idf=True, norm=u'l2', smooth_idf=True)

# Applying the vectorizer
X = vectorizer.fit_transform(sentences['text'])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([tfidf_df, sentences[["text"]]], axis=1)

# Keep in mind that log base 2 of 1 is 0,
# so a TF-IDF score of 0 indicates that the word was present once in that sentence.
sentences.head()

Unnamed: 0,cream,ice,taste,text
0,0.0,0.0,0.0,Lumberjack Song funniest Monty Python bit thin...
1,0.57735,0.57735,0.57735,strawberry ice cream dessert good taste
2,0.57735,0.57735,0.57735,taste caramel fantastic accompaniment tasty mi...


We should not be surprised to see that the first sentence has 0 for all scores. The reason this occurs is that I only looked at words that appeared in at least 2 sentences. The last 2 sentences are about icecream while the first is about Monty Python being funny. 

Now that we have worked with some made up data, let's look at some actual data! This data comes from gutenberg. Which is data that can be downloaded by anyone. We will be generating TF-IDF vectors of Jane Austen's *Persuasion* and Lewis Carroll's *Alice's Adventures in Wonderland*.

First, we must load and clean the data

In [16]:
# First defining a function that will help us clean the data
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation that spaCy doesn't
    # recognize: the double dash '--'. Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

In [17]:
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

In [35]:
# First 500 words of Alice in Wonderland
alice[0:500]

"Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of gettin"

In [36]:
# Persuasion too
persuasion[0:500]

'Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who, for his own amusement, never took up any book but the Baronetage; there he found occupation for an idle hour, and consolation in a distressed one; there his faculties were roused into admiration and respect, by contemplating the limited remnant of the earliest patents; there any unwelcome sensations, arising from domestic affairs changed naturally into pity and contempt as he turned over the almost endless creations of the las'

As we can see, we have cleaned up the text but only to prepare it to further clean it! As before, we will want to create a DataFrame of each of the sentences in the book. 

In [20]:
nlp = spacy.load('en')

# Bigger file, so may take a while!
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [37]:
alice_doc[0:500]

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT-POCKET, and looked at it, and the

In [38]:
persuasion_doc[0:500]

Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who, for his own amusement, never took up any book but the Baronetage; there he found occupation for an idle hour, and consolation in a distressed one; there his faculties were roused into admiration and respect, by contemplating the limited remnant of the earliest patents; there any unwelcome sensations, arising from domestic affairs changed naturally into pity and contempt as he turned over the almost endless creations of the last century; and there, if every other leaf were powerless, he could read his own history with an interest which never failed. This was the page at which the favourite volume always opened: "ELLIOT OF KELLYNCH HALL. "Walter Elliot, born March , , married, July , , Elizabeth, daughter of James Stevenson, Esq. of South Park, in the county of Gloucester, by which lady (who died ) he has issue Elizabeth, born June , ; Anne, born August , ; a still-born son, November , ; Mary, born November , ." Precis

Now it should start to look familiar where we are heading. So next, we will group into sentences and create a DataFrame. This time we will have an additional column that has the author's name of the sentence. 

In [23]:
# Group into sentences
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one DataFrame
sentences = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])
sentences.head()

Unnamed: 0,text,author
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


Not there yet, now we want to remove stop words, punctuation, and lemmatize the tokens.


In [24]:
# Get rid of stop words and punctuation,
# and lemmatize the tokens
for i, sentence in enumerate(sentences["text"]):
    sentences.loc[i, "text"] = " ".join(
        [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop])

Now we are ready to try TF-IDF Vectorization. However, let's make it more interesting. If you haven't heard of N-grams. N-grams are sets of N words that appear frequently and consecutively through out the text. A quick example, if we have the following sentence. 

I enjoy learning Natural Language Processing.

The corresponding 2-gram would look like this.

    (I enjoy), (enjoy learning), (learning Natural), (Natural Language), (Language Processing).

Previously we looked at the TF-IDF vectorization of a 1-gram, because we looked at a single word. Now, we will at the TF-IDF vectorization of 2-grams. 

In [25]:
vectorizer = TfidfVectorizer(
    max_df=0.5, min_df=2, use_idf=True, norm=u'l2',
     smooth_idf=True, ngram_range=(2,2)) #The n-gram parameter lets us tell TF-IDF what N-grams to include. 


# Applying the vectorizer
X = vectorizer.fit_transform(sentences["text"])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([tfidf_df, sentences[["text", "author"]]], axis=1)

# Keep in mind that log base 2 of 1 is 0,
# so a TF-IDF score of 0 indicates that the word was present once in that sentence.
sentences.head()

Unnamed: 0,able bear,able persuade,absence home,absolute necessity,absolutely hopeless,accident lyme,accidentally hear,accommodation man,account louisa,account small,acquaint captain,acquaintance ask,acquaintance cease,acquaintance time,acquaintance visit,active service,add certainly,add dormouse,add explanation,add gryphon,add look,admiral baldwin,admiral croft,admiral mrs,admiral think,admire exceedingly,admit doubt,advance twice,advantage see,advice lady,affection confidence,afraid mention,afraid offend,afraid sir,agree captain,agree good,agree have,agree say,agreeable man,agreeable manner,...,word come,word drink,word explain,word listen,word look,word say,word scarcely,world know,world round,worth attend,worth have,write bath,write letter,write slate,writing desk,year ago,year anne,year go,year half,year monkford,year old,year pass,year school,year year,yer honour,yes mr,yes say,yes yes,young child,young fellow,young friend,young lady,young man,young people,young person,young sister,young woman,youth say,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear Rabbit oh dear,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,shall late,Carroll


Now our columns are single words, but pairs of words. We see nothing but zeros for the most part but that is okay. We are only looking at the first 5 sentences

In [26]:
sentences.describe()

Unnamed: 0,able bear,able persuade,absence home,absolute necessity,absolutely hopeless,accident lyme,accidentally hear,accommodation man,account louisa,account small,acquaint captain,acquaintance ask,acquaintance cease,acquaintance time,acquaintance visit,active service,add certainly,add dormouse,add explanation,add gryphon,add look,admiral baldwin,admiral croft,admiral mrs,admiral think,admire exceedingly,admit doubt,advance twice,advantage see,advice lady,affection confidence,afraid mention,afraid offend,afraid sir,agree captain,agree good,agree have,agree say,agreeable man,agreeable manner,...,wonder happen,wonder shall,word come,word drink,word explain,word listen,word look,word say,word scarcely,world know,world round,worth attend,worth have,write bath,write letter,write slate,writing desk,year ago,year anne,year go,year half,year monkford,year old,year pass,year school,year year,yer honour,yes mr,yes say,yes yes,young child,young fellow,young friend,young lady,young man,young people,young person,young sister,young woman,youth say
count,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,...,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0,5848.0
mean,0.00018,0.000342,0.000193,0.000376,0.000294,0.000246,0.000129,0.000168,0.000208,0.000189,0.000225,0.000204,0.000258,0.000244,0.000186,0.000235,0.00022,0.000342,0.000342,0.000169,0.0002,0.000532,0.001319,0.000526,0.000262,0.000292,0.000493,0.000265,0.000152,0.000176,0.000173,0.000199,0.000271,0.000209,0.0002,0.000221,0.000197,0.000276,0.000536,0.000313,...,0.000392,0.000342,0.000239,0.000217,0.000266,0.000264,0.000356,0.000352,0.000292,0.000248,0.000182,0.000124,0.000174,0.000194,0.000187,0.000363,0.000196,0.000507,0.000151,0.000187,0.000294,0.00011,0.000136,0.000296,0.000158,0.000294,0.000542,0.000231,0.000763,0.000872,0.0002,0.00031,0.000353,0.001603,0.00243,0.000988,0.00022,0.000141,0.000828,0.00027
std,0.009833,0.018492,0.010699,0.017218,0.016105,0.014288,0.007002,0.009129,0.009186,0.010269,0.012612,0.011217,0.01397,0.01423,0.010714,0.012724,0.011936,0.018492,0.018492,0.009323,0.01157,0.021163,0.029451,0.017025,0.014829,0.016014,0.021785,0.014937,0.008473,0.009537,0.009546,0.011046,0.015144,0.011325,0.010933,0.01199,0.011387,0.015348,0.021415,0.014054,...,0.017732,0.018492,0.010676,0.013549,0.014945,0.014901,0.01591,0.013659,0.016014,0.011048,0.009979,0.006789,0.00981,0.010859,0.010121,0.016737,0.011068,0.016256,0.008187,0.010527,0.016105,0.005975,0.007446,0.013402,0.008842,0.016105,0.021508,0.013072,0.020919,0.027771,0.011049,0.013819,0.01587,0.034154,0.039126,0.026827,0.011936,0.007809,0.023224,0.015098
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.592882,1.0,0.693171,1.0,1.0,1.0,0.396804,0.536117,0.425435,0.610022,0.8343,0.707107,0.790181,1.0,0.742853,0.707107,0.707107,1.0,1.0,0.592882,0.808585,1.0,1.0,0.793038,1.0,1.0,1.0,1.0,0.558952,0.555756,0.605278,0.719065,1.0,0.665699,0.663859,0.707107,0.795213,1.0,1.0,0.755399,...,1.0,1.0,0.570672,1.0,1.0,1.0,0.807081,0.585377,1.0,0.570672,0.614563,0.418749,0.658861,0.719065,0.558746,1.0,0.742853,0.64064,0.480775,0.707107,1.0,0.343383,0.455859,0.694943,0.583805,1.0,1.0,0.883101,0.75276,1.0,0.707107,0.716689,0.86872,1.0,1.0,1.0,0.707107,0.5,1.0,1.0


Now, why would we TF_IDF vectorize a real dataset without building some models? We wouldn't. So let's build some. We will try to see how well we can classify the text based on the TF-IDF vectorization of the 2-grams. We will look at Logistic Regression, Random Forest Classifier, and Gradient Boosting Classifier.

In [30]:
# Defining X and Y

Y = sentences['author']
X = np.array(sentences.drop(['text', 'author'], 1))

# Train, Test, Split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4,
                                                    random_state=123)

# Initializing  and fitting Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

# Results
print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))



----------------------Logistic Regression Scores----------------------
Training set score: 0.8135689851767389

Test set score: 0.7666666666666667
----------------------Random Forest Scores----------------------
Training set score: 0.8554732041049031

Test set score: 0.794017094017094
----------------------Gradient Boosting Scores----------------------
Training set score: 0.749429874572406

Test set score: 0.7525641025641026


Our models are not the best, but it is just the first model we try. We can see that if we use only 2-grams our Logistic Regression and Random Forest models overfit the data. What if we looked at both 1-grams and 2-grams? Let's find out

In [33]:
vectorizer = TfidfVectorizer(
    max_df=0.5, min_df=2, use_idf=True, norm=u'l2',
     smooth_idf=True, ngram_range=(1,2)) #Now we change to (1,2) to include 1-gram!


# Applying the vectorizer
X = vectorizer.fit_transform(sentences["text"])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([tfidf_df, sentences[["text", "author"]]], axis=1)

# Keep in mind that log base 2 of 1 is 0,
# so a TF-IDF score of 0 indicates that the word was present once in that sentence.
sentences.head()

Unnamed: 0,abide,ability,able,able bear,able persuade,abominate,abroad,absence,absence home,absent,absolute,absolute necessity,absolutely,absolutely hopeless,absurd,abuse,accept,acceptable,acceptance,accession,accident,accident lyme,accidentally,accidentally hear,accommodate,accommodation,accommodation man,accompany,accomplish,accomplishment,accord,accordingly,account,account louisa,account small,accuse,acknowledge,acknowledgement,acquaint,acquaint captain,...,wrong,wrought,yard,yarmouth,yawn,ye,year,year ago,year anne,year go,year half,year monkford,year old,year pass,year school,year year,yer,yer honour,yes,yes mr,yes say,yes yes,yesterday,yield,young,young child,young fellow,young friend,young lady,young man,young people,young person,young sister,young woman,youth,youth say,zeal,zealous,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear Rabbit oh dear,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,shall late,Carroll


As expected, we now see columns with single and 2-grams. Does this model perform better?

In [32]:
# Defining X and Y

Y = sentences['author']
X = np.array(sentences.drop(['text', 'author'], 1))

# Train, Test, Split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4,
                                                    random_state=123)

# Initializing  and fitting Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

# Results
print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))


----------------------Logistic Regression Scores----------------------
Training set score: 0.9036488027366021

Test set score: 0.8555555555555555
----------------------Random Forest Scores----------------------
Training set score: 0.9694982896237172

Test set score: 0.8444444444444444
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8244013683010262

Test set score: 0.8106837606837607


Overall our models all perform better now. However, our Random Forest Model seems to be overfitting the data a lot more. Still, all performed better on the test set. Hopefully these couple of examples have helped you better understand how we can convert text to numerical features that you can use in machine learning models. 