# Bag of Words and TF-IDF
With our data clean and lemmatized, we can now start analysing it. First, we'll create a Bag of Words (BoW) for each summary. Then we will run a Term Frequency - Inverse Document Frequency analysis on the set to examine differences in word use.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re
from string import punctuation
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer

from IPython.display import clear_output
from datetime import datetime

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
path_to_data = "Data/cleaned_summaries_and_genres.csv"
df = pd.read_csv(path_to_data, index_col=0)
df

Unnamed: 0,summary,genre
0,old major old boar manor farm call animal farm...,Children's literature
1,old major old boar manor farm call animal farm...,Speculative fiction
2,old major old boar manor farm call animal farm...,Fiction
3,alex teenager live nearfuture england lead gan...,Science Fiction
4,alex teenager live nearfuture england lead gan...,Speculative fiction
...,...,...
26536,series follow character nick stone exmilitary ...,Fiction
26537,series follow character nick stone exmilitary ...,Suspense
26538,reader first meet rapp covert operation iran d...,Thriller
26539,reader first meet rapp covert operation iran d...,Fiction


### Term Frequency with CountVectorizer
Using sklearn's CountVectorizer, we'll generate a sparse matrix of words from our summaries. This will likely take a long time, so we'll time it, too.

In [13]:
start = datetime.now()
vectorizer = CountVectorizer()

# fit the count vectorizer
vectorizer.fit(df['summary'])

# transform our set
vector = vectorizer.transform(df['summary'])

print('Complete. Total time elapsed: {} seconds.'.format((datetime.now()-start).total_seconds()))

Complete. Total time elapsed: 9.879003 seconds.


In [18]:
vector.shape

(26541, 116360)

### Inverse Document Frequency
I also just realized that I could skip CountVectorizer entirely and just use sklearn's TfidvVectorizer.

In [20]:
start = datetime.now()
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df['summary'])
print('Complete. Total time elapsed: {} seconds.'.format((datetime.now()-start).total_seconds()))

Complete. Total time elapsed: 5.439611 seconds.


In [21]:
tfidf.shape

(26541, 116360)

In [24]:
tfidf_vectorizer.inverse_transform(tfidf)

[array(['propaganda', 'malleable', 'dogma', 'political', 'simply',
        'demonstrate', 'orwell', 'revision', 'habit', 'evil', 'prevent',
        'together', 'uniting', 'within', 'keep', 'suppose', 'purpose',
        'twist', 'ironic', 'maxim', 'eventually', 'cause', 'without',
        'bolded', 'follow', 'add', 'sheet', 'append', 'excess',
        'lawbreaking', 'accusation', 'clear', 'secretly', 'later', 'shall',
        'friend', 'wing', 'legs', 'four', 'enemy', 'leg', 'upon', 'go',
        'whatever', 'original', 'society', 'belief', 'people', 'exercise',
        'order', 'revise', 'government', 'soviet', 'allusion',
        'humanisation', 'account', 'alter', 'employ', 'trading', 'bed',
        'sleep', 'alcohol', 'drink', 'vice', 'indulge', 'soon', 'formally',
        'actual', 'ideas', 'adapt', 'difference', 'tell', 'like', 'look',
        'realise', 'pilkington', 'break', 'argument', 'match', 'poker',
        'face', 'conversation', 'overhear', 'revolution', 'related',
      