# Keyword extraction with `TfidfVectorizer`

Scikit-learn's `CountVectorizer` class creates matrices of word counts and is frequently uses in text-classification tasks. The related [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class creates matrices of [term freqeuency-inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (Tf-Idf) values that reflect not just the presence of individual words, but each word's importance. One use for `TfidfVectorizer` is extracting keywords from documents. Let's use it to extract keywords from a book chapter on machine learning. Begin by loading the chapter from a text file and showing the first few paragraphs.

In [1]:
import pandas as pd

df = pd.read_csv('Data/chapter-1.txt', sep='\n', header=None)
pd.set_option('display.max_colwidth', None)
df.head()

Unnamed: 0,0
0,"Software developers are accustomed to solving problems algorithmically. Given a recipe, or algorithm, it's not difficult to write an app that hashes a password or computes a monthly mortgage payment. You code up the algorithm, feed it input, and receive output in return. It's another proposition altogether to write code that determines whether a photo contains a cat or a dog. You can try to do it algorithmically, but the minute you get it working, I'll send you a cat or dog picture that breaks the algorithm."
1,"Machine learning takes a different approach to turning input into output. Rather than rely on you to implement an algorithm, it examines a dataset consisting of inputs and outputs and learns how to generate output of its own. Under the hood, special algorithms called learning algorithms build mathematical models of the data and codify the relationship between data going in and data coming out. Once trained in this manner, a model can accept new inputs and generate outputs consistent with the ones in the training data."
2,"To use machine learning to distinguish between cats and dogs, you don't code a cat-vs-dog algorithm. Instead, you train a machine-learning model with cat and dog photos. Success depends on the learning algorithm used and the quality and volume of the training data. Part of becoming a machine-learning engineer is familiarizing yourself with the various learning algorithms and developing an intuition for when to use one versus another. That intuition begins with an examination of machine learning itself."
3,What is Machine Learning?
4,"At an existential level, machine learning (ML) is a means for finding patterns in numbers and exploiting those patterns to make predictions. Train a model with thousands (or millions) of xs and ys, and let it learn from the data so that given a new x, it can predict what y will be. Learning is the process by which ML finds patterns that can be used to predict future outputs, and it's where the 'learning' in 'machine learning' comes from."


Vectorize the paragraphs and show the first few lines of the resulting word matrix.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(2, 2), min_df=0.02, max_df=0.5, stop_words='english')
word_matrix = vectorizer.fit_transform(df[0])

feature_names = vectorizer.get_feature_names_out()
wm_df = pd.DataFrame(data=word_matrix.toarray(), columns=feature_names)
wm_df.head(10)

Unnamed: 0,000 000,000 columns,000 rows,10 years,120 rows,1s 0s,20 data,80 20,accurate model,add column,...,use scikit,used regression,using nearest,versicolor virginica,want predict,web site,width cm,world datasets,xs ys,years experience
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.420631,0.0
5,0.0,0.0,0.0,0.0,0.0,0.668611,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.32221,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.329341,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.553819,0.0,0.553819,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.484482,0.0,0.0,0.0,0.0


Convert the sparse word matrix into a coordinate matrix that includes only non-zero values (weights) and the rows and columns in which they appear.

In [3]:
coo_matrix = word_matrix.tocoo()
print(coo_matrix)

  (0, 50)	0.36506424222075046
  (0, 19)	0.6833098830012316
  (0, 36)	0.36506424222075046
  (0, 120)	0.36506424222075046
  (0, 157)	0.36506424222075046
  (1, 171)	0.294834738214157
  (1, 92)	0.37713908497127435
  (1, 16)	0.37713908497127435
  (1, 12)	0.4029796661571696
  (1, 78)	0.31487825015322757
  (1, 70)	0.4029796661571696
  (1, 69)	0.4029796661571696
  (1, 89)	0.20561748925677317
  (2, 77)	0.25926713814851615
  (2, 81)	0.25055595725128127
  (2, 170)	0.2805439761927484
  (2, 20)	0.3318088331784477
  (2, 171)	0.2427635404044369
  (2, 78)	0.25926713814851615
  (2, 89)	0.6772123252964592
  (2, 19)	0.31053199513421553
  (3, 89)	1.0
  (4, 80)	0.42063102930041807
  (4, 185)	0.42063102930041807
  (4, 132)	0.42063102930041807
  :	:
  (94, 143)	0.20842628503675406
  (94, 38)	0.19506120289470322
  (94, 76)	0.3524483552000396
  (94, 164)	0.296128858229556
  (94, 101)	0.20842628503675406
  (94, 174)	0.14402207016328555
  (94, 82)	0.47216145691600925
  (94, 25)	0.1846944397962372
  (94, 90)	0.16

Create tuples from the column numbers and weights in the coordinate matrix. Then sort the tuples in descending order based on the weights.

In [4]:
tuples = list(zip(coo_matrix.col, coo_matrix.data))
sorted_tuples = sorted(tuples, key=lambda x: x[1], reverse=True)

for _, tuple in enumerate(sorted_tuples):
    print(f'{tuple} => {feature_names[tuple[0]]}')

(89, 1.0) => machine learning
(109, 1.0) => neural networks
(174, 1.0) => unsupervised learning
(87, 1.0) => look like
(44, 1.0) => data points
(40, 1.0) => customer ids
(51, 1.0) => elbow distinct
(15, 1.0) => average age
(164, 1.0) => supervised learning
(107, 1.0) => nearest neighbors
(107, 1.0) => nearest neighbors
(71, 1.0) => iris dataset
(91, 1.0) => making predictions
(114, 1.0) => number neighbors
(37, 0.921075466800783) => coordinate pairs
(169, 0.8907478931991815) => today fact
(28, 0.816496580927726) => cluster centroids
(39, 0.8152319881530452) => customer data
(65, 0.8024943623495188) => image classification
(0, 0.7948939519153183) => 000 000
(186, 0.7906850776766574) => years experience
(107, 0.7850172237596516) => nearest neighbors
(109, 0.7811345963491914) => neural networks
(107, 0.7658695380497491) => nearest neighbors
(116, 0.7484307798581197) => open source
(7, 0.7241459491909964) => 80 20
(93, 0.7168242079334923) => means clustering
(93, 0.7168242079334923) => mea

Show the top keywords by weight and use `set` to eliminate duplicates.

In [5]:
keywords = []
num_keywords = 5

for tuple in sorted_tuples[:num_keywords]:
    keywords.append(feature_names[tuple[0]])
    
print(set(keywords))

{'data points', 'unsupervised learning', 'neural networks', 'machine learning', 'look like'}


Keyword extraction sometimes works better when you sum all the values for a given word and select the words yielding the highest sums rather than the words with the highest individual values. Sort keywords based on that criterion:

In [6]:
import numpy as np
summed_weights = pd.Series(dtype='float32')

for col_name, col_data in wm_df.iteritems():
    summed_weights = summed_weights.append(pd.Series({ col_name: np.sum(col_data) }))
    
sorted_summed_weights = summed_weights.sort_values(ascending=False)
print(sorted_summed_weights)

machine learning         8.552673
nearest neighbors        5.397634
unsupervised learning    4.608019
means clustering         4.250773
supervised learning      3.597628
                           ...   
class category           0.401409
models fall              0.393562
model predict            0.391588
models purpose           0.387821
models trained           0.383859
Length: 187, dtype: float64


Show the top keywords by summed weights:

In [7]:
keywords = []

for idx, _ in sorted_summed_weights[:num_keywords].items():
    keywords.append(idx)
    
print(keywords)

['machine learning', 'nearest neighbors', 'unsupervised learning', 'means clustering', 'supervised learning']


If you read **chapter-1.txt**, you'll see that these keywords highlight some of the most important concepts introduced in the chapter.