You want a bag of words, but with words weighted by their importance to an
observation.

Compare the frequency of the word in a document (a tweet, movie review,
speech transcript, etc.) with the frequency of the word in all other documents
using term frequency-inverse document frequency (tf-idf). scikit-learn makes
this easy with TfidfVectorizer:

In [4]:
# Load libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
# Create text
text_data = np.array(['I love Brazil. Brazil!',
'Sweden is best',
'Germany beats both'])
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)
# Show tf-idf feature matrix
feature_matrix

<3x8 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [5]:
# Show tf-idf feature matrix as dense matrix
feature_matrix.toarray()

array([[0.        , 0.        , 0.        , 0.89442719, 0.        ,
        0.        , 0.4472136 , 0.        ],
       [0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.57735027],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027,
        0.        , 0.        , 0.        ]])

In [6]:
# Show feature names
tfidf.vocabulary_

{'love': 6,
 'brazil': 3,
 'sweden': 7,
 'is': 5,
 'best': 1,
 'germany': 4,
 'beats': 0,
 'both': 2}

The more a word appears in a document, the more likely it is important to that
document. For example, if the word economy appears frequently, it is evidence
that the document might be about economics. We call this term frequency (tf).
In contrast, if a word appears in many documents, it is likely less important to
any individual document. For example, if every document in some text data
contains the word after then it is probably an unimportant word. We call this
document frequency (df).
By combining these two statistics, we can assign a score to every word
representing how important that word is in a document. Specifically, we multiply
tf to the inverse of document frequency (idf):

![](./frequency.jpg)