## Inspired by these amazing 3-part blog posts by Christian S. Perone


[Machine Learning :: Text feature extraction (tf-idf) – Part I](http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/)

[Machine Learning :: Text feature extraction (tf-idf) – Part II](http://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/)

[Machine Learning :: Cosine Similarity for Vector Space Models (Part III)](http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/)

## 1. Prepare train/test documents and initialise vectoriser

In [1]:
train_set = ['The sky is blue.', 'The sun is bright.']
test_set = ('The sun in the sky is bright.', 'We can see the shining sun, the bright sun.')

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

print(vectorizer)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


## 2. Create a vocabulary index from the training documents

In [3]:
vectorizer.fit_transform(train_set)

print(vectorizer.get_feature_names())
print(vectorizer.vocabulary_)

['blue', 'bright', 'is', 'sky', 'sun', 'the']
{'the': 5, 'sky': 3, 'is': 2, 'blue': 0, 'sun': 4, 'bright': 1}


## 3. Create a sparse matrix (term-frequency, or tf, matrix) of the test documents

In [4]:
freq_term_matrix = vectorizer.transform(test_set)

The resulting matrix is a Scipy sparse matrix in a coordiate format.

In [5]:
print(type(freq_term_matrix))
print(freq_term_matrix)

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	2
  (1, 1)	1
  (1, 4)	2
  (1, 5)	2


We can convert it to a dense matrix.

In [6]:
dense_matrix = freq_term_matrix.todense()

print(type(dense_matrix))
print(dense_matrix)

<class 'numpy.matrixlib.defmatrix.matrix'>
[[0 1 1 1 1 2]
 [0 1 0 0 2 2]]


## 4. Calculate tf-idf weights for the tf matrix

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()

tfidf.fit(freq_term_matrix)

print('IDF: ', tfidf.idf_)

IDF:  [2.09861229 1.         1.40546511 1.40546511 1.         1.        ]


In [8]:
tf_idf_matrix = tfidf.transform(freq_term_matrix)

print(tf_idf_matrix.todense())

[[0.         0.31701073 0.44554752 0.44554752 0.31701073 0.63402146]
 [0.         0.33333333 0.         0.         0.66666667 0.66666667]]


## 5. Calculate cosine similarity between documents

In [9]:
documents = (
    "The sky is blue",
    "The sun is bright",
    "The sun in the sky is bright",
    "We can see the shining sun, the bright sun",
    "The pig is beutiful",
    "I am from Japan"
)

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectoriser = TfidfVectorizer()

tfidf_matrix = tfidf_vectoriser.fit_transform(documents)

print(tfidf_matrix.shape)

(6, 16)


Calculate the cosine similarity of the first sentense agaist the rest.

Score:

    1 - exactly same documents
    0 - no similarity between documents

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

array([[1.        , 0.32395756, 0.51017391, 0.12721294, 0.2512855 ,
        0.        ]])