# 2. Computing TF-IDF Vectors with Scikit-Learn

## About dataset source

Scikit-Learn provides us with data from Usenent, which is a well-established online collection of discussion forums. These Usenent forums are called newsgroups. Each individual newsgroup focuses on some topic of discussion. That discussion topic is briefly outlined within the newsgroup name. Users within a newsgroup converse by posting messages. These user posts are generally not limited in length. Some of the posts can get quite large. Both the diversity and the varying lengths of the posts will give us a chance to expand our NLP skills. 

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(remove=('headers','footers'))

The newsgroups object contains posts from 20 different newsgroups. Each newsgroup’s discussion-theme is outlined in its names. We can view these newsgroup names by printing newsgroups.target_names.

In [2]:
print(newsgroups.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


For example, comp.sys.mac.hardware focuses on mac hardware. Meanwhile, comp.sys.ibm.pc_hardware focuses on pc hardware. Categorically, these 2 newsgroups are exceedingly similar. Their only differentiator is whether the computer hardware belongs to a mac or pc. Sometimes, categorical differences are subtle. Boundaries between text topics are fluid, and are not necessarily etched in stone. We’ll need to keep this in mind, later in the section, when we proceed to cluster the newsgroup posts.

### Get newsgroup group

In [3]:
index = 1
print(newsgroups.data[index])
#get newsgroup
origin = newsgroups.target_names[newsgroups.target[index]]
print(f"\nThe post at index {index} first appeared in the '{origin}' group")

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

The post at index 1 first appeared in the 'comp.sys.mac.hardware' group


### Get newsgroup data size

In [4]:
dataset_size = len(newsgroups.data)
print(f"Our dataset contains {dataset_size} newsgroup posts")

Our dataset contains 11314 newsgroup posts


Our dataset contains over 11,000 posts. **Our goal is to cluster these posts by topic.**
Carrying out text clustering on this scale will require computational efficiency. We’ll need to efficiently compute newsgroup-post similarities by representing our text-data as a matrix. To do so, we’ll need to transform each newsgroup post into TF vector. 

## Vectorizing Documents Using Scikit-Learn

Scikit-Learn provides built-in class for transforming input texts into TF vectors. That class is called CountVectorizer. Initializing CountVectorizer will create a vectorizer object capable of vectorizing our texts

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

All we need to do is run vectorizer.fit_transform(newsgroups.data). The method-call will return the TF matrix corresponding to the vectorized newsgroup posts. As a reminder, a TF matrix stores the counts of words (columns) across all texts (rows).

##### Remember 

TF Vector - Term Frequency Vector stores the counts of words(columns) across all texts

In [6]:
tf_matrix = vectorizer.fit_transform(newsgroups.data)
print(tf_matrix)

  (0, 108644)	4
  (0, 110106)	1
  (0, 57577)	2
  (0, 24398)	2
  (0, 79534)	1
  (0, 100942)	1
  (0, 37154)	1
  (0, 45141)	1
  (0, 70570)	1
  (0, 78701)	2
  (0, 101084)	4
  (0, 32499)	4
  (0, 92157)	1
  (0, 100827)	6
  (0, 79461)	1
  (0, 39275)	1
  (0, 60326)	2
  (0, 42332)	1
  (0, 96432)	1
  (0, 67137)	1
  (0, 101732)	1
  (0, 27703)	1
  (0, 49871)	2
  (0, 65338)	1
  (0, 14106)	1
  :	:
  (11313, 55901)	1
  (11313, 93448)	1
  (11313, 97535)	1
  (11313, 93393)	1
  (11313, 109366)	1
  (11313, 102215)	1
  (11313, 29148)	1
  (11313, 26901)	1
  (11313, 94401)	1
  (11313, 89686)	1
  (11313, 80827)	1
  (11313, 72219)	1
  (11313, 32984)	1
  (11313, 82912)	1
  (11313, 99934)	1
  (11313, 96505)	1
  (11313, 72102)	1
  (11313, 32981)	1
  (11313, 82692)	1
  (11313, 101854)	1
  (11313, 66399)	1
  (11313, 63405)	1
  (11313, 61366)	1
  (11313, 7462)	1
  (11313, 109600)	1


**Our printed tf_matrix does not appear to be a NumPy array.** What sort of data structure is it? We can check, by printing type(tf_matrix).

In [7]:
print(type(tf_matrix))

<class 'scipy.sparse.csr.csr_matrix'>


### About csr matrix 

 stands for Compressed Sparse Row, which is a storage format for compressing matrices that are composed mostly of zeros. **These mostly empty matrices are referred to as sparse matrices. They can be made smaller by storing only the non-zero elements. This compression leads to more efficient memory usage, and also faster computation. Large-scale text-based matrices are usually very sparse, since a single document normally contains just a small percentage of the total vocabulary**. Thus, Scikit-Learn automatically converts the vectorized text to the CSR format. The conversion is carried out using a csr_matrix class that’s imported from SciPy.

In order to minimize confusion between csr_matrix and numpy we convert matrix to numpy array 

In [8]:
tf_np_matrix = tf_matrix.toarray()
print(tf_np_matrix)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


* Rows - representation of the post
* Cols - represent individual words
* Vocab_size = cols 

In [9]:
assert tf_np_matrix.shape == tf_matrix.shape
num_posts, vocabulary_size = tf_np_matrix.shape
print(f"Our collection of {num_posts} newsgroup posts contain a total of "
      f"{vocabulary_size} unique words")

Our collection of 11314 newsgroup posts contain a total of 114751 unique words


Our data contains 114,000 unique words. However, most posts will hold only a few dozen of these words. **We can measure the unique word-count of a post at index i by counting the number of non-zero elements in row tf_np_matrix[i].** The easiest way to count these non-zero elements is with NumPy. The library allows us to obtain all non-zero indices of the vector at tf_np_matrix[i]. We simply need to input the vector into the np.flatnonzero function. Below, we’ll count and output the non-zero indices of the car post in newsgroups.data[0].

In [10]:
import numpy as np
row_n = 0
tf_vector = tf_np_matrix[row_n]
non_zero_indices = np.flatnonzero(tf_vector)
num_unique_words = non_zero_indices.size
print(f"The newsgroup in row {row_n} contains {num_unique_words} unique words.")
print("The actual word-counts map to the following column indices:\n")
print(non_zero_indices)


The newsgroup in row 0 contains 64 unique words.
The actual word-counts map to the following column indices:

[ 14106  15549  22088  23323  24398  27703  29357  30093  30629  32194
  32305  32499  37154  39275  42332  42333  43643  45089  45141  49871
  49881  50165  54442  55453  57577  58321  58842  60116  60326  64083
  65338  67137  67140  68931  69080  70570  72915  75280  78264  78701
  79055  79461  79534  82759  84398  87690  89161  92157  93304  95225
  96145  96432 100406 100827 100942 101084 101732 108644 109086 109254
 109294 110106 112936 113262]


#### Okay cool... but what are these words ? 

In order to find out, we’ll need a mapping between TF vector indices and word-values. That mapping can be generated by calling vectorizer.get_feature_names(). The method-call will return a list of words, which we’ll call words. Each index i will correspond to the ith word within the list. Thus, running [words[i] for i in non_zero_indices] will return all unique words within our post.

In [11]:
words = vectorizer.get_feature_names()
unique_words = [words[i] for i in non_zero_indices]
print(unique_words)

['60s', '70s', 'addition', 'all', 'anyone', 'be', 'body', 'bricklin', 'bumper', 'called', 'can', 'car', 'could', 'day', 'door', 'doors', 'early', 'engine', 'enlighten', 'from', 'front', 'funky', 'have', 'history', 'if', 'in', 'info', 'is', 'it', 'know', 'late', 'looked', 'looking', 'made', 'mail', 'me', 'model', 'name', 'of', 'on', 'or', 'other', 'out', 'please', 'production', 'really', 'rest', 'saw', 'separate', 'small', 'specs', 'sports', 'tellme', 'the', 'there', 'this', 'to', 'was', 'were', 'whatever', 'where', 'wondering', 'years', 'you']


We’ve printed all the words in newsgroups.data[0]. **Of course, not all these words have equal mention-counts. Some words occur more frequently than others. Perhaps these frequent words are more relevant to the topic of cars.** Lets print the 10 most frequent words within the post, along with their associated counts. We’ll represent this output as a Pandas table, for visualization purposes.

## Extracting Non-Zero Elements of 1D NumPy Arrays

* non_zero_indices = np.flatnonzero(np_vector): Returns the non-zero indices in a 1D NumPy array.
* non_zero_vector = np_vector[non_zero_indices]: Selects the non-zero elements of a 1D NumPy array (assuming non_zero_indices corresponds to non-zero indices of that array).

In [12]:
import pandas as pd

In [13]:
data = {
    "Data":unique_words,
    "Counts":tf_vector[non_zero_indices]
}

In [14]:
df = pd.DataFrame(data).sort_values("Counts", ascending=False)
df.head(10)

Unnamed: 0,Data,Counts
53,the,6
55,this,4
57,was,4
11,car,4
24,if,2
27,is,2
28,it,2
19,from,2
39,on,2
4,anyone,2


### What a noise...

Three first words, have nothing to do with cars. These words, the, this, and was, are among the most common words in the English language. They don’t provide a differentiating signal between the car-post and a differently-themed post.Instead, the common words are a source of noise. They increase the likelihood that 2 unrelated documents will cluster together.

NLP practitioners refer to such noisy words as **stop words**,because these words are blocked from appearing in the vectorized results. Stop words are generally deleted from the text prior to vectorization. T

Below, we’ll re-initialize a stop-word aware vectorizer. Afterwards, we’ll rerun fit_transfrom in order to re-compute the TF matrix. The number of word-columns in that matrix will be less than our previously computed vocabulary size of 114,751. Also, we’ll regenerate our words list. Common stop words such as the, this, of and it will be missing from that list.

In [15]:
vectorizer = CountVectorizer(stop_words='english')
tf_matrix = vectorizer.fit_transform(newsgroups.data)
assert tf_matrix.shape[1] < 114751

words = vectorizer.get_feature_names()
for common_word in ['the', 'this', 'was', 'if', 'it', 'on']:
    assert common_word not in words

All stop words have been deleted from the recomputed tf_matrix. Now, we can re-generate the 10 most frequent words in newsgroups.data[0]. Please note that in the process, we’ll recompute tf_np_matrix, tf_vector, unique_words, non_zero_indices, and df.

### Top words after stop-word deletion

In [16]:
text_index = 0
tf_np_matrix = tf_matrix.toarray()
tf_vector = tf_np_matrix[text_index]
non_zero_indices = np.flatnonzero(tf_vector)
unique_words = [words[index] for index in non_zero_indices]
data = {'Word': unique_words,
        'Count': tf_vector[non_zero_indices]}

df = pd.DataFrame(data).sort_values('Count', ascending=False)
print(f"After stop-word deletion, {df.shape[0]} unique words remain.")
print("The 10 most frequent words are:\n")
print(df[:10].to_string(index=False))

After stop-word deletion, 34 unique words remain.
The 10 most frequent words are:

       Word  Count
        car      4
        60s      1
        saw      1
    looking      1
       mail      1
      model      1
 production      1
     really      1
       rest      1
   separate      1


However, it’s worth noting that not all words are equal in their relevancy. Some words are more relevant to a car-discussion than others. For instance, the word model refers to a car-model (though of course it could also refer to a supermodel or a machine learning model). Meanwhile, the word really is more general. It doesn’t refer to anything car-related. The word is so irrelevant and common, that it could almost be a stop word. In fact, some NLP practitioners keep really on their stop-word list, while others don’t. Unfortunately, there is no shared consensus on which words are always useless, and which words aren’t. However, all practitioners agree that a word becomes less useful if its mentioned in too many texts. Thus, really is less relevant than model, because the former is mentioned more posts. Therefore, when ranking words by relevance, we should leverage both post-frequency and count. If two words share an equal count, then we should rank them by post-frequency instead.

Lets rerank our 34 words based on on both post-frequency and count. Afterwards, we’ll explore how these rankings can be used to improve text-vectorization.

![title](../../img/vectorization.png)

## Ranking Words by Both Post-Frequency and Count


### Filtering matrix colummns with non_zero_indices

In [17]:
#row in this case is document
row = 0
sub_matrix = tf_np_matrix[:,non_zero_indices]
print("We obtained a sub-matrix correspond to the 34 words within post 0. "
      "The first row of the sub-matrix is:")
print(sub_matrix[row])


We obtained a sub-matrix correspond to the 34 words within post 0. The first row of the sub-matrix is:
[1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


At this moment we don't want to excract word-counts. Instead we just want to know whether each is present or not.
In order to to this we transform our data to binary form.

### Converting word counts to binary values

In [18]:
from sklearn.preprocessing import binarize
binary_matrix = binarize(sub_matrix)
print(binary_matrix)

[[1 1 1 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [19]:
unique_post_mentions = binary_matrix.sum(axis=0)
print("This vector counts the unique posts in which each word is "
      f"mentioned:\n {unique_post_mentions}")

This vector counts the unique posts in which each word is mentioned:
 [  18   21  202  314    4   26  802  536  842  154   67  348  184   25
    7  368  469 3093  238  268  780  901  292   95 1493  407  354  158
  574   95   98    2  295 1174]


#### Remember ! 

* sum(axis=0) - vector of summed **rows**
* sum(axis=1) - vector of summed **cols** 

### Transform counts into documents frequencies 

In [20]:
document_frequencies = unique_post_mentions / dataset_size
data = {"Word": unique_words,
       "Count": tf_vector[non_zero_indices],
       "Document Frequency": document_frequencies}

df = pd.DataFrame(data)
df_common_words = df[df["Document Frequency"] >= 0.1]
df_common_words

Unnamed: 0,Word,Count,Document Frequency
17,know,1,0.273378
24,really,1,0.13196
33,years,1,0.103765


Three of the 34 words appear have document frequency that’s greater than 0.1. As expected, these words are very general, and not car-specific. We thus can utilize document frequencies for ranking purposes. Lets rank our words by relevance, in the following manner. First, we’ll sort the word by count, from greatest to smallest. Afterwards, all words with equal count will be sorted by document frequency, from smallest to greatest.

In [21]:
df_sorted = df.sort_values(['Count', 'Document Frequency'],
                           ascending=[False, True])
df_sorted.head(20)

Unnamed: 0,Word,Count,Document Frequency
7,car,4,0.047375
31,tellme,1,0.000177
4,bricklin,1,0.000354
14,funky,1,0.000619
0,60s,1,0.001591
1,70s,1,0.001856
13,enlighten,1,0.00221
5,bumper,1,0.002298
10,doors,1,0.005922
23,production,1,0.008397


Our sorting was successful. New car-related words, such as bumper, are now present in our list of top-ranked words. However, the actual sorting procedure was rather convoluted. It required us to sort 2 columns separately.**Perhaps, we can simplify the process by combining the word counts and document frequencies into a single score.** How can we combine these values? One approach is to divide each word-count by its associated document frequency. This resulting value will increase if:

* The word-count goes up.
* The document frequency goes down.

In [22]:
inverse_document_frequencies = 1 / document_frequencies
df['IDF'] = inverse_document_frequencies
df['Combined'] = df.Count * inverse_document_frequencies
df_sorted = df.sort_values('Combined', ascending=False)

In [23]:
df_sorted

Unnamed: 0,Word,Count,Document Frequency,IDF,Combined
31,tellme,1,0.000177,5657.0,5657.0
4,bricklin,1,0.000354,2828.5,2828.5
14,funky,1,0.000619,1616.285714,1616.285714
0,60s,1,0.001591,628.555556,628.555556
1,70s,1,0.001856,538.761905,538.761905
13,enlighten,1,0.00221,452.56,452.56
5,bumper,1,0.002298,435.153846,435.153846
10,doors,1,0.005922,168.865672,168.865672
29,specs,1,0.008397,119.094737,119.094737
23,production,1,0.008397,119.094737,119.094737


Our new ranking has failed! The word car no longer appears at the top of the list. What happened? Well, lets take a look at our table. There is a problem with the IDF values: some of them are huge! The printed IDF values range from approximately 100 to over 5000. Meanwhile, our word-count range is very small. The counts vary from 1 to 4. Thus, when we multiply word-counts by IDF values, the IDF will dominate. The counts will then have no impact on the final results. We need to somehow make our IDF values smaller. What should we do?

#### Math solution

Data scientists are commonly confronted with numeric values that are too large. One way to shrink the values down is to apply a logarithmic function. For instance, running np.log10(1000000) will return 6. Essentially, a value of 1,000,000 will be replaced by the count of zeros in that value.

In [24]:
df['Combined'] = df.Count * np.log10(df.IDF)
df_sorted = df.sort_values('Combined', ascending=False)

In [25]:
df_sorted.head(15)

Unnamed: 0,Word,Count,Document Frequency,IDF,Combined
7,car,4,0.047375,21.108209,5.297806
31,tellme,1,0.000177,5657.0,3.752586
4,bricklin,1,0.000354,2828.5,3.451556
14,funky,1,0.000619,1616.285714,3.208518
0,60s,1,0.001591,628.555556,2.798344
1,70s,1,0.001856,538.761905,2.731397
13,enlighten,1,0.00221,452.56,2.655676
5,bumper,1,0.002298,435.153846,2.638643
10,doors,1,0.005922,168.865672,2.227541
29,specs,1,0.008397,119.094737,2.075893


**TFIDF is a simple but powerful metric for ranking words within a document. Of course, the metric is only relevant if that document is part of a larger document group. Otherwise, the computed TFIDF values all equal zero.** Also, the metric loses its effectiveness when applied to small collections of similar tests. Nonetheless, for most real-world text datasets, TFIDF produces good ranking results. Furthermore, the metric has additional uses. It can be utilized to vectorize words within a document. The numeric content of df.Combined is essentially a vector. It was produced by modifying the TF vector stored in df.Count. In this same manner, we can transform any TF vector into a TFIDF vector. We just need to multiply the TF vector by the log of inverse document frequencies.

Is there a benefit to transforming TF vectors into more complicated TFIDF vectors? Yes! Within larger text datasets, TFIDF vectors provide a greater signal of textual similarity and divergence. For example, 2 texts that are both discussing cars are more likely to cluster together if their irrelevant vector elements are penalized. Thus, penalizing common words using the IDF will improve the clustering of large text collections.

**This isn’t necessarily true of smaller datasets, where the number of documents is low, and the document frequency is high. Consequently, the IDF might be too small to meaningfully improve the clustering results.**

# Computin TFIDF Vectors with Scikit-Learn

That TfidfVectorizer class is nearly identical to CountVectorizer, except that it takes IDF into account during the vectorization process. Below, we’ll import TfidfVectorizer from sklearn.feature_extraction.text. Afterwards, we’ll initialize the class by running TfidfVectorizer(stop_words=’english). The constructed tfidf_vectorizer object will be parametrized to ignore all stop-words. Subsequently, executing tfidf_vectorizer.fit_transform(newsgroups.data) will return a matrix of vectorized TFIDF values. The matrix-shape will be identical to tf_matrix.shape.

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(newsgroups.data)
assert tfidf_matrix.shape == tf_matrix.shape

Our tfdif_vectorizer has learned the same vocabulary as the simpler TF vectorizer. In fact, the indices of words in tfidf_matrix are identical to those of tf_matrix. We can confirm this by calling tfidf_vectorizer.get_feature_names(). The method-call will return an ordered list of words that is identical to our previously computed words list.

In [27]:
assert tfidf_vectorizer.get_feature_names() == words

Since word-order is preserved, we should expect the non-zero indices of tfidf_matrix[0] to equal our previously computed non_zero_indices array. We’ll confirm below, after converting tfidf_matrix from a CSR data-structure to a NumPy array.

In [31]:
tfidf_np_matrix = tfidf_matrix.toarray()
tfidf_vector = tfidf_np_matrix[0]
tfidf_non_zero_indices = np.flatnonzero(tfidf_vector)
assert np.array_equal(tfidf_non_zero_indices,
                      non_zero_indices)

The non-zero indices of tf_vector and tfidif_vector are identical. We thus can add the TFIDF vector as a column in our existing df table. Adding a TFIDF column will allow us to compare Scikit-Learn’s output with our manually-computed score.

In [32]:
df['TFIDF'] = tfidf_vector[non_zero_indices]

In [35]:
df.sort_values("TFIDF",ascending=False).head()

Unnamed: 0,Word,Count,Document Frequency,IDF,Combined,TFIDF
7,car,4,0.047375,21.108209,5.297806,0.459552
31,tellme,1,0.000177,5657.0,3.752586,0.262118
4,bricklin,1,0.000354,2828.5,3.451556,0.247619
14,funky,1,0.000619,1616.285714,3.208518,0.23428
0,60s,1,0.001591,628.555556,2.798344,0.209729


Sorting by df.TFIDF should produce a relevance ranking that is consistent with our previous observations. Lets verify that both df.TFIDF and df.Combined produce the same word-rankings after sorting.

In [36]:
df_sorted_old = df.sort_values('Combined', ascending=False)
df_sorted_new = df.sort_values('TFIDF', ascending=False)
assert np.array_equal(df_sorted_old['Word'].values,
                      df_sorted_new['Word'].values)
print(df_sorted_new[:10].to_string(index=False))

      Word  Count  Document Frequency          IDF  Combined     TFIDF
       car      4            0.047375    21.108209  5.297806  0.459552
    tellme      1            0.000177  5657.000000  3.752586  0.262118
  bricklin      1            0.000354  2828.500000  3.451556  0.247619
     funky      1            0.000619  1616.285714  3.208518  0.234280
       60s      1            0.001591   628.555556  2.798344  0.209729
       70s      1            0.001856   538.761905  2.731397  0.205568
 enlighten      1            0.002210   452.560000  2.655676  0.200827
    bumper      1            0.002298   435.153846  2.638643  0.199756
     doors      1            0.005922   168.865672  2.227541  0.173540
     specs      1            0.008397   119.094737  2.075893  0.163752


Our word-rankings have remained unchanged. However, the values of the TFIDF and Combined columns are not identical. Our top 10 manually-computed Combined values are all greater than 1. Meanwhile, all of Scikit-Learn’s TFIDF values are less than 1. Why is this the case?

**As it turns out, Scikit-Learn automatically normalizes its TFIDF vector results. The magnitude of df.TFIDF has been modified to equal 1. We can confirm by calling norm(df.TFIDF.values).**

In [37]:
from numpy.linalg import norm
assert norm(df.TFIDF.values) == 1

Why would Scikit-Learn automatically normalize the vectors? For our own benefit! As discussed in Section Thirteen, its easier to compute text-vector similarity when all vector magnitudes equal 1. Consequently, our normalized TFIDF matrix is primed for similarity analysis.

![title](../../img/Vectorizer.png)

# Summary 

In order to compute TF-IDF

1. Use CountVectorizer with stopwords to counts words in documents
2. For code simplicity convert asr matrix to numpy
3. Compute word frequencies across documents
4. Compute inverse word frequencies (1/freq)
5. Combine the word counts and document frequencies into a single score (Count * inverse transform)
6. Obtain TF - IDF by log(point5)

or... 

Use TfidfVectorizer ands it's all !!

## Next 

Read about bag of words because TF-IDF solve drawbacks of bag of words technique.

## Remember 

* TF - how many times a word is used in that entire document
* IDF - how important is the word in the entire list of documents 

So using TF and IDF machine makes sense of important words in a document and important words throughout all documents

TF-IDF

https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3

Bag of words

https://medium.com/swlh/bag-of-words-code-the-easiest-explanation-of-nlp-technique-using-a-python-8a4fdfb8598c