# Tf-IDF Vectorizer for NLP
* TF: Term Frequency => How much frequently a word is occuring a sentence.
    * Formula: ***TF = no. of times a term "n" appeared in a sentence / no. of total terms in that document***
* IDF: Inverse Document Frequency => How much frequently a word is occuring in a document.
    * Formula: ***IDF = 1 + log(no. of total documents / no. of documents in which a term "n" appeared)***
    
**Note: Stopwords and other pre-proccessing is not done in this practical, because the goal of this practical is just to give the intuition and demo of Tf-IDF**

In [5]:
# Importing the required libraries

import goose3     # For fetching the text from the URL
from sklearn.feature_extraction.text import TfidfVectorizer
import math
from nltk import sent_tokenize

In [2]:
goose = goose3.Goose()

# Extracting data from the wikopedia article on NLP!
data = goose.extract("https://en.wikipedia.org/wiki/Natural_language_processing")

In [3]:
data = data.cleaned_text

In [4]:
data

'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.\n\nNatural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated interpretation and generation of natural language, but at the time not articulated as a problem separate from artificial intelligence.\n\nThe premise 

In [6]:
# Generating the sentences

sentences = [sentence for sentence in sent_tokenize(data)]

In [7]:
sentences

['Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.',
 'The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.',
 'The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.',
 'Natural language processing has its roots in the 1950s.',
 'Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated interpretation and generation of natural language, but at the time not articulated as a problem separate from artificial intelligence.',

In [8]:
# Generating Tf-IDF Values for the sentences

tf_idf_vectorizer = TfidfVectorizer()
transformed_sentences = tf_idf_vectorizer.fit_transform(sentences)

In [11]:
print("Total unique features: ",len(tf_idf_vectorizer.get_feature_names()))

Total unique features:  861


In [12]:
# Printing the Tf-IDF Values for the Sentence-1! 
transformed_sentences.toarray()[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

# Congratulations, you have learned the application & output of Tf-IDF!