# <b> NLP: Intro to TF-IDF (Term Frequency and Inverse Document Frequency)
> Description:
* TF-IDF stands for "term frequency-inverse document frequency". It's a statistical measure used in natural language processing to evaluate the importance of a word in a document or a corpus.
* TF-IDF is calculated by multiplying the frequency of a word (term frequency) in a document by its inverse frequency in the entire corpus (inverse document frequency).
* The intuition behind TF-IDF is that words that appear frequently in a document but rarely in the rest of the corpus are more likely to be important in that document. This measure is widely used in text mining, information retrieval, and other NLP applications to rank documents and extract meaningful features from text data.

In [4]:
# importing Libraries :
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [5]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [6]:
# Creating a paragraph on which we will be performing Tokenization.

paragraph  = '''The Cosmic Microwave Background (CMB) is a form of electromagnetic radiation that pervades the entire universe. It is thought to be the afterglow of the Big Bang, the event that marks the beginning of the universe as we know it.

The CMB was first discovered in 1964 by two radio astronomers, Arno Penzias and Robert Wilson, who were working at Bell Labs in New Jersey. They were using a large horn-shaped antenna to study radio waves emitted by the Milky Way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky. After ruling out a number of possible explanations, they realized that they had stumbled upon the CMB.

The CMB is incredibly faint, with a temperature of just 2.7 Kelvin (-270.45 degrees Celsius). However, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in 100,000. These tiny fluctuations are thought to be the result of slight density variations in the early universe, which were stretched out by cosmic expansion to form the large-scale structures we see today, such as galaxies and clusters of galaxies.

Studying the CMB has been crucial to our understanding of the universe and its evolution. It has provided strong evidence for the Big Bang theory, as well as for the existence of dark matter and dark energy. It has also allowed astronomers to measure the age, size, and composition of the universe with unprecedented accuracy.

In recent years, the study of the CMB has entered a new era, with a number of high-precision experiments, such as the Planck satellite and the Atacama Cosmology Telescope, providing even more detailed maps of the CMB and shedding light on some of the universe's deepest mysteries.
'''

### Cleaning the paragraph.

In [9]:
## FORMING THE SENTENCES AND CLEANING THE PARGRAPHS USING STOPWORDS AND LEMMATIZATION:

# Perfoerming Sentence Tokenization: 
sentences = nltk.sent_tokenize(paragraph)

# Performing Lemmetization and making all the characters 'lowercase':
wordnet  = WordNetLemmatizer()
corpus = []

for i in range(len(sentences)):

  review = re.sub("[^a-zA-Z]", ' ', sentences[i])
  review = review.lower()
  review = review.split()
  review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
  review = ' '.join(review)
  
  corpus.append(review)


In [13]:
corpus 

['cosmic microwave background cmb form electromagnetic radiation pervades entire universe',
 'thought afterglow big bang event mark beginning universe know',
 'cmb first discovered two radio astronomer arno penzias robert wilson working bell lab new jersey',
 'using large horn shaped antenna study radio wave emitted milky way kept detecting mysterious signal seemed coming direction sky',
 'ruling number possible explanation realized stumbled upon cmb',
 'cmb incredibly faint temperature kelvin degree celsius',
 'however remarkably uniform across entire sky temperature variation part',
 'tiny fluctuation thought result slight density variation early universe stretched cosmic expansion form large scale structure see today galaxy cluster galaxy',
 'studying cmb crucial understanding universe evolution',
 'provided strong evidence big bang theory well existence dark matter dark energy',
 'also allowed astronomer measure age size composition universe unprecedented accuracy',
 'recent year s

### Performing TF-IDF

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf =  TfidfVectorizer()
x = tfidf.fit_transform(corpus)

In [18]:
import pandas as pd
pd.DataFrame(x.toarray())


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,114,115,116,117,118,119,120,121,122,123
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.364408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.271858,0.233475,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.271858,0.271858,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.236029,0.0,0.0,0.0,...,0.0,0.0,0.236029,0.0,0.236029,0.236029,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.376478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.354658,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.304585,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.187383,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.272417,0.0,0.0,0.0


<b> > By performing TF-IDF we make sure that the symantics (which word is more important in that particular sentence and which is not.) the words get captured while doing some projects based on NLP (e.g. Sentiment analysis, spam analysis etc)