# NLP : transform a text into embeddings using TF-IDF

### **TF-IDF**: Term Frequency-Inverse Document Frequency 

### Before talking about TF-IDF I am going to talk about the simplest form of transforming the words into embeddings, ***the Document-term matrix.*** In this technique you only need to build a matrix where each row is a phrase, each column is a token and the value of the cell is the number of times that a word appeared in the phrase.

![image.png](attachment:image.png)

### After that to get the similarity between two phrases you only need to choose the similarity method and apply it to the phrases rows. **The major problem of this method is that all words are treated as having the same importance in the phrase.**

### To address this problem TF-IDF emerged as a numeric statistic that is intended to reflect how important a word is to a document.
### TF-IDF gets this importance score by getting the term’s frequency (TF) and multiplying it by the term inverse document frequency (IDF). The higher the TF-IDF score the rarer the term in a document and the higher its importance.

## How to calculate TF-IDF?

## The term frequency it’s how many times a term appears in the document, for example: 
   ## if the word ‘stock’ appears 20 times in a 2000 words document the TF of stock is    
   ## 20/2000 = 0.01.

## The IDF is logarithmic of the total number of documents divided by the total number of documents that contain the term, for example: 
   ## if there are 50.000 documents and the word ‘stock’ appears in 500 documents so the IDF
   ##  is the log(50000/500) = 4.6. So the TF-IDF of ‘stock’ is 4.6 * 0.01 = 0.046.




In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# converting the text data to feature vectors (numeric values)
vectorizer = TfidfVectorizer()

# accept a list (or an iterable) containing a single element.
feature_vectors = vectorizer.fit_transform(["my name is nouhaila and i'm learning nlp"])

print(feature_vectors)

  (0, 5)	0.3779644730092272
  (0, 2)	0.3779644730092272
  (0, 0)	0.3779644730092272
  (0, 6)	0.3779644730092272
  (0, 1)	0.3779644730092272
  (0, 4)	0.3779644730092272
  (0, 3)	0.3779644730092272


## To understand this output assume that :

## The general form:  (A,B)  C
### A: Document index 
### B: Specific word-vector index 
### C: TFIDF score for word B in document A  

## This is a sparse matrix. It indicates the tfidf score for all non-zero values in the word vector for each document.
