- NLP is used on textual data. To understand the textual data, first of all, it should be converted into numeric data.
- This is **Feature Extraction**. A technique to convert the content into numerical vectors and understand the importance of words or tokens in the given text.
- **Vectorizer** is a class in NLP that is useful to perform feature extraction.
- There are **2 types of vectorizers**. They are - 
    - Count Vectorizer
    - Tfidf vectorizer

### Count Vectorizer
- Separates each string into tokens and then counts the number of times each word occurs in the sentence.
- This approach is also known as 'bag of words' approach.

In [1]:
sentences = ['Hey hey hey lets go get lunch today', 'Did you go home?', 'Hey!!! I need a favour']
print(sentences)

['Hey hey hey lets go get lunch today', 'Did you go home?', 'Hey!!! I need a favour']


In [2]:
#create an object to COuntVectorizer class
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [3]:
s = cv.fit_transform(sentences)
print(s)

  (0, 4)	3
  (0, 6)	1
  (0, 3)	1
  (0, 2)	1
  (0, 7)	1
  (0, 9)	1
  (1, 3)	1
  (1, 0)	1
  (1, 10)	1
  (1, 5)	1
  (2, 4)	1
  (2, 8)	1
  (2, 1)	1


[(x, y) n], where in xth sentence, x = 0,1,2,3... yth word is being repeated for n times.
Let me make this more clearer on how it produced this output - 

After processing with CountVectorizer, the feature names (unique words) extracted from these sentences, excluding common stop words and punctuation, are:

In [4]:
print("Feature Names:", cv.get_feature_names_out())

Feature Names: ['did' 'favour' 'get' 'go' 'hey' 'home' 'lets' 'lunch' 'need' 'today'
 'you']


The extracted feature names are organised alphabetically automatically. The sparse matrix representation of the sentences, where each row corresponds to a sentence and each column to a feature (word) with its count in the sentence, is:

In [5]:
print("Sparse Matrix:\n", s.toarray())

Sparse Matrix:
 [[0 0 1 1 3 0 1 1 0 1 0]
 [1 0 0 1 0 1 0 0 0 0 1]
 [0 1 0 0 1 0 0 0 1 0 0]]


Considering the first row in this matrix, i.e. for the first sentence - did, favour,home, need, you = 0, which indicates it does not occur in the first sentence. get, go, lets, lunch, today = 1 and hey=3.

### Tfidf Vectorizer

- Alternative to Count Vectorizer
- Creates a table from our text or sentences
- Does not count how many times a word is repeated.
- It calculates term-frequency (TF) inverse document frequency (IDF) value for each word. This TF-IDF is the product of 2 weights : the term frequency and the inverse document frequency.

**TF or Term Frequency** represents how often a word occurs **in a single sentence**.
- TF value increases when we have several occurences of the same word in the same sentence.
- Eg - if a word is repeated many times in a sentence and not found in other sentences then that word is more meaningful and important. Such a word will be given more importance. TF-IDF will allocate a high score for that word.

**IDF or Inverse Document Frequency** is another weight representing how common a word is **across many sentences**.
- If a word is used in many sentences, then its TF-IDF will decrease.
- When several sentences are there, some common words may be repeated in each sentences. Such common words should not be given importance. TF-IDF will allocate low score for such words.

In [6]:
sentences = ['Hey lets get lunch', 'Hey!!! i need a favour']
print(sentences)

['Hey lets get lunch', 'Hey!!! i need a favour']


In [7]:
#lets create an object to the TfidfVectorizer class
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer()

In [8]:
s = tv.fit_transform(sentences)
print(s)

  (0, 4)	0.534046329052269
  (0, 1)	0.534046329052269
  (0, 3)	0.534046329052269
  (0, 2)	0.37997836159100784
  (1, 0)	0.6316672017376245
  (1, 5)	0.6316672017376245
  (1, 2)	0.4494364165239821


In [9]:
# Printing the feature (word) names and the sparse matrix in a dense form
print("Feature Names:", tv.get_feature_names_out())

Feature Names: ['favour' 'get' 'hey' 'lets' 'lunch' 'need']


In [10]:
print("TF-IDF Matrix:\n", s.toarray())

TF-IDF Matrix:
 [[0.         0.53404633 0.37997836 0.53404633 0.53404633 0.        ]
 [0.6316672  0.         0.44943642 0.         0.         0.6316672 ]]


In this matrix:
- The first row represents the TF-IDF values for each word in the first sentence. Words not present in the sentence have a TF-IDF score of 0.
- The second row represents the TF-IDF values for the second sentence.