# Vectorizing

## What is it?


Vectorizing in NLP refers to the process of converting text data into numerical vectors that can be used as input to machine learning models.

Since machine learning algorithms require numerical input, vectorization is a crucial step in preprocessing text data.

Common Vectorization Techniques:


1.	Bag of Words (BoW)
2.	Term Frequency-Inverse Document Frequency (TF-IDF)
3.	N-Grams

## What for?

Uses of Vectorizing

	1.Text Classification
	
    Converting text into vectors for use in machine learning algorithms to classify text into categories (e.g., spam detection, sentiment analysis).
	
    2.Clustering
    
    Grouping similar documents together based on their vector representations (e.g., topic clustering).
	
    3.Information Retrieval
    
    Matching and ranking documents based on the similarity of their vector representations to a query vector (e.g., search engines).
	
    4.Text Similarity
    
    Measuring how similar two pieces of text are by comparing their vectors (e.g., plagiarism detection).
    
	5.Named Entity Recognition (NER)
	
    Using word vectors as features in models that identify and classify named entities in text.
	
    6.Machine Translation
    
    Translating text from one language to another by leveraging vector representations to understand the context and semantics.
	
    7.Recommendation Systems
    
    Recommending content based on the similarity of vector representations (e.g., recommending articles or products).
	
    8.Question Answering
    
    Understanding and retrieving relevant answers to questions based on the vectorized representations of text.

## Example

For the sentence “I love natural language processing,” the vectorization process could involve:

* Bag of Words: [1, 1, 1, 1] (for the words “I”, “love”, “natural”, “language processing”).
* IDF: [0.2, 0.3, 0.5, 0.5] (with weights adjusted for word importance).

## How to do it?

### Packages



*   sklearn



In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

### Examples

In [3]:
sample_data = ['This is the first paper.',
               'This document is the second paper.',
               'And this is the third one.',
               'Is this the first paper?']

#### Count Vectorizer

In [4]:
count_vectorizer = CountVectorizer()
x1 = count_vectorizer.fit_transform(sample_data)

In [5]:
x1.toarray()

array([[0, 0, 1, 1, 0, 1, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 0, 1, 1, 1],
       [0, 0, 1, 1, 0, 1, 0, 1, 0, 1]])

In [6]:
count_vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'paper', 'second', 'the',
       'third', 'this'], dtype=object)

#### Tf-IDF Vectorizer

In [7]:
tfidf_vectorizer = TfidfVectorizer()
x2 = tfidf_vectorizer.fit_transform(sample_data)

In [8]:
x2.toarray()

array([[0.        , 0.        , 0.58028582, 0.38408524, 0.        ,
        0.46979139, 0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.55690079, 0.        , 0.29061394, 0.        ,
        0.35546256, 0.55690079, 0.29061394, 0.        , 0.29061394],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.        , 0.58028582, 0.38408524, 0.        ,
        0.46979139, 0.        , 0.38408524, 0.        , 0.38408524]])

In [9]:
tfidf_vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'paper', 'second', 'the',
       'third', 'this'], dtype=object)

#### N Grams Vectorizer

In [10]:
n_gram_vectorizer = CountVectorizer(ngram_range=(2, 2))
x3 = n_gram_vectorizer.fit_transform(sample_data)

In [11]:
x3.toarray()

array([[0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0],
       [0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1]])

In [12]:
n_gram_vectorizer.get_feature_names_out()

array(['and this', 'document is', 'first paper', 'is the', 'is this',
       'second paper', 'the first', 'the second', 'the third',
       'third one', 'this document', 'this is', 'this the'], dtype=object)

#### N Gram Vectorizer with TF-IDF

In [13]:
ng_tfidf_vectorizer = TfidfVectorizer(ngram_range=(2, 2))
x4 = ng_tfidf_vectorizer.fit_transform(sample_data)

In [14]:
x4.toarray()

array([[0.        , 0.        , 0.52303503, 0.42344193, 0.        ,
        0.        , 0.52303503, 0.        , 0.        , 0.        ,
        0.        , 0.52303503, 0.        ],
       [0.        , 0.47633035, 0.        , 0.30403549, 0.        ,
        0.47633035, 0.        , 0.47633035, 0.        , 0.        ,
        0.47633035, 0.        , 0.        ],
       [0.49819711, 0.        , 0.        , 0.31799276, 0.        ,
        0.        , 0.        , 0.        , 0.49819711, 0.49819711,
        0.        , 0.39278432, 0.        ],
       [0.        , 0.        , 0.43779123, 0.        , 0.55528266,
        0.        , 0.43779123, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.55528266]])

In [15]:
ng_tfidf_vectorizer.get_feature_names_out()

array(['and this', 'document is', 'first paper', 'is the', 'is this',
       'second paper', 'the first', 'the second', 'the third',
       'third one', 'this document', 'this is', 'this the'], dtype=object)

## Practise

### Quiz 1

*   Read the `spam.csv`
*   Print the fist 5 rows
*   apply `count Vectorizer` on each document in the file



Unnamed: 0,text,target,text_without_stopwords,text_without_sp,stemmed_text,lemmatized_text
0,"Go until jurong point, crazy.. Available only ...",ham,ham,ham,ham,ham
1,Ok lar... Joking wif u oni...,ham,ham,ham,ham,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam,spam,spam,spam,spam
3,U dun say so early hor... U c already then say...,ham,ham,ham,ham,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham,ham,ham,ham,ham


### Quiz 2



*  What is the shape (number or rows and columns) of the matrix ?
*  Print vector representation of the first document
*  Print the vector representation the second word



In [46]:
# write answer here


### Quiz 3

What is the 3 terms that appears most (maximum number of appears) documents ?

Unnamed: 0,count
7756,2242
8609,2240
7627,1328
1084,979
4087,898


count    7756
dtype: int64

np.int64(2242)

'to'

### Quiz 4

What is the terms that less appers in most (minimum number of appers) documents ?

count    2
dtype: int64

count    1
Name: 2, dtype: int64

1

array(['000pes'], dtype=object)

1

### Quiz 5

Find the document which maximum words appers in it?

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8662,8663,8664,8665,8666,8667,8668,8669,8670,8671
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


count    1084
dtype: int64

Unnamed: 0,count
1084,176


array([[0, 0, 0, ..., 0, 0, 0]])

### Quiz 6

Find in how many documents the terms `for` appear ?

In [17]:
# write answer here


(array([3308]),)

np.int64(704)

### Quiz 7

What is the top 3 most common word in all documents ?

find/print the vector of that word ?


(array([3308]),)

### Quiz 8

solve the above question using 

1. `TF-IDF Vectorizer`
2. `N-Grams Vectorizer`

