#  Vectorizing Raw Data
This is the process of converting text data into numerical data. This is done using the following techniques:
* **CountVectorizer** : It Converts a collection of text documents to a matrix of token counts
* **TfidfVectorizer** : It Converts a collection of raw documents to a matrix of TF-IDF features.
* **N-grams vectorization** : It Converts a collection of text documents to a matrix of token counts

In [23]:
sample_data = ['This is the first paper.',
               'This document is the second paper.',
               'And this is the third one.',
               'Is this the first paper?']

## Data Cleaning

* Remove punctuations
* Remove stop words
* Lowercase the text (optional)
* Stemming or Lemmatization (optional)
* Remove numbers (optional)
* Remove URLs (optional)
* Remove HTML tags (optional)

Assuming that the data is already cleaned, we can proceed to the next step.

In [24]:
# Data Cleaning

## CountVectorizer

1. import CountVectorizer from sklearn.feature_extraction.text
2. Create an instance of CountVectorizer
3. Fit and transform the data
4. Convert the sparse matrix to an array
5. Get the feature names
6. Get the length of the feature names

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
count_vectorizer = CountVectorizer()

In [7]:
x1 = count_vectorizer.fit_transform(sample_data)

In [8]:
x1.toarray()

array([[0, 0, 1, 1, 0, 1, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 0, 1, 1, 1],
       [0, 0, 1, 1, 0, 1, 0, 1, 0, 1]])

In [9]:
count_vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'paper', 'second', 'the',
       'third', 'this'], dtype=object)

In [10]:
len(count_vectorizer.get_feature_names_out())

10

## TfidfVectorizer

1. import TfidfVectorizer from sklearn.feature_extraction.text
2. Create an instance of TfidfVectorizer
3. Fit and transform the data
4. Convert the sparse matrix to an array
5. Get the feature names
6. Get the length of the feature names

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
tfidf_vectorizer = TfidfVectorizer()

In [13]:
x2 = tfidf_vectorizer.fit_transform(sample_data)

In [14]:
x2.toarray()

array([[0.        , 0.        , 0.58028582, 0.38408524, 0.        ,
        0.46979139, 0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.55690079, 0.        , 0.29061394, 0.        ,
        0.35546256, 0.55690079, 0.29061394, 0.        , 0.29061394],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.        , 0.58028582, 0.38408524, 0.        ,
        0.46979139, 0.        , 0.38408524, 0.        , 0.38408524]])

In [15]:
tfidf_vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'paper', 'second', 'the',
       'third', 'this'], dtype=object)

In [16]:
len(tfidf_vectorizer.get_feature_names_out())

10

## N-grams vectorization

1. import CountVectorizer from sklearn.feature_extraction.text
2. Create an instance of CountVectorizer
3. Fit and transform the data
4. Convert the sparse matrix to an array
5. Get the feature names
6. Get the length of the feature names

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
n_gram_vectorizer = CountVectorizer(ngram_range=(2, 2))

In [19]:
x3 = n_gram_vectorizer.fit_transform(sample_data)

In [20]:
x3.toarray()

array([[0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0],
       [0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1]])

In [21]:
n_gram_vectorizer.get_feature_names_out()

array(['and this', 'document is', 'first paper', 'is the', 'is this',
       'second paper', 'the first', 'the second', 'the third',
       'third one', 'this document', 'this is', 'this the'], dtype=object)

In [22]:
len(n_gram_vectorizer.get_feature_names_out())

13