##                                               ** BASIC VECTORIZATION TECHNIQUES**

## **1.Bag of Words**

**Advantages of Count Vectorizer**

Simple and straightforward implementation.
Effective for tasks where word frequency is a key feature.

**Disadvantages of Count Vectorizer**

Similar to BoW, it produces high-dimensional and sparse matrices.
Ignores the context and order of words.
Limited ability to capture semantic meaning

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus of text documents
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Create a CountVectorizer object
vectorizer = CountVectorizer()

# Fit and transform the corpus into a Bag-of-Words representation
X = vectorizer.fit_transform(corpus)

# Get the feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Print the Bag-of-Words matrix
print(X.toarray())

# Print the feature names
feature_names

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
#corpus = ['Text processing is necessary.', 'Text processing is necessary and important.', 'Text processing is easy.']
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [5]:
! pip install BagOfWords

Collecting BagOfWords
  Downloading bagofwords-1.0.4.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Discarding [4;34mhttps://files.pythonhosted.org/packages/02/40/b2c96601b3a205a3ad511fd3342bbf0f23608a67348c6dcd09e9b44088b9/bagofwords-1.0.4.tar.gz (from https://pypi.org/simple/bagofwords/)[0m: [33mRequested BagOfWords from https://files.pythonhosted.org/packages/02/40/b2c96601b3a205a3ad511fd3342bbf0f23608a67348c6dcd09e9b44088b9/bagofwords-1.0.4.tar.gz has inconsistent version: expected '1.0.4', but metadata has '1.0.3'[0m
  Downloading bagofwords-1.0.1.tar.gz (11 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generati

### **2.Count Vectorizer**

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The cat sat on the mat.",
    "The dog sat on the log mat.",
    "Cats and dogs are pets."
]

# Initialize CountVectorizer
count_vectorizer = CountVectorizer()
X_count = count_vectorizer.fit_transform(documents)
#feature_names = vectorizer.get_feature_names_out()
# Convert to array and print
print(X_count.toarray())
print(count_vectorizer.get_feature_names_out())

[[0 0 1 0 0 0 0 1 1 0 1 2]
 [0 0 0 0 1 0 1 1 1 0 1 2]
 [1 1 0 1 0 1 0 0 0 1 0 0]]
['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'log' 'mat' 'on' 'pets' 'sat' 'the']


### **3.TF-IDF**

Term Frequency (TF): Measures the frequency of a word in a document.

TF =Number of times term t appears in document d/Total number of terms in document
d
TF(t,d)=
Total number of terms in document d
Number of times term t appears in document d
​




Disadvantages of TF-IDF

Still results in sparse matrices.
Does not capture word order or context.
Computationally more expensive than BoW.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "Cats and dogs are pets."
]

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get feature names (vocabulary)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Print the TF-IDF matrix
print(tfidf_matrix.toarray())

# Print the feature names
feature_names

[[0.         0.         0.42755362 0.         0.         0.
  0.         0.42755362 0.32516555 0.         0.32516555 0.6503311 ]
 [0.         0.         0.         0.         0.42755362 0.
  0.42755362 0.         0.32516555 0.         0.32516555 0.6503311 ]
 [0.4472136  0.4472136  0.         0.4472136  0.         0.4472136
  0.         0.         0.         0.4472136  0.         0.        ]]


array(['and', 'are', 'cat', 'cats', 'dog', 'dogs', 'log', 'mat', 'on',
       'pets', 'sat', 'the'], dtype=object)

### **                                           ADVANCED VECTORIZATION TECHNIQUES**

# **1.Word Embeddings**