## Task: Convert the sentences in the variable sents to vectorized form.

### Import Count Vectorizer from sklearn.feature_extraction.text and pandas

In [49]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [89]:
documents = [
    'The cat sat on the mat',
    'The dog is barking at the cat',
    'The cat and the dog are friends',
    'A quick brown fox jumps over the lazy dog'
]

## Approach 1: Using default parameters

In [90]:
cv = CountVectorizer()

In [91]:
X = cv.fit_transform(documents)
X_array = X.toarray()

In [92]:
df = pd.DataFrame(X_array, columns=cv.get_feature_names_out())

Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be words, phrases, symbols, or other meaningful elements, depending on the context and the task at hand. Tokenization is a fundamental step in many natural language processing (NLP) tasks because it allows machines to understand and process human language.

Note that here, we have tokenized a sentence by breaking it into individual words.

Now, how do we capture the information in seqence of words? Example "brown fox" or "quick brown fox"?

We can use Ngrams in order to do that.

## Approach 2: Add n_gram range in the CountVectorizer

In [94]:
cv = CountVectorizer(ngram_range=(1,2))

In [95]:
X = cv.fit_transform(documents)
X_array = X.toarray()

In [96]:
df = pd.DataFrame(X_array, columns=cv.get_feature_names_out())

In order to reduce number of features, we can add stopwords

## Approach 3: Add stopwords

In [98]:
cv = CountVectorizer(ngram_range=(1,2),stop_words='english')

In [99]:
X = cv.fit_transform(documents)
X_array = X.toarray()

In [100]:
df = pd.DataFrame(X_array, columns=cv.get_feature_names_out())

## Approach 4: Add custom pre-processing steps

In [133]:
import nltk
#nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')

stop = stopwords.words('english')
import re

In [134]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [136]:
# Custom pre-processing function with stemming
def custom_preprocessor(text):
    # Lowercase the text
    text = text.lower()
    
    # Remove special characters and digits using regex
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenize the text into words
    words = text.split()
    
    #Remove Stopwords
    words = [word for word in words if word not in stop]
    
    # Stem or Lemmatize each word
    #stemmed_words = [stemmer.stem(word) for word in words]
    
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    # Join the stemmed words back into a single string
    processed_text = ' '.join(lemmatized_words)
    
    return processed_text

In [137]:
cv = CountVectorizer(ngram_range=(1,2),stop_words='english',preprocessor=custom_preprocessor)

In [139]:
X = cv.fit_transform(documents)
X_array = X.toarray()

In [140]:
df = pd.DataFrame(X_array, columns=cv.get_feature_names_out())

In [141]:
df

Unnamed: 0,barking,barking cat,brown,brown fox,cat,cat dog,cat sat,dog,dog barking,dog friend,...,friend,jump,jump lazy,lazy,lazy dog,mat,quick,quick brown,sat,sat mat
0,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,1,0,0,1,1
1,1,1,0,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,1,0,1,0,1,...,1,0,0,0,0,0,0,0,0,0
3,0,0,1,1,0,0,0,1,0,0,...,0,1,1,1,1,0,1,1,0,0


## Now that we understand the basics of Vetorization, Vectorize the documents in the variable document using Tf-IDF vectorizer. Use english stopwords, custom preprocessor and ngram_range of (1,2)

In [151]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(preprocessor=custom_preprocessor,stop_words='english',ngram_range=(1,2))

In [153]:
X = tfidf.fit_transform(documents)

In [157]:
df = pd.DataFrame(X.toarray(),columns=tfidf.get_feature_names_out())