#  Text analysis and feature extraction
In order to feed text to a model we need to transform it to a numerical features, in this notebook we will discuss how to build a bag-of-words model from text to use it later for different applications.

## Bag of words:
The "Bag of Words" (BoW) is a common and straightforward technique used in natural language processing (NLP) for text analysis and feature extraction. It represents a document as a collection, or "bag," of individual words, disregarding grammar, word order, and context, and focusing solely on word frequency.

Count the occurrences of words in the corpus.

In [21]:
# Import the pandas library and alias it as 'pd'
import pandas as pd

# Import the CountVectorizer class from scikit-learn's feature_extraction.text module
from sklearn.feature_extraction.text import CountVectorizer

# Define a list of text documents
texts = ['I love natural language processing.', 'NLP is fascinating, and I want to learn more.']

# Create an instance of the CountVectorizer class
vectorizer = CountVectorizer()

# Fit the vectorizer on the text data to build the vocabulary
vectorizer.fit(texts)

# Transform the text data into a document-term matrix (BoW representation)
x = vectorizer.transform(texts)

# Get the feature (word) names from the vocabulary
columns = vectorizer.get_feature_names_out()

# Create a pandas DataFrame using the BoW matrix, with columns representing words and rows representing documents
# This DataFrame visualizes the frequency of each word in each document
pd.DataFrame(x.todense(), columns=columns, index=texts)




Unnamed: 0,and,fascinating,is,language,learn,love,more,natural,nlp,processing,to,want
I love natural language processing.,0,0,0,1,0,1,0,1,0,1,0,0
"NLP is fascinating, and I want to learn more.",1,1,1,0,1,0,1,0,1,0,1,1


In [20]:
vectorizer.get_feature_names_out()

array(['and', 'fascinating', 'is', 'language', 'learn', 'love', 'more',
       'natural', 'nlp', 'processing', 'to', 'want'], dtype=object)

In [24]:
vectorizer.get_feature_names_out()

array(['fascinating', 'language', 'learn', 'love', 'natural', 'nlp',
       'processing', 'want'], dtype=object)

# Stop-words

Stop-words are words that are not significant to the topic in hand, for example `[am, is, are, in, at, ...]` can be considered stop-words in many applications as they don't add meaning.

 Stop words are extremely common words that appear frequently in text but often do not carry significant meaning on their own. Examples of stop words in English include "the," "and," "in," "of," "to," "a," "an," and so on.

In some other domains and problems you may have different kind of stop-words, for example if you are processing some chatbot data you may find `[can you please, would you please, can I, may I, ...]` such examples don't add meaning so stop-words can also be domain specific, and `TFIDF` can help you find these.

In [23]:
# Define a list of text documents
texts = ['I love natural language processing.', 'NLP is fascinating, and I want to learn more.']

# Create an instance of the CountVectorizer class and specify that you want to remove English stop words
vectorizer = CountVectorizer(stop_words='english')

# Fit the vectorizer on the text data to build the vocabulary and remove stop words
vectorizer.fit(texts)

# Transform the text data into a document-term matrix (BoW representation)
x = vectorizer.transform(texts)

# Get the feature (word) names from the vocabulary
columns = vectorizer.get_feature_names_out()

# Create a pandas DataFrame using the BoW matrix, with columns representing words and rows representing documents
# This DataFrame visualizes the frequency of each remaining word in each document after stop words removal
df = pd.DataFrame(x.todense(), columns=columns, index=texts)

df


Unnamed: 0,fascinating,language,learn,love,natural,nlp,processing,want
I love natural language processing.,0,1,0,1,1,0,1,0
"NLP is fascinating, and I want to learn more.",1,0,1,0,0,1,0,1


> Note that the words `and`,`is`,`more`,`to` werw removed here

# N-Grams

N-Grams is a way we can use to count for the context in the text, the bigger n-gram range the bigger context you can capture but also more features to generate, so be careful not to break your memory.

In natural language processing (NLP), N-grams are contiguous sequences of N items, where the items are typically words, characters, or tokens extracted from a text or speech. N-grams are used to capture the local linguistic structure and the relationships between neighboring words in a given piece of text. 

Here are some common types of N-grams:

* Unigrams (1-grams): These are single words in a text. For example, in the sentence "I love NLP," the unigrams are ["I," "love," "NLP"].

* Bigrams (2-grams): Bigrams consist of pairs of consecutive words in a text. For the same sentence, the bigrams would be ["I love," "love NLP"].

* Trigrams (3-grams): Trigrams are sequences of three consecutive words in a text. For example, "I love NLP" would have trigrams ["I love NLP"].

* 4-grams, 5-grams, and so on: These refer to sequences of N consecutive words in a text, where N can be any positive integer. For example, a 4-gram for the sentence "I love natural language processing" would be ["I love natural language," "love natural language processing"].

In [31]:
# Define a list of text documents
texts = ['I love natural language processing.', 'NLP is fascinating, and I want to learn more.']

# Create an instance of the CountVectorizer class and specify:
# - stop_words='english': Remove common English stop words
# - ngram_range=(1, 2): Generate both unigrams and bigrams
vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 2))

# Fit the vectorizer on the text data to build the vocabulary, remove stop words, and create n-grams
vectorizer.fit(texts)

# Transform the text data into a document-term matrix (BoW representation)
x = vectorizer.transform(texts)

# Get the feature (word or n-gram) names from the vocabulary
columns = vectorizer.get_feature_names_out()

# Create a pandas DataFrame using the BoW matrix, with columns representing words or n-grams
# and rows representing documents. This DataFrame visualizes the frequency of each remaining word or n-gram in each document
pd.DataFrame(x.todense(), columns=columns, index=texts)



Unnamed: 0,fascinating,fascinating want,language,language processing,learn,love,love natural,natural,natural language,nlp,nlp fascinating,processing,want,want learn
I love natural language processing.,0,0,1,1,0,1,1,1,1,0,0,1,0,0
"NLP is fascinating, and I want to learn more.",1,1,0,0,1,0,0,0,0,1,1,0,1,1


It's important to note that while N-grams provide valuable context information, they can also lead to high-dimensional feature spaces, especially when dealing with a large vocabulary. In practice, N-grams are often used in combination with techniques like feature selection and dimensionality reduction to manage the complexity of text data.

# TFIDF

TF-IDF stands for "Term Frequency-Inverse Document Frequency." It is a numerical statistic used in natural language processing (NLP) and information retrieval to evaluate the importance of a word within a document relative to a collection of documents (corpus). TF-IDF is a technique for text feature extraction that helps in quantifying the relevance of words in a document to a specific search query or task.

• Instead of just counting the frequency of each word, each word here is weighted using TF-IDF

• Focus on unique words in the corpus 

$$W_{x, y} = tf_{x, y} \times log(\frac{N}{df_x})$$

In [34]:
# Import the TfidfVectorizer class from scikit-learn's feature_extraction.text module
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a list of text documents
texts = ['I love natural language processing.', 'NLP is fascinating, and I want to learn more.']

# Create an instance of the TfidfVectorizer class and specify:
# - stop_words='english': Remove common English stop words
# - ngram_range=(1, 2): Generate both unigrams and bigrams
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))

# Fit the vectorizer on the text data to build the TF-IDF vocabulary, remove stop words, and create TF-IDF features
vectorizer.fit(texts)

# Transform the text data into a TF-IDF matrix
x = vectorizer.transform(texts)

# Get the feature (word or n-gram) names from the TF-IDF vocabulary
columns = vectorizer.get_feature_names_out()

# Create a pandas DataFrame using the TF-IDF matrix, with columns representing words or n-grams
# and rows representing documents. This DataFrame visualizes the TF-IDF scores for each term in each document.
pd.DataFrame(x.todense(), columns=columns, index=texts)




Unnamed: 0,fascinating,fascinating want,language,language processing,learn,love,love natural,natural,natural language,nlp,nlp fascinating,processing,want,want learn
I love natural language processing.,0.0,0.0,0.377964,0.377964,0.0,0.377964,0.377964,0.377964,0.377964,0.0,0.0,0.377964,0.0,0.0
"NLP is fascinating, and I want to learn more.",0.377964,0.377964,0.0,0.0,0.377964,0.0,0.0,0.0,0.0,0.377964,0.377964,0.0,0.377964,0.377964


## Simple Application


In [37]:
# Import necessary libraries
from collections import Counter  # For counting elements
import random  # For generating random values
from termcolor import colored  # For colored console output
from sklearn.datasets import fetch_20newsgroups  # For loading the 20 Newsgroups dataset
import numpy as np  # For numerical operations
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances  # For cosine similarity calculations
from sklearn.model_selection import train_test_split  # For data splitting

# Load the 20 Newsgroups dataset, specifically the 'test' subset,
# while removing 'headers', 'footers', and 'quotes'.
# Select categories related to 'rec.autos', 'comp.windows.x', 'soc.religion.christian', and 'rec.sport.baseball'.
data = fetch_20newsgroups(subset='test', remove=['headers', 'footers', 'quotes'],
                         categories=['rec.autos', 'comp.windows.x', 
                                     'soc.religion.christian', 'rec.sport.baseball'])

# Extract the text data (documents) and labels
x = data.data  # Text data
y = [data.target_names[i] for i in data.target]  # Labels

# Print the first document (data) and its corresponding label
print(f'DATA : {x[0]}')
print(f'LABEL: {y[0]}')


DATA : With all the recent problems the Indians have been having
with their pitching staff I have heard numerous names
thrown around about who could solve their problem.

One name I have not heard is Mike Soper (RP).  As far as
I know, Soper has had pretty good minor league stats.
Why not give the kid a chance?  Anyone know anything about
this guy?

-- 
LABEL: rec.sport.baseball


In [38]:
Counter(y)

Counter({'rec.sport.baseball': 397,
         'soc.religion.christian': 398,
         'comp.windows.x': 395,
         'rec.autos': 396})

In [39]:
# split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, stratify=y, random_state=42)

Let's try to get the top-5 similar articles

In [42]:
# Create an instance of the CountVectorizer class and specify that you want to remove common English stop words
vectorizer = CountVectorizer(stop_words='english')

# Fit the vectorizer on the training data (x_train) to build the vocabulary and remove stop words
vectorizer.fit(x_train)

# Transform the training data (x_train) into a document-term matrix (BoW representation)
x_train_v = vectorizer.transform(x_train)

# Transform the test data (x_test) into a document-term matrix (BoW representation) using the same vocabulary
x_test_v = vectorizer.transform(x_test)


In [52]:
# Loop through 5 random test data samples
GREEN = '\033[92m'
RED = '\033[91m'
YELLOW = '\033[93m'
RESET = '\033[0m'
for i in random.choices(range(0, len(x_test)), k=5):
    print(f"ID: {i}")

    # Print the true label in green color
    print("True label:", colored(y_test[i], GREEN))

    # Calculate cosine similarity between the current test data sample and all training data samples
    distances = cosine_similarity(x_test_v[i], x_train_v).flatten()

    # Sort the training data indices by similarity in descending order
    indices = np.argsort(distances)[::-1]

    # Loop through the top 3 nearest neighbors
    for _, j in enumerate(indices[:3]):
        # Print the neighbor's label in green if it matches the true label, or in red if it doesn't
        print(f"{_} nearest label is {GREEN if y_train[j] == y_test[i] else RED}{y_train[j]}{RESET}",
      f"similarity: {YELLOW}{round(distances[j], 3)}{RESET}")


ID: 192
True label: comp.windows.x
0 nearest label is [92mcomp.windows.x[0m similarity: [93m0.214[0m
1 nearest label is [92mcomp.windows.x[0m similarity: [93m0.204[0m
2 nearest label is [91mrec.autos[0m similarity: [93m0.164[0m
ID: 176
True label: soc.religion.christian
0 nearest label is [92msoc.religion.christian[0m similarity: [93m0.339[0m
1 nearest label is [92msoc.religion.christian[0m similarity: [93m0.333[0m
2 nearest label is [92msoc.religion.christian[0m similarity: [93m0.312[0m
ID: 131
True label: comp.windows.x
0 nearest label is [92mcomp.windows.x[0m similarity: [93m0.577[0m
1 nearest label is [92mcomp.windows.x[0m similarity: [93m0.469[0m
2 nearest label is [92mcomp.windows.x[0m similarity: [93m0.469[0m
ID: 304
True label: comp.windows.x
0 nearest label is [92mcomp.windows.x[0m similarity: [93m0.356[0m
1 nearest label is [92mcomp.windows.x[0m similarity: [93m0.196[0m
2 nearest label is [91mrec.autos[0m similarity: [93m0.19[0m