<a href="https://colab.research.google.com/github/RDGopal/IB9LQ0-GenAI/blob/main/Prediction_and_Semantic_Search_with_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prediction and Semantic Search with Embeddings

In [None]:
!pip install gensim

**Select Runtime-Restart session after installing gensim**

## Prediction
We will continue with the `sms_spam.csv` dataset to analyze and predict whether an SMS is spam or not. In this instance, we will use embeddings of the text to evaluate its predictive performance.

In [None]:
import pandas as pd
import nltk

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/RDGopal/IB9LQ0-GenAI/main/Data/sms_spam.csv')
df

##Sentence Embeddings
Sentence embeddings are high-dimensional vector representations of sentences, capturing the semantic properties of the text. Unlike word embeddings that represent individual words, sentence embeddings represent the entire input sentence.

###Sentence Embeddings vs. Word Embeddings

* Granularity: Word embeddings represent individual words, whereas sentence embeddings encapsulate the meaning of full sentences or even larger text chunks.
* Context Sensitivity: Word embeddings have a fixed representation for each word, regardless of its contextual use. Sentence embeddings, on the other hand, consider the context of the entire sentence, which can change the representation based on how words are used together.
* Use Cases: Word embeddings are useful for tasks like word similarity and word analogy, while sentence embeddings are better suited for tasks that require understanding of larger text units, such as document classification, sentiment analysis, and question answering.

##Doc2Vec: Computation of Embeddings
* Mechanism: Developed as an extension of the Word2Vec model, Doc2Vec (also known as Paragraph Vector) embeds words in a vector space and adds a unique vector (document ID) that represents the document (or sentence) itself. It learns to predict words in a document while also maintaining a unique document vector.
* Training: Doc2Vec can be trained using two architectures:
** Distributed Memory (DM): Similar to Word2Vec’s CBOW model, but adds a paragraph token.
** Distributed Bag of Words (DBOW): Similar to Word2Vec’s Skip-gram, but predicts words randomly sampled from the paragraph, ignoring context words.

By default, Gensim’s Doc2Vec uses the Distributed Memory (DM) model. This is one of the two primary algorithms for training Doc2Vec, and it works by preserving the order of words in the document while attempting to predict a word in the context of the preceding words and a special token that represents the document (or sentence). This model is analogous to the Continuous Bag of Words (CBOW) model used in Word2Vec, but with the addition of the paragraph (document) vector.

**Key Characteristics of DM:**

It attempts to predict a word based on the context words and a unique document identifier. It generally produces more coherent embeddings for larger documents where the order of words contributes more meaningfully to the semantic content.

If you wanted to use the other training method, the Distributed Bag of Words (DBOW), you would need to specify this when initializing the Doc2Vec model with the parameter `dm=0`:


`model = Doc2Vec(vector_size=40, min_count=2, epochs=30, dm=0)`

**Comparison of DM and DBOW:**

* DM (Distributed Memory):
Better for understanding semantic similarity.
Uses the context of the current word to predict the word.
Typically results in higher quality embeddings where document order matters.
* DBOW (Distributed Bag of Words):
Does not need the word order, thus faster to train.
It predicts words randomly from the paragraph in the current context.
Can be less memory intensive as it does not need to store word vectors during training.


The choice between DM and DBOW often depends on the specific requirements of the application, the nature of the data, and computational resources. DM is usually preferred when the quality of the embeddings is paramount, while DBOW can be favored for its speed and lower resource consumption.

We will create 40 dimensional doc2vec embeddings of the text and use these for predicting the outcome.

In [None]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

# Preprocess and tag each message in the dataset
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(df['text'])]


In [None]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Preprocess and tag each message in the dataset
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(df['text'])]

# Define and train the Doc2Vec model
model = Doc2Vec(vector_size=40, min_count=2, epochs=30)
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

# Create embeddings and expand each element into separate columns
embeddings = [model.infer_vector(word_tokenize(text.lower())) for text in df['text']]
df_embeddings = pd.DataFrame(embeddings, columns=[f'embed_{i}' for i in range(len(embeddings[0]))])
df = pd.concat([df, df_embeddings], axis=1)

In [None]:
df

## Prediction with Doc2Vec Embeddings

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, auc, classification_report
import matplotlib.pyplot as plt
import numpy as np



# Select embedding columns as features. This assumes embedding column names are like 'embed_0', 'embed_1', ..., 'embed_39'
X = df.loc[:, df.columns.str.startswith('embed_')]
y = df['type'].apply(lambda x: 1 if x == 'spam' else 0)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_classifier.predict(X_test)
y_prob = rf_classifier.predict_proba(X_test)[:, 1]  # probabilities for ROC curve

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

# Feature Importance
importances = rf_classifier.feature_importances_
indices = np.argsort(importances)[::-1]
print("Feature ranking:")

for f in range(X_train.shape[1]):
    print("%d. embedding %d (%f)" % (f + 1, indices[f], importances[indices[f]]))


##Sentence Transformers: Computation of Embeddings

* Mechanism: Sentence-transformers modify the pre-trained BERT or other transformer models to produce meaningful sentence embeddings. It uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine similarity.
* Training: It typically involves fine-tuning a transformer model on a dataset of sentence pairs with some similarity measure. The goal is to train embeddings such that similar sentences are close in vector space, and dissimilar sentences are far apart.

###Advantages and Disadvantages

**Doc2Vec:**

* Advantages:
** Good at capturing semantic meaning of longer texts.
** Does not require labeled data, as it uses unsupervised learning.

* Disadvantages:
** Inferior in capturing nuances compared to more advanced models like BERT.
** Requires careful hyperparameter tuning and significant training data to perform well.

**Sentence-Transformers**:

* Advantages:
** Produces state-of-the-art embeddings that are highly effective for many NLP tasks.
** Can leverage pre-trained transformer models which have been trained on vast amounts of data.
* Disadvantages:
** Computationally expensive, requiring powerful hardware (GPUs) for fine-tuning and inference.
** Sometimes overfitting can occur on smaller or less diverse datasets.

Models in sentence-transformers have fixed embedding sizes determined by their architecture. For example, models based on BERT typically produce embeddings of size 768, whereas smaller models like `all-MiniLM-L6-v2` produce embeddings of size 384.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/RDGopal/IB9LQ0-GenAI/main/Data/sms_spam.csv')

In [None]:
%pip install -U sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings and expand into separate columns
hf_embeddings = [model.encode(text) for text in df['text']]
df_hf_embeddings = pd.DataFrame(hf_embeddings, columns=[f'hf_embed_{i}' for i in range(len(hf_embeddings[0]))])
df = pd.concat([df, df_hf_embeddings], axis=1)

# Check the dataframe
print(df.head())


In [None]:
df

### Reduce dimensionality
Sometimes you want to reduce the dimensionality of the embedding vector. In our case, let's say that we want to reduce from 384 dimensions to 40 dimensions. The basic approach is to apply dimensionality reduction techniques like PCA after generating embeddings.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/RDGopal/IB9LQ0-GenAI/main/Data/sms_spam.csv')
df

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

# Load a pre-trained model from sentence-transformers
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for all texts
hf_embeddings = [model.encode(text) for text in df['text']]

# Convert list of embeddings into a DataFrame
embeddings_df = pd.DataFrame(hf_embeddings)

# Initialize PCA to reduce to 40 dimensions
pca = PCA(n_components=40)
pca_result = pca.fit_transform(embeddings_df.values)

# Convert the PCA result into a DataFrame and set appropriate column names
df_pca_embeddings = pd.DataFrame(pca_result, columns=[f'pca_embed_{i}' for i in range(40)])

# Drop any existing PCA embedding columns first to avoid duplication
df = df.drop(columns=[col for col in df.columns if col.startswith('pca_embed_')], errors='ignore')

# Concatenate the original DataFrame with the PCA embeddings DataFrame
df = pd.concat([df, df_pca_embeddings], axis=1)

# Check the new DataFrame structure
print(df.head())


## Prediction with Sentence-Transformers

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, auc, classification_report
import matplotlib.pyplot as plt
import numpy as np


# Assuming 'df' is the DataFrame and the target variable 'type' is encoded as 0 for 'ham' and 1 for 'spam'
# Select embedding columns as features. This assumes embedding column names are like 'embed_0', 'embed_1', ..., 'embed_39'
X = df.loc[:, df.columns.str.startswith('pca_embed_')]
y = df['type'].apply(lambda x: 1 if x == 'spam' else 0)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_classifier.predict(X_test)
y_prob = rf_classifier.predict_proba(X_test)[:, 1]  # probabilities for ROC curve

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

# Feature Importance
importances = rf_classifier.feature_importances_
indices = np.argsort(importances)[::-1]
print("Feature ranking:")

for f in range(X_train.shape[1]):
    print("%d. embedding %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Your Turn
Read the `fakenews1000.csv` file and predict fake news based on the embeddings of the `text`.

##Semantic Search
Semantic search is an approach that aims to understand the meaning behind a search query, instead of just matching keywords, in order to return more relevant results. We can use  embeddings, which are vector representations of underlying text that capture their meaning, to represent the meaning of a search query. In this exercise, we will perform a semantic search using embeddings.


### Read the PDF text
* Read the PDF document `docAI.pdf`.
* Split the text into sentences and store in a dataframe.
* Clean and preprocess the text.


In [None]:
%pip install PyPDF2

### Choice 1: Read the pdf locally

In [None]:
import pandas as pd
import numpy as np
from PyPDF2 import PdfReader
import re

# Read the PDF file
reader = PdfReader("Machine_stops.pdf")

text = ""
for page in reader.pages:
    text += page.extract_text() + " "

# Split the text into sentences
sentences = re.split(r'\.\s+', text)

# Clean and preprocess the text
df = pd.DataFrame({'text': sentences})
df['clean_text'] = df['text'].str.lower()
df['clean_text'] = df['clean_text'].str.replace('[^a-z\s]', '', regex=True)
df['clean_text'] = df['clean_text'].str.replace('\s+', ' ', regex=True)
df['sentence_id'] = np.arange(len(df))


###Choice 2: Read from GitHub

In [None]:
import requests
from PyPDF2 import PdfReader
from io import BytesIO

# URL to the raw PDF file on GitHub
url = 'https://raw.githubusercontent.com/RDGopal/IB9LQ0-GenAI/main/Data/docAI.pdf'

# Use requests to get the content of the PDF file
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Use BytesIO to convert the bytes response to a file-like object
    pdf_file = BytesIO(response.content)

    # Now you can use PdfReader to read the file
    reader = PdfReader(pdf_file)

    # Initialize a variable to hold all extracted text
    full_text = ""

    # Iterate over each page in the PDF file
    for page in reader.pages:
        text = page.extract_text()
        if text:  # Check if text was successfully extracted
            full_text += text + "\n"  # Append the text of each page

    text = full_text

else:
    print("Failed to download the file. Status code:", response.status_code)

###Split text into sentences and clean

In [None]:
# Split the text into sentences
import re
sentences = re.split(r'\.\s+', text)

# Clean and preprocess the text
df = pd.DataFrame({'text': sentences})
df['clean_text'] = df['text'].str.lower()
df['clean_text'] = df['clean_text'].str.replace('[^a-z\s]', '', regex=True)
df['clean_text'] = df['clean_text'].str.replace('\s+', ' ', regex=True)
df['sentence_id'] = np.arange(len(df))

In [None]:
df

###Creating Sentence Embeddings with Doc2Vec

In [None]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from nltk.tokenize import word_tokenize

# Tokenizing and tagging
tagged_data = [TaggedDocument(words=word_tokenize(doc), tags=[str(i)]) for i, doc in enumerate(df['clean_text'])]

# Training a Doc2Vec model
model = Doc2Vec(vector_size=100, min_count=2, epochs=40)
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

# Saving embeddings
df['doc2vec_embedding'] = [model.infer_vector(word_tokenize(row['clean_text'])) for index, row in df.iterrows()]


In [None]:
df

###Semantic Search Functionality
We will use cosine similarity to find the top 5 closest sentences

In [None]:
def find_top_5_similar_sentences(df, query, model):
    query_embedding = model.infer_vector(word_tokenize(query.lower()))
    embeddings_matrix = np.vstack(df['doc2vec_embedding'])
    similarities = cosine_similarity([query_embedding], embeddings_matrix)
    top_5_indices = np.argsort(similarities[0])[::-1][:5]
    return df.iloc[top_5_indices]['text']

In [None]:
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity  # Import cosine_similarity

# Set display options
pd.set_option('display.max_colwidth', None)  # Ensure no truncation
pd.set_option('display.max_rows', None)  # Display any number of rows

query = "Change your business" #@param {type:"string"}
top_5_sentences = find_top_5_similar_sentences(df, query, model)
for sentence in top_5_sentences:
    print(sentence)
    print()


### Creating Sentence Embeddings with Sentence Transformer

In [None]:
from sentence_transformers import SentenceTransformer

# Load the model
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings
df['sbert_embedding'] = [sbert_model.encode(text) for text in df['clean_text']]


In [None]:
df.head()

###Semantic Search Functionality

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def semantic_search(query, embeddings, top_k=5):
    query_embedding = sbert_model.encode([query])
    cos_similarities = cosine_similarity(query_embedding, embeddings)[0]
    top_k_indices = np.argsort(cos_similarities)[::-1][:top_k]
    return df.iloc[top_k_indices]['clean_text']



In [None]:
# Set display options
pd.set_option('display.max_colwidth', None)  # Ensure no truncation
pd.set_option('display.max_rows', None)  # Display any number of rows
query = "Change your business" #@param {type:"string"}
top_sentences = semantic_search(query, np.array(list(df['sbert_embedding'])))
for sentence in top_sentences:
    print(sentence)
    print()



Semantic search is a promising new technology that has the potential to revolutionize the way we search for information. As semantic search technology continues to develop, we can expect to see even more benefits in the future.



#YOUR TURN
1.	Read the document `Machine_stops.pdf` and implement semantic search over it.