# Search Engine Example: From Basic to Advanced Techniques

This notebook demonstrates the implementation of various search engine techniques, from simple to more advanced approaches. It shows how to build a text-based search system for a collection of documents from different courses.

## Contents and Techniques Covered:

1. **Data Loading and Exploration**
   - Loading documents from a JSON source
   - Organizing and exploring the document structure

2. **Basic Text Vectorization**
   - CountVectorizer: Simple word counting
   - TF-IDF Vectorization: Term frequency-inverse document frequency
   - Removing stop words to improve quality

3. **Search Implementation**
   - Computing similarity between queries and documents
   - Using cosine similarity for comparing vector representations
   - Filtering by course and boosting specific fields
   - Creating a reusable `TextSearch` class

4. **Dimensionality Reduction Techniques**
   - TruncatedSVD (Latent Semantic Analysis)
   - Non-negative Matrix Factorization (NMF)

5. **Advanced Embedding with BERT**
   - Using pre-trained BERT model for contextual embeddings
   - Processing text through transformer architecture
   - Batch processing for efficiency
   - Saving embeddings for future use

Each section builds on previous concepts, progressing from basic bag-of-words representations to advanced contextual embeddings using transformers.

In [107]:
import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

## 1. Data Loading and Exploration

First, we load the document data from a JSON source and organize it into a structured format.

In [108]:
documents[2] #example document

{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'section': 'General course-related questions',
 'question': 'Course - Can I still join the course after the start date?',
 'course': 'data-engineering-zoomcamp'}

In [109]:
import pandas as pd

In [110]:
df = pd.DataFrame(documents, columns=['course', 'section', 'question', 'text'])
df.head()

Unnamed: 0,course,section,question,text
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
2,data-engineering-zoomcamp,General course-related questions,Course - Can I still join the course after the...,"Yes, even if you don't register, you're still ..."
3,data-engineering-zoomcamp,General course-related questions,Course - I have registered for the Data Engine...,You don't need it. You're accepted. You can al...
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...


### Vector Space Model for Search

In information retrieval, we represent documents as vectors in a mathematical space:

- **Document Vectorization**: Converting text documents into numerical vectors
- **Term-Document Matrix**:
    - Rows: Documents in our collection
    - Columns: Words/terms from our vocabulary
    - Values: Importance or frequency of each term in each document
  
- **Bag-of-Words Approach**:
  - Word order and grammar are disregarded
  - Only the occurrence or frequency of words matters
  - Uses sparse matrices (mostly zeros) for efficient storage
  - Foundation for many information retrieval systems

### Understanding the Data Structure

Let's examine the structure of our documents to understand what we're working with. Each document represents a Q&A entry from a course and contains fields like 'course', 'section', 'question', and 'text' (the answer).

In [111]:
df[df.course == 'data-engineering-zoomcamp'].head()

Unnamed: 0,course,section,question,text
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
2,data-engineering-zoomcamp,General course-related questions,Course - Can I still join the course after the...,"Yes, even if you don't register, you're still ..."
3,data-engineering-zoomcamp,General course-related questions,Course - I have registered for the Data Engine...,You don't need it. You're accepted. You can al...
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...


## 2. Basic Text Vectorization

Here we explore various text vectorization techniques, starting with a simple example dataset.

In [112]:
from sklearn.feature_extraction.text import CountVectorizer

### Basic Text Representation: CountVectorizer

We'll start with the most basic approach: representing documents by counting word occurrences.

`CountVectorizer` from scikit-learn:
- Converts a collection of text documents into a matrix of token counts
- Creates a "vocabulary" of all unique words in the corpus
- Represents each document as a vector of word counts

In [113]:
cv = CountVectorizer()

In [114]:
# for the purpose of this example, we will use a small set of documents
# in a real-world scenario, you would use a larger set of documents

docs_example = [
    "January course details, register now",
    "Course prerequisites listed in January catalog",
    "Submit January course homework by end of month",
    "Register for January course, no prerequisites",
    "January course setup: Python and Google Cloud"
]

In [115]:
cv.fit(docs_example)

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"


In [116]:
names = cv.get_feature_names_out()
names

array(['and', 'by', 'catalog', 'cloud', 'course', 'details', 'end', 'for',
       'google', 'homework', 'in', 'january', 'listed', 'month', 'no',
       'now', 'of', 'prerequisites', 'python', 'register', 'setup',
       'submit'], dtype=object)

In [117]:
X = cv.transform(docs_example)

In [118]:
X.toarray()

array([[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0],
       [1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0]])

In [119]:
df_docs = pd.DataFrame(X.toarray(), columns=names).T
df_docs

Unnamed: 0,0,1,2,3,4
and,0,0,0,0,1
by,0,0,1,0,0
catalog,0,1,0,0,0
cloud,0,0,0,0,1
course,1,1,1,1,1
details,1,0,0,0,0
end,0,0,1,0,0
for,0,0,0,1,0
google,0,0,0,0,1
homework,0,0,1,0,0


In [120]:
cv = CountVectorizer(stop_words='english') # removing common English stop words
X = cv.fit_transform(docs_example)

names = cv.get_feature_names_out()

df_docs = pd.DataFrame(X.toarray(), columns=names).T
df_docs

# this is a bag of words representation
# each row is a word, each column is a document
# the values are the counts of the words in the documents

Unnamed: 0,0,1,2,3,4
catalog,0,1,0,0,0
cloud,0,0,0,0,1
course,1,1,1,1,1
details,1,0,0,0,0
end,0,0,1,0,0
google,0,0,0,0,1
homework,0,0,1,0,0
january,1,1,1,1,1
listed,0,1,0,0,0
month,0,0,1,0,0


In [121]:
from sklearn.feature_extraction.text import TfidfVectorizer

### Improved Text Representation: TF-IDF Vectorization

**TF-IDF (Term Frequency-Inverse Document Frequency)** improves on simple word counting:

- **Term Frequency (TF)**: How often a word appears in a document
  - More frequent words in a document are more important (higher weight)

- **Inverse Document Frequency (IDF)**: How unique a word is across documents
  - Words that appear in many documents get lower weights
  - Rare, distinctive words get higher weights

This approach gives higher weight to terms that are:
1. Important within a specific document (high TF)
2. Distinctive across the document collection (high IDF)

TF-IDF helps reduce the importance of common words that appear in many documents but aren't very meaningful (like "the", "is", "of").

In [122]:
cv = TfidfVectorizer(stop_words='english')
X = cv.fit_transform(docs_example)

names = cv.get_feature_names_out()

df_docs = pd.DataFrame(X.toarray(), columns=names).T
df_docs.round(2)

# this is a TF-IDF representation
# each row is a word, each column is a document
# the values are the TF-IDF scores of the words in the documents


# TF-IDF stands for Term Frequency-Inverse Document Frequency
# it is a statistical measure that evaluates the importance of a word in a document relative to a
# collection of documents (corpus)
# the higher the score, the more important the word is in that document,
# the lower the score, the less important the word is in that document
# the more rare the word is in that document, the higher the score,
# and the more common the word is in that document, the lower the score


Unnamed: 0,0,1,2,3,4
catalog,0.0,0.57,0.0,0.0,0.0
cloud,0.0,0.0,0.0,0.0,0.47
course,0.33,0.27,0.23,0.36,0.23
details,0.69,0.0,0.0,0.0,0.0
end,0.0,0.0,0.47,0.0,0.0
google,0.0,0.0,0.0,0.0,0.47
homework,0.0,0.0,0.47,0.0,0.0
january,0.33,0.27,0.23,0.36,0.23
listed,0.0,0.57,0.0,0.0,0.0
month,0.0,0.0,0.47,0.0,0.0


## 3. Search Implementation

Now we implement search functionality using the vectorized representations and similarity measures.

In [123]:
query = "Do I need to know python to sign up for the January course?"

In [124]:
q = cv.transform([query])
q.toarray()

array([[0.        , 0.        , 0.39515588, 0.        , 0.        ,
        0.        , 0.        , 0.39515588, 0.        , 0.        ,
        0.        , 0.829279  , 0.        , 0.        , 0.        ]])

In [125]:
query_dict = dict(zip(names, q.toarray()[0]))
query_dict

{'catalog': np.float64(0.0),
 'cloud': np.float64(0.0),
 'course': np.float64(0.39515588491314224),
 'details': np.float64(0.0),
 'end': np.float64(0.0),
 'google': np.float64(0.0),
 'homework': np.float64(0.0),
 'january': np.float64(0.39515588491314224),
 'listed': np.float64(0.0),
 'month': np.float64(0.0),
 'prerequisites': np.float64(0.0),
 'python': np.float64(0.8292789960182417),
 'register': np.float64(0.0),
 'setup': np.float64(0.0),
 'submit': np.float64(0.0)}

In [126]:
doc_dict = dict(zip(names, X.toarray()[1]))
doc_dict

{'catalog': np.float64(0.5675015398728066),
 'cloud': np.float64(0.0),
 'course': np.float64(0.2704175244456293),
 'details': np.float64(0.0),
 'end': np.float64(0.0),
 'google': np.float64(0.0),
 'homework': np.float64(0.0),
 'january': np.float64(0.2704175244456293),
 'listed': np.float64(0.5675015398728066),
 'month': np.float64(0.0),
 'prerequisites': np.float64(0.45785666908911726),
 'python': np.float64(0.0),
 'register': np.float64(0.0),
 'setup': np.float64(0.0),
 'submit': np.float64(0.0)}

In [127]:
df_qd = pd.DataFrame([query_dict, doc_dict], index=['query', 'doc']).T

In [128]:
# calculate the cosine similarity between the query and the document manually -  this is the dot product of the two vectors
(df_qd['query'] * df_qd['doc']).sum()

np.float64(0.21371415233666782)

In [129]:
X.dot(q.T).toarray() # this is the same as the above, but using the dot product of the two vectors

array([[0.25955955],
       [0.21371415],
       [0.17843726],
       [0.28419115],
       [0.57137158]])

### Query-Document Similarity with Cosine Similarity

To find relevant documents for a query, we need to measure how similar they are:

- **Cosine Similarity**: Measures the cosine of the angle between two vectors
  - Value ranges from -1 (opposite) to 1 (identical)
  - For text vectors with non-negative values, range is 0 to 1
  - Higher value = more similar documents
  - Not affected by document length (unlike dot product)
  
- **Formula**: $\text{cosine}(A,B) = \frac{A \cdot B}{||A|| \times ||B||}$
  - Where $A \cdot B$ is the dot product
  - And $||A||$ is the magnitude of vector A

In [130]:
from sklearn.metrics.pairwise import cosine_similarity

In [131]:
# this the cosine similarity between the query and the document, it's the same as the dot product of the two vectors
cosine_similarity(X, q)

array([[0.25955955],
       [0.21371415],
       [0.17843726],
       [0.28419115],
       [0.57137158]])

In [132]:
df.columns

Index(['course', 'section', 'question', 'text'], dtype='object')

In [133]:
fields = ['section', 'question', 'text']
transformers = {}
matrices = {}

for field in fields:
    cv = TfidfVectorizer(stop_words='english', min_df=3)
    X = cv.fit_transform(df[field])

    transformers[field] = cv
    matrices[field] = X

In [134]:
transformers['text'].get_feature_names_out()

array(['001', '01', '02', ..., 'zones', 'zoom', 'zoomcamp'],
      shape=(2118,), dtype=object)

In [135]:
matrices['text']

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 26463 stored elements and shape (948, 2118)>

In [136]:
query = "I just singned up. Is it too late to join the course?"

In [137]:
q = transformers['text'].transform([query])
score = cosine_similarity(matrices['text'], q).flatten()

In [138]:
mask = (df.course == 'data-engineering-zoomcamp').values
score = score * mask
score[:10]

array([0.3336047 , 0.        , 0.        , 0.1328874 , 0.        ,
       0.        , 0.        , 0.12722114, 0.        , 0.        ])

In [139]:
import numpy as np

In [140]:
idx = np.argsort(-score)[:10]
idx

array([  0,  15,  22,  27,  38, 287,   3,   7, 113,  11])

In [141]:
score[idx]

array([0.3336047 , 0.23530268, 0.22668   , 0.1894954 , 0.16484429,
       0.13921764, 0.1328874 , 0.12722114, 0.1207499 , 0.10830554])

In [142]:
df.iloc[idx].text

0      The purpose of this document is to capture fre...
15     No, late submissions are not allowed. But if t...
22     It's up to you which platform and environment ...
27     You can do most of the course without a cloud....
38     You will have two attempts for a project. If t...
287    This error could result if you are using some ...
3      You don't need it. You're accepted. You can al...
7      Yes, we will keep all the materials after the ...
113    In the join queries, if we mention the column ...
11     No, you can only get a certificate if you fini...
Name: text, dtype: object

In [143]:
fields

['section', 'question', 'text']

In [144]:
query = "I just signed up. Is it too late to join the course?"

### Enhancing Relevance: Field Boosting and Filtering

To improve search results, we can:

1. **Field Boosting**: Give different weights to different fields
   - For example, we might consider matching text in a "question" field more important than in a "description" field
   - This is done by multiplying similarity scores by a boost factor

2. **Filtering**: Restrict results by specific criteria
   - We can filter results by metadata like course name, date, etc.
   - Implemented by masking scores (setting non-matching documents to zero)

In [145]:
# Define boost factors: we're giving the 'question' field 3x importance
boost = {'question': 3.0}  # Other fields will get default weight of 1.0

# Initialize score array with zeros for each document
score = np.zeros(len(df))

# For each field (section, question, text), calculate similarity and apply boost
for f in fields:
    # Get boost value (default 1.0 if not specified)
    b = boost.get(f, 1.0)
    
    # Transform query into TF-IDF space for this field
    q = transformers[f].transform([query])
    
    # Calculate cosine similarity between query and all documents for this field
    s = cosine_similarity(matrices[f], q).flatten()
    
    # Add the boosted similarity score to the total score
    score = score + b * s

In [146]:
# Define filters to narrow down results by metadata
filters = {
    'course': 'data-engineering-zoomcamp'  # Only show results from this course
}

# Apply each filter by masking out non-matching documents
for field, value in filters.items():
    # Create a boolean mask: True for matching documents, False for non-matching
    mask = (df[field] == value).values
    
    # Multiply scores by mask - this sets scores to 0 for non-matching documents
    # Since any number * 0 = 0, non-matching documents will get a final score of 0
    score = score * mask

In [147]:
idx = np.argsort(-score)[:10]
results = df.iloc[idx]
results.to_dict(orient='records')

[{'course': 'data-engineering-zoomcamp',
  'section': 'General course-related questions',
  'question': 'Course - What are the prerequisites for this course?',
  'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites'},
 {'course': 'data-engineering-zoomcamp',
  'section': 'General course-related questions',
  'question': 'How can we contribute to the course?',
  'text': 'Star the repo! Share it with friends if you find it useful ❣️\nCreate a PR if you see you can improve the text or the structure of the repository.'},
 {'course': 'data-engineering-zoomcamp',
  'section': 'General course-related questions',
  'question': 'Course - Which playlist on YouTube should I refer to?',
  'text': 'All the main videos are stored in the Main “DATA ENGINEERING” playlist (no year specified). The Github repository has also been updated to show each video with a thumbnail, that would bring you directly to the same playlist below.\nBelow is the MAIN PLAYLIST’. And then you refer to the

In [148]:
class TextSearch:
    """A versatile search engine for text documents that supports:
    - Multiple text fields
    - Field boosting
    - Filtering by metadata
    - TF-IDF vectorization
    """

    def __init__(self, text_fields):
        """Initialize the search engine with specified text fields to index.
        
        Args:
            text_fields (list): List of field names to use for searching
        """
        self.text_fields = text_fields
        self.matrices = {}       # Will store TF-IDF matrices for each field
        self.vectorizers = {}    # Will store vectorizers for each field

    def fit(self, records, vectorizer_params={}):
        """Process and index the document collection.
        
        Args:
            records (list): List of document dictionaries
            vectorizer_params (dict): Optional parameters for the TF-IDF vectorizer
        """
        # Convert records to a DataFrame if they aren't already
        self.df = pd.DataFrame(records)

        # For each text field, create a vectorizer and transform the text
        for f in self.text_fields:
            # Initialize a TF-IDF vectorizer with optional parameters
            cv = TfidfVectorizer(**vectorizer_params)
            
            # Transform the text in this field to TF-IDF vectors
            X = cv.fit_transform(self.df[f])
            
            # Store both the matrix and vectorizer for later use
            self.matrices[f] = X
            self.vectorizers[f] = cv

    def search(self, query, n_results=10, boost={}, filters={}):
        """Search for documents matching the query.
        
        Args:
            query (str): The search query
            n_results (int): Maximum number of results to return
            boost (dict): Field boost factors (e.g., {'title': 3.0})
            filters (dict): Metadata filters (e.g., {'category': 'science'})
            
        Returns:
            list: Matching documents as dictionaries
        """
        # Initialize scores array
        score = np.zeros(len(self.df))

        # Calculate similarity for each field and apply boosts
        for f in self.text_fields:
            # Get boost factor (default 1.0)
            b = boost.get(f, 1.0)
            
            # Transform query using this field's vectorizer
            q = self.vectorizers[f].transform([query])
            
            # Calculate cosine similarity
            s = cosine_similarity(self.matrices[f], q).flatten()
            
            # Add weighted similarity to total score
            score = score + b * s

        # Apply any filters to restrict results
        for field, value in filters.items():
            mask = (self.df[field] == value).values
            score = score * mask

        # Get indices of top scoring documents
        idx = np.argsort(-score)[:n_results]
        
        # Return the top results as dictionaries
        results = self.df.iloc[idx]
        return results.to_dict(orient='records')

### Creating a Reusable TextSearch Class

We encapsulate all our search functionality into a reusable class.

In [149]:
fields

['section', 'question', 'text']

In [150]:
index = TextSearch(text_fields=['section', 'question', 'text'])

In [151]:
index.fit(documents)

### Using Our TextSearch Class

Now we'll use our TextSearch class to build a complete search solution. This encapsulates all the concepts we've learned:

1. We initialize the search index with the fields we want to search
2. We fit/train the index on our document collection
3. We execute a search with boosting and filtering

This approach provides a clean, reusable interface for search functionality.

In [152]:
query

'I just signed up. Is it too late to join the course?'

In [153]:
index.search(
    query='I just signed up. Is it too late to join the course?',
    n_results=5,
    boost={'question': 3.0},
    filters={'course': 'data-engineering-zoomcamp'}
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineerin

## 4. Dimensionality Reduction Techniques

As our vocabulary grows, our vectors become very high-dimensional (one dimension per unique word), leading to:
- Increased computational complexity
- The "curse of dimensionality" (sparse, distant vectors)
- Difficulty capturing semantic relationships between words

Dimensionality reduction addresses these issues by:
- Projecting high-dimensional vectors into a lower-dimensional space
- Preserving semantic relationships between documents
- Potentially capturing latent topics or themes
- Improving efficiency and sometimes accuracy of similarity calculations

In [154]:
X = matrices['text']
cv = transformers['text']

In [155]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=16)
X_emb = svd.fit_transform(X)

### Truncated SVD (Latent Semantic Analysis)

**Singular Value Decomposition (SVD)** is a matrix factorization technique that decomposes our document-term matrix into three matrices:

- **Truncated SVD** keeps only the top K components (16 in our example)
- Also known as **Latent Semantic Analysis (LSA)** when applied to text data
- **Benefits**:
  - Captures latent semantic relationships between words
  - Words with similar meanings tend to be closer in the reduced space
  - Can discover synonyms and related concepts
  - Addresses the "synonym problem" in information retrieval
  - Significantly reduces dimensionality (from thousands to dozens)

In [156]:
X_emb[0]

array([ 0.08800366, -0.07517865, -0.10105712,  0.05230767,  0.05235927,
       -0.05944197,  0.02153147,  0.05269332, -0.20656883,  0.31692401,
        0.0293634 ,  0.10904857, -0.11577076,  0.03276341,  0.00761791,
       -0.00146267])

In [157]:
query = 'I just signed up. Is it too late to join the course?'

Q = cv.transform([query])
Q_emb = svd.transform(Q)

In [158]:
Q_emb[0]

array([ 0.04353771, -0.03066805, -0.04424086,  0.01347335,  0.02497004,
       -0.0518951 ,  0.01203983,  0.03418793, -0.12052568,  0.1735309 ,
        0.02738705,  0.0752527 , -0.06151985,  0.02799265,  0.01159803,
       -0.02185774])

In [159]:
np.dot(X_emb[0], Q_emb[0])

np.float64(0.11482853349220448)

In [160]:
score = cosine_similarity(X_emb, Q_emb).flatten()

In [161]:
idx = np.argsort(-score)[:10]

In [162]:
list(df.loc[idx].text)

['Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
 "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'No, it’s not possible. The form is closed after the due date. But don’t worry, homework is not mandatory for finishing the course.',
 'If you have submitted two projects (and peer-reviewed at least 3 course-mates’ projects for each submission), you will get the certificate for the course. According to the course coordinator, Alexey Grigorev, only two projects are needed to get the course certificate.\n

### Non-negative Matrix Factorization (NMF)

**Non-negative Matrix Factorization (NMF)** is an alternative dimensionality reduction technique with unique properties:

- **Key Characteristics**:
  - All values in the resulting matrices are non-negative
  - Produces an additive, parts-based representation (unlike SVD)
  - Often creates more interpretable components

- **Advantages for Text Analysis**:
  - Components often correspond to topics or themes
  - Non-negative values align well with the intuition that topics add up to form documents
  - Often works well for topic modeling and document clustering
  - May capture different semantic relationships than SVD

In [163]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=16)
X_emb = nmf.fit_transform(X)
X_emb[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.00032308, 0.        , 0.        , 0.31244107,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        ])

In [164]:
Q = cv.transform([query])
Q_emb = nmf.transform(Q)
Q_emb[0]

array([0.        , 0.00098135, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.17668529,
       0.        , 0.        , 0.        , 0.        , 0.00079345,
       0.        ])

In [165]:
score = cosine_similarity(X_emb, Q_emb).flatten()
idx = np.argsort(-score)[:10]
list(df.loc[idx].text)

['Please choose the closest one to your answer. Also do not post your answer in the course slack channel.',
 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
 "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 "No, you can only get a certificate if you finish the course with a “live” cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.",
 "The purpose

In [166]:
import torch
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()  # Set the model to evaluation mode if not training

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

## 5. Advanced Embedding with BERT

**BERT (Bidirectional Encoder Representations from Transformers)** represents a major advancement in text representation:

- **Contextual Embeddings**: Unlike traditional methods, BERT generates word embeddings based on surrounding context
  - The same word can have different embeddings in different contexts (e.g., "bank" in "river bank" vs. "bank account")

- **Advantages over Traditional Methods**:
  - Pre-trained on massive text corpora (Wikipedia, books)
  - Captures deep semantic relationships and linguistic patterns
  - Understands word meaning in context
  - More effectively captures nuanced queries and document meanings

- **Process**:
  - Tokenize text into subwords
  - Process through the transformer architecture
  - Extract contextual embeddings (usually from the last hidden layer)
  - Average or use special [CLS] token embeddings for sentence representation

In [167]:
texts = [
    "Yes, we will keep all the materials after the course finishes.",
    "You can follow the course at your own pace after it finishes"
]
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')


In [168]:
encoded_input

{'input_ids': tensor([[  101,  2748,  1010,  2057,  2097,  2562,  2035,  1996,  4475,  2044,
          1996,  2607, 12321,  1012,   102],
        [  101,  2017,  2064,  3582,  1996,  2607,  2012,  2115,  2219,  6393,
          2044,  2009, 12321,   102,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}

In [169]:
with torch.no_grad():  # Disable gradient calculation for inference
    outputs = model(**encoded_input)
    hidden_states = outputs.last_hidden_state

In [170]:
hidden_states.shape

torch.Size([2, 15, 768])

In [171]:
sentence_embeddings = hidden_states.mean(dim=1)
sentence_embeddings.shape

torch.Size([2, 768])

In [172]:
sentence_embeddings.numpy()

# note that if use a GPU, first you need to move your tensors to CPU
# sentence_embeddings_cpu = sentence_embeddings.cpu()

array([[ 0.3599924 , -0.16072305,  0.35452363, ...,  0.04289253,
         0.03482319, -0.03822242],
       [ 0.17849939, -0.5000251 ,  0.25277585, ..., -0.11413134,
        -0.33608466,  0.4109512 ]], shape=(2, 768), dtype=float32)

In [173]:
def make_batches(seq, n):
    result = []
    for i in range(0, len(seq), n):
        batch = seq[i:i+n]
        result.append(batch)
    return result

### Batch Processing for Efficient Embedding Generation

Processing large collections of documents through BERT can be resource-intensive. We implement batch processing to:

- **Manage Memory Usage**: Process smaller chunks of documents at a time
- **Improve Performance**: Batch processing is more computationally efficient than one-by-one processing
- **Track Progress**: Use tqdm to visualize completion progress
- **Process Scale**: Enable handling of large document collections without running out of memory

The approach:
1. Split the document collection into smaller batches
2. Process each batch through BERT
3. Collect embeddings from each batch
4. Combine all embeddings into a final output matrix

In [174]:
from tqdm.auto import tqdm

In [175]:
def compute_embeddings(texts, batch_size=8):
    """Generate BERT embeddings for a collection of texts efficiently.
    
    Args:
        texts (list): List of text strings to encode
        batch_size (int): Number of texts to process in each batch
        
    Returns:
        numpy.ndarray: Matrix of embeddings, shape (num_texts, embedding_dim)
    """
    # Split texts into batches for efficient processing
    text_batches = make_batches(texts, batch_size)
    
    all_embeddings = []
    
    # Process each batch with a progress bar
    for batch in tqdm(text_batches):
        # Tokenize the batch of texts
        # - padding=True ensures all sequences in the batch have the same length
        # - truncation=True cuts texts that are too long
        # - return_tensors='pt' returns PyTorch tensors
        encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors='pt')
    
        # Disable gradient calculation for inference (saves memory and is faster)
        with torch.no_grad():
            # Get BERT outputs for this batch
            outputs = model(**encoded_input)
            # Extract the hidden states from the last layer
            hidden_states = outputs.last_hidden_state
            
            # Average token embeddings to get sentence embeddings
            # (dimension 1 corresponds to tokens in each sequence)
            batch_embeddings = hidden_states.mean(dim=1)
            
            # Convert PyTorch tensors to NumPy arrays
            batch_embeddings_np = batch_embeddings.cpu().numpy()
            all_embeddings.append(batch_embeddings_np)
    
    # Stack all batch embeddings into a single matrix
    final_embeddings = np.vstack(all_embeddings)
    return final_embeddings

In [176]:
embeddings = {}

In [177]:
# fields = ['section', 'question', 'text']

for f in fields:
    print(f'computing embeddings for {f}...')
    embeddings[f] = compute_embeddings(df[f].tolist())

computing embeddings for section...


  0%|          | 0/119 [00:00<?, ?it/s]

computing embeddings for question...


  0%|          | 0/119 [00:00<?, ?it/s]

computing embeddings for text...


  0%|          | 0/119 [00:00<?, ?it/s]

In [178]:
embeddings

{'section': array([[ 0.37748608, -0.16826633, -0.71794635, ...,  0.32759327,
         -0.12342925,  0.18710026],
        [ 0.37748608, -0.16826633, -0.71794635, ...,  0.32759327,
         -0.12342925,  0.18710026],
        [ 0.37748608, -0.16826633, -0.71794635, ...,  0.32759327,
         -0.12342925,  0.18710026],
        ...,
        [-0.1783811 , -0.00579773, -0.19219266, ..., -0.09306458,
          0.06128873, -0.07417933],
        [-0.1783811 , -0.00579773, -0.19219266, ..., -0.09306458,
          0.06128873, -0.07417933],
        [-0.1783811 , -0.00579773, -0.19219266, ..., -0.09306458,
          0.06128873, -0.07417933]], shape=(948, 768), dtype=float32),
 'question': array([[-6.55925050e-02, -3.21504474e-01,  5.13027191e-01, ...,
         -8.19995776e-02, -1.21371210e-01,  3.88520695e-02],
        [ 2.28554845e-01,  4.20735665e-02,  2.02741608e-01, ...,
         -8.88650045e-02,  4.93600965e-04,  8.19936686e-04],
        [ 3.47465314e-02, -2.72454262e-01,  2.28157520e-01, ...,


In [179]:
import pickle

In [180]:
with open('embeddings.bin', 'wb') as f_out:
    pickle.dump(embeddings, f_out)

In [181]:
# size in MB of the embeddings
import os
embeddings_size = os.path.getsize('embeddings.bin') / (1024 * 1024)  # in MB
print(f'Embeddings size: {embeddings_size:.2f} MB')


Embeddings size: 8.33 MB


## Summary

In this notebook, we've explored a progression of search techniques:

1. We started with basic bag-of-words representations using CountVectorizer.
2. We improved on this by using TF-IDF to account for term importance.
3. We built a flexible search system with field boosting and filtering.
4. We explored dimensionality reduction techniques (SVD and NMF) to handle the curse of dimensionality.
5. Finally, we implemented state-of-the-art BERT embeddings to capture contextual meaning.

## Comparison of Search Methods

Here's a comparison of the different approaches we explored:

| Method | Pros | Cons | Best for |
|--------|------|------|----------|
| **CountVectorizer** | Simple, fast, easy to understand | Ignores word importance, sensitive to common words | Quick prototyping, small collections |
| **TF-IDF** | Accounts for term importance, works well in practice | Still bag-of-words (no word order), no semantic understanding | General purpose IR, medium-sized collections |
| **SVD/LSA** | Reduces dimensions, finds latent relationships, handles synonyms | Loses interpretability, needs tuning | Handling synonym problem, reducing computation |
| **NMF** | Topic-like components, more interpretable than SVD | Still no deep semantics, sensitive to initialization | Topic modeling applications |
| **BERT** | Deep semantic understanding, context-aware, state-of-the-art | Computationally expensive, more complex | Understanding nuanced queries, semantic search |

**Search Quality Progression**:
1. Basic word counting (CountVectorizer) → Limited understanding
2. Word importance weighting (TF-IDF) → Better results for distinctive terms
3. Latent semantic analysis (SVD) → Some semantic connections
4. Contextual embeddings (BERT) → Deep semantic understanding

**Computational Cost Progression**:
1. CountVectorizer → Very fast
2. TF-IDF → Fast
3. SVD/NMF → Moderate
4. BERT → Most expensive

When building a search system, the choice of method depends on your specific requirements for accuracy, speed, and available computational resources.