# DEVELOPING A SEMANTIC SEARCH ENGINE USING NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING

University of North Carolina at Charlotte

ITCS 6150 - Intelligent Systems, Spring 2024

Ryan Hull, 
Albert Oh, 
Tim Hillmann, 
Adam Lowder

## Introduction
The University of North Carolina at Charlotte's website hosts a vast array of information, yet finding specific content can be challenging using traditional search methods. Our project aims to introduce a semantic search engine that utilizes machine learning and natural language processing to understand and interpret user queries contextually, aiming for significantly improved accuracy and relevance in search results. 

## Problem Statement 
Our objective is to develop a search engine that enhances traditional keyword matching by employing machine learning algorithms to analyze and understand the context and intent behind user queries. The challenge lies in effectively processing and interpreting the expansive and diverse content on the university's website, requiring a holistic approach to data preprocessing, semantic analysis, and embedding generation in an ML pipeline.

## Table of Contents 
- [Introduction](#introduction)
- [Problem Statement](#problem-statement)
- [Methodology](#methodology)  
- [Pipeline](#pipeline)
    - [Data Collection](#data-collection)
    - [Data Preprocessing](#data-preprocessing)
    - [Summarization](#summarization)
    - [Semantic Analysis/Embedding](#semantic-analysis)
    - [Search Engine](#search-engine)
    - [Evaluation](#evaluation)

## Methodology

Our methodology begins with recursive/deep crawling the University’s website, preprocessing the site data, including the extraction and cleaning of text. Following this, we will generate context aware summaries using a pre-trained LLM model, then create embeddings in a similar manner to capture the semantic meaning of texts. For the search engine's development, we plan to embed user queries and perform similarity searches on our embeddings database, utilizing a small custom front-end web application for communication to the user. Finally, we will evaluate the search engine's performance with A-B testing against the UNC Charlotte website.`


## Pipeline

### Data Collection

Over the course of this project we explored several different ways to traverse and collect website data while utilizing Python libraries such as beautifulsoup and Playwright. 

Using a recursive web crawler approach, we were able to traverse the entire website and collect all the text data from the website. This resulting in an large dataset >4GB of text data alone (later skimmed down to ~47mb), and over 14,000 sites. While this worked, we found that the data was too large to process in a reasonable amount of time, as well as included various sources of outdated information that were not ideal or wanted in our project. In the end we have two datasets, one that is a subset of the entire website with stop-words removed and limits to path count (allowing for filtering by path count or subdomain) called endpoints.csv and another manually picked test set which we will demonstrate in this pipeline, called subdomain_test_set.csv. Note that the text data in these datasets is not the full text data, but rather a subset of the text data that was collected from the website and cleaned. However, no manual cleaning was done to the text data.

#### How we chose our test set

The test we will demonstrate includes pages from the main website, as well as any base path for subdomains that we found. The length of the pages ranged from a few paragraphs to a few pages of text. We chose these pages because they were the most relevant to the main website and would be the most likely to be searched for, as well as show a diverse range of text data and topics.

#### Technical Methods

* The main libraries used here are `requests`, `concurrent.futures`, and `BeautifulSoup`, in order to crawl a domain, extract links and text from the HTML content of each page, and write the results to a CSV file. We also included some error handling and progress reporting.

1. **Parallelization**: Our method uses `concurrent.futures.ThreadPoolExecutor` to fetch multiple URLs at the same time. This can significantly speed up the web scraping, but it also adds complexity. In practice we found that for small datasets it is not necessary to use parallelization, but for larger datasets it increases speed linearlly.

2. **Error handling**: The code includes several `try`/`except` blocks to handle exceptions that might occur when fetching a URL or processing the HTML content.

3. **Progress reporting**: We display the progress of the web scraping, including the most recent URL and status, and the total number of successes and failures, making it useful to see progress in long running tasks.

4. **Link and text extraction**: The code includes logic to extract links and text from the HTML content, resolve relative links to absolute links, and remove certain sections of the page (like JavaScript and CSS blocks, headers, footers, etc.). These were improved upon by trial and error, and may not be perfect, but they work well for the most part and can be improved upon.

5. **URL validation and filtering**: We include logic to validate URLs and filter out certain URLs based on a list of excluded terms. This is necessary to avoid crawling irrelevant or unwanted pages.

#### Limitations

* The code does not handle JavaScript-heavy websites well, and may not be able to extract all the text from such websites. This is because the code only processes the HTML content of a page, and does not execute JavaScript. This is a limitation of the `requests` and `BeautifulSoup` libraries, and would require a more complex solution to solve such as using a headless browser like Playwright or Selenium. This switch would also allow for more complex interactions with the website, such as clicking buttons or filling out forms, which would be useful for more complex websites, however it would also slow down the process significantly.

* This is currently only useful for public sites only, as it does not handle authentication or cookies. 

#### Future Improvements

* We could improve the code to handle JavaScript-heavy websites by using a headless browser like Playwright or Selenium. This would allow us to execute JavaScript and extract the text from the rendered page.

* Currently we store the page data for every visited page, and later process and filter out the data set we want to be able to search over. This limits our calls to the site, however is not ideal. If we define the filters we want during the preprocessing step, we could avoid storing the data we don't want in the first place. Traversal of the entire site would still be necessary, but we could avoid storing the data we don't want.

* Additionally, storing the entirely html could allow for comparing the hash of the html to see if the page has changed, and only storing the new data / sending it through the pipeline. This would allow for a more up to date dataset and search engine, and would allow for the dataset to be updated over time in a more efficient manner. 


In [1]:
# Data Collection

import csv
import os
import re
from urllib.parse import urljoin
from urllib.parse import urlparse
import requests 
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup
csv.field_size_limit(2**31 - 1)

RUN = True # Set to False to skip the web scraping, and use the existing CSV file

FILENAME = './data/endpoints.csv' # The file to write the endpoints to
DOMAIN = 'https://charlotte.edu' # The domain to crawl
PATH_LIMIT = 2  # Maximum number of slashes in the URL path

# Url parameters to exclude/skip
EXCLUDE = ['?', 'page', 'gateway', 'illiad', 'news-articles', 'news-events', 'news-media', 'linkedin', 'facebook', 'twitter', 'instagram', 'youtube', 'flickr', 'pinterest', '.com', '.org', '.net', '.gov', '.pdf', '.doc', 'xml', 'php', 'mailto:', '@', 'tel:', 'javascript:', 'tel:', 'sms:', 'mailto:', 'angular', 'react', '.js', 'event', 'corporate', '#', 'image', 'gallery', 'taskstream-student-handbook']    

# Display the progress of the web scraping
current_display = display('Starting...', display_id=True)
progress_display = display('Starting...', display_id=True)
success_count = 0
failure_count = 0

def print_status(url, status, exception=None):
    global success_count
    global failure_count
    global current_display
    global progress_display
    
    if status == 'Success':
        success_count += 1
    if status == 'Failed':
        failure_count += 1
    if status == 'Exception':
        failure_count += 1
        
    # Print the most recent URL and status, and the total counts    
    current_display.update(f'Most recent URL: {url}, Status: {status}, Exception: {exception}')
    progress_display.update(f'Successes: {success_count}, Failures: {failure_count}')

def remove_url_prefix(url):
    url = url.replace('http://', '').replace('https://', '').replace('www.', '')
    return url.lower()

def is_valid_url(url):
    if any(ex in url for ex in EXCLUDE) or len(url) < 8 or len(url) > 100:
        return False
    try:
        split_url = re.split('https?://', url)
        return 'mailto:' not in url and '@' not in url and 'charlotte.edu' in split_url[0] or len(split_url) > 1 and 'charlotte.edu' in split_url[1]
    except Exception as e:
        print_status(url, 'Exception', e)
        return False
    
def write_to_csv(valid_endpoints, failed_endpoints):
    # Export the endpoints to a CSV file
    try:
        with open(FILENAME, 'w', newline='', encoding="utf-8") as file:
            writer = csv.writer(file)
            writer.writerow(['URL', 'Title', 'Text'])  # Write the column labels
            for endpoint in valid_endpoints:  # Write the valid endpoints
                writer.writerow(endpoint)
            # for endpoint in failed_endpoints:  # Write the failed endpoints
            #     writer.writerow(endpoint)
            file.flush()
            os.fsync(file.fileno())
    except Exception as e:
        print_status(None, 'Failed', e)

def fetch_url(url):
    try:
        response = requests.get(url, timeout=10)
        if response.status_code == 200:
            return ('Success', response.content)
        else:
            return ('Failed', None)
    except (requests.exceptions.RequestException, requests.exceptions.Timeout, ValueError):
        print_status(url, 'Failed')
        return ('Failed', None)
    
def crawl_domain(domain):
    # Initialize sets and lists to keep track of visited URLs, URLs to visit, and endpoints
    visited = set()
    to_visit = [domain.rstrip('/')]
    valid_endpoints = []
    failed_endpoints = []

    # Use ThreadPoolExecutor to parallelize the web scraping
    with ThreadPoolExecutor(max_workers=50) as executor:
        
        # Submit tasks to the executor for each URL in to_visit that hasn't been visited yet and is valid
        futures = {executor.submit(fetch_url, url): url for url in to_visit if is_valid_url(url) and url not in visited}
        # Add the URLs that are being visited to the visited set
        visited.update(url for url in to_visit if is_valid_url(url))

        # Continue until all futures are done
        while futures:
            # Wait for the first future to complete
            done, _ = concurrent.futures.wait(futures, return_when=concurrent.futures.FIRST_COMPLETED)

            # Process each completed future
            for future in done:
                url = futures.pop(future)

                try:
                    # Get the result of the future
                    data = future.result()
                except Exception as e:
                    # If an exception occurred while fetching the URL, print the status and continue
                    print_status(url, 'Failed')
                    continue
                
                # Unpack the status and content from the data
                status, content = data
                
                try:
                    if status == 'Success':
                        # Parse the HTML content
                        soup = BeautifulSoup(content, 'html.parser')
                        
                        # Link extraction
                        # Find all links in the HTML content
                        links = soup.find_all('a')
                        for link in links:
                            href = link.get('href')
                            if href is not None:
                                # Resolve relative links to absolute links
                                full_url = urljoin(domain, href).rstrip('/')
                                clean_url = remove_url_prefix(full_url)
                                slash_count = urlparse(clean_url).path.count('/')
                                # If the URL is valid, hasn't been visited yet, and doesn't have too many slashes, add it to the futures
                                if is_valid_url(clean_url) and slash_count <= PATH_LIMIT and clean_url not in visited:
                                    futures[executor.submit(fetch_url, full_url)] = full_url
                                    visited.add(clean_url)
                        
                        # Text extraction
                        
                        # Save title if it exists
                        title_text = soup.title.string if soup.title else ''
                        
                        # Find the "main", "main-content", or "body" element
                        element = None
                        main_element = soup.find(id="main")
                        main_content_element = soup.find(id="main-content")

                        if main_element is not None:
                            element = main_element
                        elif main_content_element is not None:
                            element = main_content_element.parent
                        else:
                            element = soup.find('body')

                        # Extract all visible text in the element and its child elements
                        if element:
                            text = element.get_text(strip=True, separator=' ')
                        else:
                            text = ''

                        text = text.replace('"', "'")  # Replace all double quotes with single quotes
                        text = text.replace('\n', '')  # Remove new lines
                        text = text.replace('\t', '')  # Remove tabs
                        text = ' '.join(text.split())
                        # Save the text along with the URL and status
                        valid_endpoints.append([str(url), str(title_text), str(text)])  
                        print_status(url, status)
                        if len(valid_endpoints) % 100 == 0:
                            write_to_csv(valid_endpoints, failed_endpoints)
                        
                    else:
                        # If the status is not 'Success', add the URL and status to the failed_endpoints list
                        failed_endpoints.append([url, ''])
                        visited.add(url)
                except Exception as e:
                    # If an exception occurred while processing the HTML content, print the status and continue
                    print_status(url, 'Failed: ' + str(e))
                    visited.add(url)
                    continue
        # Write the valid and failed endpoints to a CSV file
        write_to_csv(valid_endpoints, failed_endpoints)


if not RUN: # If RUN is False, skip the web scraping and use the existing CSV file
    print('Skipping web scraping...')
    
else: # If RUN is True, perform the web scraping    
    os.makedirs(os.path.dirname(FILENAME), exist_ok=True) # Create the directory if it doesn't exist
    open(FILENAME, 'w').close() # Clear the csv file
    crawl_domain(DOMAIN) # Crawl the domain

'Most recent URL: https://eng-resources.charlotte.edu/unccengkit/cooling, Status: Success, Exception: None'

'Successes: 14665, Failures: 167'

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
  k = self.parse_starttag(i)


### Data Preprocessing

In the data preprocessing phase of our project, we employed Python's powerful pandas library to refine and transform our raw dataset into a structured format suitable for further analysis and usage in our semantic search engine. The process involved several crucial steps, each aimed at improving the quality and relevance of the data extracted during the collection phase.

#### Initial Data Cleaning
The first step involved importing the raw dataset from a CSV file and removing any rows where essential information was missing. Specifically, we dropped entries without valid 'URL' or 'Text' fields to ensure that only complete records were processed further, as we;; as. This initial cleaning helps in reducing noise and focusing on meaningful data.

#### Filtering URLs
We applied filters to remove URLs containing specific terms that were deemed irrelevant or unwanted for our project's scope. By applying a custom lambda function, we could efficiently exclude URLs based on a defined list of terms, thereby narrowing down our dataset to include only the content relevant to our objectives. In this case, we filtered out URLs containing `?`, since these are often dynamic pages containing search parameters.

#### Sorting and Filtering by Structural Attributes
To better understand the website's architecture and prioritize content, we sorted the data based on the URL path count and text length. Entries were ordered by their path complexity to identify and segregate main pages from deeper, potentially less relevant subpages. Additionally, we filtered out text data that fell below a certain length threshold, focusing on substantial content that is more likely to be of interest to users.

#### Deduplication and Final Arrangement
Duplicate URLs were removed to avoid redundant processing and to ensure the uniqueness of each data point in our dataset. This deduplication step is critical for maintaining a clean and efficient search index. Furthermore, we sorted the remaining entries based on URL length to prioritize shorter, typically more significant URLs.

#### Data Conversion and Export
The final step in our preprocessing involved converting the cleaned DataFrame back into a list format for compatibility with subsequent processing stages and exporting the refined dataset back into CSV format for easy access and use in future tasks. This step marks the transition from raw to processed data, ready for integration into our semantic search pipeline.

#### Limitations
While our preprocessing methods have significantly improved dataset usability, they are primarily designed for structured, text-based content and might not handle non-textual data or highly dynamic content effectively. Moreover, the manual selection of filter terms and length thresholds requires domain knowledge and might not capture all nuances of the dataset.

#### Future Improvements
In future iterations, we aim to enhance our preprocessing pipeline by incorporating natural language processing techniques for better text analysis and by developing more sophisticated criteria for URL filtering and text relevance assessment. Automating the selection of filter terms and thresholds based on data characteristics and user feedback could also enhance the adaptability and effectiveness of our preprocessing steps.


In [2]:
import pandas as pd

def import_and_clean_data(file_path):
    df = pd.read_csv(file_path)
    df = df.dropna(subset=['URL', 'Title', 'Text'])
    return df

def filter_urls(df, filter_out_terms):
    mask = df['URL'].apply(lambda x: any(term in x for term in filter_out_terms))
    df = df[~mask]
    return df

def sort_by_path_count(df):
    df['path_count'] = df['URL'].apply(lambda url: url.count('/') - 2)
    df = df.sort_values(by=['path_count'])
    return df

def filter_by_text_length(df, min_length):
    df.loc[:, 'Text'] = df['Text'].astype(str)
    df = df[df['Text'].str.len() > min_length]
    return df

def sort_by_url_length(df):
    df['URL_length'] = df['URL'].str.len()
    df = df.sort_values(by=['URL_length'])
    df = df.drop(columns=['URL_length'])
    return df

def remove_duplicates(df):
    df = df.drop_duplicates(subset=['URL'])
    return df

def save_to_csv(df, file_path):
    df.to_csv(file_path, index=False)

def convert_to_list(df):
    return list(df.itertuples(index=False, name=None))

# Usage
endpoints = import_and_clean_data('./data/endpoints.csv')
endpoints = filter_urls(endpoints, ['?'])
endpoints = sort_by_path_count(endpoints)
subdomain_list = endpoints[endpoints['path_count'] == 0] # Subdomain test set
subdomain_list = filter_by_text_length(subdomain_list, 50)
subdomain_list = sort_by_url_length(subdomain_list)
subdomain_list = remove_duplicates(subdomain_list)
save_to_csv(subdomain_list, './data/subdomain_test_set.csv')
test_set = convert_to_list(subdomain_list)
subdomain_list.head()

Unnamed: 0,URL,Title,Text,path_count
0,https://charlotte.edu,The University of North Carolina at Charlotte ...,UNC Charlotte Icons 0 percent of new undergrad...,0
3725,http://hi.charlotte.edu,Health Informatics and Analytics,Master of Health Informatics and Analytics Hea...,0
4738,http://lc.charlotte.edu,Learning Communities,Homepage apply online now! UNC Charlotte’s Lea...,0
4239,http://faq.charlotte.edu,\r\n\tKnowledge Base\r\n,Updating... Skip to main content Filter your s...,0
4813,http://bcp.charlotte.edu,Emergency Management,Home The University is under normal operations...,0


### Summarization

In the summarization phase of our project, we utilized a pre-trained language model to generate context-aware summaries of the text data extracted from the University's website. The goal of this phase was to distill the content of each page into concise, informative summaries that capture the essence of the original text and facilitate semantic analysis and embedding generation. An important focus was to ensure the url path and subdomain added context to the summary.

#### Model Selection 
There are many pre-trained language models available, and it is an ever improving and growing list. For our project we wanted an open-source, instruct based, and smaller sized LLM that could run reliably on Google Colab and 8gb GPUs. In the end we chose [Google's Gemma-2b-it model](https://huggingface.co/google/gemma-2b-it), as through our initial compairson and research it provided a good balance of performance and size for our needs.

#### Environment Setup
We initiated our summarization process by setting up the necessary environment and determining whether the process was running on Google Colab to access specific resources and manage authentication securely. This setup included configuring our model and tokenizer with appropriate parameters for optimal performance across local and cloud-based environments.

#### Summary Generation Mechanism
Our approach employed a structured instructive prompt tailored to guide the LLM in producing summaries that encapsulate the main topics, purpose, and relevant keywords of each website. This instructive technique ensures that the model's output is aligned with our project's informational needs, providing a standardized format for the extracted summaries.

#### Technical Enhancements and Innovations
To improve the speed performance of the model We incorporated optimizations such as quantization by tested varying precisions (float16, bfloat16, 8-bit, 4-bit) without seeing any significant accuracy degredation. Quantization allows for adjustments to computational efficiency, allowing our model to run effectively even in resource-constrained environments like personal laptops or less powerful cloud services. We mainly focused on using GPU's for our model, however we understand that in a Production pipeline the summarization process may not have time constraints as a batched process and could be run on a CPU.

#### Challenges and Limitations
Despite our approach, our summarization process is still computationally intensive and requires substantial resources to execute efficiently. We were able to reliably run our process on Google Colab's free T4 tier, however run close to the resource limits during long runs. Additionally, the quality and relevance of the generated summaries can vary based on the size of our input data and the model's inherent biases and limitations, necessitating review of the output for consistency and accuracy. If this were a critical or customer facing application, we would need to thoroughly ensure that the summaries are accurate.

#### Future Directions and Improvements
Looking ahead, we aim to integrate headless browser solutions like Playwright or Selenium, enabling our system to execute and interpret JavaScript, thereby capturing the full range of content available on modern websites. Additionally, refining our model's ability to discern and disregard irrelevant data during preprocessing could further streamline our summarization process, reducing computational overhead and improving the relevacy of our summaries.

Some other ideas for future improvements include:
* Using a more powerful model, or API based inferences from much larger or private models for generated summaries.
* Flagging or adding additional context to prompts based on URL or subdomain, to add context to the summary.
* Re-archetecturing our implementation to allow for easier switching between models, or to allow for multiple models to be run in parallel or subsequentally to compare results programmatically.
* Tree-based summarization/ data structure, where we summarize based on levels of sub-domain and path. This would allow for a more structured and organized summary, and would allow for more complex queries/filters later during our semantic search.

In summary, our project's LLM Summary Generation phase represents a significant step in automating content comprehension and summarization, offering a scalable solution to process and distill vast quantities of web data into actionable insights.

In [3]:
!pip install bitsandbytes accelerate



In [4]:
%pip install torch transformers




In [4]:
# LLM Model 
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from textwrap import dedent
import json
from IPython.display import display
import time
import json
import os

# Environment setup
def is_running_on_colab():
    try:
        from google.colab import userdata
        return True
    except ModuleNotFoundError:
        return False

if is_running_on_colab():
    access_token = userdata.get('HUGGINGFACE_TOKEN')
else:
    access_token = access_token = os.getenv("HUGGINGFACE_TOKEN")

MODEL = "google/gemma-2b-it" # Newer and small model, promising given the size to performance as well as open source

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

quantization_config = BitsAndBytesConfig(load_in_4bit=True, torch_dtype=torch.float16, bnb_4bit_compute_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", quantization_config=quantization_config)

LLM_INSTRUCT_PROMPT = dedent('''\
Given the URL "{url}", the title of the webpage "{title}", and the following text from the site: "{text}" provide a concise summary that includes:
1. The main topics in the text.
2. The purpose or objective of the website, inferred from the title, text and url (including subdomain and path).
3. Tags or keywords that a user may search to try and find the site in a search engine.

You must utilize information from the URL (such as the specific path and subdomain) to contextualize and add to the understanding of the text.

Format the response in a json like the following:

'summary': 'The description of url and summary of the text goes here.',
'topics': ['topic1', 'topic2', 'topic3'],
'tags': ['tag1', 'tag2', 'tag3']
''')

def gen_summary(url: str, title:str, text:str) -> str:
    chat = [
        { "role": "user", "content": LLM_INSTRUCT_PROMPT.format(url=url, text=text, title=title) },
    ]
    prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    input_ids = tokenizer([prompt], add_special_tokens=False, return_tensors="pt").to(device)
    output = model.generate(**input_ids, max_new_tokens=1000)
    return tokenizer.decode(output[0], skip_special_tokens=True)

def extract_summary(llm_output: str):
    model_index = llm_output.find("'tags': ['tag1', 'tag2', 'tag3']\nmodel")
    if model_index != -1:
        json_string = llm_output[model_index + 40:]

        # Replace ",] with "] to fix json formatting, common issue with LLM
        json_string = json_string.replace('",]', '"]')

        # If last character is not } then add it
        if json_string[-1] != '}':
            json_string += '}'

        # If first character is not { then add it
        if json_string[0] != '{':
            json_string = '{' + json_string

        try:
            json_object = json.loads(json_string)
            return json_object
        except:
            return None
    else:
        raise ValueError('Could not find model output in LLM output')

cuda:0


`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
import gc

# Clrear GPU cache as it builds up
def flush_gpu_cache():
    if device == "cuda:0":
        torch.cuda.empty_cache()
        gc.collect()

In [None]:
%pip install pandas

In [6]:
import pandas as pd

# Generate Summaries
current_display = display('Starting...', display_id=True)
progress_display = display('Starting...', display_id=True)
time_display = display('Starting...', display_id=True)
success_count = 0
failure_count = 0
avg_time = 0
last_time = 0
current_time = 0

# Test llm summaries
summary_outputs = []
extract_summaries = []
final_urls = []

test_set = list(pd.read_csv('./data/subdomain_test_set.csv').itertuples(index=False, name=None))

for t in test_set:
    start_time = time.time()
    try:
        sum = gen_summary(t[0],t[1],t[2])
        summary_outputs.append(sum)
        extract = extract_summary(sum)
        extract_summaries.append(extract)
        if extract is not None:
            extract['url']=t[0]
            extract['title']=t[1]
            final_urls.append(extract)
            success_count += 1
            current_display.update(f'Most recent URL: {t[0]}, Extract: {extract["summary"][:60]}...')
            progress_display.update(f'Successes: {success_count}, Failures: {failure_count}')

            # Save the final urls to a json file every 10 summaries, in case of failure or runtime error. This ensures we don't waste progress
            if success_count % 10 == 0:
                flush_gpu_cache()
                with open('./data/sublist_llm_summaries_from_pipeline.json', 'w') as f:
                    json.dump(final_urls, f, indent=4)
        else:
            raise Exception('Error parsing json')
    except Exception as e:
        failure_count += 1
        current_display.update(f'Most recent URL: {t[0]} \nError: {e}')
        progress_display.update(f'Successes: {success_count}, Failures: {failure_count}')
        continue
    end_time = time.time()
    last_time = end_time - start_time
    avg_time = (avg_time * (success_count - 1) + last_time) / success_count
    time_display.update(f'Average time: {avg_time}, Last time: {last_time}')

# Save the final urls to a json file one last time at the end
with open('./data/sublist_llm_summaries_from_pipeline.json', 'w') as f:
    json.dump(final_urls, f, indent=4)

'Most recent URL: https://interdisciplinarystudies.charlotte.edu, Extract: The Office of Interdisciplinary Studies promotes interdiscip...'

'Successes: 450, Failures: 5'

'Average time: 4.584820302327476, Last time: 5.368001222610474'

  attn_output = torch.nn.functional.scaled_dot_product_attention(


In [7]:
# Clear llm model and objects from memory once complete

del model
del tokenizer

### Semantic Analysis

In the semantic analysis phase of our project, we aimed to generate embeddings for the text data extracted from the University's website. These embeddings capture the semantic meaning of the text and enable our search engine to understand and interpret user queries contextually, thereby enhancing the relevance and accuracy of search results.

### Embeddings Creation
We utilized the Txtai library, and specifically the `txtai.Embeddings` module with the pre-trained model `sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco`. This model transforms the textual content from each web page into dense vector representations, capturing the underlying meanings and relationships within the data. The embedding process converts the summaries, topics, and tags extracted from each page into a format suitable for semantic search, thereby enhancing the relevance and precision of our search engine. There are many different models and methods for creating embeddings, and we chose this one because it was easy to use and had good performance for our needs.

### Data Preparation
Prior to embedding, we prepared the text data by combining URLs, summaries, topics, and tags into a single, displayable text format. This consolidation was necessary to ensure that the model could consider all relevant information when generating embeddings. Additionally, we implemented error handling to manage missing files.

### Indexing and Search Capability
Once the embeddings were generated, they were indexed to enable efficient semantic search across our dataset. The `embeddings.index()` function was utilized to associate each web page's URL with its corresponding embedding, laying the groundwork for our search functionality. 

We developed two primary functions for interacting with the indexed data:
- `search(query, max_results=5)`: This function allows users to perform semantic searches within our dataset. By inputting a query, the function retrieves the most semantically relevant web pages, demonstrating the practical application of our embeddings in real-world search scenarios.
- `explain(query, max_results=5)`: In addition to the standard search functionality, we implemented an 'explain' feature. This function provides insight into the reasons behind the search results, offering users transparency and a deeper understanding of the search engine’s behavior.

### Technical Methods
Our implementation relies on the robust capabilities of the txtai library. The underlying technology is not new, but txtai provides a simple and effective interface for generating embeddings and performing semantic searches. During our initial research we explored how the library works by utilizing [Sentence-Transformers](https://www.sbert.net) to create and store our own embeddings in a in-memory database. In a production use case we would likely use a more powerful and scalable database, such as Elasticsearch, to store and search our embeddings and have more control over the configuration.

### Limitations
While our semantic analysis and embedding generation have significantly enhanced our project's capabilities, there are limitations. The process is heavily dependent on the quality and relevance of the initial text data. Moreover, as the embeddings are based on pre-trained models, there may be constraints related to the specificity and domain relevance of these models. Additionally, the embedding model will need to be able to run on command to embed user queries, so the size of the model is a consideration for where and how the server will be hosted. 

### Future Improvements
Looking ahead, we aim to refine our embedding generation by experimenting with different models and tuning parameters to better suit our specific dataset and use cases. Additionally, enhancing our error handling and data preparation methods will improve the robustness and effectiveness of our semantic analysis.

We also believe there exists an oportunity for extremely cost effective semantic search capabilities by using [Transformers.js](https://huggingface.co/docs/transformers.js/en/index), which would allow for the embedding generation and search to be done client side, and would allow for a more scalable and cost effective solution. This would also allow for more complex queries and filters to be run client side, and would allow for a more interactive and dynamic search experience. More testing of the limitations and capabilities of this for large datasets would be necessary.


In [2]:
import txtai
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

# Create embeddings for each web page
embeddings = txtai.Embeddings(path="sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco", content=True)

def to_displayable_text(row):
    return f"{row['url']}\n{row['title']}\n{row['summary']}\n{row['topics']}\n{row['tags']}"

if 'final_urls' not in globals():
    try:
        with open('./data/sublist_llm_summaries_from_pipeline.json', 'r') as f:
            final_urls = json.load(f)
    except FileNotFoundError:
        print('No file found')

if final_urls is not None:
    embeddings.index([(row["url"], to_displayable_text(row)) for row in final_urls])

def search(query, max_results=1):
    # Search index
    results = embeddings.search(query, max_results)
    return results

def explain(query, max_results=1):
    # Explain index
    results = embeddings.explain(query,max_results)
    return results

NameError: name 'json' is not defined

In [9]:
# Python Search Example
while True:
    user_input = input("Enter command ('your search query here' or 'quit' to exit): ").strip()
    
    if user_input.lower() == 'quit':
        print("Exiting the program.")
        break
    elif len(user_input) < 5:
        print("Invalid input. Please input a query longer than 4 characters.")
    else:
        results = search(user_input)
        for result in results:
            print(result)

{'id': 'https://sis.charlotte.edu', 'text': "https://sis.charlotte.edu\nDepartment of Software information Systems - College of Computing and Informatics\nThe Department of Software Information Systems (SIS) is a pioneer in Information Technology research and education. We offer a wide selection of courses in Information Technology and Software Engineering, with an emphasis on designing and deploying IT infrastructures that deliver integrated, secure, reliable, and easy-to-use services. We also partner with the Computer Science and Geography and Earth Sciences departments to offer specific concentrations in those fields to our students.\n['Information Technology', 'Software Engineering', 'Computer Science', 'Geography', 'Earth Sciences']\n['SIS', 'Department of Software information Systems', 'Information Technology', 'Software Engineering', 'Computer Science', 'Geography', 'Earth Sciences']", 'score': 0.7893373370170593}
{'id': 'https://dsi.charlotte.edu', 'text': "https://dsi.charlott

### Search Engine

In the search engine phase of our project, we developed a small custom front-end web application to communicate with the user and demonstrate the capabilities of our semantic search engine. The application allows users to input queries and receive relevant search results, showcasing the practical application of our semantic analysis and embedding generation.

## API Integration and Functionality

In the development phase of our project, we integrated two primary functionalities into our web service: `search` and `explain`, both critical to our semantic analysis system. Utilizing the Flask web framework, renowned for its ease of use and flexibility, we crafted a lightweight, RESTful API service designed to interface seamlessly with our backend logic and data embeddings.

### Implementation Strategy

We developed two endpoints within our Flask application: `/search` and `/explain`, corresponding to our core functionalities. These endpoints accept POST requests, aligning with standard API practices for data transmission and retrieval, ensuring both security and efficiency.

1. **Search Endpoint**:
    - This endpoint facilitates querying the embeddings index to find the most relevant information based on user input. It decodes the JSON payload from client requests to extract the 'query' and 'max_results' parameters, providing a user-friendly interface for data querying.
    - Upon receiving a query, it interacts with the `search` function, which utilizes our pre-processed and indexed data, returning a list of results sorted by relevance. This process encapsulates the essence of semantic search by leveraging natural language understanding to fetch pertinent information.

2. **Explain Endpoint**:
    - Similar to the search functionality but with an added layer of interpretability, this endpoint provides insights into the query results. It helps users understand why certain pieces of information were deemed relevant by detailing the matching process.
    - This transparency is crucial for applications requiring a deeper understanding of AI-driven decisions, enhancing user trust and system accountability.

### API Design and User Experience

Our API design prioritizes simplicity and efficiency, making it accessible to developers and end-users alike. By employing the Flask framework, we benefited from its minimalistic yet powerful features, enabling rapid development and deployment. The choice of POST methods for both endpoints aligns with best practices for APIs, ensuring data encapsulation and enhancing security.

### Testing and Debugging

The Flask application is configured to run in debug mode, facilitating real-time feedback and immediate error logging during the development process. This feature accelerates debugging and streamlines the testing phase, allowing for a more agile development workflow.

### Future Directions

As the project evolves, we anticipate extending the API's capabilities, incorporating additional endpoints, and refining existing ones to accommodate a broader range of queries and use cases. Continuous integration of user feedback and performance metrics will guide these enhancements, ensuring the API remains robust, user-centric, and aligned with evolving project goals.

### Conclusion

The creation of our Flask-based API represents a significant milestone in our project, bridging the gap between complex backend algorithms and user-facing applications. It exemplifies a successful integration of semantic technologies with web services, paving the way for advanced search and analysis tools accessible through straightforward, RESTful interfaces.


In [1]:
from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/search', methods=['POST'])
def search_endpoint():
    """Search the embeddings index."""
    data = request.get_json()
    query = data.get('query', '')
    max_results = data.get('max_results', 5)
    results = search(query, max_results=max_results)
    return jsonify({'results': results})

@app.route('/explain', methods=['POST'])
def explain_endpoint():
    """Explain the embeddings index."""
    data = request.get_json()
    query = data.get('query', '')
    max_results = data.get('max_results', 5)
    results = explain(query, max_results=max_results)
    return jsonify({'results': results})

if __name__ == '__main__':
    app.run()  # Enable reloader and debugger


 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
[2024-03-20 17:45:26,864] ERROR in app: Exception on /search [POST]
Traceback (most recent call last):
  File "c:\Python312\Lib\site-packages\flask\app.py", line 1463, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Python312\Lib\site-packages\flask\app.py", line 872, in full_dispatch_request
    rv = self.handle_user_exception(e)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Python312\Lib\site-packages\flask\app.py", line 870, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Python312\Lib\site-packages\flask\app.py", line 855, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Ryan\AppData\Local\Temp\ipykernel_6248\1726753683.py", line 11, in 

### Evaluation

In the evaluation phase of our project, we aimed to assess the performance and effectiveness of our semantic search engine, focusing on the relevance and accuracy of search results. We utilized a custom evaluation framework to measure the engine's performance against a manually curated test set, comparing the engine's output with the [Google Search Console powered University website search](https://search.charlotte.edu).