# Immigration Law Search Engine Documentation


This notebook implements a search engine for immigration law and case law data. It leverages BERT embeddings to capture the semantic meaning of text and FAISS for efficient similarity search. The system also includes a basic chatbot interface to interact with the search engine.

In [None]:
! pip install pandas transformers faiss-cpu
! pip install -U langchain-community

In [None]:
from google.colab import drive
drive.mount('/content/drive')

This section installs the necessary Python libraries for this notebook and mounts Google Drive to access the data files.

**Libraries:**

*   **pandas:** For data manipulation and working with CSV files.
*   **transformers:** Provides pre-trained models like BERT for text embeddings.
*   **faiss-cpu:** A library for efficient similarity search and clustering of dense vectors.
*   **langchain-community:** A library for creating applications using large language models (LLMs).

## Data Loading and Preprocessing <a name="data-loading-and-preprocessing"></a>

This also loads the dataset from Google Drive, renames columns for clarity, and combines text fields to create more context-rich data for the search engine.

**Steps:**

1.  **Load Data:** The `cleanerData.csv` file, containing immigration law data, is loaded from Google Drive into a pandas DataFrame.
2.  **Rename Columns:**
    *   `Section` is renamed to `Title`.
    *   `Paragraph` is renamed to `Content`.
3.  **Combine Text:** `Title` and `Content` are combined into a new column called `combined` to provide richer context for search queries.
4. **Create texts list**: the `combined` column is converted to a list, called `texts`, for later processing.

In [None]:
import pandas as pd
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import GooglePalm 
from langchain.docstore.document import Document

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/cleanerData.csv')

print(df.columns)

df.rename(columns={'Section': 'Title'}, inplace=True)
df.rename(columns={'Paragraph': 'Content'}, inplace=True)

df['combined'] = df['Title'] + " " + df['Content']
texts = df['combined'].tolist()

## Document Preparation <a name="document-preparation"></a>

In this step, we convert the text data into LangChain `Document` objects, which are used to hold text content and its associated metadata.

**Purpose:**

*   **LangChain Documents:** These are used to organize and structure text data for various operations in LangChain, especially when working with large language models.
*   **Metadata:** The `title` of each entry is added as metadata to provide additional context when retrieving documents.

In [None]:
documents = [Document(page_content=text, metadata={'title': title})
            for text, title in zip(df['combined'], df['Title'])]

## Embedding Creation <a name="embedding-creation"></a>

This section uses the BERT model to generate text embeddings. Embeddings are vector representations of text that capture their semantic meaning.

**Key Components:**

*   **BERT:** A pre-trained transformer model that understands the context of words in sentences.
*   **Tokenizer:** Converts text into tokens that BERT can process.
*   **Embeddings:** Each piece of text is converted into a vector, which represents its semantic meaning.


In [None]:
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

embeddings = [get_embedding(text) for text in texts]

## FAISS Index Creation <a name="faiss-index-creation"></a>

Here, a FAISS index is created to enable fast similarity search on the generated embeddings.

**Key Concepts:**

*   **FAISS:** A library for efficient similarity search and clustering of dense vectors.
*   **`IndexFlatL2`:** A type of FAISS index that uses the L2 (Euclidean) distance to measure similarity between vectors.
*   **Index:** Stores the embeddings, allowing for fast nearest-neighbor searches.


In [None]:
import faiss
import numpy as np

embedding_dim = embeddings[0].shape[0]
index = faiss.IndexFlatL2(embedding_dim)

embeddings_array = np.array(embeddings)

index.add(embeddings_array)

## Search Function <a name="search-function"></a>

This function performs the core task of searching for relevant texts based on a user's query.

**How it Works:**

1.  **Query Embedding:** The user's query is converted into a BERT embedding using the `get_embedding` function.
2.  **FAISS Search:** The FAISS index is searched to find the embeddings closest to the query embedding.
3.  **Retrieve Results:** The function returns the `top_n` most relevant texts, along with their titles and distances from the query.

In [None]:
def search(query, top_n=5):
    query_embedding = get_embedding(query).reshape(1, -1)

    distances, indices = index.search(query_embedding, top_n)

    results = []
    for i, index_val in enumerate(indices[0]):
        results.append({
            'title': df['Title'][index_val],
            'content': df['Content'][index_val],
            'distance': distances[0][i]
        })
    return results

query = "What are the conditions for asylum based on political persecution?"
results = search(query)

for result in results:
    print(f"Title: {result['title']}\nContent: {result['content']}\nDistance: {result['distance']}\n")

Title: schedule 11 paragraph 40
Content: Prospective 40U.K.In Schedule 3 (withholding and withdrawal of support), in paragraph 17(1), for the definition of “asylum-seeker” substitute— ““asylum-seeker” has the meaning given by section 18,”. Prospective 40U.K.In Schedule 3 (withholding and withdrawal of support), in paragraph 17(1), for the definition of “asylum-seeker” substitute— ““asylum-seeker” has the meaning given by section 18,”.
Distance: 50.42012023925781

Title: schedule 11 paragraph 8
Content: Prospective 8U.K.In section 95 (persons for whom support may be provided), the heading becomes“Support for asylum-seekers,etc”. Prospective 8U.K.In section 95 (persons for whom support may be provided), the heading becomes“Support for asylum-seekers,etc”.
Distance: 50.75392532348633

Title: schedule 11 paragraph 6
Content: Prospective 6U.K.The heading of the Part becomes“Support for asylum-seekers,etc”. Prospective 6U.K.The heading of the Part becomes“Support for asylum-seekers,etc”.
Dis

## Chatbot Function <a name="chatbot-function"></a>

This section implements a simple chatbot interface, allowing users to interact with the search engine in a conversational way.

**Functionality:**

*   **Interactive Loop:** The chatbot continuously prompts the user for input.
*   **Exit Keywords:** Typing `exit` or `quit` will end the chatbot session.
*   **Search:** User queries are passed to the `search` function.
*   **Display Results:** The most relevant results (title and a snippet of the content) are shown to the user.

In [None]:
def chatbot():
    print("Ask me anything about immigration law or case law!")
    while True:
        user_input = input("You: ")
        if user_input.lower() in ['exit', 'quit']:
            break
        results = search(user_input)
        if results:
            print("Top results:")
            for result in results:
                print(f"\nTitle: {result['title']}\nContent: {result['content'][:500]}...")
        else:
            print("Sorry, I couldn't find relevant information.")
        print()

chatbot()


## Saving the Index <a name="saving-the-index"></a>

This final step saves the FAISS index to a file.

**Purpose:**

*   **Persistence:** The index can be saved to avoid re-creating it each time the notebook is run.
*   **Reusability:** The saved index can be reloaded later for faster access to the search engine.

In [None]:
faiss.write_index(index, "faiss_index.index")