# Asylum Seeker Q&A System

In [None]:
!pip install langchain langchain-openai langchain-community faiss-cpu tiktoken langchain_huggingface

*   **Explanation**: This command installs the following:
    *   `langchain`: Core framework for building applications with language models.
    *   `langchain-openai`: Integrations with OpenAI models.
    *   `langchain-community`: Integrations with different LLM.
    *   `faiss-cpu`: Library for efficient similarity search.
    *   `tiktoken`: For tokenization of text.
    *   `langchain_huggingface`: Integrations with HuggingFace models.

In [None]:
import pandas as pd
import os
from langchain_openai import OpenAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.document_loaders import TextLoader
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain_core.vectorstores import VectorStoreRetriever
from langchain.chains import RetrievalQA

*   **Explanation**: We import libraries for data manipulation (`pandas`), interacting with the operating system (`os`), accessing language models (`langchain_openai`), generating embeddings (`langchain_huggingface`), loading documents (`langchain.document_loaders`), structuring documents (`langchain.docstore.document`), splitting text (`langchain.text_splitter`), creating vector stores (`langchain.vectorstores`), and building retrieval chains (`langchain.chains`).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

*   **Explanation**: This step reads the `cleanerData.csv` file into a Pandas DataFrame, renames columns for better clarity, and combines the `Title` and `Content` into a single column `combined`. This makes the text more relevant.
*   The `texts` variable then takes the content from the `combined` column and turn it into a list.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/cleanerData.csv')
df.rename(columns={'Section': 'Title'}, inplace=True)
df.rename(columns={'Paragraph': 'Content'}, inplace=True)
df['combined'] = df['Title'] + " " + df['Content']
texts = df['combined'].tolist()

*   **Explanation**: `TextLoader` is used to handle the text data as a list, and then we create LangChain `Document` objects, which will be used in the next step.

In [None]:
loader = TextLoader(texts)

In [None]:
documents = [Document(page_content=t) for t in texts]

*   **Explanation**: This splits the `Document` objects into chunks of 1000 characters each, with an overlap of 200 characters between chunks. This overlap helps retain context across chunk boundaries.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

In [None]:
docs = text_splitter.split_documents(documents)

*   **Explanation**: This defines a function `save_object` that uses `pickle` to serialize and save the `docs` into the file.

In [None]:
import pickle
def save_object(docs, filename):
    with open(filename, 'wb') as f:
        pickle.dump(docs, f)
save_object(docs, 'docs.pkl')

*   **Explanation**: This defines a function `loaded_docs` to load the file. Then, we load the `docs` from the file `docs.pkl`. Finally, to check that the content was loaded well, we print the first element.

In [None]:
def loaded_docs(filename):
    with open(filename, 'rb') as f:
      loaded_docs = pickle.load(f)
    return loaded_docs
loaded_docs = loaded_docs("docs.pkl")
print(loaded_docs[0].page_content)

body (1)The Secretary of State must appoint a person as the Director of Labour Market Enforcement (referred to in this Chapter as “the Director”). (2)The Director is to hold office in accordance with the terms of his or her appointment. (3)The functions of the Director are exercisable on behalf of the Crown. (4)The Secretary of State must provide the Director with such staff, goods, services, accommodation and other resources as the Secretary of State considers the Director needs for the exercise of his or her functions. (5)The Secretary of State must— (a)pay the Director such expenses, remuneration and allowances, and (b)pay or make provision for the payment of such pension to or in respect of the Director, as may be provided for by or under the terms of the Director's appointment. Commencement Information I1S. 1in force at 12.7.2016 byS.I. 2016/603,reg. 3(a) (1)The Director must before the beginning of each financial year prepare a labour market enforcement strategy for that year and

In [None]:
len(docs)

In [None]:
len(loaded_docs)

*   **Explanation**: This initializes the embedding model from Hugging Face, using the `sentence-transformers/all-mpnet-base-v2` model. This model is designed to generate high-quality embeddings for sentence-level text.

In [None]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

*   **Explanation**: This step builds a FAISS index from the `docs` and their `embeddings`. FAISS is used for efficient similarity search.

In [None]:
library = FAISS.from_documents(docs, embeddings)

In [None]:
library

*   **Explanation**: A sample query `query1` is defined and used to search for similar documents in the vector store. The `similarity_search` function returns the most similar documents. Here, we display the most similar one.

In [None]:
query1 = "Why can I not work in the UK while applying for asylum?"

In [None]:
queryAnswer = library.similarity_search(query1)

In [None]:
print(queryAnswer[0].page_content)

3, paragraph 1(1)(ga) prevents a local authority in England from providing support or assistance under a provision mentioned in paragraph (ga) to a person if— (a)support is being provided to the person by virtue of paragraph 10B or section 95A of the Immigration and Asylum Act 1999, or (b)there are reasonable grounds for believing that support will be provided to the person by virtue of that paragraph or section.” 7U.K.In paragraph 6 (third class of ineligible person: failed asylum-seeker), in sub-paragraph (1), in the words before sub-paragraph (a), after “person” insert“ in Wales, Scotland or Northern Ireland ”. 8U.K.In paragraph 7 (fourth class of ineligible person: person unlawfully in United Kingdom), in the words before sub-paragraph (a), after “person” insert“ in Wales, Scotland or Northern Ireland ”. 9U.K.Before paragraph 8 insert— 7B(1)Paragraph 1 applies to a person in England if— (a)under the Immigration Act 1971, he requires leave to enter or remain in the United Kingdom

*   **Explanation**: This `similarity_search_with_score` function is used to obtain not only the most similar documents but also their similarity scores.

In [None]:
docs_and_scores = library.similarity_search_with_score(query1)

In [None]:
docs_and_scores[0]

(Document(id='5790ae72-4165-47bb-9634-bc83255ae133', metadata={}, page_content='3, paragraph 1(1)(ga) prevents a local authority in England from providing support or assistance under a provision mentioned in paragraph (ga) to a person if— (a)support is being provided to the person by virtue of paragraph 10B or section 95A of the Immigration and Asylum Act 1999, or (b)there are reasonable grounds for believing that support will be provided to the person by virtue of that paragraph or section.” 7U.K.In paragraph 6 (third class of ineligible person: failed asylum-seeker), in sub-paragraph (1), in the words before sub-paragraph (a), after “person” insert“ in Wales, Scotland or Northern Ireland ”. 8U.K.In paragraph 7 (fourth class of ineligible person: person unlawfully in United Kingdom), in the words before sub-paragraph (a), after “person” insert“ in Wales, Scotland or Northern Ireland ”. 9U.K.Before paragraph 8 insert— 7B(1)Paragraph 1 applies to a person in England if— (a)under the Immigration Act 1971, he requires leave to enter or remain in the United Kingdom'),
 0.8213681)

*   **Explanation**: This converts the FAISS vector store into a retriever, which can be used in chains or other LangChain components to retrieve relevant documents.

In [None]:
retriever = library.as_retriever()

*   **Explanation**: Here we import `AutoModelForSeq2SeqLM` and `AutoTokenizer` from the `transformers` library. We then specify the model `google/flan-t5-xl` and load the tokenizer and the model.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

*   **Explanation**: This function takes a question, a model, and a tokenizer as inputs. It tokenizes the question and generates a response using the model, decoding the output to a string.
*   Define the function `get_answer_from_library`.

In [None]:
def query_flan(question, model, tokenizer, max_new_tokens=50):
    inputs = tokenizer(question, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)

    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

*   **Explanation**: This function performs the following steps:
    *   It finds the `k` (in this case 2) most relevant documents from the `library` for the input `question`.
    *   It prints the content of these relevant documents.
    *   It combines the content of these documents into a single `context` string.
    *   It then constructs a `prompt` that includes this context, the original question, and instructions to provide a detailed answer and explain the reasoning.
    *   It uses the `query_flan` function to get an answer from the model, using the constructed prompt.
    *   Finally, it returns the answer.

In [None]:
def get_answer_from_library(question, library, model, tokenizer, max_new_tokens=50):

    relevant_docs = library.similarity_search(question, k=2)

    print("Relevant Documents:")
    for doc in relevant_docs:
        print(doc.page_content)

    context = " ".join([doc.page_content for doc in relevant_docs])


    prompt = f"Context: {context}\nQuestion: {question}\nAnswer: Provide a detailed answer and explain the reasoning behind it, using the provided context."

    answer = query_flan(prompt, model, tokenizer, max_new_tokens)

    return answer

question = "Why can I not work in the UK while applying for asylum?"
answer = get_answer_from_library(question, library, model, tokenizer)
print("Answer:")
print(answer)

Relevant Documents:
3, paragraph 1(1)(ga) prevents a local authority in England from providing support or assistance under a provision mentioned in paragraph (ga) to a person if— (a)support is being provided to the person by virtue of paragraph 10B or section 95A of the Immigration and Asylum Act 1999, or (b)there are reasonable grounds for believing that support will be provided to the person by virtue of that paragraph or section.” 7U.K.In paragraph 6 (third class of ineligible person: failed asylum-seeker), in sub-paragraph (1), in the words before sub-paragraph (a), after “person” insert“ in Wales, Scotland or Northern Ireland ”. 8U.K.In paragraph 7 (fourth class of ineligible person: person unlawfully in United Kingdom), in the words before sub-paragraph (a), after “person” insert“ in Wales, Scotland or Northern Ireland ”. 9U.K.Before paragraph 8 insert— 7B(1)Paragraph 1 applies to a person in England if— (a)under the Immigration Act 1971, he requires leave to enter or remain in the United Kingdom
part 1 chapter 2 (1)The Immigration Act 1971 is amended as follows. (2)In section 3(1)(c)(i) (power to grant limited leave to enter or remain in the United Kingdom subject to condition restricting employment or occupation) for “employment” substitute“ work ”. (3)After section 24A insert— (1)A person (“P”) who is subject to immigration control commits an offence if— (a)P works at a time when P is disqualified from working by reason of P's immigration status, and (b)at that time P knows or has reasonable cause to believe that P is disqualified from working by reason of P's immigration status. (2)For the purposes of subsection (1) a person is disqualified from working by reason of the person's immigration status if— (a)the person has not been granted leave to enter or remain in the United Kingdom, or (b)the person's leave to enter or remain in the United Kingdom— (i)is invalid, (ii)has ceased to have effect (whether by reason of curtailment, revocation, cancellation, passage of time or
Answer:
The person has not been granted leave to enter or remain in the United Kingdom, or (b)the person's leave to enter or remain in the United Kingdom— (i)is invalid, (ii)has ceased to have effect


*   **Explanation**: Here, we call the `get_answer_from_library` to get an answer to our question. Then, we print the answer.

**Conclusion**

This notebook successfully demonstrates the development of an Asylum Seeker Q&A system, leveraging the power of several libraries, like LangChain, Hugging Face, FAISS, Sentence Transformers, and Google's Flan-T5 model, to create a robust question-answering solution. The process involves efficient document management, text embedding, similarity searching, and advanced language modeling to provide relevant and informative answers to user queries.

**Key Accomplishments:**

*   **Data Preprocessing:** The system effectively prepares the raw asylum data for analysis.
*   **Embedding Generation:** High-quality text embeddings are created, enabling accurate semantic comparisons.
*   **Efficient Similarity Search:** FAISS enables rapid identification of relevant document chunks.
*   **Contextualized Question Answering:** The Flan-T5 model generates answers that are contextually grounded in the retrieved documents.
* **Saving data**: The index and document embeddings are saved in files to make it easier to load later.

**Potential Improvements:**

While the current system works well, there's room for further enhancement:

*   **Model Fine-tuning:** Experimenting with different embedding models or fine-tuning Flan-T5 could improve answer accuracy.
*   **Parameter Optimization:** Adjusting parameters like chunk size, overlap, and the number of retrieved documents can enhance the overall performance.
*   **Expanding the Dataset:** A larger, more diverse dataset would lead to more comprehensive and accurate responses.

**Overall Impact:**

This project demonstrates a practical application of advanced NLP techniques to address real-world information access challenges. The Q&A system empowers users to find specific information about the asylum process and the model and the index is saved into a file to make it usable in the future.