### Installations

In [1]:
!pip install sentence_transformers
!pip install faiss-cpu
!pip install faiss-gpu

Collecting sentence_transformers
  Downloading sentence_transformers-3.2.0-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.2.0-py3-none-any.whl (255 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.2/255.2 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence_transformers
Successfully installed sentence_transformers-3.2.0
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0
Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading faiss_gpu-1.7.2-cp310-cp3

### Imports

In [31]:
import json
import numpy as np
import faiss
from transformers import T5Tokenizer, T5ForConditionalGeneration
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

#### Data Processing and Prepration

In [3]:
# Load dataset
def load_data(corpus_path, train_path):
    with open(corpus_path, 'r') as f:
        corpus = json.load(f)
    with open(train_path, 'r') as f:
        train_data = json.load(f)
    return corpus, train_data

corpus_path = "corpus.json"
train_path = "train.json"
corpus, train_data = load_data(corpus_path, train_path)

In [16]:
# Preprocess corpus
articles = []
metadata = []

for article in corpus:
    article_text = article['body']
    meta = {
        "title": article['title'],
        "author": article['author'],
        "url": article['url'],
        "source": article['source'],
        "category": article['category'],
        "published_at": article['published_at']
    }
    articles.append(article_text)
    metadata.append(meta)

# Preprocess train
question_types = set()
for query in train_data:
    question_types.add(query['question_type'])

question_types = list(question_types)
print(question_types)

['null_query', 'comparison_query', 'temporal_query', 'inference_query']


# Step 1: Document Retrieval (TF-IDF)


Now, we are converting all the 609 article's body, stored in `articles` in a matrix as vectors.
- `stop_words` is removing all english common words like is, the, can, he, etc.
- the `vectorizer` is fitting articles data, so that, later we can search the transform the query vector in this fitted manner.

In [5]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
tfidf_matrix = vectorizer.fit_transform(articles)
print(tfidf_matrix.shape)

(609, 10000)


FAISS (Facebook AI Similarity Search) is a library designed for efficient similarity search and clustering of dense vectors.

So, what are we doing here.

- faiss will create a empty matrix something called flat index matrix, where every data stored is same.
- Next, we will add our vectorised articles data and it will store it in flat index matrix based on similarities.
- Similar articles will be stored nearer and different articles will be stored farther. Like, in a 3D Vector Database.

In [7]:
# Use FAISS for fast similarity search
index = faiss.IndexFlatL2(tfidf_matrix.shape[1])

index.add(tfidf_matrix.toarray().astype('float32'))

- This function aims to retrieve the most relevant

- documents from a corpus based on a given query using TF-IDF which converts, our query to vectorised like we converted our articles.

- Then, we are searching the the query vector in the flat index matrix, where all articles are stored based on similarity.

- And `top_k = 4` signifies we are accepting the most similar 4 articles present in the flat index matrix.

- `D` is the distance of the similar 4 aticles from query vector, and `I` is indices vectors of the similar 4 articles from query vector

In [8]:
def retrieve_documents(query, top_k=4):
    query_vec = vectorizer.transform([query])

    query_vec_dense = query_vec.toarray().astype('float32')

    Distances, Indexes = index.search(query_vec_dense, top_k)
    return Indexes.flatten()

# Step 2: Answer Generation using a Generative Model (T5)

#### This function is generating an answer to the user query using a pre-trained, google open-source language model called T5 (Text-To-Text Transfer Transformer). [T5 Hugging Face Model](https://huggingface.co/google-t5/t5-small)

#### Here, we will pass the retrieved top 4 similar documents from the query based on above function.

#### Then we are preparing context, where we add body of all 4 similar articles.

#### Then we are passing the `input_text` to the model and generating answers.

In [9]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

def generate_answer(retrieved_docs, query):
    context = " ".join([articles[doc] for doc in retrieved_docs])
    input_text = f"question: {query} context: {context}"
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, num_beams=4)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

#### Below, function is finding `fact` part of output, i.e. it returns the part of article string that is similar to the answer string, so it will return the string from article body that matches the answer string.

In [10]:
def extract_full_sentence_with_answer(article_text, answer_text):
    # Split the article text into sentences
    sentences = re.split(r'(?<=[.!?]) +', article_text)

    # Search for the sentence containing the answer_text
    for sentence in sentences:
        if answer_text.lower() in sentence.lower():
            return sentence.strip()

    return "Answer text not found in the article."

# Step 3: Evidence List Construction
### This function is for returning the evidence from which articles we got the answer.

### It uses the top 4 similar articles search output from the flat index matrix and answer of the T5 model.

### Then i are simply returning the similar articles, based on specified format in PS and for `fact` i am using the above model.

In [12]:
def construct_evidence_list(retrieved_docs, answer):
    evidence_list = []
    for doc_idx in retrieved_docs:
        meta = metadata[doc_idx]
        fact = extract_full_sentence_with_answer(articles[doc_idx], answer)
        evidence = {
            "title": meta['title'],
            "author": meta['author'],
            "url": meta['url'],
            "source": meta['source'],
            "category": meta['category'],
            "published_at": meta['published_at'],
            "fact": fact
        }
        evidence_list.append(evidence)
    return evidence_list

## Multi Class Logistic Regression

#### We are using this classifier to separate queries as the 4 possible query_types from the `train.json` data.

1. Preparing data as `X_train` and `y_train`
2. We use LabelEncoder to encode the target labels into numerical values.
  - `Target Labels = ['null_query', 'comparison_query', 'temporal_query', 'inference_query']`
3. We create a pipeline that includes:

 - TF-IDF Vectorizer: This converts the text queries into numerical features.
 - Logistic Regression: This is our classification model.

4. We train the model using the fit() method.
5. We define a function predict_query_type() that can be used to predict the type of new queries.

Finally, we provide an example of how to use the model to predict a query type.

In [32]:
X_train = []
y_train = []

# Prepare the data
for i in train_data:
    X_train.append(i['query'])
    y_train.append(i['question_type'])

# Encode the target labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)

# Create a pipeline with TF-IDF vectorizer and Logistic Regression
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, stop_words='english', max_features=5000)),
    ('clf', LogisticRegression(multi_class='ovr', max_iter=1000))
])

# Train the model
pipeline.fit(X_train, y_train_encoded)

# Function to predict query type
def predict_query_type(query):
    prediction = pipeline.predict([query])
    return label_encoder.inverse_transform(prediction)[0]



# Step 4: Pipeline to answer queries

### This function just uses every function above to generate answer and the evidence list for the query taken as input.

In [33]:
def answer_query(query):
    retrieved_docs = retrieve_documents(query)
    answer = generate_answer(retrieved_docs, query)
    evidence_list = construct_evidence_list(retrieved_docs, answer)
    question_type = predict_query_type(query)
    output = {
        "query": query,
        "answer": answer,
        "question_type": question_type,
        "evidence_list": evidence_list
    }
    return output

### Displaying Result

In [34]:
# Test on a sample query
sample_query = "Do the TechCrunch article on software companies and the Hacker News article on The Epoch Times both report an increase in revenue related to payment and subscription models, respectively?"
output = answer_query(sample_query)
print(json.dumps(output, indent=4))

{
    "query": "Do the TechCrunch article on software companies and the Hacker News article on The Epoch Times both report an increase in revenue related to payment and subscription models, respectively?",
    "answer": "The Epoch Times has amassed a fortune, growing its revenue by a staggering 685% in two years, to $122 million in 2021, according to the group\u2019s most recent tax records",
    "question_type": "comparison_query",
    "evidence_list": [
        {
            "title": "How the conspiracy-fueled Epoch Times went mainstream and made millions",
            "author": null,
            "url": "https://www.nbcnews.com/news/us-news/epoch-times-falun-gong-growth-rcna111373",
            "source": "Hacker News",
            "category": "technology",
            "published_at": "2023-10-16T03:41:24+00:00",
            "fact": "Answer text not found in the article."
        },
        {
            "title": "Here\u2019s how Rainforest, a budding Stripe rival, aims to win over so