<a href="https://colab.research.google.com/github/Namitt/RAG-System-for-Indian-Political-News/blob/main/5568424.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **RAG System for Mumbai Local Travel**

#1. Installation of Libraries

In [1]:
# Installing libraries required for data manipulation, text embedding, and similarity search.
!pip install faiss-gpu # If using a GPU, otherwise install faiss-cpu
!pip install sentence_transformers # For using sentence transformers
!pip installtransformers
!pip installtorch
!pip install pandas


Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using

The required Python libraries are installed with these scripts. For NLP applications, transformers and sentence_transformers offer pre-trained models; for effective similarity search in big datasets, faiss is utilised. Pandas is utilised for data manipulation. Depending on the hardware that is available, faiss-gpu or faiss-cpu should be used, with the former optimising for performance in GPU-equipped situations.

In [2]:
import torch # PyTorch for handling deep learning tasks
from transformers import AutoTokenizer, AutoModel, GPT2LMHeadModel, GPT2Tokenizer # For generating text
import faiss # For efficient nearest neighbor search
import numpy as np
import pandas as pd
import json # For saving and loading data in JSON format
from sentence_transformers import SentenceTransformer # For text embeddings

#2. Data Loading and Preprocessing

In [3]:
# Load the dataset file
df = pd.read_csv("https://raw.githubusercontent.com/Namitt/RAG-System-for-Mumbai-Local-Travel/main/Mumbai%20Local%20Train%20Dataset.csv")

# Fill missing descriptions with empty string
df['Nearby attractions'] = df['Nearby attractions'].fillna('')
df['About'] = df['About'].fillna('')

# Combine "About" and "Nearby attractions" columns
df['combined_text'] = df['About'] + " " + df['Nearby attractions']

The main dataset is made up of data from local railway stations in Mumbai that was obtained from a CSV file. Since it provides the foundation for text chunking and subsequent embedding, this data is essential.

When handling data, pandas guarantees effective preparation and manipulation. The fillna('') function fills in the missing values that might cause processing issues while chunking text.

#3. Text Chunking

In [4]:
# Function to split text into chunks
def chunk_text(text, chunk_size=512, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunks.append(" ".join(words[i:i + chunk_size]))
    return chunks

# Chunk the combined text
df['chunked_combined_text'] = df['combined_text'].apply(lambda x: chunk_text(x))
chunks = [item for sublist in df['chunked_combined_text'].tolist() for item in sublist]

In order to comply with the embedding model's constraints on input size, descriptions are segmented into more manageable segments, each containing 512 characters.

By guaranteeing that no one input surpasses the maximum token limit and maintaining context without overloading the model, churning text into smaller bits facilitates better handling by the embedding model.

#4.  Loading Embeddings from JSON

In [5]:
# Load the SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed the text chunks and convert to numpy for compatibility with FAISS
embeddings = np.vstack([model.encode(chunk, convert_to_tensor=True).cpu().detach().numpy() for chunk in chunks])

# Save embeddings to JSON for later retrieval
with open('embeddings.json', 'w') as f:
    json.dump(embeddings.tolist(), f)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentence Transformer model 'all-MiniLM-L6-v2' is used to embed each text chunk, converting textual data into a high-dimensional vector space.


The selection of 'all-MiniLM-L6-v2' strikes a compromise between computational efficiency and performance, rendering it appropriate for producing embeddings with semantic significance.

#5. Setup FAISS for Efficient Similarity Search / Creating and Training a FAISS Index

In [6]:
# Load embeddings from JSON and ensure they are in the correct format
with open('embeddings.json', 'r') as f:
    embeddings = np.array(json.load(f), dtype='float32')

# Ensure the embeddings array is C-contiguous as expected by FAISS
embeddings = np.ascontiguousarray(embeddings)

# Create and train a FAISS index
index = faiss.IndexFlatL2(embeddings.shape[1])

# Normalize embeddings to unit length before adding them to the index
faiss.normalize_L2(embeddings)
index.add(embeddings)  # Add normalized embeddings to the index for searching

Finding the most pertinent text chunks to get in answer to a query requires a fast solution for closest neighbour searches, which FAISS offers in big datasets.

**1. dtype='float32', embeddings = np.array(json.load(f))** - Pre-computed embeddings are loaded from a JSON file by this code, which then transforms them into a NumPy array of the specified data type (float32). This is an important step since it makes embeddings reusable, eliminating the need to recompute them every time the script executes. By ensuring that the data is in a format that is compatible with FAISS, using float32 maximises computational effectiveness and memory utilisation.

**2. embeddings = np.ascontiguousarray(embeddings)** - The input data for FAISS must be in a C-contiguous format. This indicates that the data is kept in a row-major order contiguous block of memory, which is essential for the smooth operation of the library's internal processes. Making sure the data is C-contiguous improves indexing and searching efficiency and helps prevent runtime issues.

 **3. index = faiss.IndexFlatL2(embeddings.shape[1])** - In many closest neighbour search applications, L2 (Euclidean) distance computations are frequent, and this step initialises an FAISS index expressly for those purposes. When searching exhaustively is feasible, the IndexFlatL2 method is a straightforward and effective method appropriate for smaller datasets. This index type was selected due to its ease of use and efficiency for ordinary nearest neighbour searches.

**4. faiss.normalize_L2(embeddings)** - For many machine learning applications, particularly those that include distance computations, it is sometimes required to normalise the embeddings to unit length, which sets each vector's norm equal to 1. By concentrating just on the angle between vectors, normalisation makes sure that the similarity between them is unaffected by their magnitude. By taking this action, the nearest neighbour search becomes more accurate and resilient.

**5. index.add(embeddings)**- The normalised embeddings are added to the previously constructed FAISS index in this last phase. The index must be populated in order to provide effective similarity searches. The index may be utilised for querying once the embeddings have been inserted, which enables the system to locate the closest neighbours for any given input query rapidly.

#6. Loading GPT-2 for Text Generation

In [7]:
# Load pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token  # Set padding token to end of sequence token
gpt_model = GPT2LMHeadModel.from_pretrained('gpt2')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Because GPT-2 is well known for producing language that resembles that of a person, it is perfect for producing replies that call for a profound comprehension of subtleties and context.

#7. Functions for Query Encoding and Text Generation

In [8]:
# Function to encode queries using the same model as for embedding chunks
def encode_query(query):
    return model.encode([query], convert_to_tensor=True).cpu().detach().numpy()

This function uses the Sentence Transformer paradigm to transform a textual query into a numerical vector. Similar text chunks in the dataset may be found by using the vector, which provides the semantic embedding of the query.

Sentence Transformers are essential in this situation since they are made expressly to produce meaningful sentence embeddings. Compatibility and efficiency in the ensuing search process in FAISS are guaranteed by converting the query into a tensor and then releasing it from the current computing device (e.g., switching from GPU to CPU).

#8.  Retrieving and Generating Text

In [9]:
# Function to perform text retrieval and generate response based on the context
def search_and_generate(query, top_k=1):
    query_embedding = encode_query(query)
    D, I = index.search(query_embedding, k=top_k)
    retrieved_info = " ".join([chunks[i] for i in I[0]])

    prompt = f"Query: {query}\n\nPlease reflect on the following extract and elaborate:\n\n'{retrieved_info}'"
    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = gpt_model.generate(
        inputs['input_ids'],
        max_length=1024,
        no_repeat_ngram_size=2,
        num_return_sequences=1,
        top_p=0.92,
        temperature=0.85,
        repetition_penalty=1.2
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

The essential functions of the RAG system are contained in this function. Based on the query's embedding, it first extracts the most pertinent text chunks, and then it utilises these pieces as context to provide a text answer using GPT-2.


**FAISS Search**: Because the FAISS library finds the most relevant text chunks that fit the query embedding rapidly, it is utilised for closest neighbour searches.

**Text Generation** :Using GPT-2: Using GPT-2's sophisticated language modelling skills to generate replies enables the creation of text that is both logical and suitable for the given context. In order to provide a variety of believable answers, the generation function's parameters (such as temperature, top_p, no_repeat_ngram_size, and repetition_penalty) are carefully calibrated to strike a compromise between inventiveness and relevance and coherence.

**Prompt Design**: In order to guarantee that the output is pertinent to the query, the prompt structure plays a critical role in directing the model's generation towards reflecting on the obtained text.

#9. Testing the System

In [10]:
# Testing the integrated pipeline
test_queries = [
    "Tell me about and what can I visit near Vasai Road?",
    "Tell me about the history and attractions near Dadar",
    "What is the significance of Thane station?",
    "Provide details about the facilities available at Dadar station.",
    "What makes Churchgate station unique?",
    "Give me some information about the development of Borivali station.",
    "What are the key features of Bandra station?",
    "Describe the connectivity of Kasara station.",
    "What express trains stop at Mumbai Central station?",
    "How has the CST station evolved over the years?"
]

# Function to run test queries and print results
def run_test_queries(queries):
    for query in queries:
        result = search_and_generate(query, top_k=1)
        print(f"Query: {query}\nGenerated Response: {result}\n{'-'*60}\n")

# Running the test queries
run_test_queries(test_queries)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: Tell me about and what can I visit near Vasai Road?
Generated Response: Query: Tell me about and what can I visit near Vasai Road?

Please reflect on the following extract and elaborate:

'Vasai Road Junction (station code: BSR) is a railway station on the Western line and Vasai Road–Roha line of the Mumbai Suburban Railway network. Vasai is a historical suburban town north of Mumbai and it is located in Palghar district. It is a much modern part of Vasai Taluka. It is a part of the new Vasai-Virar city. It is a major railway station which bypasses Mumbai and connects the trains coming from Vadodara to Konkan Railway and Pune Junction railway station and further towards cities of Bengaluru and Hyderabad History The station was formerly known as Bassein Road (Bassein being Vasai's Portuguese name). This is the reason why the station code is BSR standing for the original BaSsein Rd. It is an historic station, for it was the terminus of the first local service of the erstwhile BB&C

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: Tell me about the history and attractions near Dadar
Generated Response: Query: Tell me about the history and attractions near Dadar

Please reflect on the following extract and elaborate:

'and opened on 1968.[13] During the Indo-Pakistan War of 1971 a Jawan Canteen was established in the station to serve Indian soldiers. The Canteen was conducted by Wadala Junior Chambers (Founder- Gangaram Joshi), under the guidance of Nanik Rupani, who was the President at that time.[14][15] After decades, In 2009 The Midtown terminus of Dadar Western side was inaugurated for increasing more trains on the suburban route and long-distance route for decreasing a load of passengers.[16] And the side elevated road which is parallel to Midtown Terminus connects to Tilak Bridge for direct taxi's and another vehicle's movement, was inaugurated in 2014. The Cost for construction was?30 crore (US$3.8 million).[17] Shivaji Park,Siddhivinayak Temple,Dadar Flower Market'


------------------------------

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What is the significance of Thane station?
Generated Response: Query: What is the significance of Thane station?

Please reflect on the following extract and elaborate:

'Thane (formerly Thana, station code: TNA) is an A1 category major railway station of the Indian Railways serving the city of Thane, Located in Maharashtra, it is one of the busiest railway stations in India. As of 2013, Thane railway station handles 260000 people daily. More than 1,000 trains visit the station each day, including 330+ long-distance trains.[1] The station has ten platforms. It is the origin and destination station of all the trans-harbour suburban trains. Thane is India's first passenger railway Station along with Bori Bunder Railway Station. History Thane railway station was the terminus for the first ever passenger train in India. On 16 April 1853, the first passenger train service was inaugurated from Bori Bunder (now renamed Chhatrapati Shivaji Maharaj Terminus), Mumbai to Thane.[2] Covering

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: Provide details about the facilities available at Dadar station.
Generated Response: Query: Provide details about the facilities available at Dadar station.

Please reflect on the following extract and elaborate:

'Dadar railway station is one of the major interchange railway stations of Mumbai Suburban Railway. It serves the Dadar area in Mumbai, India. This railway station lies on both the Central line named as Dadar Central with station code DR and Western line named as Dadar Western with station code DDR. It's also a terminal for Mumbai Suburban Railway as well as Indian Railways.[1] Two roads are passes through parallel in the vicinity of Dadar railway station which is Senapati Bapat Marg on the Westside and Lakhamsi Nappu Road on the Eastside. Structure Dadar railway station has 15 platforms, In that, 7 platforms consist of the Western side which is two platforms of the slow suburban route, three platforms of the fast suburban route and the last two platforms are the termi

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What makes Churchgate station unique?
Generated Response: Query: What makes Churchgate station unique?

Please reflect on the following extract and elaborate:

'The Fort area built by the British had three main gates.[1] One of these gates led straight to Saint Thomas Cathedral Church, hence it was named "Church Gate". This gate was demolished in 1860. Later the Churchgate railway station was built in 1870 in close proximity to the position of the demolished gate.[2] Churchgate station is a terminus of Western Railway line of Mumbai suburban railway. It is the southernmost station of the city, though up to the 1931, Colaba was the southernmost station, however the rail line was removed beyond Churchgate, making Churchgate the southernmost station.[3][4] The Bombay, Baroda and Central India Railway (present Western Railway) was inaugurated in 1855 with the construction of rail line (BG) between Ankleshwar and Uttaran (a distance of 29 miles). In 1859 this line was further extende

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: Give me some information about the development of Borivali station.
Generated Response: Query: Give me some information about the development of Borivali station.

Please reflect on the following extract and elaborate:

'Borivali (station code: BO (suburban)/BVI (mainline)) is a railway station on the Western line of the Mumbai Suburban Railway network and an outbound station. It serves the suburban of Borivali. The Borivali Railway Station[2] is a terminus for all slow, semi-fast and fast trains on the Mumbai Suburban Railway system. It also serves as the final city-limit stop for all mail and express trains on Western Railway before leaving Mumbai. As of Oct 2022, the plans to extend the Harbour Line to Borivali, and expansion plans are in full steam, with the survey for land acquisition being completed. [3] Borivali is used by almost 2.87 lakhs (287,000) passengers every day and is the busiest station on the western suburban line of Mumbai. The number of passengers using Bori

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What are the key features of Bandra station?
Generated Response: Query: What are the key features of Bandra station?

Please reflect on the following extract and elaborate:

'Bandra (/bæ?ndra/; station code: B for Suburban services and BA for Indian Railways) is a railway station on the Western Line and Harbour Line of the Mumbai Suburban Railway network. It serves the Bandra suburban area and the commercial area of Bandra-Kurla Complex (BKC). Bandra Terminus is near to Bandra railway station and serves interstate traffic on the Western Railway. The station is a Grade-I heritage structure. The other 4 railway stations on Mumbai's heritage list include Chhatrapati Shivaji Maharaj Terminus, Western Railways Headquarters Building (Churchgate), Byculla railway station and Reay Road railway station.[2] All fast and slow commuter trains have a halt at this station. Bandra is also directly connected to Chhatrapati Shivaji Terminus through Harbour Line via Vadala Road. BEST buses are al

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: Describe the connectivity of Kasara station.
Generated Response: Query: Describe the connectivity of Kasara station.

Please reflect on the following extract and elaborate:

'Kasara (station code: KSRA/N for North (local)) is a railway station on the Central line of the Mumbai Suburban Railway network. It is the final stop in the north-east sector of the Central Line. Background Kasara was called Kassarah during the British Raj. For outstation trains, it is a technical halt, where bankers (generally WAG-5 or WAG-7) are attached behind the trains to enable it to cross the tough and high gradient Kasara Ghat. Earlier the station was electrified with a DC 1.5 kV traction system, but on 19 February 2006, it was changed to an AC 25 kV 50 Hz system. Services It takes up to 162 minutes for slow local suburban trains to reach Kasara from Chhatrapati Shivaji Terminus whereas the fast locals take 136 mins.[1] Connections MSRTC BUS and Shared taxis to Nasik Road, Jawahar, Mokhada, Rajur, a

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What express trains stop at Mumbai Central station?
Generated Response: Query: What express trains stop at Mumbai Central station?

Please reflect on the following extract and elaborate:

'Mumbai Central (formerly Bombay Central, station code: MMCT[1]) is a major railway station on the Western line, situated in Mumbai, Maharashtra in an area known by the same name. It serves as a major stop for both Local and Inter-City/Express trains with separate platforms for them. It is also a terminal for several long-distance trains including the Mumbai Rajdhani Express. It is one of the five major Terminal stations in Mumbai while others being Mumbai CST, Mumbai LTT, Mumbai BDTS and Mumbai Dadar. Trains depart from the station connecting various destinations mostly across states in the northern, western and north-western parts of India. The station was renamed from Bombay Central to Mumbai Central in 1997, following the change of Bombay to Mumbai. In 2018, a resolution was passed to chang

This function is intended to assess the system through the execution of a predetermined series of test questions, enabling the observation of the system's performance in handling various literary inquiries.

When testing with various questions, the function presents the replies in a methodical manner, making it simple to analyse and analyse the system's performance.