# Part 1: Retrieval-Augmented Generation (RAG) Model for QA Bot

### Problem Statement:

### Develop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA)

### bot for a business. Use a vector database like Pinecone DB and a generative model like

### Cohere API (or any other available alternative). The QA bot should be able to retrieve

### relevant information from a dataset and generate coherent answers.


In [1]:
!pip install faiss-cpu cohere PyPDF2 numpy


Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Collecting cohere
  Downloading cohere-5.9.2-py3-none-any.whl.metadata (3.4 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting boto3<2.0.0,>=1.34.0 (from cohere)
  Downloading boto3-1.35.19-py3-none-any.whl.metadata (6.6 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.9.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx>=0.21.2 (from cohere)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting parameterized<0.10.0,>=0.9.0 (from cohere)
  Downloading parameterized-0.9.0-py2.py3-none-any.whl.metadata (18 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Downloading types_requests-2.32.0.20240914-py3-none-a

In [2]:
import PyPDF2

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text()
    return text


In [3]:
def split_text(text, chunk_size=1000):
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return chunks


In [14]:
import cohere
import numpy as np

co = cohere.Client('SGXUJ2vUDqaNNpJwh1ffmo1PFkGmN50W6ghcW4UA')  # Replace with your Cohere API key

def create_embeddings(texts, batch_size=40):
    embeddings = []
    for i in range(0, len(texts), batch_size):

        batch = texts[i:i+batch_size]
        # Added input_type argument for the embed-english-v3.0 model
        response = co.embed(texts=batch, model="embed-english-v3.0", input_type="search_document")
        embeddings.append(response.embeddings)

    return np.vstack(embeddings)


In [15]:
import faiss
import numpy as np

# Define index parameters
dimension = 1024  # Cohere's embedding model dimensionality
index = faiss.IndexFlatL2(dimension)  # FAISS L2 (cosine) index


In [39]:
# Example PDF path
pdf_path = '/content/Gen AI Engineer _ Machine Learning Engineer Assignment.pdf'
text = extract_text_from_pdf(pdf_path)
chunks = split_text(text)

# Generate embeddings
chunk_embeddings = create_embeddings(chunks)

# Add embeddings to FAISS index
index.add(np.array(chunk_embeddings).astype(np.float32))


In [40]:
def retrieve(query, index, k=3):
    # Create query embedding with Cohere
    query_embed = co.embed(texts=[query], model="embed-english-v3.0", input_type="search_document").embeddings
    # Search in FAISS index
    # Convert query_embed to a 1D array before passing it to index.search
    D, I = index.search(np.array(query_embed).astype(np.float32), k)

    # Fetch relevant documents
    return [chunks[i] for i in I[0]]

In [49]:
query = "What is the main topic of the document? Give a summary."
context = retrieve(query, index)

contexts = ""
for cont in context:
  contexts = contexts + cont



In [55]:
import cohere

stream = co.chat_stream(
  model='command-r-plus-08-2024',
  message=contexts,
  temperature=0.4,
  chat_history=[],
  prompt_truncation='AUTO',
  #connectors=[{"id":"web-search"}],
  max_tokens=4096
)

for event in stream:
  if event.event_type == "text-generation":
    print(event.text, end='')

## I. Introduction:

In the era of rapidly evolving digital manipulation techniques, the need for robust deepfake detection methods has never been more critical. Deepfakes, a portmanteau of "deep learning" and "fake," pose significant challenges to various sectors, including media, politics, and personal privacy. This paper introduces a groundbreaking deepfake detection system that integrates both visual and auditory cues, marking a significant advancement in the field.

## II. Literature Review:

Recent research has made notable strides in the battle against deepfakes. Yu et al. (2023) presented a compelling approach by combining EfficientNet's efficient feature extraction capabilities with torchvision, demonstrating the potential of innovative techniques. However, our proposed method takes this a step further by incorporating both visual and auditory analysis.

## VI. Feature Extraction:

The process of feature extraction is pivotal in machine learning, and this code employs two cutt