#**Data-Cube-Task**

#                             **Task**  

Creating a Python-based backend service that integrates with Redis for storing and querying vector data. The service will simulate the functionality of an AI-powered documentation assistant, allowing documents to be uploaded, vectorized, stored, and queried for similarity search. This project will assess your proficiency in Python, Redis, and AI/ML integration.

**Task Details:**

1.
Create a Python Backend Service

o
Develop a Python-based backend service using FastAPI (or Flask/Django) that provides the following endpoints:

▪
Upload Document Endpoint: Allows a user to upload documents and processes them to create vector embeddings.

▪
Search Endpoint: Accepts a user query and performs a similarity search in Redis, returning the most relevant documents along with similarity scores.


2.
Redis Vector Database Implementation

o
Set up and configure Redis with vector database capabilities (using modules like Redis Vector Search or RedisAI).

o
Use Redis as a store for the vector embeddings, ensuring data is properly indexed and optimized for similarity search.

o
Implement an indexing strategy to efficiently handle document vectors for quick retrieval.


3.
AI/ML Integration

o
Integrate a pre-trained model (e.g., OpenAI's GPT embeddings, Sentence Transformers) to process the text and generate vector embeddings for each document.

o
Implement the preprocessing logic required to prepare documents before vectorization.


4.
Search Functionality

o
Implement a similarity search endpoint that:
▪

Takes a text query, vectorizes it using the chosen embedding model, and retrieves the most similar documents stored in Redis.

▪
Returns the top matching document(s) along with similarity scores in the response.

5.
Performance Considerations

o
Use Redis data structures to optimize memory usage and search operations for large-scale data.

o
Implement error handling, ensuring the system is robust and handles unexpected situations gracefully.

o
Demonstrate parallel processing or batch processing if necessary to improve performance during data ingestion.


6.
Documentation

o
Provide documentation on:

▪
How to set up and run the backend service with Redis.

▪
How to interact with the API endpoints.

o
Explain your choice of Redis modules and data structures, and any design decisions made.


<!-- Evaluation Criteria: -->

•
Technical Skills: Proficiency in Python, FastAPI/Flask/Django, and Redis (including Redis Vector Search or RedisAI).

•
Problem-Solving: Ability to optimize vector storage and similarity search for large data sets using Redis.

•
Code Quality: Clean, modular, well-documented, and maintainable code with appropriate error handling.

•
Performance and Scalability: Efficient use of Redis for storage and retrieval, with considerations for scalability.

•
Creativity: Innovative solutions for optimizing data ingestion and retrieval.

•
Documentation and Communication: Clear explanations of how to use the system and why specific choices were made.

In [None]:
!pip install fastapi uvicorn sentence-transformers redis pyngrok nest-asyncio
!apt-get install redis-server
!redis-server --daemonize yes


In [None]:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
import redis
import numpy as np
from typing import List

In [None]:
# Initializing Redis connection
redis_host = "localhost"
redis_port = 6379
redis_client = redis.StrictRedis(host=redis_host, port=redis_port, decode_responses=False)

In [None]:
# Loading SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
# FastAPI application
app = FastAPI()

In [None]:
# Pydantic model for document upload
class Document(BaseModel):
    document_id: str
    content: str

In [None]:
# Endpoint to uploading documents
@app.post("/upload")
def upload_document(doc: Document):
    try:
        # Generating vector embedding
        embedding = model.encode(doc.content).tolist()

        # Storing embedding in Redis
        key = f"vector:{doc.document_id}"
        redis_client.hset(key, mapping={
            "embedding": np.array(embedding).tobytes(),
            "content": doc.content
        })
        return {"message": "Document uploaded successfully", "document_id": doc.document_id}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

In [None]:
# Endpoint for similarity search
@app.get("/search")
def search_documents(query: str, top_k: int = 3):
    try:
        # Vectorizing query
        query_vector = model.encode(query)

        # Performing similarity search
        keys = redis_client.keys("vector:*")
        results = []

        for key in keys:
            stored_embedding = np.frombuffer(redis_client.hget(key, "embedding"), dtype=np.float32)
            similarity = np.dot(query_vector, stored_embedding) / (
                np.linalg.norm(query_vector) * np.linalg.norm(stored_embedding)
            )
            results.append({
                "document_id": key.decode('utf-8').split(":")[1],
                "similarity_score": similarity,
                "content": redis_client.hget(key, "content").decode('utf-8')
            })

        # Sorting results by similarity score
        sorted_results = sorted(results, key=lambda x: x['similarity_score'], reverse=True)[:top_k]

        return {"results": sorted_results}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

In [None]:
!pip install nest_asyncio

import nest_asyncio
import uvicorn

nest_asyncio.apply()
uvicorn.run(app, host="127.0.0.1", port=8000)