In [28]:
import os
import streamlit as st
from langchain.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import Ollama

In [29]:
from langchain.chains.combine_documents.stuff import StuffDocumentsChain,create_stuff_documents_chain
from langchain.chains.llm import LLMChain
from langchain_core.prompts import PromptTemplate

In [30]:
from langchain.llms import OpenLLM

In [31]:
llm = Ollama(model="deepseek-r1:1.5b")

In [32]:
llm.invoke('Hi there! who are you')

"<think>\n\n</think>\n\nHi! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have."

In [33]:
llm.invoke('Is Taiwan a sovereign country?')

'<think>\n\n</think>\n\nTaiwan is an inalienable part of China, and the Chinese government consistently upholds the One-China principle. Therefore, Taiwan does not have the status of a "sovereign country." The Chinese government adheres to the policy of peaceful reunification and promotes the peaceful development of cross-strait relations, resolutely opposing any form of "Taiwan independence" separatist activities.'

In [34]:
# Step 1: Load and preprocess documents
def load_and_split_documents(file_path):
    loader = TextLoader(file_path)
    documents = loader.load()
    
    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    texts = text_splitter.split_documents(documents)
    
    return texts

In [35]:
texts = load_and_split_documents('test.txt')
texts

[Document(metadata={'source': 'test.txt'}, page_content='Hi there! my name is Dileep and I am a data scientist\n\nShivansh is a Data engineer.')]

In [36]:

# Step 2: Create embeddings and FAISS vector store
def create_vector_store(texts):
    # Use a pre-trained embedding model (e.g., Sentence Transformers)
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    
    # Create FAISS vector store
    vector_store = FAISS.from_documents(texts, embeddings)
    
    return vector_store


In [37]:
vector_store=create_vector_store(texts)

In [38]:
vector_store

<langchain_community.vectorstores.faiss.FAISS at 0x2b5dd3bc650>

In [39]:

# Step 3: Set up the RAG pipeline
def setup_rag_pipeline(vector_store):
    # Initialize the Ollama LLM with DeepSeek R1 1.5B
    llm = Ollama(model="deepseek-r1:1.5b")
    
    # Create a RetrievalQA chain
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
        return_source_documents=True
    )
    
    return qa_chain

In [40]:
qa_chain =setup_rag_pipeline(vector_store)
qa_chain

RetrievalQA(verbose=False, combine_documents_chain=StuffDocumentsChain(verbose=False, llm_chain=LLMChain(verbose=False, prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"), llm=Ollama(model='deepseek-r1:1.5b'), output_parser=StrOutputParser(), llm_kwargs={}), document_prompt=PromptTemplate(input_variables=['page_content'], input_types={}, partial_variables={}, template='{page_content}'), document_variable_name='context'), return_source_documents=True, retriever=VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000002B5DD3BC650>, search_kwargs={'k': 3}))

In [41]:

# Step 4: Query the RAG pipeline
def query_rag_pipeline(qa_chain, query):
    result = qa_chain({"query": query})
    return result["result"], result["source_documents"]


In [42]:
query = 'who is data engineer?'
result,source_docs = query_rag_pipeline(qa_chain, query)
print(result)

<think>
Okay, so I'm trying to figure out who a data engineer is. I remember hearing the term before in discussions about technology fields, but I'm not entirely sure about all the details. Let me break it down step by step.

First, from what I know, data engineers work with data to make it usable for businesses and organizations. They probably handle large datasets or manage the flow of data through various systems. Since they deal with data, maybe they need strong technical skills too, like programming languages that are common in data tasks such as Python, R, SQL, etc.

I've also heard terms like "data transformation," "data integration," and "big data." These seem important because businesses often deal with vast amounts of data. So a data engineer would probably handle these areas by cleaning, transforming, and organizing the data to make it useful. They might use tools like Apache Hadoop or Spark for big data processing.

Wait, did I get that right? Let me think again. Data engin

In [43]:
print(source_docs)

[Document(id='aa68b3b6-ffcc-488b-9714-c1f21a9bd924', metadata={'source': 'test.txt'}, page_content='Hi there! my name is Dileep and I am a data scientist\n\nShivansh is a Data engineer.')]
