# üß† DeepSeek Research Assistant üìÑüîç  
## **AI-Powered Research Paper Q&A & Summarization**  

üîç **Analyze any research paper with AI!**  
‚úÖ **Upload a PDF**  
‚úÖ **Get AI-generated summaries**  
‚úÖ **Ask any questions, and get answers with citations**  
‚úÖ **Retrieve key insights instantly**  

---

### **‚öôÔ∏è Tech Stack**
üîπ **LLM Engine**: DeepSeek-R1-8B (via Ollama)  
üîπ **AI Framework**: LangChain (retrieval & prompt engineering)  
üîπ **Text Extraction**: pdfminer.six  
üîπ **Semantic Search**: ChromaDB (BM25 + Embeddings)  

---

### **üöÄ How to Use**
1Ô∏è‚É£ **Run all cells**  
2Ô∏è‚É£ **Place a research paper (PDF) inside `Research_papers/`**  
3Ô∏è‚É£ **Ask any question**, and the AI retrieves & answers  
4Ô∏è‚É£ **All responses are logged for reference**  

---

üë®‚Äçüíª **Made with ‚ù§Ô∏è using DeepSeek-R1 & LangChain**  


üìÇ Cell 2: Import Required Packages

In [110]:
import os
import glob
import shutil
import datetime
import json
import re
import pdfminer.high_level
import streamlit as st  # Optional for UI
import time


# LangChain Core
from langchain_community.llms import Ollama
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Embeddings & Vector Storage
from langchain_community.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

# Retrieval Enhancements
from langchain.schema import Document
from langchain.retrievers import BM25Retriever, ContextualCompressionRetriever
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.document_transformers import LongContextReorder

print("‚úÖ All Packages Imported Successfully")


‚úÖ All Packages Imported Successfully


üìÇ Cell 3: Extract Text from PDF 

In [111]:
pdf_directory = "/Users/pouyapourfarrokh/Desktop/AI&Data science Projects/DeepSeek Research Assistant/-DeepSeek-Research-Assistant-AI-Powered-Paper-Summarizer-Q-A/Research_papers"

def get_latest_pdf(directory):
    """Retrieve the most recently added PDF file from the directory."""
    pdf_files = glob.glob(os.path.join(directory, "*.pdf"))
    return max(pdf_files, key=os.path.getctime) if pdf_files else None

def extract_text_from_pdf(pdf_path):
    """Extracts text from a given PDF file."""
    if not pdf_path or not os.path.exists(pdf_path):
        return None
    try:
        text = pdfminer.high_level.extract_text(pdf_path)
        return text if text.strip() else "‚ö†Ô∏è No extractable text found."
    except Exception as e:
        return f"‚ùå Error extracting text: {str(e)}"

latest_pdf = get_latest_pdf(pdf_directory)

if latest_pdf:
    extracted_text = extract_text_from_pdf(latest_pdf)
    print(f"‚úÖ Extracted text from: {os.path.basename(latest_pdf)}")
    print(extracted_text[:1000])
else:
    print("‚ö†Ô∏è No PDFs found in the directory.")


‚úÖ Extracted text from: DeepSeek_V3.pdf
DeepSeek-V3 Technical Report

DeepSeek-AI

research@deepseek.com

Abstract

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total
parameters with 37B activated for each token. To achieve efficient inference and cost-effective
training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-
tures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers
an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training
objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and
high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to
fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms
other open-source models and achieves performance comparable to leading closed-source
models. Despite its excellent performance, DeepSeek-V3 requires

üìÇ Cell 4: Chunk the Text 

In [112]:
if extracted_text and "‚ùå Error" not in extracted_text and "‚ö†Ô∏è No extractable text" not in extracted_text:
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    text_chunks = text_splitter.split_text(extracted_text)
    print(f"‚úÖ Successfully split the document into {len(text_chunks)} chunks.")
else:
    text_chunks = []
    print("‚ö†Ô∏è No valid text found for splitting.")


‚úÖ Successfully split the document into 97 chunks.


üìÇ Cell 5: Store chunks in ChromaDB

In [113]:
chroma_db_path = "/Users/pouyapourfarrokh/Desktop/AI&Data science Projects/DeepSeek Research Assistant/-DeepSeek-Research-Assistant-AI-Powered-Paper-Summarizer-Q-A/db/chroma_db"

if not text_chunks:
    print("‚ö†Ô∏è No valid text chunks found. Skipping vector storage.")
else:
    embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
    vector_db = Chroma.from_texts(text_chunks, embedding=embedding_model, persist_directory=chroma_db_path)
    print(f"‚úÖ Indexed {len(text_chunks)} chunks in ChromaDB at {chroma_db_path} (Dimension: 384)")


‚úÖ Indexed 97 chunks in ChromaDB at /Users/pouyapourfarrokh/Desktop/AI&Data science Projects/DeepSeek Research Assistant/-DeepSeek-Research-Assistant-AI-Powered-Paper-Summarizer-Q-A/db/chroma_db (Dimension: 384)


In [114]:
def retrieve_relevant_text(query, document_chunks, top_k=3):
    """
    Retrieves the most relevant text segments for answering a query.
    """
    relevant_text = []
    query_keywords = set(query.lower().split())

    for section in document_chunks:
        section_words = set(section.lower().split())
        overlap = len(query_keywords.intersection(section_words))

        if overlap > 0:
            relevant_text.append(section)

    return " ".join(relevant_text[:top_k])


In [123]:
qa_prompt = PromptTemplate(
    input_variables=["question", "context"],
    template="""
You are an AI research assistant. Answer the user's question based **only on the provided research paper content**.

### **Strict Instructions:**  
- Do **NOT** include speculation, `<think>`, or information beyond the document.  
- Your response **MUST** follow a structured format.  

---

## **üìå Answer**  

### **1Ô∏è‚É£ Key Insights**  
- <Summarize the most important information related to the user's question.>  

### **2Ô∏è‚É£ Supporting Evidence**  
- <Provide details, data, or results from the research paper to support the answer.>  

### **3Ô∏è‚É£ Implications or Applications**  
- <Explain what this means in a broader scientific or technical context. If not applicable, state "Not applicable.">  

---

### **üìñ Source:**  
<Cite the section, table, or figure from the retrieved research paper. If no citation is found, state "Source not explicitly provided.">  
"""
)


In [126]:
log_file_path = "qna_log.json"

def save_response_to_log(question, response):
    """Saves each question and answer to a JSON log file."""
    try:
        # Load existing log file if available
        if os.path.exists(log_file_path):
            with open(log_file_path, "r", encoding="utf-8") as f:
                log_data = json.load(f)
        else:
            log_data = []

        # Append the new question-response pair
        log_data.append({"question": question, "response": response})

        # Save back to the log file
        with open(log_file_path, "w", encoding="utf-8") as f:
            json.dump(log_data, f, indent=4, ensure_ascii=False)

    except Exception as e:
        print(f"‚ùå Error logging response: {str(e)}")

def clear_console():
    """Clears the console before displaying the next response."""
    os.system('cls' if os.name == 'nt' else 'clear')

def interactive_qa():
    """Handles the interactive Q&A loop for any research paper."""
    if not retriever:
        print("\n‚ö†Ô∏è No valid retriever found. Please check the document processing pipeline.")
        return

    while True:
        # ‚úÖ Clear previous response BEFORE showing the next input prompt
        clear_console()

        user_question = input("\n‚ùì **Your Question (type 'end' to exit):** ").strip()

        if user_question.lower() == "end":
            print("\nüëã Exiting Q&A mode. Have a great day!")
            break

        if not user_question:
            print("\n‚ö†Ô∏è Please enter a valid question.")
            continue
        
        best_context = retrieve_relevant_text(user_question, text_chunks, top_k=3)

        if not best_context:
            print("\n‚ö†Ô∏è No relevant information found. Try rephrasing the question.")
            continue

        raw_response = qa_chain.invoke({"question": user_question, "context": best_context})
        structured_answer = process_and_display_response(raw_response)

        # ‚úÖ Save response to JSON file
        save_response_to_log(user_question, structured_answer)

        # ‚úÖ Display the latest structured response
        print("\nüìå **Final Answer:**\n", structured_answer)

        # ‚úÖ Small delay before allowing the next question
        time.sleep(1)

print("‚úÖ Q&A Logging Enabled. Type 'end' to exit anytime.")

interactive_qa()


‚úÖ Q&A Logging Enabled. Type 'end' to exit anytime.
[H[2J


üëã Exiting Q&A mode. Have a great day!
