# üß† DeepSeek Research Assistant üìÑüîç  
### AI-Powered Research Paper Summarization & Q&A using DeepSeek-R1 & LangChain  

This project allows **students & professors** to:  
‚úÖ Upload a **research paper (PDF)**  
‚úÖ Get an **AI-generated summary**  
‚úÖ Receive **suggested questions** for better understanding  
‚úÖ **Ask custom questions**, and the model searches the ENTIRE paper before answering  

### ‚öôÔ∏è Tech Stack:  
- **DeepSeek-R1-8B** (via Ollama) ‚Äì AI-powered text analysis  
- **LangChain** ‚Äì Retrieval-based Q&A & prompt engineering  
- **ChromaDB** ‚Äì Vector database for semantic search  
- **pdfminer.six** ‚Äì Extract text from PDFs  
- **Streamlit** ‚Äì User-friendly UI (for deployment)  


üìÇ Cell 2: Import Required Packages

In [17]:
import os
import glob
import pdfminer.high_level
from langchain_community.llms import Ollama
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
import streamlit as st  # For future UI
from langchain_community.embeddings import OllamaEmbeddings


üìÇ Cell 3: Extract Text from PDF 

In [15]:
pdf_directory = "/Users/pouyapourfarrokh/Desktop/AI&Data science Projects/DeepSeek Research Assistant/-DeepSeek-Research-Assistant-AI-Powered-Paper-Summarizer-Q-A/Research_papers"

def get_latest_pdf(directory):
    """Retrieve the latest added PDF file from the directory."""
    pdf_files = sorted(glob.glob(os.path.join(directory, "*.pdf")), key=os.path.getctime, reverse=True)
    return pdf_files[0] if pdf_files else None

def extract_text_from_pdf(pdf_path):
    """Extract text from a given PDF file."""
    return pdfminer.high_level.extract_text(pdf_path)

# Get latest uploaded PDF
latest_pdf = get_latest_pdf(pdf_directory)

if latest_pdf:
    extracted_text = extract_text_from_pdf(latest_pdf)
    print(f"‚úÖ Extracted text from: {latest_pdf}")
    print(extracted_text[:1000])  # Preview first 1000 characters
else:
    print("‚ö†Ô∏è No PDFs found in the directory.")


‚úÖ Extracted text from: /Users/pouyapourfarrokh/Desktop/AI&Data science Projects/DeepSeek Research Assistant/-DeepSeek-Research-Assistant-AI-Powered-Paper-Summarizer-Q-A/Research_papers/DeepSeek_V3.pdf
DeepSeek-V3 Technical Report

DeepSeek-AI

research@deepseek.com

Abstract

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total
parameters with 37B activated for each token. To achieve efficient inference and cost-effective
training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-
tures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers
an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training
objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and
high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to
fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3

üìÇ Cell 4: Chunk the Text & Store in ChromaDB

In [19]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Split research paper into chunks
text_chunks = text_splitter.split_text(extracted_text)

# Use Ollama for local embeddings
embedding_model = OllamaEmbeddings(model="mistral")  # Change to "deepseek" if available

# Store chunks in ChromaDB
vector_db = Chroma.from_texts(text_chunks, embedding=embedding_model)

print(f"‚úÖ Indexed {len(text_chunks)} chunks in ChromaDB for document search.")

‚úÖ Indexed 216 chunks in ChromaDB for document search.
