# 🧠 DeepSeek Research Assistant 📄🔍  
### AI-Powered Research Paper Summarization & Q&A using DeepSeek-R1 & LangChain  
This project allows **students & professors** to:  
✅ Upload a **research paper (PDF)**  
✅ Get an **AI-generated summary**  
✅ Receive **suggested questions** for better understanding  
✅ **Ask custom questions** for deeper insights  

### ⚙️ Tech Stack:  
- **DeepSeek-R1-8B** (via Ollama) – AI-powered text analysis  
- **LangChain** – Prompt engineering & AI interaction  
- **ChromaDB** – Vector database for semantic search  
- **pdfminer.six** – Extract text from PDFs  
- **Streamlit** – User-friendly UI (for deployment)  


📂 Cell 2: Import Required Packages

In [5]:
import os
import glob
import pdfminer.high_level
import langchain
from langchain_community.llms import Ollama
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
import streamlit as st

# Initialize the DeepSeek model
llm = Ollama(model="huihui_ai/deepseek-r1-abliterated:8b")

print("✅ Packages Imported Successfully")


✅ Packages Imported Successfully


  llm = Ollama(model="huihui_ai/deepseek-r1-abliterated:8b")


📂 Cell 3: Extract Text from PDF 

In [7]:
pdf_directory = "/Users/pouyapourfarrokh/Desktop/AI&Data science Projects/DeepSeek Research Assistant/-DeepSeek-Research-Assistant-AI-Powered-Paper-Summarizer-Q-A/Research_papers"

def get_latest_pdf(directory):
    pdf_files = sorted(glob.glob(os.path.join(directory, "*.pdf")), key=os.path.getctime, reverse=True)
    return pdf_files[0] if pdf_files else None

def extract_text_from_pdf(pdf_path):
    return pdfminer.high_level.extract_text(pdf_path)

latest_pdf = get_latest_pdf(pdf_directory)

if latest_pdf:
    extracted_text = extract_text_from_pdf(latest_pdf)
    print(f"✅ Extracted text from: {latest_pdf}")
    print(extracted_text[:1000])
else:
    print("⚠️ No PDFs found in the directory.")


✅ Extracted text from: /Users/pouyapourfarrokh/Desktop/AI&Data science Projects/DeepSeek Research Assistant/-DeepSeek-Research-Assistant-AI-Powered-Paper-Summarizer-Q-A/Research_papers/DeepSeek_V3.pdf
DeepSeek-V3 Technical Report

DeepSeek-AI

research@deepseek.com

Abstract

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total
parameters with 37B activated for each token. To achieve efficient inference and cost-effective
training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-
tures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers
an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training
objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and
high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to
fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 o

📂 Cell 4: Integrating LangChain for Summarization


In [8]:
# Initialize LangChain with DeepSeek model
llm = Ollama(model="huihui_ai/deepseek-r1-abliterated:8b")

summary_prompt = PromptTemplate(
    input_variables=["text"],
    template="""Summarize the following research paper text in clear and concise points:\n\n{text}\n\n### Summary:"""
)

# LangChain Summarization Chain
summary_chain = LLMChain(llm=llm, prompt=summary_prompt)

def summarize_text(text):
    text = text[:3000]  # Limit to model's token capacity
    return summary_chain.run(text)

# Generate summary
summary = summarize_text(extracted_text)
print("📌 Research Paper Summary:\n", summary)


  summary_chain = LLMChain(llm=llm, prompt=summary_prompt)
  return summary_chain.run(text)


📌 Research Paper Summary:
 <think>
Okay, so I need to help summarize the research paper text for DeepSeek-V3 in clear and concise points. Let me read through the provided content carefully.

The abstract mentions that DeepSeek-V3 is a strong Mixture-of-Experts (MoE) language model with 671B total parameters, specifically 37B activated per token. It uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures validated in DeepSeek-V2. They've introduced an auxiliary-loss-free strategy for load balancing and set a multi-token prediction training objective for better performance. The model is pre-trained on 14.8 trillion tokens, then goes through Supervised Fine-Tuning and Reinforcement Learning stages. Evaluations show it outperforms open-source models and matches closed-source ones, all while using only about 2.788M H800 GPU hours for training and being very stable without loss spikes.

Looking at the sections: Introduction talks about the model's strengths; Architecture goes in

📂 Cell 5: LangChain for Generating Suggested Questions & Answers


In [11]:
qa_prompt = PromptTemplate(
    input_variables=["summary"],
    template="""Based on the following research paper summary, generate 5 thought-provoking questions and their corresponding answers:\n\n### Summary:\n{summary}\n\n### Questions & Answers:"""
)

qa_chain = LLMChain(llm=llm, prompt=qa_prompt)

def generate_questions_and_answers(summary):
    return qa_chain.run(summary)

# Generate 5 questions and their answers
questions_answers = generate_questions_and_answers(summary)

print("📌 5 AI-Generated Questions & Answers:\n")
print(questions_answers)


📌 5 AI-Generated Questions & Answers:

<think>
Alright, let's dive into the thought process of generating these questions and answers based on the provided research paper summary of DeepSeek-V3.

First, I need to understand the key points covered in the summary. The model is a Mixture-of-Experts (MoE) language model with significant parameters, utilizing specific architectural innovations like Multi-head Latent Attention (MLA). It's pre-trained on an enormous amount of data and goes through fine-tuning and reinforcement learning stages, achieving impressive performance metrics while being computationally efficient.

To create thought-provoking questions, I should focus on areas that highlight the model's strengths, unique features, and implications. The answers need to be concise yet informative, addressing why these aspects matter in the context of language models.

1. **Why is the choice of 671B parameters significant for DeepSeek-V3?**  
   - The total number of parameters determine