<font face="Times New Roman" size=5>
<div dir=rtl align="center">
<font face="Times New Roman" size=5>
In The Name of God
</font>
<br>
<img src="https://logoyar.com/content/wp-content/uploads/2021/04/sharif-university-logo.png" alt="University Logo" width="150" height="150">
<br>
<font face="Times New Roman" size=4 align=center>
Sharif University of Technology - Department of Electrical Engineering
</font>
<br>
<font color="#008080" size=6>
Foundations of Data Science
</font>
<hr/>
<font color="#800080" size=5>
Phase 1 Report: Research Assistant
<br>
</font>
<font size=5>
Instructor: Dr. Khalaj
<br>
</font>
<font size=4>
Fall 2024
<br>
<font face="Times New Roman" size=4>
Ali Sadeghiyan 400101464
</font>

</div></font>

# 4 Product:

### **Research Assistant with Retrieval-Augmented Generation (RAG)**  
We implemented a **RAG-based Research Assistant** capable of retrieving and analyzing research papers using a **vector database** and **semantic search**. Our approach includes the following steps:  

1. **Dataset Preparation:**  
   - Loaded the `dblp-v10.csv` dataset in chunks to handle large-scale data efficiently.  
   - Filtered out missing values and retained relevant columns (`title`, `abstract`, `authors`).  

2. **Embedding Generation:**  
   - Used `SentenceTransformers` (`all-MiniLM-L6-v2`) to encode research paper abstracts and titles into **dense vector embeddings**.  
   - Processed embeddings in batches for memory efficiency and stored them in a NumPy array.  

3. **Vector Database with FAISS:**  
   - Indexed the computed embeddings using **FAISS** (Facebook AI Similarity Search) with an L2 distance-based flat index for fast retrieval.  
   - Saved the FAISS index for future queries.  

4. **Semantic Search for Papers:**  
   - Implemented a **search function** to find the most relevant research papers based on text queries.  
   - Encoded the user query into an embedding and performed a **nearest-neighbor search** in FAISS.  
   - Retrieved and displayed the **top-K most relevant papers**, showing **titles, abstracts, and authors**.  

5. **Named Entity Recognition (NER) for Author-Based Search:**  
   - Integrated **SpaCy NER** to extract author names from input queries.  
   - Allowed users to search for papers by author names, either explicitly provided or extracted using **NER**.  
   - Filtered the dataset to return papers written by the specified author.  

---

### **Research Assistant with Summarization**  
We enhanced the system by integrating **automatic research paper summarization** to generate concise summaries for retrieved papers:  

1. **Summarization Pipeline:**  
   - Used **Hugging Face’s Transformers** and the `facebook/bart-large-cnn` model for summarization.  
   - Generated **concise, structured summaries** highlighting the key contributions of each paper.  

2. **Automated Summarization Workflow:**  
   - After retrieving relevant papers, the system automatically **summarizes** each paper’s abstract.  
   - Displays both the full abstract and a condensed version to help users quickly grasp the key ideas.  

3. **User-Friendly Interaction:**  
   - Users can search for papers using either **semantic search** or **author-based queries**.  
   - Results include **titles, abstracts, authors, and generated summaries**, making the assistant more informative.  

---

### **Final Outcome**  
- We developed an **intelligent Research Assistant** that can **retrieve, analyze, and summarize** research papers using **semantic search, FAISS indexing, Named Entity Recognition (NER), and LLM-based summarization**.  
- The system allows researchers to **quickly find and understand** relevant academic literature.  
- This approach **reduces manual effort** in literature reviews and enhances research efficiency.  



## RAG:

In [4]:
!pip install faiss-cpu sentence-transformers psycopg2-binary


Collecting psycopg2-binary
  Downloading psycopg2_binary-2.9.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4

In [6]:
!pip install tqdm



In [5]:
import pandas as pd

file_path = "dblp-v10.csv"
chunksize = 10000  # Adjust based on available memory
data_chunks = []

# Read dataset in chunks
for chunk in pd.read_csv(file_path, chunksize=chunksize):
    chunk = chunk[['title', 'abstract', 'authors']].dropna()
    data_chunks.append(chunk)

# Combine all chunks into a single DataFrame
df = pd.concat(data_chunks, ignore_index=True)
print(f"Loaded {len(df)} papers")

Loaded 827532 papers


In [7]:
import torch
from sentence_transformers import SentenceTransformer
import numpy as np
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('all-MiniLM-L6-v2').to(device)

def compute_embeddings(texts):
    """Compute embeddings in batches with progress tracking."""
    batch_size = 32
    embeddings = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Computing embeddings"):
        batch = texts[i:i+batch_size]
        with torch.no_grad():
            emb = model.encode(batch, convert_to_tensor=True, device=device).cpu().numpy()
        embeddings.append(emb)
    return np.vstack(embeddings)

# Generate embeddings with progress bar
df['combined_text'] = df['title'] + " " + df['abstract']
embeddings = compute_embeddings(df['combined_text'].tolist())

print(f"Computed {embeddings.shape[0]} embeddings of dimension {embeddings.shape[1]}")

Computing embeddings: 100%|██████████| 25861/25861 [37:44<00:00, 11.42it/s]


Computed 827532 embeddings of dimension 384


In [8]:
import faiss

embedding_dim = embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)  # L2 (Euclidean) similarity search
index.add(embeddings)  # Add all embeddings to FAISS index

# Save FAISS index
faiss.write_index(index, "faiss_index.bin")
df.to_csv("processed_papers.csv", index=False)

print("FAISS index saved successfully.")


FAISS index saved successfully.


In [9]:
!pip install tabulate



In [11]:
from tabulate import tabulate

def search_papers(query, top_k=5):
    """Retrieve top_k most relevant papers and format output nicely."""
    query_embedding = model.encode([query], convert_to_tensor=True, device=device).cpu().numpy()
    distances, indices = index.search(query_embedding, top_k)

    results = df.iloc[indices[0]][['title', 'abstract', 'authors']]

    print("\n" + "="*80)
    print(f"Top {top_k} Results for Query: '{query}'")
    print("="*80 + "\n")

    for i, row in results.iterrows():
        print(f"**Title:** {row['title']}\n")
        print(f"**Abstract:** {row['abstract'][:500]}{'...' if len(row['abstract']) > 500 else ''}\n")
        print(f"**Authors:** {row['authors']}\n")
        print("-" * 80 + "\n")

    return results

query = "Neural networks for NLP"
results = search_papers(query, top_k=5)

query = "GAN for Text"
results = search_papers(query, top_k=5)



Top 5 Results for Query: 'Neural networks for NLP'

**Title:** Are Deep Learning Approaches Suitable for Natural Language Processing

**Abstract:** In recent years, Deep Learning (DL) techniques have gained much at-tention from Artificial Intelligence (AI) and Natural Language Processing (NLP) research communities because these approaches can often learn features from data without the need for human design or engineering interventions. In addition, DL approaches have achieved some remarkable results. In this paper, we have surveyed major recent contributions that use DL techniques for NLP tasks. All these reviewed topics have been limited t...

**Authors:** ['S. Alshahrani', 'Epaminondas Kapetanios']

--------------------------------------------------------------------------------

**Title:** Natural language grammatical inference with recurrent neural networks

**Abstract:** This paper examines the inductive inference of a complex grammar with neural networks and specifically, the ta

In [22]:
from tabulate import tabulate
import spacy

# Load an NER model for author extraction (use 'en_core_web_sm' or a larger one like 'en_core_web_trf')
nlp = spacy.load("en_core_web_sm")

def extract_author_name(text):
    """Extracts potential author names using Named Entity Recognition (NER)."""
    doc = nlp(text)
    authors = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
    return authors

def search_papers(query=None, author=None, top_k=5):
    """Retrieve top_k most relevant papers based on query or author name."""
    if query:
        query_embedding = model.encode([query], convert_to_tensor=True, device=device).cpu().numpy()
        distances, indices = index.search(query_embedding, top_k)
        results = df.iloc[indices[0]][['title', 'abstract', 'authors']]
    elif author:
        # Use NER to extract author name
        extracted_authors = extract_author_name(author)
        if extracted_authors:
            author_name = extracted_authors[0]  # Taking the first detected author name
        else:
            author_name = author  # If NER fails, use input directly

        # Search database for the author
        results = df[df['authors'].str.contains(author_name, case=False, na=False)].head(top_k)
    else:
        print("Please provide either a query or an author name.")
        return None

    print("\n" + "=" * 80)
    if query:
        print(f"Top {top_k} Results for Query: '{query}'")
    else:
        print(f"Top {top_k} Results for Author: '{author_name}'")
    print("=" * 80 + "\n")

    for _, row in results.iterrows():
        print(f"**Title:** {row['title']}\n")
        print(f"**Abstract:** {row['abstract'][:500]}{'...' if len(row['abstract']) > 500 else ''}\n")
        print(f"**Authors:** {row['authors']}\n")
        print("-" * 80 + "\n")

    return results

query_results = search_papers(query="Neural networks for NLP", top_k=5)
author_results = search_papers(author="'S. Lawrence", top_k=5)





Top 5 Results for Query: 'Neural networks for NLP'

**Title:** Are Deep Learning Approaches Suitable for Natural Language Processing

**Abstract:** In recent years, Deep Learning (DL) techniques have gained much at-tention from Artificial Intelligence (AI) and Natural Language Processing (NLP) research communities because these approaches can often learn features from data without the need for human design or engineering interventions. In addition, DL approaches have achieved some remarkable results. In this paper, we have surveyed major recent contributions that use DL techniques for NLP tasks. All these reviewed topics have been limited t...

**Authors:** ['S. Alshahrani', 'Epaminondas Kapetanios']

--------------------------------------------------------------------------------

**Title:** Natural language grammatical inference with recurrent neural networks

**Abstract:** This paper examines the inductive inference of a complex grammar with neural networks and specifically, the ta

## Research Assistant

In [16]:
from transformers import pipeline

def summarize_paper_local(title, abstract, authors, model="facebook/bart-large-cnn"):
    """Summarize the research paper using a local transformer model."""
    summarizer = pipeline("summarization", model=model)

    prompt = f"""
    Summarize the following research paper concisely:

    **Title:** {title}
    **Abstract:** {abstract}
    **Authors:** {", ".join(authors)}

    The summary should be clear and highlight the key contributions of the paper.
    """

    summary = summarizer(prompt, max_length=150, min_length=50, do_sample=False)
    return summary[0]['summary_text']

title = "Are Deep Learning Approaches Suitable for Natural Language Processing"
abstract = "In recent years, Deep Learning (DL) techniques have gained much attention from AI and NLP research communities..."
authors = ['S. Alshahrani', 'Epaminondas Kapetanios']

summary = summarize_paper_local(title, abstract, authors)
print("**Paper Summary:**\n", summary)


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Your max_length is set to 150, but your input_length is only 124. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=62)


🔍 **Paper Summary:**
 Deep Learning (DL) techniques have gained much attention from AI and NLP research communities. The summary should be clear and highlight the key contributions of the paper. Summarize the following research paper concisely: Are Deep Learning Approaches Suitable for Natural Language Processing (NLP)?


In [20]:
for i, row in results.iterrows():

      title = row['title']
      abstract = row['abstract']
      authors = row['authors']

      summary = summarize_paper_local(title, abstract, authors)
      print("**Paper Summary:**\n", summary)
      print("-" * 80 + "\n")

Device set to use cuda:0


**Paper Summary:**
 Synthesizing photo-realistic images from text descriptions is a challenging problem in computer vision and has many practical applications. We propose stacked Generative Adversarial Networks (StackGAN) to generate photos conditioned on text descriptions. The Stage-I GAN sketches the primitive shape and basic colors of the object based on the given text description.
--------------------------------------------------------------------------------



Device set to use cuda:0


**Paper Summary:**
 This paper describes a method for using Generative Adversarial Networks to learn distributed representations of natural language documents. We propose a model that is based on the recently proposed Energy-Based GAN, but instead uses a Denoising Autoencoder. Document representations are extracted from the hidden layer of the discriminator and evaluated both quantitatively and qualitatively.
--------------------------------------------------------------------------------



Device set to use cuda:0


**Paper Summary:**
 This report summarizes the tutorial presented by the author at NIPS 2016 on generative adversarial networks (GANs) The tutorial describes: (1) Why generative modeling is a topic worth studying, (2) how generative models work, and (3) how GANs compare to other models. The tutorial contains three exercises for readers to complete, and the solutions.
--------------------------------------------------------------------------------



Device set to use cuda:0


**Paper Summary:**
 Generative Adversarial Text to Image Synthesis (GAN) is a new type of AI system. It uses deep convolutional generative adversarial networks (GANs) to generate images of specific categories. The model can generate plausible images of birds and flowers from detailed text descriptions.
--------------------------------------------------------------------------------



Device set to use cuda:0


**Paper Summary:**
 This paper introduces techniques for projecting image samples into the latent space using any pre-trained GAN, provided that the computational graph is available. We evaluate these techniques on both MNIST digits and Omniglot handwritten characters. In the case ofMNIST digits, we show that projections into the. latent space maintain information about the style and the identity of the digit. We show that even characters from alphabets that have not been seen during training may be projected well into. the latent. space.
--------------------------------------------------------------------------------

