Skip to content

Cygnus2505/RAG_System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG System

A Scala-based Retrieval-Augmented Generation (RAG) system that builds semantic search capabilities over PDF documents using Apache Lucene, Ollama embeddings, and Hadoop MapReduce.

Implementation on AWS video and output file

Youtube link :https://youtu.be/wxwdUuiz1A8
Results Drive link : https://drive.google.com/drive/folders/1PGKfexsTVAxPAamOpNQBSRuVVNHcEUVp?usp=sharing

Architecture

  • PDF Processing: Extract text from PDFs using PDFBox
  • Text Chunking: Split documents into overlapping windows
  • Vocabulary Building: Create token vocabulary using MapReduce
  • Embeddings: Generate vector embeddings using Ollama (mxbai-embed-large)
  • Indexing: Build Lucene search index with vector similarity
  • Search: Semantic search with optional RAG pipeline using Ollama (tinyllama)

Prerequisites

Local Development

  • Java 17+
  • SBT (Scala Build Tool)
  • Ollama installed and running
  • Models: mxbai-embed-large, tinyllama

AWS Deployment

  • AWS CLI configured
  • EMR cluster with Java 17
  • Ollama installed on EMR nodes
  • S3 bucket for data storage

Local Setup

1. Install Dependencies

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama
ollama serve

# Pull required models
ollama pull mxbai-embed-large
ollama pull tinyllama

2. Configure Environment

# Set Ollama host
export OLLAMA_HOST=http://127.0.0.1:11434

# For Windows users, also set Hadoop paths
export HADOOP_HOME=C:\hadoop\hadoop-3.3.6
export JAVA_LIBRARY_PATH=C:\hadoop\hadoop-3.3.6\bin

3. Prepare Data

# Create input directory
mkdir MSR

# Add PDF files to MSR directory
# Copy your PDF files to MSR/ directory

Creating filelist.txt for Local Development

Option 1: Manual Creation

# Create filelist.txt with PDF paths
echo "MSR/document1.pdf" > filelist.txt
echo "MSR/document2.pdf" >> filelist.txt
echo "MSR/document3.pdf" >> filelist.txt

Option 2: Automatic Generation

# Generate filelist.txt from all PDFs in MSR directory
find MSR -name "*.pdf" | sort > filelist.txt

# Or on Windows PowerShell:
Get-ChildItem MSR -Filter "*.pdf" | ForEach-Object { $_.FullName } | Sort-Object > filelist.txt

Option 3: Using Absolute Paths

# Generate with absolute paths (recommended for Windows)
find "$(pwd)/MSR" -name "*.pdf" | sort > filelist.txt

# Or on Windows PowerShell:
Get-ChildItem MSR -Filter "*.pdf" | ForEach-Object { (Resolve-Path $_.FullName).Path } | Sort-Object > filelist.txt

Example filelist.txt content:

MSR/paper1.pdf
MSR/paper2.pdf
MSR/paper3.pdf

4. Build and Run

# Compile
sbt compile

# Run tests
sbt "runMain Main test"

# Build vocabulary
sbt "runMain Main vocab filelist.txt out/vocab-mr 20"

# Build embeddings
sbt "runMain Main tokenEmbed out/vocab-mr/vocab.csv out/token-embeddings.csv"

# Build Lucene index
sbt "runMain Main hadoop filelist.txt out/mr-output"

# Search
sbt "runMain Main search --q \"What are bandits?\" --index out/index_shard_0"

# RAG search
sbt "runMain Main search --q \"What are bandits?\" --index out/index_shard_0 --rag"

AWS Deployment

1. Bootstrap Script

Create aws-bootstrap.sh:

#!/bin/bash
set -x
set -e

log() { echo "[bootstrap] $*"; }

echo " Starting RAG Builder EMR Bootstrap"

# Detect master vs core/task
ROLE=$(jq -r '.instanceRole' /mnt/var/lib/info/instance.json 2>/dev/null || echo unknown)
IS_MASTER=false
if [ "$ROLE" = "MASTER" ] || [ "$ROLE" = "Head" ]; then
  IS_MASTER=true
fi
log "Instance role: $ROLE (is_master=$IS_MASTER)"

# System updates and tools
sudo yum clean all -y || true
sudo yum install -y jq git curl tar gzip || true

# Set UTF-8 defaults
sudo tee /etc/profile.d/utf8.sh >/dev/null <<'EOF'
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
EOF
source /etc/profile.d/utf8.sh || true

# Install Java 17 (Amazon Corretto)
echo "--- Installing Amazon Corretto 17 ---"
sudo rpm --import https://yum.corretto.aws/corretto.key || true
sudo curl -sLo /etc/yum.repos.d/corretto.repo https://yum.corretto.aws/corretto.repo || true
sudo yum install -y java-17-amazon-corretto-devel || sudo yum install -y java-17-amazon-corretto || true

# Set JAVA_HOME
JAVA_17_HOME="/usr/lib/jvm/java-17-amazon-corretto.x86_64"
if [ -d "$JAVA_17_HOME" ]; then
  echo "export JAVA_HOME=$JAVA_17_HOME" | sudo tee -a /etc/profile.d/java.sh
  echo "export PATH=\$JAVA_HOME/bin:\$PATH" | sudo tee -a /etc/profile.d/java.sh
  source /etc/profile.d/java.sh
  log "Java 17 installed at: $JAVA_17_HOME"
fi

# Install Ollama on ALL nodes
echo " Installing Ollama on all nodes..."
curl -fsSL https://ollama.com/install.sh | sh

# Create ollama service user
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama || true

# Configure Ollama service
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_MODELS=/mnt/ollama"
EOF

# Start Ollama service
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama

# Wait for Ollama to be ready
echo " Waiting for Ollama to start..."
for i in {1..30}; do
    if curl -s http://localhost:11434/api/tags > /dev/null; then
        echo " Ollama is ready!"
        break
    fi
    echo "Waiting for Ollama... ($i/30)"
    sleep 2
done

# Pull required models
echo " Pulling required Ollama models..."
sudo -u ollama ollama pull tinyllama &
sudo -u ollama ollama pull mxbai-embed-large &
wait

echo " Ollama models pulled successfully"

# Set environment variables
sudo tee /etc/profile.d/ollama.sh >/dev/null <<'EOF'
export OLLAMA_HOST=${OLLAMA_HOST:-http://127.0.0.1:11434}
export OLLAMA_MODELS=${OLLAMA_MODELS:-/mnt/ollama}
export LANG=${LANG:-en_US.UTF-8}
export LC_ALL=${LC_ALL:-en_US.UTF-8}
EOF

log "Bootstrap completed successfully!"

2. Upload Data to S3

# Upload PDFs to S3
aws s3 sync MSR/ s3://your-bucket/MSRCorpus/

Creating filelist.txt for EMR/AWS

Option 1: Generate from S3 Contents

# Create filelist with S3 paths
aws s3 ls s3://your-bucket/MSRCorpus/ --recursive | awk '{print "s3://your-bucket/MSRCorpus/" $4}' > filelist-s3.txt

# Upload to S3
aws s3 cp filelist-s3.txt s3://your-bucket/input/filelist.txt

Option 2: Manual S3 filelist Creation

# Create filelist.txt manually with S3 paths
cat > filelist-s3.txt << EOF
s3://your-bucket/MSRCorpus/paper1.pdf
s3://your-bucket/MSRCorpus/paper2.pdf
s3://your-bucket/MSRCorpus/paper3.pdf
EOF

# Upload to S3
aws s3 cp filelist-s3.txt s3://your-bucket/input/filelist.txt

Option 3: Using S3A Paths (Alternative)

# Create filelist with S3A paths (if preferred)
aws s3 ls s3://your-bucket/MSRCorpus/ --recursive | awk '{print "s3a://your-bucket/MSRCorpus/" $4}' > filelist-s3a.txt
aws s3 cp filelist-s3a.txt s3://your-bucket/input/filelist.txt

Option 4: Generate on EMR Master

# SSH into EMR master and generate filelist
ssh -i your-key.pem hadoop@your-emr-master-ip

# Generate filelist from S3
aws s3 ls s3://your-bucket/MSRCorpus/ --recursive | awk '{print "s3://your-bucket/MSRCorpus/" $4}' > filelist.txt

# Verify contents
head -5 filelist.txt

Example S3 filelist.txt content:

s3://your-bucket/MSRCorpus/1083142.1083143.pdf
s3://your-bucket/MSRCorpus/1083144.1083145.pdf
s3://your-bucket/MSRCorpus/1083146.1083147.pdf

Automated filelist Creation Scripts

Local filelist generator (create-local-filelist.sh):

#!/bin/bash
# Create filelist.txt for local development

INPUT_DIR="MSR"
OUTPUT_FILE="filelist.txt"

if [ ! -d "$INPUT_DIR" ]; then
    echo "Error: Directory $INPUT_DIR does not exist"
    exit 1
fi

echo "Creating filelist.txt from $INPUT_DIR directory..."
find "$INPUT_DIR" -name "*.pdf" | sort > "$OUTPUT_FILE"

echo "Generated $OUTPUT_FILE with $(wc -l < "$OUTPUT_FILE") PDF files"
echo "First 5 files:"
head -5 "$OUTPUT_FILE"

S3 filelist generator (create-s3-filelist.sh):

#!/bin/bash
# Create filelist.txt for EMR/AWS

BUCKET_NAME="your-bucket"
S3_PREFIX="MSRCorpus"
OUTPUT_FILE="filelist-s3.txt"

echo "Creating S3 filelist from s3://$BUCKET_NAME/$S3_PREFIX/..."
aws s3 ls "s3://$BUCKET_NAME/$S3_PREFIX/" --recursive | \
    awk '{print "s3://'$BUCKET_NAME'/'$S3_PREFIX'/" $4}' > "$OUTPUT_FILE"

echo "Generated $OUTPUT_FILE with $(wc -l < "$OUTPUT_FILE") PDF files"
echo "First 5 files:"
head -5 "$OUTPUT_FILE"

# Upload to S3
echo "Uploading to S3..."
aws s3 cp "$OUTPUT_FILE" "s3://$BUCKET_NAME/input/filelist.txt"
echo " Uploaded to s3://$BUCKET_NAME/input/filelist.txt"

Windows PowerShell filelist generator (create-filelist.ps1):

# Create filelist.txt for local development on Windows

$InputDir = "MSR"
$OutputFile = "filelist.txt"

if (-not (Test-Path $InputDir)) {
    Write-Error "Directory $InputDir does not exist"
    exit 1
}

Write-Host "Creating filelist.txt from $InputDir directory..."
Get-ChildItem $InputDir -Filter "*.pdf" -Recurse | 
    ForEach-Object { $_.FullName } | 
    Sort-Object | 
    Out-File -FilePath $OutputFile -Encoding UTF8

$fileCount = (Get-Content $OutputFile).Count
Write-Host "Generated $OutputFile with $fileCount PDF files"
Write-Host "First 5 files:"
Get-Content $OutputFile | Select-Object -First 5
# Upload JAR
aws s3 cp target/scala-3.5.1/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar s3://your-bucket/jars/

3. Create EMR Cluster

Using AWS EMR Console:

  1. Go to EMR Console: Navigate to AWS EMR in the AWS Management Console
  2. Create Cluster: Click "Create cluster"
  3. Configure Cluster:
    • Cluster name: RAG-System-Cluster
    • Release: emr-7.0.0
    • Applications: Select Hadoop
  4. Instance Configuration:
    • Instance type: m5.2xlarge
    • Number of instances: 3 (1 master + 2 core nodes)
  5. Bootstrap Actions:
    • Bootstrap action: Custom action
    • Script location: s3://your-bucket/bootstrap/aws-bootstrap.sh
  6. Logging:
    • Log URI: s3://your-bucket/logs/
  7. IAM Roles: Use default EMR roles (or create custom ones)
  8. Create Cluster: Click "Create cluster"

Using AWS CLI (Alternative):

aws emr create-cluster \
  --name "RAG-System-Cluster" \
  --release-label emr-7.0.0 \
  --instance-type m5.2xlarge \
  --instance-count 3 \
  --applications Name=Hadoop \
  --bootstrap-actions Path=s3://your-bucket/bootstrap/aws-bootstrap.sh \
  --log-uri s3://your-bucket/logs/ \
  --use-default-roles

4. Run MapReduce Jobs

Using AWS EMR Console:

  1. Go to EMR Console: Navigate to your cluster
  2. Add Steps: Click "Add step" button
  3. Configure Step:
    • Step type: Custom JAR
    • JAR location: s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar
    • Main class: Leave blank (uses Main class)
    • Arguments: vocab s3://your-bucket/input/filelist.txt s3://your-bucket/out/vocab-mr 20
    • Action on failure: Continue
  4. Add Step: Click "Add step"

Build Lucene Index:

  1. Add Another Step: Click "Add step" again
  2. Configure Step:
    • Step type: Custom JAR
    • JAR location: s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar
    • Main class: Leave blank
    • Arguments: hadoop s3://your-bucket/input/filelist.txt s3://your-bucket/out/mr-output s3://your-bucket/out/lucene-shards
    • Action on failure: Continue
  3. Add Step: Click "Add step"

Using AWS CLI (Alternative):

# Get cluster ID
CLUSTER_ID=$(aws emr list-clusters --active --query 'Clusters[0].Id' --output text)

# Build vocabulary
aws emr add-steps \
  --cluster-id $CLUSTER_ID \
  --steps Type=CUSTOM_JAR,Name=BuildVocab,ActionOnFailure=CONTINUE,Jar=s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar,Args=["vocab","s3://your-bucket/input/filelist.txt","s3://your-bucket/out/vocab-mr","20"]

# Build Lucene index
aws emr add-steps \
  --cluster-id $CLUSTER_ID \
  --steps Type=CUSTOM_JAR,Name=BuildIndex,ActionOnFailure=CONTINUE,Jar=s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar,Args=["hadoop","s3://your-bucket/input/filelist.txt","s3://your-bucket/out/mr-output","s3://your-bucket/out/lucene-shards"]

5. Run Single-Node Tools

SSH into EMR master and run:

# Download files locally
aws s3 cp s3://your-bucket/out/vocab-mr/vocab.csv .
aws s3 cp s3://your-bucket/out/token-embeddings.csv .

# Build token embeddings
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar tokenEmbed vocab.csv token-embeddings.csv

# Find token neighbors
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar tokenNeighbors --in token-embeddings.csv --vocab vocab.csv --top 2000 --k 10

# Download Lucene shards
aws s3 sync s3://your-bucket/out/lucene-shards/ lucene-shards/

# Run search
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar search --q "What are bandits?" --index lucene-shards/

# Run RAG search
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar search --q "What are bandits?" --index lucene-shards/ --rag

# Run tests
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test

Testing

Local Testing

# Run all tests
sbt "runMain Main test"

# Run specific tests
sbt "runMain Main test logging"
sbt "runMain Main test chunker"
sbt "runMain Main test tokenizer"
sbt "runMain Main test vectors"
sbt "runMain Main test embeddecoder"

AWS Testing

# Run all tests
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test

# Run specific tests
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test logging
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test chunker

Commands Reference

Main Commands

  • vocab <filelist> <output> <linesPerSplit> - Build vocabulary using MapReduce
  • hadoop <filelist> <output> [lucene_output_dir] - Build Lucene index and embeddings
  • tokenEmbed <vocab> <output> - Generate token embeddings
  • tokenNeighbors <embeddings> <vocab> <output> [options] - Find similar tokens
  • search --q "question" [options] - Semantic search
  • embedEval <queries> <output> - Evaluate embedding quality
  • test [testName] - Run tests

Search Options

  • --q "question" - Search query
  • --index path - Lucene index path
  • --k number - Number of results (default: 5)
  • --rag - Enable RAG pipeline with chat completion

TokenNeighbors Options

  • --in <file> - Input embeddings file
  • --vocab <file> - Vocabulary file
  • --top N - Top N tokens to process
  • --k N - K nearest neighbors

🔧 Configuration

application.conf

app {
  pdf_dir         = "MSR"
  out_dir         = ${?OUT_DIR}
  out_dir         = "out"
  shards          = 4
  window_chars    = 1400
  overlap_chars   = 250
  embedding_model = "mxbai-embed-large"
  similarity      = "cosine"
  topk_neighbors  = 5
  batch_size     = 64
}

Environment Variables

  • OLLAMA_HOST - Ollama server URL (default: http://127.0.0.1:11434)
  • OUT_DIR - Output directory for results
  • HADOOP_HOME - Hadoop installation path (Windows only)

Troubleshooting

Common Issues

  1. SLF4J Warnings

    • Fixed by excluding conflicting logging dependencies
    • Run sbt "runMain Main test logging" to verify
  2. Hadoop Native Library Error (Windows)

    • Install winutils.exe and hadoop.dll
    • Set HADOOP_HOME environment variable
    • Use provided Windows-specific javaOptions
  3. OLLAMA_HOST Not Set

    • Set environment variable: export OLLAMA_HOST=http://127.0.0.1:11434
    • Ensure Ollama is running: ollama serve
  4. Wrong FS Error (EMR)

    • Use S3A paths: s3a://bucket/path
    • Ensure S3A configuration is set in Hadoop
  5. Java Version Mismatch

    • Ensure Java 17 is installed on EMR
    • Check JAVA_HOME is set correctly

Debug Commands

# Check Java version
java -version

# Check Ollama status
curl http://localhost:11434/api/tags

# Check Hadoop configuration
hadoop version

# Test S3 connectivity
aws s3 ls s3://your-bucket/

Performance Tips

  1. Batch Size: Adjust batch_size in application.conf for embedding requests
  2. Shards: Increase shards for larger datasets
  3. Window Size: Optimize window_chars and overlap_chars for your documents
  4. EMR Instance Types: Use larger instances for faster processing
  5. S3 Transfer: Use aws s3 sync for efficient file transfers

Output Files

  • vocab.csv - Token vocabulary
  • token_embeddings.csv - Token embeddings
  • token_neighbors.csv - Similar token pairs
  • index_shard_* - Lucene search indexes
  • embed_similarity.csv - Embedding evaluation results
  • embed_analogy.csv - Word analogy test results

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make changes
  4. Run tests: sbt "runMain Main test"
  5. Submit a pull request

License

This project is part of CS441 coursework at the University of Illinois at Chicago.

About

A Scala-based Retrieval-Augmented Generation (RAG) system that builds semantic search capabilities over PDF documents using Apache Lucene, Ollama embeddings, and Hadoop MapReduce.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors