RAG System

A Scala-based Retrieval-Augmented Generation (RAG) system that builds semantic search capabilities over PDF documents using Apache Lucene, Ollama embeddings, and Hadoop MapReduce.

Implementation on AWS video and output file

Youtube link :https://youtu.be/wxwdUuiz1A8
Results Drive link : https://drive.google.com/drive/folders/1PGKfexsTVAxPAamOpNQBSRuVVNHcEUVp?usp=sharing

Architecture

PDF Processing: Extract text from PDFs using PDFBox
Text Chunking: Split documents into overlapping windows
Vocabulary Building: Create token vocabulary using MapReduce
Embeddings: Generate vector embeddings using Ollama (mxbai-embed-large)
Indexing: Build Lucene search index with vector similarity
Search: Semantic search with optional RAG pipeline using Ollama (tinyllama)

Prerequisites

Local Development

Java 17+
SBT (Scala Build Tool)
Ollama installed and running
Models: mxbai-embed-large, tinyllama

AWS Deployment

AWS CLI configured
EMR cluster with Java 17
Ollama installed on EMR nodes
S3 bucket for data storage

Local Setup

1. Install Dependencies

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama
ollama serve

# Pull required models
ollama pull mxbai-embed-large
ollama pull tinyllama

2. Configure Environment

# Set Ollama host
export OLLAMA_HOST=http://127.0.0.1:11434

# For Windows users, also set Hadoop paths
export HADOOP_HOME=C:\hadoop\hadoop-3.3.6
export JAVA_LIBRARY_PATH=C:\hadoop\hadoop-3.3.6\bin

3. Prepare Data

# Create input directory
mkdir MSR

# Add PDF files to MSR directory
# Copy your PDF files to MSR/ directory

Creating filelist.txt for Local Development

Option 1: Manual Creation

# Create filelist.txt with PDF paths
echo "MSR/document1.pdf" > filelist.txt
echo "MSR/document2.pdf" >> filelist.txt
echo "MSR/document3.pdf" >> filelist.txt

Option 2: Automatic Generation

# Generate filelist.txt from all PDFs in MSR directory
find MSR -name "*.pdf" | sort > filelist.txt

# Or on Windows PowerShell:
Get-ChildItem MSR -Filter "*.pdf" | ForEach-Object { $_.FullName } | Sort-Object > filelist.txt

Option 3: Using Absolute Paths

# Generate with absolute paths (recommended for Windows)
find "$(pwd)/MSR" -name "*.pdf" | sort > filelist.txt

# Or on Windows PowerShell:
Get-ChildItem MSR -Filter "*.pdf" | ForEach-Object { (Resolve-Path $_.FullName).Path } | Sort-Object > filelist.txt

Example filelist.txt content:

MSR/paper1.pdf
MSR/paper2.pdf
MSR/paper3.pdf

4. Build and Run

# Compile
sbt compile

# Run tests
sbt "runMain Main test"

# Build vocabulary
sbt "runMain Main vocab filelist.txt out/vocab-mr 20"

# Build embeddings
sbt "runMain Main tokenEmbed out/vocab-mr/vocab.csv out/token-embeddings.csv"

# Build Lucene index
sbt "runMain Main hadoop filelist.txt out/mr-output"

# Search
sbt "runMain Main search --q \"What are bandits?\" --index out/index_shard_0"

# RAG search
sbt "runMain Main search --q \"What are bandits?\" --index out/index_shard_0 --rag"

AWS Deployment

1. Bootstrap Script

Create aws-bootstrap.sh:

#!/bin/bash
set -x
set -e

log() { echo "[bootstrap] $*"; }

echo " Starting RAG Builder EMR Bootstrap"

# Detect master vs core/task
ROLE=$(jq -r '.instanceRole' /mnt/var/lib/info/instance.json 2>/dev/null || echo unknown)
IS_MASTER=false
if [ "$ROLE" = "MASTER" ] || [ "$ROLE" = "Head" ]; then
  IS_MASTER=true
fi
log "Instance role: $ROLE (is_master=$IS_MASTER)"

# System updates and tools
sudo yum clean all -y || true
sudo yum install -y jq git curl tar gzip || true

# Set UTF-8 defaults
sudo tee /etc/profile.d/utf8.sh >/dev/null <<'EOF'
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
EOF
source /etc/profile.d/utf8.sh || true

# Install Java 17 (Amazon Corretto)
echo "--- Installing Amazon Corretto 17 ---"
sudo rpm --import https://yum.corretto.aws/corretto.key || true
sudo curl -sLo /etc/yum.repos.d/corretto.repo https://yum.corretto.aws/corretto.repo || true
sudo yum install -y java-17-amazon-corretto-devel || sudo yum install -y java-17-amazon-corretto || true

# Set JAVA_HOME
JAVA_17_HOME="/usr/lib/jvm/java-17-amazon-corretto.x86_64"
if [ -d "$JAVA_17_HOME" ]; then
  echo "export JAVA_HOME=$JAVA_17_HOME" | sudo tee -a /etc/profile.d/java.sh
  echo "export PATH=\$JAVA_HOME/bin:\$PATH" | sudo tee -a /etc/profile.d/java.sh
  source /etc/profile.d/java.sh
  log "Java 17 installed at: $JAVA_17_HOME"
fi

# Install Ollama on ALL nodes
echo " Installing Ollama on all nodes..."
curl -fsSL https://ollama.com/install.sh | sh

# Create ollama service user
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama || true

# Configure Ollama service
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_MODELS=/mnt/ollama"
EOF

# Start Ollama service
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama

# Wait for Ollama to be ready
echo " Waiting for Ollama to start..."
for i in {1..30}; do
    if curl -s http://localhost:11434/api/tags > /dev/null; then
        echo " Ollama is ready!"
        break
    fi
    echo "Waiting for Ollama... ($i/30)"
    sleep 2
done

# Pull required models
echo " Pulling required Ollama models..."
sudo -u ollama ollama pull tinyllama &
sudo -u ollama ollama pull mxbai-embed-large &
wait

echo " Ollama models pulled successfully"

# Set environment variables
sudo tee /etc/profile.d/ollama.sh >/dev/null <<'EOF'
export OLLAMA_HOST=${OLLAMA_HOST:-http://127.0.0.1:11434}
export OLLAMA_MODELS=${OLLAMA_MODELS:-/mnt/ollama}
export LANG=${LANG:-en_US.UTF-8}
export LC_ALL=${LC_ALL:-en_US.UTF-8}
EOF

log "Bootstrap completed successfully!"

2. Upload Data to S3

# Upload PDFs to S3
aws s3 sync MSR/ s3://your-bucket/MSRCorpus/

Creating filelist.txt for EMR/AWS

Option 1: Generate from S3 Contents

# Create filelist with S3 paths
aws s3 ls s3://your-bucket/MSRCorpus/ --recursive | awk '{print "s3://your-bucket/MSRCorpus/" $4}' > filelist-s3.txt

# Upload to S3
aws s3 cp filelist-s3.txt s3://your-bucket/input/filelist.txt

Option 2: Manual S3 filelist Creation

# Create filelist.txt manually with S3 paths
cat > filelist-s3.txt << EOF
s3://your-bucket/MSRCorpus/paper1.pdf
s3://your-bucket/MSRCorpus/paper2.pdf
s3://your-bucket/MSRCorpus/paper3.pdf
EOF

# Upload to S3
aws s3 cp filelist-s3.txt s3://your-bucket/input/filelist.txt

Option 3: Using S3A Paths (Alternative)

# Create filelist with S3A paths (if preferred)
aws s3 ls s3://your-bucket/MSRCorpus/ --recursive | awk '{print "s3a://your-bucket/MSRCorpus/" $4}' > filelist-s3a.txt
aws s3 cp filelist-s3a.txt s3://your-bucket/input/filelist.txt

Option 4: Generate on EMR Master

# SSH into EMR master and generate filelist
ssh -i your-key.pem hadoop@your-emr-master-ip

# Generate filelist from S3
aws s3 ls s3://your-bucket/MSRCorpus/ --recursive | awk '{print "s3://your-bucket/MSRCorpus/" $4}' > filelist.txt

# Verify contents
head -5 filelist.txt

Example S3 filelist.txt content:

s3://your-bucket/MSRCorpus/1083142.1083143.pdf
s3://your-bucket/MSRCorpus/1083144.1083145.pdf
s3://your-bucket/MSRCorpus/1083146.1083147.pdf

Automated filelist Creation Scripts

Local filelist generator (create-local-filelist.sh):

#!/bin/bash
# Create filelist.txt for local development

INPUT_DIR="MSR"
OUTPUT_FILE="filelist.txt"

if [ ! -d "$INPUT_DIR" ]; then
    echo "Error: Directory $INPUT_DIR does not exist"
    exit 1
fi

echo "Creating filelist.txt from $INPUT_DIR directory..."
find "$INPUT_DIR" -name "*.pdf" | sort > "$OUTPUT_FILE"

echo "Generated $OUTPUT_FILE with $(wc -l < "$OUTPUT_FILE") PDF files"
echo "First 5 files:"
head -5 "$OUTPUT_FILE"

S3 filelist generator (create-s3-filelist.sh):

#!/bin/bash
# Create filelist.txt for EMR/AWS

BUCKET_NAME="your-bucket"
S3_PREFIX="MSRCorpus"
OUTPUT_FILE="filelist-s3.txt"

echo "Creating S3 filelist from s3://$BUCKET_NAME/$S3_PREFIX/..."
aws s3 ls "s3://$BUCKET_NAME/$S3_PREFIX/" --recursive | \
    awk '{print "s3://'$BUCKET_NAME'/'$S3_PREFIX'/" $4}' > "$OUTPUT_FILE"

echo "Generated $OUTPUT_FILE with $(wc -l < "$OUTPUT_FILE") PDF files"
echo "First 5 files:"
head -5 "$OUTPUT_FILE"

# Upload to S3
echo "Uploading to S3..."
aws s3 cp "$OUTPUT_FILE" "s3://$BUCKET_NAME/input/filelist.txt"
echo " Uploaded to s3://$BUCKET_NAME/input/filelist.txt"

Windows PowerShell filelist generator (create-filelist.ps1):

# Create filelist.txt for local development on Windows

$InputDir = "MSR"
$OutputFile = "filelist.txt"

if (-not (Test-Path $InputDir)) {
    Write-Error "Directory $InputDir does not exist"
    exit 1
}

Write-Host "Creating filelist.txt from $InputDir directory..."
Get-ChildItem $InputDir -Filter "*.pdf" -Recurse | 
    ForEach-Object { $_.FullName } | 
    Sort-Object | 
    Out-File -FilePath $OutputFile -Encoding UTF8

$fileCount = (Get-Content $OutputFile).Count
Write-Host "Generated $OutputFile with $fileCount PDF files"
Write-Host "First 5 files:"
Get-Content $OutputFile | Select-Object -First 5

# Upload JAR
aws s3 cp target/scala-3.5.1/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar s3://your-bucket/jars/

3. Create EMR Cluster

Using AWS EMR Console:

Go to EMR Console: Navigate to AWS EMR in the AWS Management Console
Create Cluster: Click "Create cluster"
Configure Cluster:
- Cluster name: RAG-System-Cluster
- Release: emr-7.0.0
- Applications: Select Hadoop
Instance Configuration:
- Instance type: m5.2xlarge
- Number of instances: 3 (1 master + 2 core nodes)
Bootstrap Actions:
- Bootstrap action: Custom action
- Script location: s3://your-bucket/bootstrap/aws-bootstrap.sh
Logging:
- Log URI: s3://your-bucket/logs/
IAM Roles: Use default EMR roles (or create custom ones)
Create Cluster: Click "Create cluster"

Using AWS CLI (Alternative):

aws emr create-cluster \
  --name "RAG-System-Cluster" \
  --release-label emr-7.0.0 \
  --instance-type m5.2xlarge \
  --instance-count 3 \
  --applications Name=Hadoop \
  --bootstrap-actions Path=s3://your-bucket/bootstrap/aws-bootstrap.sh \
  --log-uri s3://your-bucket/logs/ \
  --use-default-roles

4. Run MapReduce Jobs

Using AWS EMR Console:

Go to EMR Console: Navigate to your cluster
Add Steps: Click "Add step" button
Configure Step:
- Step type: Custom JAR
- JAR location: s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar
- Main class: Leave blank (uses Main class)
- Arguments: vocab s3://your-bucket/input/filelist.txt s3://your-bucket/out/vocab-mr 20
- Action on failure: Continue
Add Step: Click "Add step"

Build Lucene Index:

Add Another Step: Click "Add step" again
Configure Step:
- Step type: Custom JAR
- JAR location: s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar
- Main class: Leave blank
- Arguments: hadoop s3://your-bucket/input/filelist.txt s3://your-bucket/out/mr-output s3://your-bucket/out/lucene-shards
- Action on failure: Continue
Add Step: Click "Add step"

Using AWS CLI (Alternative):

# Get cluster ID
CLUSTER_ID=$(aws emr list-clusters --active --query 'Clusters[0].Id' --output text)

# Build vocabulary
aws emr add-steps \
  --cluster-id $CLUSTER_ID \
  --steps Type=CUSTOM_JAR,Name=BuildVocab,ActionOnFailure=CONTINUE,Jar=s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar,Args=["vocab","s3://your-bucket/input/filelist.txt","s3://your-bucket/out/vocab-mr","20"]

# Build Lucene index
aws emr add-steps \
  --cluster-id $CLUSTER_ID \
  --steps Type=CUSTOM_JAR,Name=BuildIndex,ActionOnFailure=CONTINUE,Jar=s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar,Args=["hadoop","s3://your-bucket/input/filelist.txt","s3://your-bucket/out/mr-output","s3://your-bucket/out/lucene-shards"]

5. Run Single-Node Tools

SSH into EMR master and run:

# Download files locally
aws s3 cp s3://your-bucket/out/vocab-mr/vocab.csv .
aws s3 cp s3://your-bucket/out/token-embeddings.csv .

# Build token embeddings
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar tokenEmbed vocab.csv token-embeddings.csv

# Find token neighbors
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar tokenNeighbors --in token-embeddings.csv --vocab vocab.csv --top 2000 --k 10

# Download Lucene shards
aws s3 sync s3://your-bucket/out/lucene-shards/ lucene-shards/

# Run search
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar search --q "What are bandits?" --index lucene-shards/

# Run RAG search
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar search --q "What are bandits?" --index lucene-shards/ --rag

# Run tests
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test

Testing

Local Testing

# Run all tests
sbt "runMain Main test"

# Run specific tests
sbt "runMain Main test logging"
sbt "runMain Main test chunker"
sbt "runMain Main test tokenizer"
sbt "runMain Main test vectors"
sbt "runMain Main test embeddecoder"

AWS Testing

# Run all tests
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test

# Run specific tests
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test logging
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test chunker

Commands Reference

Main Commands

vocab <filelist> <output> <linesPerSplit> - Build vocabulary using MapReduce
hadoop <filelist> <output> [lucene_output_dir] - Build Lucene index and embeddings
tokenEmbed <vocab> <output> - Generate token embeddings
tokenNeighbors <embeddings> <vocab> <output> [options] - Find similar tokens
search --q "question" [options] - Semantic search
embedEval <queries> <output> - Evaluate embedding quality
test [testName] - Run tests

Search Options

--q "question" - Search query
--index path - Lucene index path
--k number - Number of results (default: 5)
--rag - Enable RAG pipeline with chat completion

TokenNeighbors Options

--in <file> - Input embeddings file
--vocab <file> - Vocabulary file
--top N - Top N tokens to process
--k N - K nearest neighbors

🔧 Configuration

application.conf

app {
  pdf_dir         = "MSR"
  out_dir         = ${?OUT_DIR}
  out_dir         = "out"
  shards          = 4
  window_chars    = 1400
  overlap_chars   = 250
  embedding_model = "mxbai-embed-large"
  similarity      = "cosine"
  topk_neighbors  = 5
  batch_size     = 64
}

Environment Variables

OLLAMA_HOST - Ollama server URL (default: http://127.0.0.1:11434)
OUT_DIR - Output directory for results
HADOOP_HOME - Hadoop installation path (Windows only)

Troubleshooting

Common Issues

SLF4J Warnings
- Fixed by excluding conflicting logging dependencies
- Run sbt "runMain Main test logging" to verify
Hadoop Native Library Error (Windows)
- Install winutils.exe and hadoop.dll
- Set HADOOP_HOME environment variable
- Use provided Windows-specific javaOptions
OLLAMA_HOST Not Set
- Set environment variable: export OLLAMA_HOST=http://127.0.0.1:11434
- Ensure Ollama is running: ollama serve
Wrong FS Error (EMR)
- Use S3A paths: s3a://bucket/path
- Ensure S3A configuration is set in Hadoop
Java Version Mismatch
- Ensure Java 17 is installed on EMR
- Check JAVA_HOME is set correctly

Debug Commands

# Check Java version
java -version

# Check Ollama status
curl http://localhost:11434/api/tags

# Check Hadoop configuration
hadoop version

# Test S3 connectivity
aws s3 ls s3://your-bucket/

Performance Tips

Batch Size: Adjust batch_size in application.conf for embedding requests
Shards: Increase shards for larger datasets
Window Size: Optimize window_chars and overlap_chars for your documents
EMR Instance Types: Use larger instances for faster processing
S3 Transfer: Use aws s3 sync for efficient file transfers

Output Files

vocab.csv - Token vocabulary
token_embeddings.csv - Token embeddings
token_neighbors.csv - Similar token pairs
index_shard_* - Lucene search indexes
embed_similarity.csv - Embedding evaluation results
embed_analogy.csv - Word analogy test results

Contributing

Fork the repository
Create a feature branch
Make changes
Run tests: sbt "runMain Main test"
Submit a pull request

License

This project is part of CS441 coursework at the University of Illinois at Chicago.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
project		project
src		src
.gitignore		.gitignore
README.md		README.md
aws-bootstrap.sh		aws-bootstrap.sh
build.sbt		build.sbt
feedback.md		feedback.md
filelist.txt		filelist.txt

Folders and files

Latest commit

History

Repository files navigation

RAG System

Implementation on AWS video and output file

Architecture

Prerequisites

Local Development

AWS Deployment

Local Setup

1. Install Dependencies

2. Configure Environment

3. Prepare Data

Creating filelist.txt for Local Development

4. Build and Run

AWS Deployment

1. Bootstrap Script

2. Upload Data to S3

Creating filelist.txt for EMR/AWS

Automated filelist Creation Scripts

3. Create EMR Cluster

Using AWS EMR Console:

Using AWS CLI (Alternative):

4. Run MapReduce Jobs

Using AWS EMR Console:

Build Lucene Index:

Using AWS CLI (Alternative):

5. Run Single-Node Tools

Testing

Local Testing

AWS Testing

Commands Reference

Main Commands

Search Options

TokenNeighbors Options

🔧 Configuration

application.conf

Environment Variables

Troubleshooting

Common Issues

Debug Commands

Performance Tips

Output Files

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages