A Scala-based Retrieval-Augmented Generation (RAG) system that builds semantic search capabilities over PDF documents using Apache Lucene, Ollama embeddings, and Hadoop MapReduce.
Youtube link :https://youtu.be/wxwdUuiz1A8
Results Drive link : https://drive.google.com/drive/folders/1PGKfexsTVAxPAamOpNQBSRuVVNHcEUVp?usp=sharing
- PDF Processing: Extract text from PDFs using PDFBox
- Text Chunking: Split documents into overlapping windows
- Vocabulary Building: Create token vocabulary using MapReduce
- Embeddings: Generate vector embeddings using Ollama (mxbai-embed-large)
- Indexing: Build Lucene search index with vector similarity
- Search: Semantic search with optional RAG pipeline using Ollama (tinyllama)
- Java 17+
- SBT (Scala Build Tool)
- Ollama installed and running
- Models:
mxbai-embed-large,tinyllama
- AWS CLI configured
- EMR cluster with Java 17
- Ollama installed on EMR nodes
- S3 bucket for data storage
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama
ollama serve
# Pull required models
ollama pull mxbai-embed-large
ollama pull tinyllama# Set Ollama host
export OLLAMA_HOST=http://127.0.0.1:11434
# For Windows users, also set Hadoop paths
export HADOOP_HOME=C:\hadoop\hadoop-3.3.6
export JAVA_LIBRARY_PATH=C:\hadoop\hadoop-3.3.6\bin# Create input directory
mkdir MSR
# Add PDF files to MSR directory
# Copy your PDF files to MSR/ directoryOption 1: Manual Creation
# Create filelist.txt with PDF paths
echo "MSR/document1.pdf" > filelist.txt
echo "MSR/document2.pdf" >> filelist.txt
echo "MSR/document3.pdf" >> filelist.txtOption 2: Automatic Generation
# Generate filelist.txt from all PDFs in MSR directory
find MSR -name "*.pdf" | sort > filelist.txt
# Or on Windows PowerShell:
Get-ChildItem MSR -Filter "*.pdf" | ForEach-Object { $_.FullName } | Sort-Object > filelist.txtOption 3: Using Absolute Paths
# Generate with absolute paths (recommended for Windows)
find "$(pwd)/MSR" -name "*.pdf" | sort > filelist.txt
# Or on Windows PowerShell:
Get-ChildItem MSR -Filter "*.pdf" | ForEach-Object { (Resolve-Path $_.FullName).Path } | Sort-Object > filelist.txtExample filelist.txt content:
MSR/paper1.pdf
MSR/paper2.pdf
MSR/paper3.pdf
# Compile
sbt compile
# Run tests
sbt "runMain Main test"
# Build vocabulary
sbt "runMain Main vocab filelist.txt out/vocab-mr 20"
# Build embeddings
sbt "runMain Main tokenEmbed out/vocab-mr/vocab.csv out/token-embeddings.csv"
# Build Lucene index
sbt "runMain Main hadoop filelist.txt out/mr-output"
# Search
sbt "runMain Main search --q \"What are bandits?\" --index out/index_shard_0"
# RAG search
sbt "runMain Main search --q \"What are bandits?\" --index out/index_shard_0 --rag"Create aws-bootstrap.sh:
#!/bin/bash
set -x
set -e
log() { echo "[bootstrap] $*"; }
echo " Starting RAG Builder EMR Bootstrap"
# Detect master vs core/task
ROLE=$(jq -r '.instanceRole' /mnt/var/lib/info/instance.json 2>/dev/null || echo unknown)
IS_MASTER=false
if [ "$ROLE" = "MASTER" ] || [ "$ROLE" = "Head" ]; then
IS_MASTER=true
fi
log "Instance role: $ROLE (is_master=$IS_MASTER)"
# System updates and tools
sudo yum clean all -y || true
sudo yum install -y jq git curl tar gzip || true
# Set UTF-8 defaults
sudo tee /etc/profile.d/utf8.sh >/dev/null <<'EOF'
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
EOF
source /etc/profile.d/utf8.sh || true
# Install Java 17 (Amazon Corretto)
echo "--- Installing Amazon Corretto 17 ---"
sudo rpm --import https://yum.corretto.aws/corretto.key || true
sudo curl -sLo /etc/yum.repos.d/corretto.repo https://yum.corretto.aws/corretto.repo || true
sudo yum install -y java-17-amazon-corretto-devel || sudo yum install -y java-17-amazon-corretto || true
# Set JAVA_HOME
JAVA_17_HOME="/usr/lib/jvm/java-17-amazon-corretto.x86_64"
if [ -d "$JAVA_17_HOME" ]; then
echo "export JAVA_HOME=$JAVA_17_HOME" | sudo tee -a /etc/profile.d/java.sh
echo "export PATH=\$JAVA_HOME/bin:\$PATH" | sudo tee -a /etc/profile.d/java.sh
source /etc/profile.d/java.sh
log "Java 17 installed at: $JAVA_17_HOME"
fi
# Install Ollama on ALL nodes
echo " Installing Ollama on all nodes..."
curl -fsSL https://ollama.com/install.sh | sh
# Create ollama service user
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama || true
# Configure Ollama service
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_MODELS=/mnt/ollama"
EOF
# Start Ollama service
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
# Wait for Ollama to be ready
echo " Waiting for Ollama to start..."
for i in {1..30}; do
if curl -s http://localhost:11434/api/tags > /dev/null; then
echo " Ollama is ready!"
break
fi
echo "Waiting for Ollama... ($i/30)"
sleep 2
done
# Pull required models
echo " Pulling required Ollama models..."
sudo -u ollama ollama pull tinyllama &
sudo -u ollama ollama pull mxbai-embed-large &
wait
echo " Ollama models pulled successfully"
# Set environment variables
sudo tee /etc/profile.d/ollama.sh >/dev/null <<'EOF'
export OLLAMA_HOST=${OLLAMA_HOST:-http://127.0.0.1:11434}
export OLLAMA_MODELS=${OLLAMA_MODELS:-/mnt/ollama}
export LANG=${LANG:-en_US.UTF-8}
export LC_ALL=${LC_ALL:-en_US.UTF-8}
EOF
log "Bootstrap completed successfully!"# Upload PDFs to S3
aws s3 sync MSR/ s3://your-bucket/MSRCorpus/Option 1: Generate from S3 Contents
# Create filelist with S3 paths
aws s3 ls s3://your-bucket/MSRCorpus/ --recursive | awk '{print "s3://your-bucket/MSRCorpus/" $4}' > filelist-s3.txt
# Upload to S3
aws s3 cp filelist-s3.txt s3://your-bucket/input/filelist.txtOption 2: Manual S3 filelist Creation
# Create filelist.txt manually with S3 paths
cat > filelist-s3.txt << EOF
s3://your-bucket/MSRCorpus/paper1.pdf
s3://your-bucket/MSRCorpus/paper2.pdf
s3://your-bucket/MSRCorpus/paper3.pdf
EOF
# Upload to S3
aws s3 cp filelist-s3.txt s3://your-bucket/input/filelist.txtOption 3: Using S3A Paths (Alternative)
# Create filelist with S3A paths (if preferred)
aws s3 ls s3://your-bucket/MSRCorpus/ --recursive | awk '{print "s3a://your-bucket/MSRCorpus/" $4}' > filelist-s3a.txt
aws s3 cp filelist-s3a.txt s3://your-bucket/input/filelist.txtOption 4: Generate on EMR Master
# SSH into EMR master and generate filelist
ssh -i your-key.pem hadoop@your-emr-master-ip
# Generate filelist from S3
aws s3 ls s3://your-bucket/MSRCorpus/ --recursive | awk '{print "s3://your-bucket/MSRCorpus/" $4}' > filelist.txt
# Verify contents
head -5 filelist.txtExample S3 filelist.txt content:
s3://your-bucket/MSRCorpus/1083142.1083143.pdf
s3://your-bucket/MSRCorpus/1083144.1083145.pdf
s3://your-bucket/MSRCorpus/1083146.1083147.pdf
Local filelist generator (create-local-filelist.sh):
#!/bin/bash
# Create filelist.txt for local development
INPUT_DIR="MSR"
OUTPUT_FILE="filelist.txt"
if [ ! -d "$INPUT_DIR" ]; then
echo "Error: Directory $INPUT_DIR does not exist"
exit 1
fi
echo "Creating filelist.txt from $INPUT_DIR directory..."
find "$INPUT_DIR" -name "*.pdf" | sort > "$OUTPUT_FILE"
echo "Generated $OUTPUT_FILE with $(wc -l < "$OUTPUT_FILE") PDF files"
echo "First 5 files:"
head -5 "$OUTPUT_FILE"S3 filelist generator (create-s3-filelist.sh):
#!/bin/bash
# Create filelist.txt for EMR/AWS
BUCKET_NAME="your-bucket"
S3_PREFIX="MSRCorpus"
OUTPUT_FILE="filelist-s3.txt"
echo "Creating S3 filelist from s3://$BUCKET_NAME/$S3_PREFIX/..."
aws s3 ls "s3://$BUCKET_NAME/$S3_PREFIX/" --recursive | \
awk '{print "s3://'$BUCKET_NAME'/'$S3_PREFIX'/" $4}' > "$OUTPUT_FILE"
echo "Generated $OUTPUT_FILE with $(wc -l < "$OUTPUT_FILE") PDF files"
echo "First 5 files:"
head -5 "$OUTPUT_FILE"
# Upload to S3
echo "Uploading to S3..."
aws s3 cp "$OUTPUT_FILE" "s3://$BUCKET_NAME/input/filelist.txt"
echo " Uploaded to s3://$BUCKET_NAME/input/filelist.txt"Windows PowerShell filelist generator (create-filelist.ps1):
# Create filelist.txt for local development on Windows
$InputDir = "MSR"
$OutputFile = "filelist.txt"
if (-not (Test-Path $InputDir)) {
Write-Error "Directory $InputDir does not exist"
exit 1
}
Write-Host "Creating filelist.txt from $InputDir directory..."
Get-ChildItem $InputDir -Filter "*.pdf" -Recurse |
ForEach-Object { $_.FullName } |
Sort-Object |
Out-File -FilePath $OutputFile -Encoding UTF8
$fileCount = (Get-Content $OutputFile).Count
Write-Host "Generated $OutputFile with $fileCount PDF files"
Write-Host "First 5 files:"
Get-Content $OutputFile | Select-Object -First 5# Upload JAR
aws s3 cp target/scala-3.5.1/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar s3://your-bucket/jars/- Go to EMR Console: Navigate to AWS EMR in the AWS Management Console
- Create Cluster: Click "Create cluster"
- Configure Cluster:
- Cluster name:
RAG-System-Cluster - Release:
emr-7.0.0 - Applications: Select
Hadoop
- Cluster name:
- Instance Configuration:
- Instance type:
m5.2xlarge - Number of instances:
3(1 master + 2 core nodes)
- Instance type:
- Bootstrap Actions:
- Bootstrap action:
Custom action - Script location:
s3://your-bucket/bootstrap/aws-bootstrap.sh
- Bootstrap action:
- Logging:
- Log URI:
s3://your-bucket/logs/
- Log URI:
- IAM Roles: Use default EMR roles (or create custom ones)
- Create Cluster: Click "Create cluster"
aws emr create-cluster \
--name "RAG-System-Cluster" \
--release-label emr-7.0.0 \
--instance-type m5.2xlarge \
--instance-count 3 \
--applications Name=Hadoop \
--bootstrap-actions Path=s3://your-bucket/bootstrap/aws-bootstrap.sh \
--log-uri s3://your-bucket/logs/ \
--use-default-roles- Go to EMR Console: Navigate to your cluster
- Add Steps: Click "Add step" button
- Configure Step:
- Step type:
Custom JAR - JAR location:
s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar - Main class: Leave blank (uses Main class)
- Arguments:
vocab s3://your-bucket/input/filelist.txt s3://your-bucket/out/vocab-mr 20 - Action on failure:
Continue
- Step type:
- Add Step: Click "Add step"
- Add Another Step: Click "Add step" again
- Configure Step:
- Step type:
Custom JAR - JAR location:
s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar - Main class: Leave blank
- Arguments:
hadoop s3://your-bucket/input/filelist.txt s3://your-bucket/out/mr-output s3://your-bucket/out/lucene-shards - Action on failure:
Continue
- Step type:
- Add Step: Click "Add step"
# Get cluster ID
CLUSTER_ID=$(aws emr list-clusters --active --query 'Clusters[0].Id' --output text)
# Build vocabulary
aws emr add-steps \
--cluster-id $CLUSTER_ID \
--steps Type=CUSTOM_JAR,Name=BuildVocab,ActionOnFailure=CONTINUE,Jar=s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar,Args=["vocab","s3://your-bucket/input/filelist.txt","s3://your-bucket/out/vocab-mr","20"]
# Build Lucene index
aws emr add-steps \
--cluster-id $CLUSTER_ID \
--steps Type=CUSTOM_JAR,Name=BuildIndex,ActionOnFailure=CONTINUE,Jar=s3://your-bucket/jars/cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar,Args=["hadoop","s3://your-bucket/input/filelist.txt","s3://your-bucket/out/mr-output","s3://your-bucket/out/lucene-shards"]SSH into EMR master and run:
# Download files locally
aws s3 cp s3://your-bucket/out/vocab-mr/vocab.csv .
aws s3 cp s3://your-bucket/out/token-embeddings.csv .
# Build token embeddings
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar tokenEmbed vocab.csv token-embeddings.csv
# Find token neighbors
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar tokenNeighbors --in token-embeddings.csv --vocab vocab.csv --top 2000 --k 10
# Download Lucene shards
aws s3 sync s3://your-bucket/out/lucene-shards/ lucene-shards/
# Run search
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar search --q "What are bandits?" --index lucene-shards/
# Run RAG search
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar search --q "What are bandits?" --index lucene-shards/ --rag
# Run tests
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test# Run all tests
sbt "runMain Main test"
# Run specific tests
sbt "runMain Main test logging"
sbt "runMain Main test chunker"
sbt "runMain Main test tokenizer"
sbt "runMain Main test vectors"
sbt "runMain Main test embeddecoder"# Run all tests
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test
# Run specific tests
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test logging
java -jar cs441-hw1-rag-assembly-0.1.0-SNAPSHOT.jar test chunkervocab <filelist> <output> <linesPerSplit>- Build vocabulary using MapReducehadoop <filelist> <output> [lucene_output_dir]- Build Lucene index and embeddingstokenEmbed <vocab> <output>- Generate token embeddingstokenNeighbors <embeddings> <vocab> <output> [options]- Find similar tokenssearch --q "question" [options]- Semantic searchembedEval <queries> <output>- Evaluate embedding qualitytest [testName]- Run tests
--q "question"- Search query--index path- Lucene index path--k number- Number of results (default: 5)--rag- Enable RAG pipeline with chat completion
--in <file>- Input embeddings file--vocab <file>- Vocabulary file--top N- Top N tokens to process--k N- K nearest neighbors
app {
pdf_dir = "MSR"
out_dir = ${?OUT_DIR}
out_dir = "out"
shards = 4
window_chars = 1400
overlap_chars = 250
embedding_model = "mxbai-embed-large"
similarity = "cosine"
topk_neighbors = 5
batch_size = 64
}OLLAMA_HOST- Ollama server URL (default: http://127.0.0.1:11434)OUT_DIR- Output directory for resultsHADOOP_HOME- Hadoop installation path (Windows only)
-
SLF4J Warnings
- Fixed by excluding conflicting logging dependencies
- Run
sbt "runMain Main test logging"to verify
-
Hadoop Native Library Error (Windows)
- Install winutils.exe and hadoop.dll
- Set HADOOP_HOME environment variable
- Use provided Windows-specific javaOptions
-
OLLAMA_HOST Not Set
- Set environment variable:
export OLLAMA_HOST=http://127.0.0.1:11434 - Ensure Ollama is running:
ollama serve
- Set environment variable:
-
Wrong FS Error (EMR)
- Use S3A paths:
s3a://bucket/path - Ensure S3A configuration is set in Hadoop
- Use S3A paths:
-
Java Version Mismatch
- Ensure Java 17 is installed on EMR
- Check JAVA_HOME is set correctly
# Check Java version
java -version
# Check Ollama status
curl http://localhost:11434/api/tags
# Check Hadoop configuration
hadoop version
# Test S3 connectivity
aws s3 ls s3://your-bucket/- Batch Size: Adjust
batch_sizein application.conf for embedding requests - Shards: Increase
shardsfor larger datasets - Window Size: Optimize
window_charsandoverlap_charsfor your documents - EMR Instance Types: Use larger instances for faster processing
- S3 Transfer: Use
aws s3 syncfor efficient file transfers
vocab.csv- Token vocabularytoken_embeddings.csv- Token embeddingstoken_neighbors.csv- Similar token pairsindex_shard_*- Lucene search indexesembed_similarity.csv- Embedding evaluation resultsembed_analogy.csv- Word analogy test results
- Fork the repository
- Create a feature branch
- Make changes
- Run tests:
sbt "runMain Main test" - Submit a pull request
This project is part of CS441 coursework at the University of Illinois at Chicago.