This midterm project explores the capabilities and limitations of open-source Large Language Models (LLMs) running locally via Ollama. The project focuses on:
- Basic model exploration across four types of tasks.
- Focused experimentation on prompt engineering techniques.
To ensure reproducibility and solve environment issues, all experiments were conducted using Docker containers.
git clone <https://github.com/Gitlio11/CS5393-Midterm>
cd CS5393-MIDTERM
# Build and start the Docker containers
docker-compose up
In a separate terminal:
# Run one of the available models
ollama run llama2
ollama run mistral
ollama run tinyllama
CS5393-MIDTERM/
├── app/
│ ├── main.py
│ ├── requirements.txt
│ └── results/ # This is where experiment results are stored
├── model_outputs/ # I place the model response samples
│ ├── llama2/
│ ├── mistral/
│ └── tinyllama/
├── report/ # Final analysis and report
├── docker-compose.yml
├── Dockerfile
└── README.md
This project tests four different prompt engineering techniques across three models:
- Llama2: Meta's open-source LLM known for general-purpose capabilities
- Mistral: A newer model with strong reasoning capabilities
- TinyLlama: A smaller, more efficient model
Direct questions without examples:
- "What is the capital of Sweden?"
- "Explain quantum entanglement in simple terms."
- "How do you calculate compound interest?"
Questions with example Q&A pairs provided to guide the model:
Question: What is the capital of France?
Answer: Paris
Question: What is the capital of Japan?
Answer: Tokyo
Question: What is the capital of Brazil?
Answer: Brasília
Question: What is the capital of Sweden?
Questions that encourage step-by-step reasoning:
- "If I have 5 apples and give 2 to my friend, then buy 3 more and eat 1, how many apples do I have left? Let's think step-by-step."
- "A train travels at 60 mph. How far will it travel in 2.5 hours? Let's think step-by-step."
- "If a shirt costs $25 and is on sale for 20% off, then there's an additional 10% discount at checkout, what is the final price? Let's think step-by-step."
Running the same reasoning questions multiple times to check for consistency:
- "What is 15 × 27? Think carefully and solve this step-by-step."
- "If today is Tuesday, what day will it be after 19 days? Think carefully and solve this step-by-step."
- "John has twice as many marbles as Tom. Tom has 5 fewer marbles than Sarah. Sarah has 15 marbles. How many marbles does John have? Think carefully and solve this step-by-step."
The experiment results are stored in the app/results/ directory, organized by technique and model. A comprehensive analysis can be found in the report/ollama-report.md file.
- The smaller the model got the faster it was to answer
- Generally the larger models were more correct
- There were cases where they all got answers incorrect
- Docker and Docker Compose
- Git
- At least 8GB of RAM for running the models
- Approximately 10GB of disk space for model storage
The project uses models that will be automatically downloaded via Ollama when first run. Each model has different size requirements:
- Llama2: ~3.8GB
- Mistral: ~4.1GB
- TinyLlama: ~1.1GB
- Models run locally and have more limited capabilities compared to cloud-based LLMs
- First inference can be slow as models load into memory
- (Add more limitations you discovered during testing)
Future improvements could include:
- Expanding to include additional open-source models
- Quantitative analysis of response quality
- (Add your ideas for future work)