Problem. Experiment with running and fine-tuning lightweight large language models locally, without depending on cloud-based APIs.
Approach.
- Set up a Conda environment for reproducibility and CUDA compatibility.
- Loaded unsloth.Q4_K_M.gguf in GGUF format using
llama.cppbindings. - Implemented text generation pipeline within Jupyter, testing interactive prompts.
- Explored model quantization (Q4_K_M) for efficient inference on limited hardware.
- Experimented with temperature, max_tokens, and context length for response variety.
Tools & Environment.
This project uses the Llama 3 (Unsloth Quantized) model for local inference.
- Model:
unsloth.Q4_K_M.gguf - Size: ~4.9 GB
- Framework:
llama-cpp-python - Environment management: Conda (
environment.yml) - Libraries:
transformers,llama-cpp-python,torch,accelerate,sentencepiece
Results (qualitative).
- Successfully ran conversational inference locally with controllable generation parameters.
- Quantized model balanced speed and coherence well on consumer-grade GPU/CPU.
- Demonstrated how small LLMs can power local chatbots or assistive apps.
What I Learned.
- Model loading from GGUF snapshots via Hugging Face Hub.
- Environment isolation using Conda for LLM workloads.
- Balancing performance vs quality through quantization and prompt tuning.
# 1. create conda environment
conda env create -f environment.yml
conda activate tinyllama
git clone https://github.com/Joe-Naz01/llm_llama.git
cd llm_llama
# 2. launch notebook
jupyter notebook llm_llama.ipynb
# 3. to download the model
huggingface-cli download andriJulian/gguf_llama3_classification unsloth.Q4_K_M.gguf --local-dir ./models