Skip to content

LLM Inference with TinyLlama (Conda) — Deployed TinyLlama-1.1B-Chat locally using a Conda environment and llama.cpp bindings. Implemented text generation with quantized GGUF weights (Q4_K_M) for efficient inference, demonstrating environment setup, model handling, and practical lightweight LLM experimentation.

Notifications You must be signed in to change notification settings

Joe-Naz01/llm_llama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LLM Inference with TinyLlama — Local Language Model Deployment

Problem. Experiment with running and fine-tuning lightweight large language models locally, without depending on cloud-based APIs.

Approach.

  • Set up a Conda environment for reproducibility and CUDA compatibility.
  • Loaded unsloth.Q4_K_M.gguf in GGUF format using llama.cpp bindings.
  • Implemented text generation pipeline within Jupyter, testing interactive prompts.
  • Explored model quantization (Q4_K_M) for efficient inference on limited hardware.
  • Experimented with temperature, max_tokens, and context length for response variety.

Tools & Environment.

  • 🦙 Llama 3 Integration

This project uses the Llama 3 (Unsloth Quantized) model for local inference.

  • Model: unsloth.Q4_K_M.gguf
  • Size: ~4.9 GB
  • Framework: llama-cpp-python
  • Environment management: Conda (environment.yml)
  • Libraries: transformers, llama-cpp-python, torch, accelerate, sentencepiece

Results (qualitative).

  • Successfully ran conversational inference locally with controllable generation parameters.
  • Quantized model balanced speed and coherence well on consumer-grade GPU/CPU.
  • Demonstrated how small LLMs can power local chatbots or assistive apps.

What I Learned.

  • Model loading from GGUF snapshots via Hugging Face Hub.
  • Environment isolation using Conda for LLM workloads.
  • Balancing performance vs quality through quantization and prompt tuning.

Quick Start

# 1. create conda environment
conda env create -f environment.yml
conda activate tinyllama

git clone https://github.com/Joe-Naz01/llm_llama.git
cd llm_llama

# 2. launch notebook
jupyter notebook llm_llama.ipynb

# 3. to download the model
huggingface-cli download andriJulian/gguf_llama3_classification unsloth.Q4_K_M.gguf --local-dir ./models

About

LLM Inference with TinyLlama (Conda) — Deployed TinyLlama-1.1B-Chat locally using a Conda environment and llama.cpp bindings. Implemented text generation with quantized GGUF weights (Q4_K_M) for efficient inference, demonstrating environment setup, model handling, and practical lightweight LLM experimentation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published