Author: Carlos Gutierrez
Email: carlos.gutierrez@carg.dev
License: Dual License - Apache 2.0 (Research) + Commercial License (Commercial Use)
A modern language model implementation from scratch, incorporating insights from recent research papers.
SheepOp LLM is a comprehensive transformer-based language model implementation designed for:
- Research & Education: Understanding how large language models work from the ground up
- Custom Training: Training models on domain-specific data (PDFs, code, text files)
- Production Deployment: Optimized inference with KV caching and efficient attention mechanisms
- Multi-Format Data Processing: Support for various data types including PDFs, images (OCR), code files, and text
The project provides a complete toolkit for building, training, and deploying transformer language models with modern best practices.
All detailed documentation is available in the docs/ folder:
- Complete Guide - Full project documentation with mathematical foundations, architecture, and usage
- Architecture - System architecture and design patterns
- Mathematics - Complete mathematical derivations for all components
- Embeddings - What are embeddings and how they work
- Attention - Attention mechanisms explained step-by-step
- Feed-Forward - Feed-forward networks explained
- Normalization - Layer normalization explained
- Neural Networks - Neural networks, neurons, and weights explained
- Training - What is training, why we need data, why more data is better, and how to interpret training metrics
- Optimization - Optimizers (AdamW, gradient descent) explained
- Scheduling - Learning rate scheduling explained
- Generation - Text generation and sampling strategies
- Data Processing - How data processing works step-by-step
- Multi-Format Data Guide - Working with PDFs, images, code files
- Data Guide - General data handling guide
- Database Extraction Guide - Extracting data from databases
- Repository Download Guide - Automatically downloading GitHub repositories for code training
- Control System Model - Mathematical control system formulation
- Optimizations - Performance optimizations
- Retraining Guide - How to retrain models
Q: How do I get started with this project?
A: See Complete Guide - Quick Start section
Q: What do I need to install?
A: See Complete Guide - Installation section
Q: How do I train my first model?
A: See Complete Guide - Usage section
Q: What are embeddings?
A: See Embeddings Explained
Q: How does attention work?
A: See Attention Explained
Q: What is a feed-forward network?
A: See Feed-Forward Explained
Q: Why do we need normalization?
A: See Normalization Explained
Q: How do neural networks work?
A: See Neural Network Explained
Q: What is a neuron and what are weights?
A: See Neural Network Explained
Q: What is training and why do we need it?
A: See Training Explained
Q: Why do we need data for training?
A: See Training Explained - Why Do We Need Data section
Q: Why is more data better?
A: See Training Explained - Why More Data is Better section
Q: How does the optimizer work?
A: See Optimization Explained
Q: What is learning rate scheduling?
A: See Scheduling Explained
Q: How does data processing work?
A: See Data Processing Explained
Q: Can I train on PDFs?
A: See Multi-Format Data Guide
Q: Can I train on images?
A: See Multi-Format Data Guide
Q: How do I process different file types?
A: See Data Processing Explained
Q: How do I download code repositories automatically?
A: See Repository Download Guide
Q: How does text generation work?
A: See Generation Explained
Q: What is temperature in generation?
A: See Generation Explained - Temperature section
Q: What is top-k and top-p sampling?
A: See Generation Explained - Top-k and Top-p sections
Q: What are the mathematical foundations?
A: See Mathematics or Complete Guide - Mathematical Foundations section
Q: How do I understand the complete mathematical model?
A: See Mathematics for step-by-step derivations
Q: Is there a control system perspective?
A: See Control System Model
Q: How is the architecture designed?
A: See Architecture
Q: What is the complete system flow?
A: See Complete Guide - Architecture Explained section
Q: How do I optimize inference?
A: See Optimizations
Q: How do I retrain a model?
A: See Retraining Guide
Q: How do I extract data from databases?
A: See Database Extraction Guide
Q: How do I download GitHub repositories for code training?
A: See Repository Download Guide
AdamW - Advanced optimizer combining adaptive learning rates with weight decay. See Optimization Explained
Attention - Mechanism that determines how much each word should consider other words. See Attention Explained
Autoregressive - Generation method where the model uses its own previous outputs as inputs. See Generation Explained
Batch - Small group of examples processed together during training. See Training Explained
Bias - Constant added to weighted sum in neural networks. See Neural Network Explained
Backpropagation - Algorithm for computing gradients through the network. See Training Explained
Causal Masking - Prevents tokens from attending to future tokens. See Complete Guide
Cosine Annealing - Learning rate schedule that follows a cosine curve. See Scheduling Explained
Cross-Entropy Loss - Loss function for classification tasks. See Mathematics
Data Processing - Transformation of raw files into training-ready text. See Data Processing Explained
Dropout - Regularization technique that randomly sets activations to zero. See Complete Guide
Decoder - Part of transformer that generates output. See Architecture
Embedding - Numerical representation of words/tokens. See Embeddings Explained
Epoch - One complete pass through the training data. See Training Explained
Evaluation - Process of measuring model performance. See Training Explained
Feed-Forward Network (FFN) - Two-layer neural network that transforms features. See Feed-Forward Explained
Forward Pass - Computing predictions from inputs through the model. See Neural Network Explained
GELU - Gaussian Error Linear Unit activation function. See Feed-Forward Explained
Generation - Process of creating new text from a trained model. See Generation Explained
Gradient - Derivative of loss with respect to parameters. See Optimization Explained
Gradient Clipping - Technique to prevent exploding gradients. See Complete Guide
Gradient Descent - Basic optimization algorithm. See Optimization Explained
Hidden State - Intermediate representation in the model. See Architecture
Layer Normalization - Normalization technique applied per layer. See Normalization Explained
Learning Rate - Step size for weight updates. See Optimization Explained
Logits - Raw scores before applying softmax. See Generation Explained
Loss - Measure of prediction error. See Training Explained
Multi-Head Attention - Attention mechanism with multiple parallel heads. See Attention Explained
Momentum - Technique to accelerate gradient descent. See Optimization Explained
Neural Network - Computational model inspired by biological neurons. See Neural Network Explained
Neuron - Basic processing unit in neural networks. See Neural Network Explained
Normalization - Technique to standardize activations. See Normalization Explained
Nucleus Sampling (Top-p) - Sampling strategy keeping tokens with cumulative probability ≥ p. See Generation Explained
Optimization - Process of finding optimal weights. See Optimization Explained
Optimizer - Algorithm that updates model weights. See Optimization Explained
Overfitting - Model memorizes training data but doesn't generalize. See Training Explained
Perplexity - Measure of model uncertainty (exp(loss)). See Mathematics
Positional Encoding - Adds position information to embeddings. See Complete Guide
Pre-norm - Architecture where normalization comes before sublayers. See Architecture
Probability Distribution - Distribution over possible next tokens. See Generation Explained
Query (Q) - One of three representations in attention (what am I looking for?). See Attention Explained
Residual Connection - Skip connection that adds input to output. See Architecture
Sampling - Process of selecting a token from probability distribution. See Generation Explained
Scheduling - Adjusting learning rate during training. See Scheduling Explained
Self-Attention - Attention mechanism where queries, keys, and values come from same input. See Attention Explained
Softmax - Function that converts logits to probabilities. See Generation Explained
Temperature - Parameter controlling randomness in sampling. See Generation Explained
Token - Basic unit of text (word or character). See Neural Network Explained
Tokenization - Process of converting text to tokens. See Data Processing Explained
Top-k Sampling - Sampling strategy keeping only top k tokens. See Generation Explained
Top-p Sampling - Another name for nucleus sampling. See Generation Explained
Transformer - Neural network architecture based on attention. See Architecture
Training - Process of teaching model to make predictions. See Training Explained
Value (V) - One of three representations in attention (what information do I contain?). See Attention Explained
Vocabulary - Set of all possible tokens. See Embeddings Explained
Weight - Parameter in neural network that controls connection strength. See Neural Network Explained
Weight Decay - Regularization technique that penalizes large weights. See Optimization Explained
Weight Matrix - Matrix containing all weights for a layer. See Neural Network Explained
- Complete Documentation: docs/COMPLETE_GUIDE.md
- Mathematical Foundations: docs/MATHEMATICS.md
- System Architecture: docs/ARCHITECTURE.md
- Control System Model: docs/CONTROL_SYSTEM_MODEL.md
This project is available under a dual license:
Free for:
- ✅ Academic research
- ✅ Educational purposes
- ✅ Personal projects
- ✅ Open source contributions
- ✅ Non-commercial use
Terms:
- Free to use, modify, and distribute
- Patent grant included
- Must include license and copyright notice
- Must state changes if modifying
Requires a commercial license for:
⚠️ Commercial products or services⚠️ SaaS applications⚠️ Revenue-generating applications⚠️ Internal business use (for profit-making entities)⚠️ Any use that generates profit or revenue
To obtain a commercial license:
Contact: carlos.gutierrez@carg.dev
Subject: Commercial License Inquiry - SheepOp
Please include:
- Intended use case
- Expected usage volume
- Company/Organization name
- Contact information
IMPORTANT: If you use this software in academic research or publications, you MUST cite this work. This is a condition of use for academic purposes.
Required Citation Format:
BibTeX:
@software{sheepop2024,
title = {SheepOp LLM: Transformer-based Language Model Implementation},
author = {Gutierrez, Carlos},
year = {2024},
url = {https://github.com/[your-username]/sheepOp},
version = {1.0}
}Text format:
Carlos Gutierrez. (2024). SheepOp LLM: Transformer-based Language Model
Implementation. https://github.com/[your-username]/sheepOp
Note: Citation is required for academic use. Failure to cite constitutes a violation of the terms of use.
See LICENSE or LICENSE.txt for the full license text.
Carlos Gutierrez
Email: carlos.gutierrez@carg.dev
This README serves as an index to the comprehensive documentation available in the docs/ folder.