SheepOp LLM 🐑➡️🤖

Author: Carlos Gutierrez
Email: carlos.gutierrez@carg.dev
License: Dual License - Apache 2.0 (Research) + Commercial License (Commercial Use)

A modern language model implementation from scratch, incorporating insights from recent research papers.

Purpose of the Project

SheepOp LLM is a comprehensive transformer-based language model implementation designed for:

Research & Education: Understanding how large language models work from the ground up
Custom Training: Training models on domain-specific data (PDFs, code, text files)
Production Deployment: Optimized inference with KV caching and efficient attention mechanisms
Multi-Format Data Processing: Support for various data types including PDFs, images (OCR), code files, and text

The project provides a complete toolkit for building, training, and deploying transformer language models with modern best practices.

Documentation Index

All detailed documentation is available in the docs/ folder:

Core Concepts

Complete Guide - Full project documentation with mathematical foundations, architecture, and usage
Architecture - System architecture and design patterns
Mathematics - Complete mathematical derivations for all components

Component Explanations

Embeddings - What are embeddings and how they work
Attention - Attention mechanisms explained step-by-step
Feed-Forward - Feed-forward networks explained
Normalization - Layer normalization explained
Neural Networks - Neural networks, neurons, and weights explained

Training & Optimization

Training - What is training, why we need data, why more data is better, and how to interpret training metrics
Optimization - Optimizers (AdamW, gradient descent) explained
Scheduling - Learning rate scheduling explained
Generation - Text generation and sampling strategies

Data & Processing

Data Processing - How data processing works step-by-step
Multi-Format Data Guide - Working with PDFs, images, code files
Data Guide - General data handling guide
Database Extraction Guide - Extracting data from databases
Repository Download Guide - Automatically downloading GitHub repositories for code training

Advanced Topics

Control System Model - Mathematical control system formulation
Optimizations - Performance optimizations
Retraining Guide - How to retrain models

Common Questions

Getting Started

Q: How do I get started with this project?
A: See Complete Guide - Quick Start section

Q: What do I need to install?
A: See Complete Guide - Installation section

Q: How do I train my first model?
A: See Complete Guide - Usage section

Understanding Concepts

Q: What are embeddings?
A: See Embeddings Explained

Q: How does attention work?
A: See Attention Explained

Q: What is a feed-forward network?
A: See Feed-Forward Explained

Q: Why do we need normalization?
A: See Normalization Explained

Q: How do neural networks work?
A: See Neural Network Explained

Q: What is a neuron and what are weights?
A: See Neural Network Explained

Training Questions

Q: What is training and why do we need it?
A: See Training Explained

Q: Why do we need data for training?
A: See Training Explained - Why Do We Need Data section

Q: Why is more data better?
A: See Training Explained - Why More Data is Better section

Q: How does the optimizer work?
A: See Optimization Explained

Q: What is learning rate scheduling?
A: See Scheduling Explained

Data Questions

Q: How does data processing work?
A: See Data Processing Explained

Q: Can I train on PDFs?
A: See Multi-Format Data Guide

Q: Can I train on images?
A: See Multi-Format Data Guide

Q: How do I process different file types?
A: See Data Processing Explained

Q: How do I download code repositories automatically?
A: See Repository Download Guide

Generation Questions

Q: How does text generation work?
A: See Generation Explained

Q: What is temperature in generation?
A: See Generation Explained - Temperature section

Q: What is top-k and top-p sampling?
A: See Generation Explained - Top-k and Top-p sections

Mathematical Questions

Q: What are the mathematical foundations?
A: See Mathematics or Complete Guide - Mathematical Foundations section

Q: How do I understand the complete mathematical model?
A: See Mathematics for step-by-step derivations

Q: Is there a control system perspective?
A: See Control System Model

Architecture Questions

Q: How is the architecture designed?
A: See Architecture

Q: What is the complete system flow?
A: See Complete Guide - Architecture Explained section

Advanced Questions

Q: How do I optimize inference?
A: See Optimizations

Q: How do I retrain a model?
A: See Retraining Guide

Q: How do I extract data from databases?
A: See Database Extraction Guide

Q: How do I download GitHub repositories for code training?
A: See Repository Download Guide

Glossary

A

AdamW - Advanced optimizer combining adaptive learning rates with weight decay. See Optimization Explained

Attention - Mechanism that determines how much each word should consider other words. See Attention Explained

Autoregressive - Generation method where the model uses its own previous outputs as inputs. See Generation Explained

B

Batch - Small group of examples processed together during training. See Training Explained

Bias - Constant added to weighted sum in neural networks. See Neural Network Explained

Backpropagation - Algorithm for computing gradients through the network. See Training Explained

C

Causal Masking - Prevents tokens from attending to future tokens. See Complete Guide

Cosine Annealing - Learning rate schedule that follows a cosine curve. See Scheduling Explained

Cross-Entropy Loss - Loss function for classification tasks. See Mathematics

D

Data Processing - Transformation of raw files into training-ready text. See Data Processing Explained

Dropout - Regularization technique that randomly sets activations to zero. See Complete Guide

Decoder - Part of transformer that generates output. See Architecture

E

Embedding - Numerical representation of words/tokens. See Embeddings Explained

Epoch - One complete pass through the training data. See Training Explained

Evaluation - Process of measuring model performance. See Training Explained

F

Feed-Forward Network (FFN) - Two-layer neural network that transforms features. See Feed-Forward Explained

Forward Pass - Computing predictions from inputs through the model. See Neural Network Explained

G

GELU - Gaussian Error Linear Unit activation function. See Feed-Forward Explained

Generation - Process of creating new text from a trained model. See Generation Explained

Gradient - Derivative of loss with respect to parameters. See Optimization Explained

Gradient Clipping - Technique to prevent exploding gradients. See Complete Guide

Gradient Descent - Basic optimization algorithm. See Optimization Explained

H

Hidden State - Intermediate representation in the model. See Architecture

L

Layer Normalization - Normalization technique applied per layer. See Normalization Explained

Learning Rate - Step size for weight updates. See Optimization Explained

Logits - Raw scores before applying softmax. See Generation Explained

Loss - Measure of prediction error. See Training Explained

M

Multi-Head Attention - Attention mechanism with multiple parallel heads. See Attention Explained

Momentum - Technique to accelerate gradient descent. See Optimization Explained

N

Neural Network - Computational model inspired by biological neurons. See Neural Network Explained

Neuron - Basic processing unit in neural networks. See Neural Network Explained

Normalization - Technique to standardize activations. See Normalization Explained

Nucleus Sampling (Top-p) - Sampling strategy keeping tokens with cumulative probability ≥ p. See Generation Explained

O

Optimization - Process of finding optimal weights. See Optimization Explained

Optimizer - Algorithm that updates model weights. See Optimization Explained

Overfitting - Model memorizes training data but doesn't generalize. See Training Explained

P

Perplexity - Measure of model uncertainty (exp(loss)). See Mathematics

Positional Encoding - Adds position information to embeddings. See Complete Guide

Pre-norm - Architecture where normalization comes before sublayers. See Architecture

Probability Distribution - Distribution over possible next tokens. See Generation Explained

Q

Query (Q) - One of three representations in attention (what am I looking for?). See Attention Explained

R

Residual Connection - Skip connection that adds input to output. See Architecture

S

Sampling - Process of selecting a token from probability distribution. See Generation Explained

Scheduling - Adjusting learning rate during training. See Scheduling Explained

Self-Attention - Attention mechanism where queries, keys, and values come from same input. See Attention Explained

Softmax - Function that converts logits to probabilities. See Generation Explained

T

Temperature - Parameter controlling randomness in sampling. See Generation Explained

Token - Basic unit of text (word or character). See Neural Network Explained

Tokenization - Process of converting text to tokens. See Data Processing Explained

Top-k Sampling - Sampling strategy keeping only top k tokens. See Generation Explained

Top-p Sampling - Another name for nucleus sampling. See Generation Explained

Transformer - Neural network architecture based on attention. See Architecture

Training - Process of teaching model to make predictions. See Training Explained

V

Value (V) - One of three representations in attention (what information do I contain?). See Attention Explained

Vocabulary - Set of all possible tokens. See Embeddings Explained

W

Weight - Parameter in neural network that controls connection strength. See Neural Network Explained

Weight Decay - Regularization technique that penalizes large weights. See Optimization Explained

Weight Matrix - Matrix containing all weights for a layer. See Neural Network Explained

Quick Links

Complete Documentation: docs/COMPLETE_GUIDE.md
Mathematical Foundations: docs/MATHEMATICS.md
System Architecture: docs/ARCHITECTURE.md
Control System Model: docs/CONTROL_SYSTEM_MODEL.md

License

This project is available under a dual license:

Apache 2.0 License (Research & Non-Commercial Use)

Free for:

✅ Academic research
✅ Educational purposes
✅ Personal projects
✅ Open source contributions
✅ Non-commercial use

Terms:

Free to use, modify, and distribute
Patent grant included
Must include license and copyright notice
Must state changes if modifying

Commercial License (Commercial Use)

Requires a commercial license for:

⚠️ Commercial products or services
⚠️ SaaS applications
⚠️ Revenue-generating applications
⚠️ Internal business use (for profit-making entities)
⚠️ Any use that generates profit or revenue

To obtain a commercial license: Contact: carlos.gutierrez@carg.dev
Subject: Commercial License Inquiry - SheepOp

Please include:

Intended use case
Expected usage volume
Company/Organization name
Contact information

Citation Requirement

IMPORTANT: If you use this software in academic research or publications, you MUST cite this work. This is a condition of use for academic purposes.

Required Citation Format:

BibTeX:

@software{sheepop2024,
  title = {SheepOp LLM: Transformer-based Language Model Implementation},
  author = {Gutierrez, Carlos},
  year = {2024},
  url = {https://github.com/[your-username]/sheepOp},
  version = {1.0}
}

Text format:

Carlos Gutierrez. (2024). SheepOp LLM: Transformer-based Language Model 
Implementation. https://github.com/[your-username]/sheepOp

Note: Citation is required for academic use. Failure to cite constitutes a violation of the terms of use.

See LICENSE or LICENSE.txt for the full license text.

Contact

Carlos Gutierrez
Email: carlos.gutierrez@carg.dev

This README serves as an index to the comprehensive documentation available in the docs/ folder.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data.example		data.example
docs		docs
inference_benchmarks		inference_benchmarks
models		models
plots		plots
training		training
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
LICENSE.txt		LICENSE.txt
README.md		README.md
benchmark_batch.py		benchmark_batch.py
checkpoints		checkpoints
config.json		config.json
config.py		config.py
config_cuda_8gb.json		config_cuda_8gb.json
config_quick_optimized.json		config_quick_optimized.json
data		data
download_all_repos.py		download_all_repos.py
download_data.py		download_data.py
download_large_data.py		download_large_data.py
download_repos.py		download_repos.py
example.py		example.py
example_optimized.py		example_optimized.py
extensions.txt		extensions.txt
extract_from_database.py		extract_from_database.py
inference.py		inference.py
inference_metrics.py		inference_metrics.py
llm_test_prompts.txt		llm_test_prompts.txt
requirements.txt		requirements.txt
setup_storage.py		setup_storage.py
sheepop_poc_paper.md		sheepop_poc_paper.md
train.py		train.py
utils.py		utils.py
verify_benchmark.py		verify_benchmark.py

License

Licenses found

CarGDev/sheepOp

Folders and files

Latest commit

History

Repository files navigation

SheepOp LLM 🐑➡️🤖

Purpose of the Project

Documentation Index

Core Concepts

Component Explanations

Training & Optimization

Data & Processing

Advanced Topics

Common Questions

Getting Started

Understanding Concepts

Training Questions

Data Questions

Generation Questions

Mathematical Questions

Architecture Questions

Advanced Questions

Glossary

A

B

C

D

E

F

G

H

L

M

N

O

P

Q

R

S

T

V

W

Quick Links

License

Apache 2.0 License (Research & Non-Commercial Use)

Commercial License (Commercial Use)

Citation Requirement

Contact

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages