Skip to content

CarGDev/sheepOp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SheepOp LLM 🐑➡️🤖

Author: Carlos Gutierrez
Email: carlos.gutierrez@carg.dev
License: Dual License - Apache 2.0 (Research) + Commercial License (Commercial Use)

A modern language model implementation from scratch, incorporating insights from recent research papers.


Purpose of the Project

SheepOp LLM is a comprehensive transformer-based language model implementation designed for:

  • Research & Education: Understanding how large language models work from the ground up
  • Custom Training: Training models on domain-specific data (PDFs, code, text files)
  • Production Deployment: Optimized inference with KV caching and efficient attention mechanisms
  • Multi-Format Data Processing: Support for various data types including PDFs, images (OCR), code files, and text

The project provides a complete toolkit for building, training, and deploying transformer language models with modern best practices.


Documentation Index

All detailed documentation is available in the docs/ folder:

Core Concepts

  • Complete Guide - Full project documentation with mathematical foundations, architecture, and usage
  • Architecture - System architecture and design patterns
  • Mathematics - Complete mathematical derivations for all components

Component Explanations

Training & Optimization

  • Training - What is training, why we need data, why more data is better, and how to interpret training metrics
  • Optimization - Optimizers (AdamW, gradient descent) explained
  • Scheduling - Learning rate scheduling explained
  • Generation - Text generation and sampling strategies

Data & Processing

Advanced Topics


Common Questions

Getting Started

Q: How do I get started with this project?
A: See Complete Guide - Quick Start section

Q: What do I need to install?
A: See Complete Guide - Installation section

Q: How do I train my first model?
A: See Complete Guide - Usage section

Understanding Concepts

Q: What are embeddings?
A: See Embeddings Explained

Q: How does attention work?
A: See Attention Explained

Q: What is a feed-forward network?
A: See Feed-Forward Explained

Q: Why do we need normalization?
A: See Normalization Explained

Q: How do neural networks work?
A: See Neural Network Explained

Q: What is a neuron and what are weights?
A: See Neural Network Explained

Training Questions

Q: What is training and why do we need it?
A: See Training Explained

Q: Why do we need data for training?
A: See Training Explained - Why Do We Need Data section

Q: Why is more data better?
A: See Training Explained - Why More Data is Better section

Q: How does the optimizer work?
A: See Optimization Explained

Q: What is learning rate scheduling?
A: See Scheduling Explained

Data Questions

Q: How does data processing work?
A: See Data Processing Explained

Q: Can I train on PDFs?
A: See Multi-Format Data Guide

Q: Can I train on images?
A: See Multi-Format Data Guide

Q: How do I process different file types?
A: See Data Processing Explained

Q: How do I download code repositories automatically?
A: See Repository Download Guide

Generation Questions

Q: How does text generation work?
A: See Generation Explained

Q: What is temperature in generation?
A: See Generation Explained - Temperature section

Q: What is top-k and top-p sampling?
A: See Generation Explained - Top-k and Top-p sections

Mathematical Questions

Q: What are the mathematical foundations?
A: See Mathematics or Complete Guide - Mathematical Foundations section

Q: How do I understand the complete mathematical model?
A: See Mathematics for step-by-step derivations

Q: Is there a control system perspective?
A: See Control System Model

Architecture Questions

Q: How is the architecture designed?
A: See Architecture

Q: What is the complete system flow?
A: See Complete Guide - Architecture Explained section

Advanced Questions

Q: How do I optimize inference?
A: See Optimizations

Q: How do I retrain a model?
A: See Retraining Guide

Q: How do I extract data from databases?
A: See Database Extraction Guide

Q: How do I download GitHub repositories for code training?
A: See Repository Download Guide


Glossary

A

AdamW - Advanced optimizer combining adaptive learning rates with weight decay. See Optimization Explained

Attention - Mechanism that determines how much each word should consider other words. See Attention Explained

Autoregressive - Generation method where the model uses its own previous outputs as inputs. See Generation Explained

B

Batch - Small group of examples processed together during training. See Training Explained

Bias - Constant added to weighted sum in neural networks. See Neural Network Explained

Backpropagation - Algorithm for computing gradients through the network. See Training Explained

C

Causal Masking - Prevents tokens from attending to future tokens. See Complete Guide

Cosine Annealing - Learning rate schedule that follows a cosine curve. See Scheduling Explained

Cross-Entropy Loss - Loss function for classification tasks. See Mathematics

D

Data Processing - Transformation of raw files into training-ready text. See Data Processing Explained

Dropout - Regularization technique that randomly sets activations to zero. See Complete Guide

Decoder - Part of transformer that generates output. See Architecture

E

Embedding - Numerical representation of words/tokens. See Embeddings Explained

Epoch - One complete pass through the training data. See Training Explained

Evaluation - Process of measuring model performance. See Training Explained

F

Feed-Forward Network (FFN) - Two-layer neural network that transforms features. See Feed-Forward Explained

Forward Pass - Computing predictions from inputs through the model. See Neural Network Explained

G

GELU - Gaussian Error Linear Unit activation function. See Feed-Forward Explained

Generation - Process of creating new text from a trained model. See Generation Explained

Gradient - Derivative of loss with respect to parameters. See Optimization Explained

Gradient Clipping - Technique to prevent exploding gradients. See Complete Guide

Gradient Descent - Basic optimization algorithm. See Optimization Explained

H

Hidden State - Intermediate representation in the model. See Architecture

L

Layer Normalization - Normalization technique applied per layer. See Normalization Explained

Learning Rate - Step size for weight updates. See Optimization Explained

Logits - Raw scores before applying softmax. See Generation Explained

Loss - Measure of prediction error. See Training Explained

M

Multi-Head Attention - Attention mechanism with multiple parallel heads. See Attention Explained

Momentum - Technique to accelerate gradient descent. See Optimization Explained

N

Neural Network - Computational model inspired by biological neurons. See Neural Network Explained

Neuron - Basic processing unit in neural networks. See Neural Network Explained

Normalization - Technique to standardize activations. See Normalization Explained

Nucleus Sampling (Top-p) - Sampling strategy keeping tokens with cumulative probability ≥ p. See Generation Explained

O

Optimization - Process of finding optimal weights. See Optimization Explained

Optimizer - Algorithm that updates model weights. See Optimization Explained

Overfitting - Model memorizes training data but doesn't generalize. See Training Explained

P

Perplexity - Measure of model uncertainty (exp(loss)). See Mathematics

Positional Encoding - Adds position information to embeddings. See Complete Guide

Pre-norm - Architecture where normalization comes before sublayers. See Architecture

Probability Distribution - Distribution over possible next tokens. See Generation Explained

Q

Query (Q) - One of three representations in attention (what am I looking for?). See Attention Explained

R

Residual Connection - Skip connection that adds input to output. See Architecture

S

Sampling - Process of selecting a token from probability distribution. See Generation Explained

Scheduling - Adjusting learning rate during training. See Scheduling Explained

Self-Attention - Attention mechanism where queries, keys, and values come from same input. See Attention Explained

Softmax - Function that converts logits to probabilities. See Generation Explained

T

Temperature - Parameter controlling randomness in sampling. See Generation Explained

Token - Basic unit of text (word or character). See Neural Network Explained

Tokenization - Process of converting text to tokens. See Data Processing Explained

Top-k Sampling - Sampling strategy keeping only top k tokens. See Generation Explained

Top-p Sampling - Another name for nucleus sampling. See Generation Explained

Transformer - Neural network architecture based on attention. See Architecture

Training - Process of teaching model to make predictions. See Training Explained

V

Value (V) - One of three representations in attention (what information do I contain?). See Attention Explained

Vocabulary - Set of all possible tokens. See Embeddings Explained

W

Weight - Parameter in neural network that controls connection strength. See Neural Network Explained

Weight Decay - Regularization technique that penalizes large weights. See Optimization Explained

Weight Matrix - Matrix containing all weights for a layer. See Neural Network Explained


Quick Links


License

This project is available under a dual license:

Apache 2.0 License (Research & Non-Commercial Use)

Free for:

  • ✅ Academic research
  • ✅ Educational purposes
  • ✅ Personal projects
  • ✅ Open source contributions
  • ✅ Non-commercial use

Terms:

  • Free to use, modify, and distribute
  • Patent grant included
  • Must include license and copyright notice
  • Must state changes if modifying

Commercial License (Commercial Use)

Requires a commercial license for:

  • ⚠️ Commercial products or services
  • ⚠️ SaaS applications
  • ⚠️ Revenue-generating applications
  • ⚠️ Internal business use (for profit-making entities)
  • ⚠️ Any use that generates profit or revenue

To obtain a commercial license: Contact: carlos.gutierrez@carg.dev
Subject: Commercial License Inquiry - SheepOp

Please include:

  • Intended use case
  • Expected usage volume
  • Company/Organization name
  • Contact information

Citation Requirement

IMPORTANT: If you use this software in academic research or publications, you MUST cite this work. This is a condition of use for academic purposes.

Required Citation Format:

BibTeX:

@software{sheepop2024,
  title = {SheepOp LLM: Transformer-based Language Model Implementation},
  author = {Gutierrez, Carlos},
  year = {2024},
  url = {https://github.com/[your-username]/sheepOp},
  version = {1.0}
}

Text format:

Carlos Gutierrez. (2024). SheepOp LLM: Transformer-based Language Model 
Implementation. https://github.com/[your-username]/sheepOp

Note: Citation is required for academic use. Failure to cite constitutes a violation of the terms of use.

See LICENSE or LICENSE.txt for the full license text.


Contact

Carlos Gutierrez
Email: carlos.gutierrez@carg.dev


This README serves as an index to the comprehensive documentation available in the docs/ folder.

About

No description, website, or topics provided.

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE.txt

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages