# EGen-Core — Quick Inference Example

**The Athena Project (2025–2026)** — Developed by [ErebusTN](https://github.com/ErebusTN)

A minimal end-to-end example showing how to:
1. Install EGen-Core
2. Load a model from HuggingFace Hub
3. Tokenize a prompt
4. Generate text
5. Decode and display the output

In [None]:
!pip install -q EGen-Core

In [None]:
import torch
import sys

print(f"Python:  {sys.version}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA:    {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU:     {torch.cuda.get_device_name(0)}")
    print(f"VRAM:    {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

## Load Model

EGen-Core's `AutoModel` auto-detects the model architecture and loads it
using layer-wise sharded inference, keeping GPU memory usage minimal.

In [None]:
from egen_core import AutoModel

# Choose a model — EGen-Core can handle 70B+ models on 4GB VRAM!
MODEL_ID = "garage-bAInd/Platypus2-70B-instruct"  # Change to your preferred model
MAX_LENGTH = 128

model = AutoModel.from_pretrained(MODEL_ID)
print(f"\nModel loaded: {MODEL_ID}")

## Tokenize and Generate

In [None]:
prompt = "What is the capital of the United States?"

input_tokens = model.tokenizer(
    [prompt],
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=MAX_LENGTH,
    padding=False
)

print(f"Prompt:      {prompt}")
print(f"Token count: {input_tokens['input_ids'].shape[-1]}")

In [None]:
generation_output = model.generate(
    input_tokens['input_ids'].cuda(),
    max_new_tokens=50,
    use_cache=True,
    return_dict_in_generate=True
)

output_text = model.tokenizer.decode(generation_output.sequences[0])
print(f"\nGenerated output:\n{output_text}")

## With Compression (Optional)

Enable 4-bit compression for ~3× speedup with minimal accuracy loss.

In [None]:
# Uncomment to test with compression:
# model_compressed = AutoModel.from_pretrained(MODEL_ID, compression='4bit')
# output = model_compressed.generate(
#     input_tokens['input_ids'].cuda(),
#     max_new_tokens=50,
#     use_cache=True,
#     return_dict_in_generate=True
# )
# print(model_compressed.tokenizer.decode(output.sequences[0]))

---

**Done!** ✅ You've just run inference on a large language model with EGen-Core.
Visit the [README](https://github.com/ErebusTN/EGen-Core) for more examples and documentation.