# About

Examples on how to use the LuminarSequenceDetector on the basis of the LuminarSequenceClassifier.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import torch
import gc

from IPython.display import display, HTML
from luminar.detector import LuminarSequenceDetector
from luminar.utils.cuda import get_best_device
from luminar.sequence_classifier import LuminarSequence
from luminar.utils import LuminarSequenceTrainingConfig, ConvolutionalLayerSpec
from luminar.utils.visualization import visualize_detection

torch.cuda.empty_cache()
gc.collect()
if torch.cuda.is_available():
    with torch.cuda.device(torch.cuda.current_device()):
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()

In [3]:
class Config:
    # "/storage/projects/boenisch/PrismAI/models/luminar_sequence/PrismAI_v2-encoded-gpt2/e1s2k2du"
    MODEL_PATH = "TheItCrOw/LuminarSeq"

In [4]:
# tiiuae/falcon-7b
detector = LuminarSequenceDetector(model_path=Config.MODEL_PATH, feature_agent="gpt2", device=get_best_device())

Loading LuminarSequenceDetector from TheItCrOw/LuminarSeq to device cuda:0
LuminarSequenceTrainingConfig(feature_len=512, num_intermediate_likelihoods=13, apply_delta_augmentation=False, apply_product_augmentation=True, weighted_sampling=False, conv_layer_shapes=[[32, 5, 1], [64, 5, 1], [32, 3, 1]], projection_dim=128, lstm_hidden_dim=256, lstm_layers=1, stack_spans=4, hf_dataset='TheItCrOw/MAGE-encoded-gpt2', dataset_root_path='/storage/projects/stoeckel/prismai/encoded/fulltext/', models_root_path='/storage/projects/boenisch/PrismAI/models/luminar_sequence/', domain=None, agent='gpt_4o_mini_gemma2_9b', feature_agent='gpt2', max_epochs=100, batch_size=128, early_stopping_patience=8, rescale_features=False, kfold=3, learning_rate=0.004, seed=42)
Loaded.


[nltk_data] Downloading package punkt to
[nltk_data]     /home/staff_homes/kboenisc/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
document = """
current and upcoming surveys will measure the cosmological parameters with an extremely high accuracy . the primary goal of these observations is to eliminate some of the currently viable cosmological models created to explain the late time accelerated expansion ( either real or only inferred ) . however , most of the statistical tests used in cosmology have a strong requirement : the use of a model to fit the data . these statistical tests are usually based on the normal , or Gaussian , distribution . this distribution is defined by two basic parameters : the mean and the variance ( also referred to as standard deviation or simply "σ") . the hypothesis that we test , which is referred to as the "Model" , is also defined by only two parameters : the "F" and the "ω" values ; defined by : F(ω) = σ(ω) 2 - (σF)2. The observed data in the analysis is given by . where "s" is the square amplitude of the CMB temperature vector and is associated with the "angular power spectrum" while is the amplitude of the CMB temperature vector and is associated with the "angular power spectrum" . the observed data in the analysis is given by : where "s" is the square amplitude of the CMB temperature vector and is associated with the "angular power spectrum" while is the amplitude of the CMB temperature vector and is associated with the "angular power spectrum" . it is not very useful when dealing with a theoretical distribution that is not in a normal case, such as the CMB , which is defined by : where is the scalar function describing the temperature distribution , and is defined by . the inverse cumulative density is defined by . so, the inverse probability is simply equal to : The assumption is basically: the data we are observing is normally distributed and our model is exactly correct. we use the data to construct the probability for our hypothesis: .
when using an iterative method to solve for the initial parameters ( ω , F ) of a statistical model , the method must be based on some form of a "
"""

In [6]:
print("Document length:", len(document))

result = detector.detect(document)
print(result)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Document length: 2015
{'avg': 41.33349061012268, 'probs': [23.08170050382614, 10.568853467702866, 12.825265526771545, 40.45039117336273, 79.97509241104126, 10.568853467702866, 12.825265526771545, 40.45039117336273, 86.18561625480652, 19.017255306243896, 71.19887471199036, 43.53920519351959, 79.09947633743286, 19.017255306243896, 71.19887471199036], 'token_spans': [(0, 17), (17, 51), (51, 78), (78, 94), (94, 121), (121, 177), (177, 187), (187, 232), (232, 255), (256, 273), (273, 307), (307, 334), (334, 350), (350, 377), (377, 416)], 'char_spans': [(0, 103), (103, 297), (297, 426), (426, 512), (512, 650), (650, 824), (824, 874), (874, 1095), (1095, 1203), (1207, 1289), (1289, 1467), (1467, 1602), (1602, 1686), (1686, 1832), (1832, 2011)]}


In [7]:
html_output = visualize_detection(document, result)
HTML(html_output)