# N-gram Tracing

This notebook will be used to test out n-gram tracing for use with author verification methods. The end goal is to ensure the code works to find common n-grams between two texts and that we can return the text prior to those n-grams.

In [47]:
import sys

from from_root import from_root

sys.path.insert(0, str(from_root("src")))

from model_loading import load_model
from read_and_write_docs import read_txt
from n_gram_tracing import (
    common_ngrams,
    tokens_to_text,
    find_all_ngram_positions,
    texts_before_each_ngram,
    find_all_ngram_spans,
    texts_around_each_ngram
)

In [48]:
tokenizer, model = load_model("/Volumes/BCross/models/gpt2")

In [49]:
known_text = read_txt("../data/hodja_nasreddin_text_1.txt")
unknown_text = read_txt("../data/hodja_nasreddin_text_10.txt")

## Get Common N-Grams

Here we get the n-grams in common between the two texts.

In [50]:
common = common_ngrams(
    text1=known_text,
    text2=unknown_text,
    n=2,
    tokenizer=tokenizer,
    include_subgrams=False,
    lowercase=True
)

common

[[',', 'Ġit'],
 ['Ġas', 'Ġwas'],
 ['Ġin', 'Ġthe'],
 ['Ġis', 'Ġa'],
 ['Ġis', 'Ġnot'],
 ['Ġof', 'Ġthe'],
 ['Ġof', 'Ġthem'],
 ['Ġshould', 'Ġbe'],
 ['Ġthey', 'Ġactually'],
 ['Ġto', 'Ġreplace'],
 [',', 'Ġthis', 'Ġis'],
 ['.', '\\', 'n'],
 ['.', '\\', 'ni'],
 ['.', '\\', 'nt'],
 ['Ġso', 'v', 'iet'],
 ['Ġright', 'Ġnow', ',', 'Ġbut']]

In [51]:
sample_tokens = tokens_to_text(common[-1], tokenizer)
sample_tokens

' right now, but'

## Find Starting Positions

Two options here, to find the starting positions of n-grams and return the text before that or to include the n-gram in the text.

In [52]:
find_all_ngram_positions(known_text, " in the")

[1609, 2008]

In [53]:
example_texts = texts_before_each_ngram(known_text, " in the")
print(example_texts[0])
print(example_texts[1])

If they actually censor anything is another question.\nUnlike others, Medvedev is an internationally recognized historian.\nHe tells that these people are governmental bureaucrats although some of them have degrees.\nMain point this is a Can anyone clarify, please, where the 21 million number for three countries in version 3 comes from?\nI hope you do not suggest to replace three large sections about Kirov assassination by his single paragraph?\nOf course, if there is any sourced information in his text that currently missing, it might be 'added' to current version.\nPerhaps this article should be merged, but this must be properly done.\nYes, that was one of the reasons why she left big sport so early.\nOf course he did not write about his abuse in reports to KGB superiors.\nI like book by Radzinsky, but this source provides much more details with a lot of references Radzinsky works on a bigger 3-volume biography of Stalin right now.\nSo, you are very welcome to improve this and 'other

## Include n-gram in Result

Now we test including the n-gram in the result, either way works since probably apend back on.

In [54]:
find_all_ngram_spans(known_text, " in the")

[(1609, 1616), (2008, 2015)]

In [55]:
example_texts = texts_around_each_ngram(known_text, " in the")
print(example_texts[0])
print(example_texts[1])

If they actually censor anything is another question.\nUnlike others, Medvedev is an internationally recognized historian.\nHe tells that these people are governmental bureaucrats although some of them have degrees.\nMain point this is a Can anyone clarify, please, where the 21 million number for three countries in version 3 comes from?\nI hope you do not suggest to replace three large sections about Kirov assassination by his single paragraph?\nOf course, if there is any sourced information in his text that currently missing, it might be 'added' to current version.\nPerhaps this article should be merged, but this must be properly done.\nYes, that was one of the reasons why she left big sport so early.\nOf course he did not write about his abuse in reports to KGB superiors.\nI like book by Radzinsky, but this source provides much more details with a lot of references Radzinsky works on a bigger 3-volume biography of Stalin right now.\nSo, you are very welcome to improve this and 'other