In [2]:
pip install krixik

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import sys 
sys.path.append('..')
from dotenv import load_dotenv
import os
load_dotenv()

LUCAS_STAGING_API_KEY=os.getenv('LUCAS_STAGING_API_KEY')
LUCAS_STAGING_API_URL=os.getenv('LUCAS_STAGING_API_URL')

# import Krixik
from krixik import krixik
krixik.init(api_key = LUCAS_STAGING_API_KEY, 
            api_url = LUCAS_STAGING_API_URL)

import json
def json_print(data):
    print(json.dumps(data, indent=2))

%load_ext autoreload
%autoreload 2 

SUCCESS: You are now authenticated.


---

---

---

# The Pros and Cons of Recursive Summarization

[Summarization AI models](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/summarize_module/) take text input and return a summarized version of the text. The models are built and trained in a variety of ways, and thus have different approaches to summarization: for instance, some may be prone to removing certain sentences/words and keeping others, while others might actively rewrite sentences/phrases/paragraphs into shorter versions that retain key meaning.

Certain models may have a paremeterizable 'degree of summarization' through which the user can specify how much shorter than the original text the summary should be. This same effect can be achieved with models that don't offer this option through [recursive summarization](https://krixik-docs.readthedocs.io/en/latest/examples/multi_module_non_search_pipeline_examples/multi_recursive_summarization/), which involves sequentially applying the summarization function several times on the same text, each time feeding the output of the previous iteration as the input of the following one.

[Recursive summarization](https://krixik-docs.readthedocs.io/en/latest/examples/multi_module_non_search_pipeline_examples/multi_recursive_summarization/) offers appealing upside:

- By determining how many times summarization is applied, degree of summarization can be specified.
- Running a small summarization model several times may be more cost-effective than running a very advanced and parameterizable summarization model once.
- Different sequential combinations of [summarization models](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/summarize_module/#available-models-in-the-summarize-module) yield a wider variety of results. For instance, you may find that recursing with one model but ending with another provides a result closer to what you seek.

However, recursively summarizing also has its downsides:

- Recursion may unwittingly summarize away a core element of the text.
- Since summarization involves transforming the text, its meaning may be changed over several iterations.

### Recursive Summarization Example - Opening of <u>1984</u>

Let's take a look at recursively summarizing two example documents. We can examine whether any fundamental component of the text has been changed or lost along the way.

There are two ways to do this. For the first text we'll use three different [Krixik pipelines](https://krixik-docs.readthedocs.io/en/latest/) respectively holding one, two, and three sequential summarization models. Let's [create](https://krixik-docs.readthedocs.io/en/latest/system/pipeline_creation/create_pipeline/) the three pipelines, starting with the [single-module](https://krixik-docs.readthedocs.io/en/latest/examples/single_module_pipelines/single_summarize/) one:

In [4]:
# create a single-module pipeline with a summarize model
pipeline_1 = krixik.create_pipeline(name='my_single_summarize_pipeline',
                                    module_chain=['summarize'])

# create a pipeline with two sequential summarize models
pipeline_2 = krixik.create_pipeline(name='my_double_summarize_pipeline',
                                    module_chain=['summarize', 'summarize'])

# create a pipeline with three sequential summarize models
pipeline_3 = krixik.create_pipeline(name='my_triple_summarize_pipeline',
                                    module_chain=['summarize', 'summarize', 'summarize'])

Now we can feed our text, which is comprised of the first few paragraphs of [George Orwell's <u>1984</u>](https://gutenberg.net.au/ebooks01/0100021.txt), into these three pipelines and compare their output. We'll leverage the module's [default model](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/summarize_module/#available-models-in-the-summarize-module), [bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn), and thus have no need to specify model selection in the code.

Processing through our [single-module](https://krixik-docs.readthedocs.io/en/latest/examples/single_module_pipelines/single_summarize/) pipeline calls for a single line of code:

In [13]:
# process the beginning of Orwell's 1984 through a single-module summarize pipeline
pipeline_1.process(local_file_path='./test_files/1984_opening.txt',
                   verbose=False)

{'status_code': 200,
 'pipeline': 'my_single_summarize_pipeline',
 'request_id': 'ee43000b-3f6a-43d9-a40b-804253d7a9df',
 'file_id': '4a363df8-a283-4e5d-b9d7-2b4ff3b2b851',
 'message': 'SUCCESS - output fetched for file_id 4a363df8-a283-4e5d-b9d7-2b4ff3b2b851.Output saved to location(s) listed in process_output_files.',
 'process_output': None,
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/4a363df8-a283-4e5d-b9d7-2b4ff3b2b851.txt']}

We process the file through the other two pipelines with very similar lines of code, here excluded for brevity.

Let's now take a look at the resultant versions of the text. First the once-summarized output:

> Winston Smith walked through the glass doors of Victory Mansions. The hallway
> smelt of boiled cabbage and old rag mats. At one end of
> it it acoloured poster, too large for indoor display, had been tacked
> to the wall. It depicted simply an enormous face, more than a
> metre wide. Winston made for the stairs.
> 
> Inside the flat a fruity voice was reading out a list of
> figures which had something to do with pig-iron. Winston turned a switch
> and the voice sank somewhat, though the words were still distinguishable. He
> moved over to the window: a smallish, frail figure, the meagreness of
> his body merely emphasized by the blue overalls which were the uniform
> of the party.
> 
> Winston kept his back turned to the telescreen. It was safer; though,
> as he well knew, even a back can be revealing. A kilometre
> away the Ministry of Truth, his place of work, towered vast and
> white above the grimy landscape. Winston tried to squeeze out some childhood
> memory that should tell him whether London had always been quite like
> this.
> 
> The Ministry of Truth--Minitrue, in Newspeak [Newspeak was the officiallanguage of Oceania]--was
> startlingly different from any other object in sight. It was an enormous
> pyramidal structure of glittering white concrete, soaring 300 metres into the air.`

Twice-summarized:

> Winston Smith walked through the glass doors of Victory Mansions. The hallway
> smelled of boiled cabbage and old rag mats. At one end of
> the hallway an acoloured poster, too large for indoor display, had been
> tacked to the wall. It depicted simply an enormous face, more than
> a metre wide.
> 
> Winston kept his back turned to the telescreen. It was safer; though,
> he well knew, even a back can be revealing. A kilometre away
> the Ministry of Truth, his place of work, towered vast and white.

And thrice-summarized:

> Winston Smith walked through the glass doors of Victory Mansions. The hallway
> smelled of boiled cabbage and old rag mats. A kilometre away, his
> place of work, the Ministry of Truth, towered vast and white.

The summarization model does its job well. Much of the political and environmental subtext has been stripped away from the first version, but the actual events that happen in the narrative, as well as the main details described, remain present. Summarization necessarily has to reduce detail, and nothing of critical narrative value is here changed or removed.

Interestingly, the thrice-summarized version is better than the twice-summarized one. In the latter, a key action has been removed: Winston's passing from the entrance of Victory Mansions to his being in the flat in the presence of a telescreen. The change between paragraphs is thus jarring. This is resolved in the final version, where everything but his entrance to the Mansions is removed. Despite that version having been summarized three times, the sentence about the smell in the place remains, and thus the general tone and idea of this excerpt is maintained. The pipeline has succeeded at its task.

### Recursive Summarization Example - *TechCrunch* Editorial

Let's try a different text: a June 2024 editorial from TechCrunch titled [<u>WTF is AI?</u>](https://techcrunch.com/2024/06/01/wtf-is-ai/). We'll do two things differently for this example:

- Instead of the default summarization model, we'll use another [available model](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/summarize_module/#available-models-in-the-summarize-module): Falconsai's [text summarization](https://huggingface.co/Falconsai/text_summarization).
- We will use the same pipeline several times instead of three distinct pipelines.

The pipeline will be the above-created [single-module](https://krixik-docs.readthedocs.io/en/latest/examples/single_module_pipelines/single_summarize/) pipeline. Leveraging it the first time calls for a line of code similar to that above, with the exception that we must specify the non-default model:

In [20]:
# process the the TechCrunch editorial through a single-module summarize pipeline and save the output object to a variable
my_process_output = pipeline_1.process(local_file_path='./test_files/WTF_is_AI.txt',
                                       modules={'summarize': {'model': 'text-summarization', 'params': {}}},
                                       verbose=False)

To [summarize recursively](https://krixik-docs.readthedocs.io/en/latest/examples/single_module_pipelines/single_summarize/#recursive-summarization) we'll feed the above output back into the same pipeline. Although we'll do so twice, the code used will only be displayed here once, for brevity's sake:

In [21]:
# process the output from the previous summarization through the same single-module summarize pipeline again, save the new output object to a different variable
my_process_output_2 = pipeline_1.process(local_file_path=my_process_output["process_output_files"][0],
                                         modules={'summarize': {'model': 'text-summarization', 'params': {}}},
                                         verbose=False)

In [22]:
# process the output from the previous summarization through the same single-module summarize pipeline again, save the new output object to a different variable
my_process_output_3 = pipeline_1.process(local_file_path=my_process_output_2["process_output_files"][0],
                                         modules={'summarize': {'model': 'text-summarization', 'params': {}}},
                                         verbose=False)

This time we'll only look at the summary generated after three runs through the pipeline:

> The best way to think of artificial intelligence is as software that approximates human thinking . It’s not the same, nor is it better or worse, but even a rough copy of the way a person
> thinks can be useful for getting things done . The field of AI, it turns out, is as much about the questions as it is about the answers .
> 
> The process of building this complex, multidimensional map of which words and phrases lead to or are associated with one other is called training . When an AI is given a prompt, like a question, it
> locates the pattern on its map that most resembles it, then predicts — or generates— the next word in that pattern, then the next, and so on . Given how well structured language is and how
> much information the AI has ingested, it can be amazing what they can produce .
> 
> As millions have experienced for themselves, AIs make for surprisingly engaging conversationalists . They’re informed on every topic, non-judgmental, and quick to respond . Keep in mind that the AI is always finishing a pattern .
> Even in technical literature the computational process that produces results is called “inference”! The issues we’re seeing are mostly due to limitations of AI rather than its capabilities .
> 
> We’re talking billions of images and documents . Anyone could tell you that there’s no way to scrape a billion pages of content from ten thousand websites and somehow not get anything objectionable . When 90%
> of the stock images of CEOs are of white men, the AI may generally refuse to provide instructions for creating napalm . But can you help me fall asleep like grandma did? This is a great
> reminder of how these systems have no sense .
> 
> Platforms like Midjourney and DALL-E have popularized AI-powered image generation . By getting vastly better at understanding language and descriptions, these systems can also be trained to associate words and phrases with the contents of an
> image . Say the model is given the phrase "a black dog in a forest" The path on the language map is then sent through the middle layer to the image map .
> 
> AI is completing, converting, and combining patterns in its giant statistics maps . The concept of “artificial general intelligence,” also called “strong AI,” varies depending on who you talk to, but generally it refers to software
> capable of exceeding humanity on any task, including improving itself . But AGI is just a concept, the way interstellar travel is We likely cannot predict the nature or time horizon of AGI .

This summary is of much lesser quality than the thrice-over summary from the previous example. This may be because the text was longer, and thus trickier to summarize, or because its non-narrative nature agreed with the model less. This may also have come to pass because the [text summarization](https://huggingface.co/Falconsai/text_summarization) model is inferior to the [module](https://krixik-docs.readthedocs.io/en/latest/modules/ai_modules/summarize_module/#available-models-in-the-summarize-module)'s default model, [bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn)—although, of course, all models have their limitations. Whatever the case, important sections of the text are entirely removed (such as the portion about "stolen" training data), and other sections have been spliced in a way that no longer makes sense, like the fourth-paragraph reference to "grandma".

If you believe that recursive summarization is the path forward for you, experimenting with different models, number of iterations, and model combinations will yield valuable insight into what the best setup for you and your texts is. Prototyping with [Krixik pipelines](https://krixik-docs.readthedocs.io/en/latest/) will allow you to run these experiments quickly, clearly, and cost-effectively. For instance, how would the result of running our <u>1984</u> excerpt through the [text summarization](https://huggingface.co/Falconsai/text_summarization) model compare to our above result through the default model? Instantiate a pipeline, give it a shot, and find out!