# Summarizing Long Text into User-Defined Segments Using AI21 Models

In this notebook, we will demonstrate how to use the AI21 Studio [Summarize-by-Segment](https://docs.ai21.com/docs/summarize-by-segment-api) Task Specific Model to summarize a long text into shorter segments. 

This notebook specificallly shows how you can use the Summarize-by-Segement Task Specific Model while customizing the number of segments.

### What is a Task Specific Model (TSM)?

[Task Specific Models](https://docs.ai21.com/docs/task-specific) are Foundation Models that the AI21 has trained to perform optimally for specific sets of tasks, such as answer questions based on a context, or summarizing text. These Task Specific Models are not just fine tuned models, but they include optimizations, pre-processing, and post-processing of text that reduce hallucinations, latency and cost while increasing accuracy. Here is a schematic that shows overall how they work. 

![below](./img/1.png)

As you can see from the image above, the Task Specific Models are actually systems that allow for pre and post-processing input data, while still giving the advantages of Fine-Tuned Foundation Models.

The Summarize by Segment API takes a piece of text logical segments, returning summarized content for each segment, rather than one overall summary. Some key advantages of this approach are:


* **Reduced Hallucinations:** The summarized by segment Task-Specific Model is designed  
* Increased efficiency and productivity: By summarizing large text documents into smaller segments (rather than just a single short form summary), users can easily navigate the information and find what they need, reducing the amount of time spent searching for specific data.
* Improved comprehension and retention: By breaking down complex information into more manageable chunks, users are more likely to understand and retain the information, leading to better decision-making.

In addition, users can easily customize the output of the TSMs, and we will also introduce a function that allows users to specify the number of segments that they want in the final results.

### How to Leverage TSMs?

Many AI21 TSMs are available for easy deployment within [SageMaker Jumpstart](https://www.ai21.com/blog/al21-labs-large-language-models-discoverable-in-amazon-sagemaker-studio) but you can also use them directly from AI21 Studio,which provides API access to the models. We will use AI21 Studio API access for this notebook.

## Install Libraries as Needed

In [None]:
!pip install  ai21

## Instantiate AI21 Client

In [2]:
import json
import os
os.environ["AI21_LOG_LEVEL"] = "DEBUG"

from ai21 import AI21Client
client = AI21Client()

Get the article content and summarize it into segments.

In [17]:
response=client.summarize_by_segment.create(
    source="https://edition.cnn.com/2023/05/10/tech/google-io-event-products/index.html",
    source_type="URL")
segments=response.segments

### Put the summaries and raw segments in a dataframe.

Next, we will store both the summaries and the raw segments in a dataframe. Note that some of the summaries are blank; these are segments of text where the summarization model decided that there was no data to summarize.

In [20]:
import pandas as pd
segments_df = pd.DataFrame(segments)
segments_df = segments_df[['summary', 'segment_text']]
segments_df.columns = ['Summary', 'SegmentText']
segments_df['Summary'] = segments_df['Summary'].fillna('')

In [21]:
segments_df.head()

Unnamed: 0,Summary,SegmentText
0,,\n \n CNN\n — \n \n
1,Google unveiled its latest lineup of hardware ...,\n Google on Wednesday unveiled its...
2,,\n Here’s what Google announced at ...
3,"Google unveiled the Pixel Fold, a foldable sma...",\n Google became the latest tech co...
4,The Google Fold includes features you'd find o...,\n The Google Fold includes feature...


### Create custom number of segments
Users may be interested in provinding a custom number of segments/highlights; rather than the relying on the default number provided by the AI21 API. The following code shows how to do so.

In [22]:
def get_user_defined_segments(df, desired_segments):
    """
    Get a user-provided number of segments after using the AI21 Get Segments API.
    """
    while len(df) > desired_segments:
        # Find the index of the shortest segment
        shortest_idx = df['SegmentText'].str.len().idxmin()

        # Determine the neighbors' indices
        left_idx = max(0, shortest_idx - 1)
        right_idx = min(len(df) - 1, shortest_idx + 1)

        # Find the length of the neighbors
        left_len = df.iloc[left_idx]['SegmentText'].__len__()
        right_len = df.iloc[right_idx]['SegmentText'].__len__()

        # Merge with the shorter neighbor
        if shortest_idx == 0 or (shortest_idx < len(df) - 1 and right_len < left_len):
            # Merge with the right neighbor (shortest segment comes first)
            df.at[shortest_idx, 'Summary'] += " " + df.at[right_idx, 'Summary']
            df.at[shortest_idx, 'SegmentText'] += " " + df.at[right_idx, 'SegmentText']
            df = df.drop(right_idx).reset_index(drop=True)
        else:
            # Merge with the left neighbor (left segment comes first)
            df.at[left_idx, 'Summary'] += " " + df.at[shortest_idx, 'Summary']
            df.at[left_idx, 'SegmentText'] += " " + df.at[shortest_idx, 'SegmentText']
            df = df.drop(shortest_idx).reset_index(drop=True)

    return df

# The function is now updated with the exact description provided.
num_segments_desired=5
segments_df_merged=get_user_defined_segments(segments_df,num_segments_desired)
segments_df_merged.head()

Unnamed: 0,Summary,SegmentText
0,Google unveiled its latest lineup of hardware...,\n \n CNN\n — \n \n \n ...
1,Google is far from the first to embrace foldab...,\n Google is far from the first to ...
2,The Pixel 7a looks similar to the Pixel 7 and ...,"\n On the surface, the 7a looks sim..."
3,Google sells between eight and 10 million Pi...,\n The 7a also supports many signif...
4,"Google is introducing a new 11-inch tablet, th...","\n Under the hood, the 11-inch tabl..."


### Save the final results in Markdown

In [23]:
bullet_list=list(segments_df_merged["Summary"])
bullet_list=[i.strip() for i in bullet_list]
bullet_list=[f"* {i}" for i in bullet_list]
bullets_s="\n\n".join(bullet_list)
f_out=open("Segmented_Summary.md","w")
print(f"KEY POINTS\n\n{bullets_s}",file=f_out)
f_out.close()


The final result in part should look like:![image.png](attachment:b50d27cc-a88c-4bdd-82d1-9c7a9eed3ff9.png)