In [1]:
import torch
from transformers import pipeline

In [2]:
print(torch.__version__)
print(torch.version.cuda)
print(torch.cuda.is_available())

2.0.1
11.8
True


In [3]:
from transformers import BertTokenizerFast, pipeline

# Initialize the tokenizer
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

# Initialize the summarizer
bert_summarizer = pipeline("summarization", device=0)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [4]:
import re

def clean_text(file_path):
    with open(file_path, 'r') as f:
        text = f.read()

    patterns = [
        (r'\n([0-9]+)\n', r' \1 ', None),
        (r'\n', ' ', None),
        (r"(?<!\w)'|'(?!\w)", ' ', None)
    ]

    for pattern, replacement, flags in patterns:
        text = re.sub(pattern, replacement, text, flags=flags if flags else 0)

    return text

file_paths = ['../data/text_l1.txt', '../data/text_l2.txt', '../data/text_l3.txt']
texts = {}

for file_path in file_paths:
    # Split the file_path to get the base name, then remove the extension to get the variable name
    var_name = file_path.split('/')[-1].split('.')[0]
    # Store the cleaned text in the dictionary with the variable name as the key
    texts[var_name] = clean_text(file_path)

print(texts['text_l3'][:1000])


1 EN Council of the European Union General Conditions General Secretariat June 2016 - EN GENERAL CONDITIONS OF THE CONTRACT The contract consists of a purchase order and these general conditions, including the Annex on security measures. If there is any conflict between different provisions in the contract, the following rules must be applied: (a) the provisions set out in the purchase order take precedence over those set out in the general conditions; (b) the provisions set out in the general conditions take precedence over those set out in the tender specifications; (c) the provisions set out in the tender specifications take precedence over those set out in the contractor's tender. All documents issued by the contractor (end-user agreements, general terms and conditions, etc.), with the exception of its tender, are held inapplicable, unless explicitly mentioned in the special conditions of the contract. In all circumstances, in the event of contradiction between the contract and doc

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Measure the length of a text
def bert_len(text):
    tokens = bert_tokenizer.encode(
        text,
        add_special_tokens=True,
        max_length=len(text),
        truncation=True
    )
    return len(tokens)

# Text splitter initialization
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  
    chunk_overlap=30,
    length_function=bert_len,
    separators=[". ", ""]
)

def remove_duplicates(text):
    # Remove duplicate sentences from the text
    seen = set()
    return ". ".join([sentence for sentence in text.split(". ") if sentence not in seen and not seen.add(sentence)])

def generate_summary(text):
    # Summarize the text
    summary = bert_summarizer(text, max_length=500, min_length=25, do_sample=False)

    return summary[0]['summary_text']

def summarize_large_text(text):
    # Split the text into chunks
    chunks = text_splitter.split_text(text)

    # Summarize each chunk
    chunk_summaries = [generate_summary(chunk) for chunk in chunks]

    # Concatenate the chunk summaries
    text = ". ".join(chunk_summaries)

    # Remove duplicates
    text = remove_duplicates(text)

    return text

large_text = texts['text_l3']
summarized_text = summarize_large_text(large_text)
final_text = summarized_text

# Check if the summarized text length is over 4000 words
while len(final_text.split()) > 4000:
    final_text = summarize_large_text(final_text)


Your max_length is set to 500, but you input_length is only 462. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=231)
Your max_length is set to 500, but you input_length is only 366. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=183)
Your max_length is set to 500, but you input_length is only 462. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=231)
Your max_length is set to 500, but you input_length is only 413. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=206)
Your max_length is set to 500, but you input_length is only 418. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=209)
Your max_length is set to 500, but you input_length is only 438. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=219)
Your max_length is set to 500, but you input_length is only 436. You m

 The contract consists of a purchase order and these general conditions, including the Annex on security measures . If there is any conflict between different provisions in the contract, the following rules must be applied . All documents issued by the contractor are held inapplicable, unless explicitly mentioned in the special conditions of the contract ..  The platform may be used to exchange electronic documents (e-documents) such as electronic invoices between the parties . This is done either through web services, with a machine-to-machine connection between parties, or through a web application (the supplier portal).  The e-PRIOR portal, which allows the contractor to exchange electronic business documents, such as invoices, through a graphical user interface, is updated on a regular basis . Pre-existing material : any material, document, technology or know-how which exists prior to the contractor using it for the production of a result in the performance of the contract ..  If a

In [6]:
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer
import torch

# Initialize the tokenizer and model
pegasus_tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-bigpatent")
pegasus_model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-bigpatent")

# Check if a GPU is available and if not, use a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Move the model to the device
pegasus_model.to(device)

BigBirdPegasusForConditionalGeneration(
  (model): BigBirdPegasusModel(
    (shared): Embedding(96103, 1024, padding_idx=0)
    (encoder): BigBirdPegasusEncoder(
      (embed_tokens): Embedding(96103, 1024, padding_idx=0)
      (embed_positions): BigBirdPegasusLearnedPositionalEmbedding(4096, 1024)
      (layers): ModuleList(
        (0-15): 16 x BigBirdPegasusEncoderLayer(
          (self_attn): BigBirdPegasusEncoderAttention(
            (self): BigBirdPegasusBlockSparseAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=False)
              (key): Linear(in_features=1024, out_features=1024, bias=False)
              (value): Linear(in_features=1024, out_features=1024, bias=False)
            )
            (output): Linear(in_features=1024, out_features=1024, bias=False)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=1

In [26]:
# Use the model to explain the final text
explanation_input = "explain this" + final_text
inputs = pegasus_tokenizer(explanation_input, return_tensors='pt').to(device)
explanation_ids = pegasus_model.generate(
    **inputs,
    max_length=256, 
    do_sample=True, 
    num_beams=5, 
    length_penalty=1.0,
    repetition_penalty=1.4
    )
explanation = pegasus_tokenizer.decode(explanation_ids[0], skip_special_tokens=True)

# Delete the tensors
del inputs
del explanation_ids

# Empty the cache
torch.cuda.empty_cache()

A platform for the exchange of electronic business documents, such as electronic invoices between the parties. The platform may be used either through web services, with a machine-to-machine connection between parties, or through a web application.


In [53]:
def arrange_text(text):
    # Replace ".." by "."
    text = re.sub(r'\.\.', '.', text)
    # Replace " ." by "."
    text = re.sub(r' \.', '.', text)
    # Return to line after each "."
    text = re.sub(r'\.', '.\n', text)
    # Replace "  " by " " and "   " by " "
    text = re.sub(r' +', ' ', text)
    # Add "•" before the start of any sentence
    text = re.sub(r'(^|\n)([^\n])', r'\1• \2', text)
    # Remove the whole line if it has less than 10 non-space, non-punctuation characters after •
    text = re.sub(r'\n• [^a-zA-Z0-9]*([a-zA-Z0-9][^a-zA-Z0-9]*){0,9}\n', '\n', text)
    # Remove "• " if it is the last line and has two or more spaces after it
    text = re.sub(r'\n•  +\n$', '\n', text)
    # Remove trailing bullet points
    text = re.sub(r'\n•  +$', '', text)
    
    return text


In [54]:
print(arrange_text(final_text), '\n\n', "\033[1m" + explanation + "\033[0m")

•  The contract consists of a purchase order and these general conditions, including the Annex on security measures.
•  If there is any conflict between different provisions in the contract, the following rules must be applied.
•  All documents issued by the contractor are held inapplicable, unless explicitly mentioned in the special conditions of the contract.
•  The platform may be used to exchange electronic documents (e-documents) such as electronic invoices between the parties.
•  This is done either through web services, with a machine-to-machine connection between parties, or through a web application (the supplier portal).
•  The e-PRIOR portal, which allows the contractor to exchange electronic business documents, such as invoices, through a graphical user interface, is updated on a regular basis.
•  Pre-existing material : any material, document, technology or know-how which exists prior to the contractor using it for the production of a result in the performance of the contr

In [20]:
from transformers import BertTokenizerFast, pipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer
import torch
import re

In [21]:
# Initialize Bert summarizer & tokenizer
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
bert_summarizer = pipeline("summarization", device=0)

# Measure the length of a text
def bert_len(text):
    tokens = bert_tokenizer.encode(
        text,
        add_special_tokens=True,
        max_length=len(text),
        truncation=True
    )
    return len(tokens)

# Text splitter initialization
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  
    chunk_overlap=30,
    length_function=bert_len,
    separators=[". ", ""]
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [32]:
# Remove duplicate sentences from the text 
def remove_duplicates(text):
    seen = set()
    return ". ".join([sentence for sentence in text.split(". ") if sentence not in seen and not seen.add(sentence)])

# Summarize a text
def generate_summary(text):  
    summary = bert_summarizer(text, max_length=500, min_length=25, do_sample=False)
    return summary[0]['summary_text']

# Join the chunks together and summarize them
def summarize_chunks(text):
    chunks = text_splitter.split_text(text)
    chunk_summaries = [generate_summary(chunk) for chunk in chunks]
    text = ". ".join(chunk_summaries)
    text = remove_duplicates(text)
    return text

# Arrange the summary
def arrange_text(text):
    # Replace ".." by "."
    text = re.sub(r'\.\.', '.', text)
    # Replace " ." by "."
    text = re.sub(r' \.', '.', text)
    # Return to line after each "."
    text = re.sub(r'\.', '.\n', text)
    # Replace "  " by " " and "   " by " "
    text = re.sub(r' +', ' ', text)
    # Add "•" before the start of any sentence
    text = re.sub(r'(^|\n)([^\n])', r'\1• \2', text)
    # Remove the whole line if it has less than 10 non-space, non-punctuation characters after •
    text = re.sub(r'\n• [^a-zA-Z0-9]*([a-zA-Z0-9][^a-zA-Z0-9]*){0,9}\n', '\n', text)
    # Remove "• " if it is the last line and has two or more spaces after it
    text = re.sub(r'\n•  +\n$', '\n', text)
    # Remove trailing bullet points
    text = re.sub(r'\n•  +$', '', text)
    return text

def generate_explanation(final_text, device='cuda'):
    # Initialize the model & tokenizer
    pegasus_tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-bigpatent")
    pegasus_model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-bigpatent")
    
    # Check if a GPU is available and if not, use a CPU
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    # Move the model to the device
    pegasus_model.to(device)

    # Define the inputs
    explanation_input = "explain this " + final_text
    inputs = pegasus_tokenizer(explanation_input, return_tensors='pt').to(device)

    explanation_ids = pegasus_model.generate(
        **inputs,
        max_length=256, 
        do_sample=True, 
        num_beams=5, 
        length_penalty=1.0,
        repetition_penalty=1.4
    )

    explanation = pegasus_tokenizer.decode(explanation_ids[0], skip_special_tokens=True)
    
    # Delete the tensors
    del inputs
    del explanation_ids
    
    # Empty the cache
    torch.cuda.empty_cache()

    return arrange_text(final_text), '\n\n', "\033[1m" + explanation + "\033[0m"

def summarize_doc(text):
    text = summarize_chunks(text)
    # Check if the summarized text length is over 4000 words
    while len(text.split()) > 4000:
        text = summarize_chunks(text)
    
    text = generate_explanation(text)
    print(*text)

In [33]:
with open('../data/text_l2.txt') as f:
    test = f.read()

In [34]:
test = summarize_doc(test)

Your max_length is set to 500, but you input_length is only 480. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=240)
Your max_length is set to 500, but you input_length is only 487. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=243)
Your max_length is set to 500, but you input_length is only 495. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=247)
Your max_length is set to 500, but you input_length is only 471. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=235)
Your max_length is set to 500, but you input_length is only 485. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=242)
Your max_length is set to 500, but you input_length is only 436. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=218)
Your max_length is set to 500, but you input_length is only 482. You m

•  Nike's privacy policy explains how your data is used, shared and protected.
•  It also explains what choices you have relating to your personal data and how you can contact us.
•  Who is responsible for processing of your data will depend on how you interact with Nike's Platform.
•  Nike does not allow children to register on our Platform when they are under the legal age limit of the country in which they reside.
•  We use your personal data in the following ways:To provide the features of the Platform and Services You Request.
•  To provide features of our Platform, we will use your data to provide the requested product or service.
•  If you use our Platform to track your fitness activity or physical characteristics, we will collect this personal data and store it so that you can review it in the Platform.
•  Your fitness activity data may include data you enter about your activity or data collected by your device during your activity.
•  Nike collects data about your fitness acti