# __Information Assurance__

## __Evolving trends in Information Assurance, a NLP analysis of the Literature from 1967 to 2024__

In this work, we will carry out a Systematic Topic Review (STR) for topic extraction and Chain of Density (CoD) + Few-Shots for summarizing the contents.

The ultimate goal of this work is to perform sumatizations for each decade from 1967 to the present to understand what has been researched in matters of information assurance.

## <font color='blue'>__Large Corpus Summarization__</font>

We will explore different summarization techniques and compare their outcomes.




## __Preamble__

We will start by installing a number of packages that we are going to use throughout this example:

In [None]:
#import locale
#locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
%%capture
!pip install bertopic accelerate bitsandbytes xformers adjustText huggingface_hub openai

In [None]:
%%capture
!pip install transformers openai huggingface_hub

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import pickle

In [None]:
#hugging face token
token_hf="here your token"
from huggingface_hub import login
login(token_hf) #, add_to_git_credential=True)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# 📄 **Data**

The data is the output of the Systematic Topic Review process.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
path_file = '/content/gdrive/MyDrive/_RESEACH/Information_assurance/Data/'
path_file_working = '/content/gdrive/MyDrive/_RESEACH/Information_assurance/Data/Working'

## <font color='red'>Start here !!</font>

In [None]:
# CHECK POINT to START

with open(path_file + "dataset_info_assurance.pkl", "rb") as f:
  dataset = pickle.load(f)


In [None]:
# Create a new dataframe called df_abstract with: the 'Year' column of dataset and this series pd.Series(dataset['Paper_id'] + ': ' + dataset['Abstract']).rename('abstract')
df_abstract = pd.DataFrame({'Year': dataset['Year'], 'abstract': pd.Series(dataset['Paper_id'] + ': ' + dataset['Abstract']).rename('abstract')})
df_abstract.sample(5)

Unnamed: 0,Year,abstract
43742,2022,"[Silberman D.M., 2022]: This paper describes a..."
38012,2021,"[Chhetri C., 2021]: In today’s world, technolo..."
30596,2020,"[Walker T., 2020]: One of the core functions o..."
29412,2020,"[Bernal S.L., 2020]: Bacteria are microorganis..."
11057,2013,"[Adachi S., 2013]: NTT Innovation Institute, I..."


## __Summarizations__


### __Analysis decade 2000 - 2009__

In [None]:
# Concatenate the Paper_id with the Abstract
abstracts = df_abstract[df_abstract['Year'].between(2020, 2024)]['abstract'].rename('abstract')
titles = dataset["Title"]

In [None]:
abstracts

7426     [Zhang F., 2023]: By utilizing trusted computi...
7429     [You Z., 2023]: In recent years, cyber securit...
7444     [Chen Z., 2023]: With the popularity of digita...
27339    [Galazkiewicz A., 2020]: Multiparty computatio...
27340    [Qi K., 2020]: With the wide use of computer n...
                               ...                        
58491    [Mikhav V., 2023]: Recommender systems make it...
58492    [Chatterjee S., 2023]: Cyberattacks are occurr...
58493    [Sikora L.S., 2023]: The complication of techn...
58494    [More Valencia R.A., 2023]: The effects of inf...
58495    [Kropachev N.M., 2023]: Among the main directi...
Name: abstract, Length: 31160, dtype: object

In [None]:
# remove the tags <b> and <i> from abstracts

import re
abstracts = abstracts.apply(lambda x: re.sub(r'<b>|<\/b>|<i>|<\/i>', '', x))
abstracts.reset_index(drop=True, inplace=True)
abstracts

0        [Zhang F., 2023]: By utilizing trusted computi...
1        [You Z., 2023]: In recent years, cyber securit...
2        [Chen Z., 2023]: With the popularity of digita...
3        [Galazkiewicz A., 2020]: Multiparty computatio...
4        [Qi K., 2020]: With the wide use of computer n...
                               ...                        
31155    [Mikhav V., 2023]: Recommender systems make it...
31156    [Chatterjee S., 2023]: Cyberattacks are occurr...
31157    [Sikora L.S., 2023]: The complication of techn...
31158    [More Valencia R.A., 2023]: The effects of inf...
31159    [Kropachev N.M., 2023]: Among the main directi...
Name: abstract, Length: 31160, dtype: object

In [None]:
abstracts.iloc[300]

'[Folino G., 2020]: Intrusion detection tools have largely benefitted from the usage of supervised classification methods developed in the field of data mining. However, the data produced by modern system/network logs pose many problems, such as the streaming and non-stationary nature of such data, their volume and velocity, and the presence of imbalanced classes. Classifier ensembles look a valid solution for this scenario, owing to their flexibility and scalability. In particular, data-driven schemes for combining the predictions of multiple classifiers have been shown superior to traditional fixed aggregation criteria (e.g., predictions’ averaging and weighted voting). In intrusion detection settings, however, such schemes must be devised in an efficient way, since (part of) the ensemble may need to be re-trained frequently. A novel ensemble-based framework is proposed here for the online intrusion detection, where the ensemble is updated through an incremental stream-oriented learn

In [None]:
# BORRAR prompt: remove the tags <b> and <i> from values in dictionary_topicos_mistral

dictionary_topicos_mistral = {key: value.apply(lambda x: re.sub(r'<b>|<\/b>|<i>|<\/i>', '', x)) for key, value in dictionary_topicos_mistral.items()}


In [None]:
# This functions wraps the text
import textwrap
import shutil

def adjusted_text(text):
    screen_wide = shutil.get_terminal_size().columns  # Get the Terminar wide
    ajusted_text = textwrap.fill(text, width=screen_wide)  # Wrap the text
    print(ajusted_text)

In [None]:
def divide_abstracts(abstracts, words_limit=50_000):
    """
    Divides a list of abstracts into chunks that do not exceed a words limit.

    Args:
        abstracts: A list of strings containing abstracts.

    Returns:
        A list of lists of strings, where each inner list contains abstracts that do not exceed
        a word limit.
    """

    total_words = 0
    chunks = []
    combined_chunks = []
    current_chunk = []

    for abstract in abstracts:
        word_count = len(abstract.split())
        if total_words + word_count <= words_limit:
            current_chunk.append(abstract)
            total_words += word_count
        else:
            chunks.append(current_chunk)
            current_chunk = [abstract]
            total_words = word_count

    if current_chunk:
        chunks.append(current_chunk)
    for chunk in chunks:
        combined_chunks.append("\n ".join(chunk))

    return combined_chunks


In [None]:
combined_chunks = divide_abstracts(abstracts)

# Print the number of chunks
print(f"Number of chunks: {len(combined_chunks)}")

# Print the first chunk
print(f"First chunk: {combined_chunks[0][:50]}")

Number of chunks: 120
First chunk: [Zhang F., 2023]: By utilizing trusted computing a


### __Tokens per abstract__

In [None]:
import nltk
from nltk.tokenize import word_tokenize

# Descarga el paquete de tokenización si no está previamente cargado
nltk.download('punkt')

texto = combined_chunks[0]

tokens_per_chunk = [len(word_tokenize(texto)) for texto in combined_chunks]

# Imprime los tokens por chunk
print('\nTokens per chunk: ', *tokens_per_chunk)
total_tokens = sum(tokens_per_chunk)
print("Total tokens: ", total_tokens)

total_words = sum(len(texto.split()) for texto in combined_chunks)
print("Total words: ", total_words, '\n')

print('Total abstracts: ', len(abstracts))
print(f'Average Tokens: {total_tokens/ len(abstracts):6.2f}')
print(f'Average Words: {total_words/ len(abstracts):6.2f}')

print(f'Ratio words/tokens: {total_words/ total_tokens:6.2f}')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



Tokens per chunk:  56908 56923 56966 57215 57200 57055 57046 56913 56968 56797 57006 57361 56784 56864 56941 56873 56859 57073 57202 57326 57065 56958 57001 56804 56946 57185 56987 57038 56949 57257 57154 57255 57905 57281 56837 57019 57122 57342 57157 57770 57377 56958 57029 57091 57550 57076 56992 57223 57085 56978 57058 57017 56914 57118 57020 57022 57278 56907 57079 56945 57134 57216 56857 57421 57171 57023 57144 56785 57159 56909 57157 57098 57086 57045 56971 57125 56890 57382 57066 57569 57704 57374 57603 57332 57592 57617 57577 57735 57249 57672 57660 57216 57169 57302 57637 57721 57300 57766 57307 57215 57410 57523 57461 57445 57350 57766 57053 57349 57284 57441 57641 57270 57715 57520 57058 57461 57403 57548 57452 345
Total tokens:  6808480
Total words:  5937797 

Total abstracts:  31160
Average Tokens: 218.50
Average Words: 190.56
Ratio words/tokens:   0.87


In [None]:
max(tokens_per_chunk)

57905

## __Summarizing using basic prompting and state of the art LLM models__

Model: gpt-4-0125-preview

In [None]:
%%capture
!pip install openai

In [None]:
from openai import OpenAI, Model
OPENAI_API_KEY="here your token"

client = OpenAI(api_key=OPENAI_API_KEY)

### __`temperature`__

The temperature parameter controls the randomness of the predictions by scaling the logits before applying the softmax operation. It essentially influences the "creativity" of the generated text. The temperature parameter typically ranges from 0 to 1, but it can also take values above 1. Here's how it works:

- Values closer to 0 make the model more deterministic, favoring more likely outcomes. A temperature close to 0 causes the model to choose the most likely next word more frequently, leading to more predictable and conservative text generation. This can be useful when you need the generated text to be more focused and on-topic.
- A temperature of 1 uses the logits as they are, resulting in the default behavior of the model. This setting provides a balance between randomness and predictability in the generated text.

Values above 1 increase the model's randomness, making less likely outcomes more probable. Higher temperatures result in more diverse and creative text generation, which can be beneficial for generating creative content, brainstorming ideas, or avoiding repetitive responses. However, too high a temperature might result in nonsensical or highly unpredictable text.
Values significantly lower than 1 (e.g., 0.1 or even closer to 0) will make the model's output very repetitive and predictable, as it tends to pick the most likely next word at each step.

Let's try a Temperature = 0

### __`frequency_penalty`__

The range for the `frequency_penalty` parameter in the OpenAI GPT API generally varies from -2.0 to 2.0. This parameter adjusts the probability of frequent tokens appearing in the response:

- Positive values (up to 2.0) decrease the likelihood of tokens appearing that have already been generated, which helps reduce repetition and promotes greater diversity in the generated text.

- Negative values (up to -2.0) increase the likelihood of tokens that have already been generated appearing, which can be useful to emphasize certain topics or keywords in the generated text.

- A value of 0 means that there is no adjustment in the frequency penalty, leaving the generation of text more natural as determined by the model without this specific adjustment.

### __`presence_penalty`__

The range for the `presence_penalty` parameter in the OpenAI GPT API also generally varies from -2.0 to 2.0, similar to frequency_penalty. This parameter adjusts the probability of generating tokens that have already appeared in the text:

- Positive values (up to 2.0) increase the probability of introducing new tokens that have not yet appeared in the text, thus promoting the generation of new ideas and reducing repetition. It is useful to generate more varied and creative text.

- Negative values (up to -2.0) make the model more likely to repeat tokens that have already appeared, which may be desirable in certain contexts where it is sought to reinforce or focus on already mentioned ideas.

- A value of 0 means that there is no adjustment in the presence penalty, allowing the model to generate text without additional influence towards the repetition or novelty of the tokens.

As with `frequency_penalty`, the optimal value of `presence_penalty` will depend on your specific objectives and how you want the generated text to balance between the introduction of new concepts and the reiteration of existing ideas. Experimenting with different values will allow you to fine-tune the behavior of text generation to better meet your needs.

The model doesn't respond correctrly to the max_tokens parameter. This could be due to:
- Efficiency in the summarization process, or
- Unability to summarize the contents of the documents

## __LangChain + CoD + F-S Prompting__
LC


In [None]:
%%capture
!pip install langchain

In [None]:
%%capture
!pip install --upgrade --quiet  langchain-openai tiktoken chromadb langchain


In [None]:
%%capture
!pip install -U langchain-openai

In [None]:
import os

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY="here your token"


In [None]:
from langchain.chains.summarize import load_summarize_chain
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-0125")
chain = load_summarize_chain(llm, chain_type="stuff")

example_summary = chain.invoke(docs)

In [None]:
adjusted_text(example_summary['output_text'])

The article discusses the concept of building autonomous agents powered by Large Language Models
(LLMs). It covers components such as planning, memory, and tool use, with examples like AutoGPT and
GPT-Engineer. Challenges include limited context length, planning difficulties, and reliability of
natural language interfaces. The article also references various studies and projects in the field
of LLM-powered autonomous agents.


In [None]:
# Loading loarers from langchain_community
from langchain_community.document_loaders import TextLoader
from langchain import OpenAI, PromptTemplate
import glob # PrompTemplate needs glob

# Loading chains
from langchain.chains.summarize import load_summarize_chain

# API manager
from langchain_openai import ChatOpenAI

### __Summarizing from a txt file__


In [None]:
os.chdir(path_file_working)


In [None]:
# prompt: serializa en varios archivos  .txt los elementos de combined_chunk. uno para cada chunk
chunks = 0
for i, chunk in enumerate(combined_chunks):
  with open(f"{path_file_working}/chunk_{i+1}.txt", "w") as f:
    f.write(chunk)
    chunks += 1


In [None]:
chunks

120

In [None]:
# Loader
from langchain.document_loaders import TextLoader

documents = []
for i in range(1, chunks+1):
  documents.append(TextLoader(f"{path_file_working}/chunk_{i}.txt").load())


In [None]:
len(documents)

120

In [None]:
from langchain.prompts import PromptTemplate

prompt_template = """Write a concise summary of the following abstracts:
"{text}"
CONCISE SUMMARY:"""

prompt_template = PromptTemplate(template=prompt_template,
                                 input_variables=["text"])

llm = ChatOpenAI(temperature=0,
                 model_name="gpt-4-0125-preview")

chain = load_summarize_chain(llm,
                             chain_type="stuff",
                             prompt=prompt_template)


In [None]:
summarized_text['output_text']

'This collection of abstracts from the 1970s and 1979 focuses on various aspects of computer and information security, highlighting the evolving understanding and methodologies in protecting data and ensuring system integrity. Anderson (1972) discusses the challenges of securing data in multi-user environments, emphasizing that while a completely secure system is unattainable, adequately secure systems can be developed with sufficient barriers and controls. Reitman (1979) introduces an information flow logic for parallel programming languages to certify programs against security policies, extending the work of Denning and Denning on certifying the security of sequential programs through compile-time mechanisms. Koch (1975) and Walter (1975) address the necessity of securing electronic data and operating systems, respectively, with Walter proposing a Security Kernel to monitor information flow and prevent security compromises. Lennon (1978) and Blom (1978) discuss the role of cryptograp

In [None]:
adjusted_text(summarized_text['output_text'])

This collection of abstracts from the 1970s and 1979 focuses on various aspects of computer and
information security, highlighting the challenges and methodologies in protecting data and ensuring
system integrity in different environments. Anderson (1972) discusses the complexity of securing
multi-user computer systems, emphasizing the difficulty in achieving complete security but noting
the possibility of creating adequately secure systems through sufficient barriers and controls.
Reitman (1979) introduces an information flow logic for parallel programming languages to certify
programs against information security policies, extending the work of Denning and Denning on
certifying the security of sequential programs through compile-time mechanisms. Koch (1975) and
Walter (1975) address the need for comprehensive security measures against a range of threats,
including human error, equipment failure, and unauthorized access, with Walter proposing a Security
Kernel to monitor information f

The output shows a tiny variations among different runs; even though the Temperature is equal to ceo.

In [None]:
prompt_template = """
            CONTEXT: This is a scientific survey paper about Information Assurance.
            ROLE-- You are an expert Academic Advisor specialized in abstracts summarization to write Literature
            Review Chapters for papers in the field of Information Assurance, Security, and Cybersecurity.

            TONE-- Your interaction with users is professional, yet helpful, guiding them in structuring and
            writing effective literature reviews. Your tone is strictly academic, mirroring the formal and precise
            style of academic paper writing. This involves using scholarly language, maintaining objectivity,
            and focusing on evidence-based insights.

            PROCEDURE-- Articles: "{text}"
            Please generate increasingly entity-dense summaries of the above articles.
            Given a set of abstracts as a knowledge base, carefully read each abstract.
            In the articles you will have a reference code of the paper between square brackets at the beginning of the paper abstract.
            Step 1: Generate a first summary using the papers reference as citations.
            Repeat the following two steps (2 and 3) four times.
            Step 2. Identify 1-3 informative entities (“;” delimited) from the articles which are missing from the
            previously generated summary.
            Add these new entities to the previous ones if any.
            Step 2. Write a new summary longer than before which covers every entity and detail from the previous summary
            plus the missing entities.
            A missing entity is:
             - relevant to the main story,
             - specific yet concise (5 words or fewer),
             - novel (not in the previous summary),
             - faithful (present in the article),
             - anywhere (can be located anywhere in the article).

            GUIDELINES-- Follow these Guidelines:
             - The first summary should be long (20 sentences, more than 400 words) yet highly non-specific, containing little
             information beyond the entities marked as missing.
             - Use overly verbose language and fillers (e.g., “this article discusses...”, “In [Reitman R.P., 1979] the authors...”)
             to reach at least 400 words.
             - Do not use flamboyant language.
             - Make every word count: rewrite the previous summary to improve flow and make space for additional entities.
             - The summaries should become highly dense and concise yet self-contained, i.e., easily understood without the article.
             - Missing entities can appear anywhere in the new summary.
             - Never drop entities from the previous summary.
             - IMPORTANT: Ensure to include relevant academic citations in the classic bracketed format. In the article that I'm giving you,
             each element (row) has, at the beginning, a code to reference the paper; use these codes as references within the
             paragraph to create precise citations; this way:
             'In [Anderson J.P., 1972], the authors present the findings of a planning study that addresses the computer security
             requirements of the USAF, recommending urgent research and development to secure information processing systems for
             command, control, and support within the Air Force.'
             - IMPORTANT: some paragraph of your summary should reference to more than one paper if the content of the sentence
             you are writing is referenced in them; here you have an example: 'In [Juan Pérez J.P., 1972][Pepito Perez, 1978], the authors use the concept of...'
             - IMPORTANT: Return just the last of the four iteration answers, and just the SUMMARY, nothing else.
             - IMPORTANT: Do not include a final paragraph with a conclusion of your response.
             - Mark the output with the word "SUMMARY: " to correctly visualize your response.".
             """

In [None]:
prompt_template_v2 = """
            CONTEXT: This is a scientific survey paper about Information Assurance.
            ROLE-- You are an expert Academic Advisor specialized in abstracts summarization to write Literature
            Review Chapters for papers in the field of Information Assurance, Security, and Cybersecurity.

            TONE-- Your interaction with users is professional, yet helpful, guiding them in structuring and
            writing effective literature reviews. Your tone is strictly academic, mirroring the formal and precise
            style of academic paper writing. This involves using scholarly language, maintaining objectivity,
            and focusing on evidence-based insights.

            PROCEDURE-- Articles: "{text}"
            Please generate increasingly topic-dense summaries of the above articles.
            Given a set of articles as a knowledge base, carefully read each article.
            In the articles you will have a author reference between square brackets (like this [Juan Pérez, 1972]) at the beginning of each paper abstract.
            Step 1: Generate a first summary using the papers reference as citations.
            Repeat the following three steps (2, 3, and 4) four times.
            Step 2. Identify 1-3 informative topics (“;” delimited) from the articles which are missing from the
            previously generated summary.
            Step 3. Add these new informative topics to the previous ones if any.
            Step 4. Write a new summary longer than before which covers every informative topics and detail from the previous summary
            plus the missing informative topics. Remember to include the author reference in the summary.
            A missing informative topics is:
             - relevant to the main story,
             - specific yet concise (5 words or fewer),
             - novel (not in the previous summary),
             - faithful (present in the article),
             - anywhere (can be located anywhere in the articles).
            An author reference is:
             - a name and a year surrounded in squeare brackets,
             - novel (not in the previous summary),
             - faithful (present in the articles).

            GUIDELINES-- Follow these Guidelines:
             - The first summary should be long (20 sentences, more than 200 words) yet highly non-specific, containing little
             information beyond the topics marked as missing.
             - Use overly verbose language and fillers (e.g., “this article discusses...”, “In [Lulú, 1979] the authors...”)
             to reach at least 200 words.
             - Do not use flamboyant language.
             - Make every word count: rewrite the previous summary to improve flow and make space for additional topics.
             - The summaries should become highly dense and concise yet self-contained, i.e., easily understood without the articles.
             - Missing topics can appear anywhere in the new summary.
             - Never drop topics from the previous summary.
             - IMPORTANT: Ensure to include relevant academic citations in the classic bracketed format. In the articles that I'm giving you,
             each element (row) has, at the beginning, the author and year as a reference of the paper; use these codes as references within the
             paragraph to create precise citations; this way:
             'In [Anderson J.P., 1972], the authors present the findings of a planning study that addresses the computer security
             requirements of the USAF, recommending urgent research and development to secure information processing systems for
             command, control, and support within the Air Force.'
             - IMPORTANT: Do NOT use the examples given in the instructions.
             - IMPORTANT: some paragraph of your summary should reference to more than one paper if the topic of the sentence
             in the summary is referenced in them; DO NOT include more than three references. Here you have an example:
             'In [Juan Pérez J.P., 1972][Pepito Perez, 1978], the authors use the concept of...'
             - IMPORTANT: Return just the last of the four iteration answers, and just the SUMMARY, nothing else.
             - IMPORTANT: Eliminate all the conclusion, summary or similar paragraphs included at the end of your SUMMARY.
             - Mark the output with the word "SUMMARY: " to correctly visualize your response.".
             """

In [None]:
prompt_template_v3 = """
            CONTEXT: This is a scientific survey paper about Information Assurance.
            ROLE-- You are an expert Academic Advisor specialized in abstracts summarization to write Literature
            Review Chapters for papers in the field of Information Assurance, Security, and Cybersecurity.

            TONE-- Your interaction with users is professional, yet helpful, guiding them in structuring and
            writing effective literature reviews. Your tone is strictly academic, mirroring the formal and precise
            style of academic paper writing. This involves using scholarly language, maintaining objectivity,
            and focusing on evidence-based insights.

            PROCEDURE-- Articles: "{text}"
            Please generate increasingly topic-dense summaries of the above articles.
            Given a set of articles as a knowledge base, carefully read each article.
            In the articles you will have a author reference between square brackets (like this [Juan Pérez, 1972]) at the beginning of each paper abstract.
            Step 1: Generate a first summary using the papers reference as citations.
            Repeat the following three steps (2, 3, and 4) four times.
            Step 2. Identify 1-3 informative topics (“;” delimited) from the articles which are missing from the
            previously generated summary.
            Step 3. Add these new informative topics to the previous ones if any.
            Step 4. Write a new summary longer than before which covers every informative topics and detail from the previous summary
            plus the missing informative topics. Remember to include the author reference in the summary.
            A missing informative topics is:
             - relevant to the main story,
             - specific yet concise (5 words or fewer),
             - novel (not in the previous summary),
             - faithful (present in the article),
             - anywhere (can be located anywhere in the articles).
            An author reference is:
             - a name and a year surrounded in squeare brackets,
             - novel (not in the previous summary),
             - faithful (present in the articles).

            GUIDELINES-- Follow these Guidelines:
             - The first summary should be short (20 sentences, less than 100 words) yet highly non-specific, containing little
             information beyond the topics marked as missing.
             - Use overly verbose language and fillers (e.g., “this article discusses...”, “In [Lulú, 1979] the authors...”)
             to reach at least 200 words.
             - Do not use flamboyant language.
             - Make every word count: rewrite the previous summary to improve flow and make space for additional topics.
             - The summaries should become highly dense and concise yet self-contained, i.e., easily understood without the articles.
             - Missing topics can appear anywhere in the new summary.
             - Never drop topics from the previous summary.
             - IMPORTANT: Ensure to include relevant academic citations in the classic bracketed format. In the articles that I'm giving you,
             each element (row) has, at the beginning, the author and year as a reference of the paper; use these codes as references within the
             paragraph to create precise citations; this way:
             'In [Anderson J.P., 1972], the authors present the findings of a planning study that addresses the computer security
             requirements of the USAF, recommending urgent research and development to secure information processing systems for
             command, control, and support within the Air Force.'
             - IMPORTANT: Do NOT use the examples given in the instructions.
             - IMPORTANT: some paragraph of your summary should reference to more than one paper if the topic of the sentence
             in the summary is referenced in them; DO NOT include more than three references. Here you have an example:
             'In [Juan Pérez J.P., 1972][Pepito Perez, 1978], the authors use the concept of...'
             - IMPORTANT: Return just the last of the four iteration answers, and just the SUMMARY, nothing else.
             - IMPORTANT: Eliminate all the conclusion, summary or similar redundant paragraphs included at the end of your SUMMARY.
             - Mark the output with the word "SUMMARY: " to correctly visualize your response.".
             """

In [None]:
import time
prompt_template = PromptTemplate(template=prompt_template_v3,
                                 input_variables=["text"])

llm = ChatOpenAI(temperature=0,
                 model_name="gpt-4-turbo-2024-04-09") #gpt-4-0125-preview

chain = load_summarize_chain(llm,
                             chain_type="stuff",
                             prompt=prompt_template)

#summarized_text_2000 = chain.invoke(documents[0])

summarized_text_2020 = []

for i, docs in enumerate(documents):
    init_time = time.time()

    summarized_text_2020.append(chain.invoke(docs))
    print(f"Processed document {i+1:2d} in {(time.time() - init_time): 6.2f} seconds")

In [None]:
summarized_text_2000[0]['output_text']

'SUMMARY: The evolution of information security frameworks is crucial for safeguarding digital assets across various sectors, as highlighted in [Fanelli R.L., 2010]. The integration of competitive security exercises enhances learning in undergraduate courses, emphasizing the importance of practical, hands-on experience in information assurance education. Similarly, the infrastructural complexities of electric power and telecommunication services necessitate advanced understanding and strategies for managing emergent behaviors in these critical systems [Weijnen M.P.C., 2010]. The biosecurity landscape, as discussed in [Katona P., 2010], faces challenges with the evolving biosecurity threats, necessitating global resilience and cooperation in preparedness and response strategies. This includes integrating intelligence and technology to craft effective responses to bioevents, highlighting the intersection of cybersecurity and biosecurity. The perceptions of IT personnel on existing inform

In [None]:
adjusted_text(summarized_text_2000[0]['output_text'])

SUMMARY: The evolution of information security within distributed systems necessitates a robust
framework to address emerging threats and ensure data integrity across various platforms [Bai Shuo,
2000][Barber Richard, 2000]. As articulated in [Bensinger L.A., 2000], the development of
cryptographic key management systems plays a crucial role in enhancing data security protocols. The
integration of GIS techniques in telecommunications, as discussed in [Homjak A.S., 2000], further
exemplifies the innovative approaches being adopted to safeguard critical infrastructure. The
dynamic nature of distributed databases, as explored in [Lee S., 2000], introduces unique security
challenges that necessitate specialized strategies to maintain information security.  The
implementation of policy-based cryptographic systems, as outlined in [Branstad D.K., 2000], provides
a structured approach to address the complexities of key management in securing cryptographic keys.
This is complemented by the adva

In [None]:
'SUMMARY: ' + summarized_text_2000[1]['output_text'].split('SUMMARY: ')[1]

'SUMMARY: The integration of cybersecurity measures into modern information systems is crucial for safeguarding data against potential threats, as highlighted in [Hwang Y., 2010] and [Peng X.Q., 2010]. The development of efficient authenticated protocols for Oblivious Transfer, as discussed in [Hwang Y., 2010], addresses the need for secure communication in cryptographic settings. Similarly, the deployment of Wireless IDS, as explored in [Peng X.Q., 2010], enhances network security by enabling quick identification of attacks, thereby extending the capabilities of security management systems. The study of deception in computer-mediated communication by [Rubin V.L., 2010] provides insights into the strategies used for creating false beliefs, which has implications for automated deception detection and information security. The security of stream ciphers, as analyzed in [Zhang D.-H., 2010], relies on the robustness of chaos algorithms, suggesting a potential fit for securing database oper

In [None]:
# prompt: para cada elemento de summarized_text_2000, seleccionana la sección ['output_text'] y, dentro de ella, el string que comienza con 'SUMMARY'. Concatena los strings resultantes

summary_strings = []
for item in summarized_text_2000:
    summary_strings.append('SUMMARY: ' + item['output_text'].split('SUMMARY: ')[1])

full_summary = ''.join(summary_strings)



In [None]:
full_summary

"SUMMARY: The evolution of information security frameworks is crucial for safeguarding digital assets across various sectors, as highlighted in [Fanelli R.L., 2010]. The integration of competitive security exercises enhances learning in undergraduate courses, emphasizing the importance of practical, hands-on experience in information assurance education. Similarly, the infrastructural complexities of electric power and telecommunication services necessitate advanced understanding and strategies for managing emergent behaviors in these critical systems [Weijnen M.P.C., 2010]. The biosecurity landscape, as discussed in [Katona P., 2010], faces challenges with the evolving biosecurity threats, necessitating global resilience and cooperation in preparedness and response strategies. This includes integrating intelligence and technology to craft effective responses to bioevents, highlighting the intersection of cybersecurity and biosecurity. The perceptions of IT personnel on existing inform

In [None]:
with open("ful_summary.txt", "w") as f:
  f.write(full_summary)
loader = TextLoader("ful_summary.txt")
documents = loader.load()

In [None]:
prompt_template = """Write a summary with AT LEAST 900 AND MAXIMUN 1200 words of the following text:
"{text}"
GUIDELINES:
- Include the author reference in the summary.
- Use square brakets to reference the authors. Use the following format: [Author, Year].
- IMPORTANT: some paragraph of your summary should reference to more than one paper if the topic of the sentence
             in the summary is referenced in them; DO NOT include more than three references. Here you have an example:
             'In [Juan Pérez J.P., 1972][Pepito Perez, 1978], the authors use the concept of...'
- IMPORTANT: Eliminate all the conclusion, summary or similar paragraphs included at the end of each SUMMARY.
- IMPORTANT: Carefully review the first summary you produced, count the words, and repeat all the process if the number of words is less than 900.
CONCISE SUMMARY:"""

prompt_template = PromptTemplate(template=prompt_template,
                                 input_variables=["text"])

llm = ChatOpenAI(temperature=0,
                 model_name="gpt-4-turbo-2024-04-09")

chain = load_summarize_chain(llm,
                             chain_type="stuff",
                             prompt=prompt_template)

summarized_text = chain.invoke(documents)

In [None]:
summarized_text['output_text']

"The rapid evolution of information technology has necessitated robust cybersecurity measures to protect sensitive data across various platforms, including cloud computing and mobile devices [Chen C.-Y., 2013][Wen L., 2013]. As enterprises increasingly rely on digital infrastructure, the complexity of managing network security and ensuring compliance with international standards like ISO/IEC 27001 becomes paramount [Oudkerk S., 2013][Lee J., 2013]. The deployment of advanced encryption mechanisms, such as homomorphic encryption in digital libraries, underscores the critical need for safeguarding digital assets while maintaining user privacy and system performance [Meng Q., 2013][Borena B., 2013].\n\nIn the realm of error-correcting codes, the exploration of MacWilliams identities over finite rings highlights significant theoretical and practical implications for enhancing information security through improved error-detection and correction capabilities [Xu X., 2013][Wall J.D., 2013]. S

In [None]:
summarized_text_90['output_text']

'ENTITIES: Information security management; cryptographic techniques; public-key infrastructure; digital signature schemes; modular exponentiation; risk analysis; security policies; information security standards; BS 7799; Common Criteria; information security education; security architecture; information domains; security functions; security evaluation; trusted computing base; security mechanisms; access control; information sharing; information protection; security threats; security services; security models; organizational security objectives; security function specifications; harmonization of information security requirements\n\nSUMMARY: The landscape of information security management has evolved significantly, influenced by the development and application of cryptographic techniques, the establishment of public-key infrastructure, and the implementation of digital signature schemes. These advancements have been pivotal in addressing the security needs of modern information system

In [None]:
summarized_text_80['output_text']

"ENTITIES: retail and international banking; secure handling of secret parameters; nuclear material accountability; security compliance analysis model; military and defense systems; International Information Flow debate; protection of information methods; exploitation of unclassified and classified information; information security issues; digital electronic information security; Overclassification; disaster recovery; protected storage and processing; data integrity policy; hacker threats; extensions to standard operating systems; computer security in hospital management; SAGAT system; federal government policies; acceptance of security controls; microcomputer systems security; security status review; true information security; communication and computer security; data security concepts; local area networks; Logical Coprocessing Kernel project; intrusion and theft prevention; computer network security; strategic planning for information security; biometric measurement devices; informat