In [2]:
!pip install transformers
!pip install datasets


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.1-py3-none-any.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m58.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.2 tokenizers-0.13.2 transformers-4.27.1
Looking in indexes: https://pypi.org/simple, https://us

##**Pre-Train Large BART CNN**

In [3]:
import torch
from transformers import BartTokenizer, BartForConditionalGeneration

# Load BART model and tokenizer
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

# Define a function to generate a summary
def generate_summary(input_text, model, tokenizer):
    # Tokenize the input text
    input_ids = tokenizer.encode(input_text, return_tensors='pt')

    # Generate the summary using the BART model
    summary_ids = model.generate(input_ids, num_beams=4, max_length=100, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

# Example usage
input_text = 'Over the past few decades, interest in theories and algorithms for face recognition has been growing rapidly. Video surveillance, criminal identification, building access control, and unmanned and autonomous vehicles are just a few examples of concrete applications that are gaining attraction among industries. Various techniques are being developed including local, holistic, and hybrid approaches, which provide a face image description using only a few face image features or the whole facial features. The main contribution of this survey is to review some well-known techniques for each approach and to give the taxonomy of their categories. In the paper, a detailed comparison between these techniques is exposed by listing the advantages and the disadvantages of their schemes in terms of robustness, accuracy, complexity, and discrimination. One interesting feature mentioned in the paper is about the database used for face recognition. An overview of the most commonly used databases, including those of supervised and unsupervised learning, is given. Numerical results of the most interesting techniques are given along with the context of experiments and challenges handled by these techniques. Finally, a solid discussion is given in the paper about future directions in terms of techniques to be used for face recognition.'
summary = generate_summary(input_text, model, tokenizer)
print('Generated Summary:', summary)


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Generated Summary: Interest in theories and algorithms for face recognition has been growing rapidly. Video surveillance, criminal identification, building access control, and unmanned and autonomous vehicles are just a few examples of concrete applications. Various techniques are being developed including local, holistic, and hybrid approaches, which provide a face image description using only a few face image features or the whole facial features.


## **Summary:**
Interest in theories and algorithms for face recognition has been growing rapidly. Video surveillance, criminal identification, building access control, and unmanned and autonomous vehicles are just a few examples of concrete applications. Various techniques are being developed including local, holistic, and hybrid approaches, which provide a face image description using only a few face image features or the whole facial features.

### **Fine Tune Large Bart**

In [None]:
import torch
from transformers import BartTokenizer, BartForConditionalGeneration
from torch.utils.data import DataLoader
from datasets import load_dataset

# Load the arxiv dataset from the HuggingFace Datasets library
dataset = load_dataset('scientific_papers', 'arxiv')

# Load the BART model and tokenizer
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

# Define a function to generate a summary for a given input text using the BART model
def generate_summary(input_text, max_length=100):
    input_ids = tokenizer.encode(input_text, return_tensors='pt', truncation=True, max_length=512)
    summary_ids = model.generate(input_ids, max_length=max_length, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Define a function to fine-tune the BART model on a dataset of input/target summary pairs
def fine_tune(model, tokenizer, train_dataset, val_dataset, batch_size=4, epochs=2, lr=1e-5):
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = torch.nn.CrossEntropyLoss()
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for batch in train_loader:
            input_ids = tokenizer.batch_encode_plus(batch['input_text'], padding=True, truncation=True, max_length=512, return_tensors='pt')['input_ids']
            target_ids = tokenizer.batch_encode_plus(batch['target_summary'], padding=True, truncation=True, max_length=128, return_tensors='pt')['input_ids']
            optimizer.zero_grad()
            # Generate the summary using the BART model
            outputs = model(input_ids, labels=target_ids)
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch in val_loader:
                input_ids = tokenizer.batch_encode_plus(batch['input_text'], padding=True, truncation=True, max_length=512, return_tensors='pt')['input_ids']
                target_ids = tokenizer.batch_encode_plus(batch['target_summary'], padding=True, truncation=True, max_length=128, return_tensors='pt')['input_ids']
                # Generate the summary using the BART model
                outputs = model(input_ids, labels=target_ids)
                loss = outputs.loss
                val_loss += loss.item()
        print(f'Epoch {epoch+1}: Train Loss={train_loss:.4f} Val Loss={val_loss:.4f}')
    return model




  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# Split the dataset into a training and validation set
train_dataset = dataset['train'][:10]
val_dataset = dataset['train'][10:12]

In [None]:
import pandas as pd
tr_data = pd.DataFrame.from_dict(train_dataset)
vd_data = pd.DataFrame.from_dict(val_dataset)
train_dataset = []
val_dataset = []
for i in range(len(tr_data)):
  train_dataset.append({'input_text': tr_data['article'].iloc[i], 'target_summary': tr_data['abstract'].iloc[i]})
for i in range(len(vd_data)):
  val_dataset.append({'input_text': vd_data['article'].iloc[i], 'target_summary': vd_data['abstract'].iloc[i]})

In [None]:
# Fine-tune the BART model on the training dataset
model = fine_tune(model, tokenizer, train_dataset, val_dataset, batch_size=2, epochs=2, lr=1e-5)

Epoch 1: Train Loss=23.0249 Val Loss=4.3987
Epoch 2: Train Loss=17.1408 Val Loss=4.3239


In [None]:
input_text = "Machine learning is enabling a myriad innovations, including new algorithms for cancer diagnosis and self-driving cars. The broad use of machine learning makes it important to understand the extent to which machine-learning algorithms are subject to attack, particularly when used in applications where physical security or safety is at risk. In this paper, we focus on facial biometric systems, which are widely used in surveillance and access control. We define and investigate a novel class of attacks: attacks that are physically realizable and inconspicuous, and allow an attacker to evade recognition or impersonate another individual. We develop a systematic method to automatically generate such attacks, which are realized through printing a pair of eyeglass frames. When worn by the attacker whose image is supplied to a state-of-the-art face-recognition algorithm, the eyeglasses allow her to evade being recognized or to impersonate another individual. Our investigation focuses on white-box face-recognition systems, but we also demonstrate how similar techniques can be used in black-box scenarios, as well as to avoid face detection."
summary = generate_summary(input_text)
print('Summary: ', summary)

Summary:  We develop a systematic method to automatically generate such attacks, which are realized through printing a pair of eyeglass frames. When worn by the attacker whose image is supplied to a state-of-the-art face-recognition algorithm, the eyeglasses allow her to evade being recognized or to impersonate another individual. We also demonstrate how similar techniques can be used in black-box scenarios, as well as to avoid face detection.


##**Actual:**
Machine learning is enabling a myriad innovations, including new algorithms for cancer diagnosis and self-driving cars. The broad use of machine learning makes it important to understand the extent to which machine-learning algorithms are subject to attack, particularly when used in applications where physical security or safety is at risk. In this paper, we focus on facial biometric systems, which are widely used in surveillance and access control. We define and investigate a novel class of attacks: attacks that are physically realizable and inconspicuous, and allow an attacker to evade recognition or impersonate another individual. We develop a systematic method to automatically generate such attacks, which are realized through printing a pair of eyeglass frames. When worn by the attacker whose image is supplied to a state-of-the-art face-recognition algorithm, the eyeglasses allow her to evade being recognized or to impersonate another individual. Our investigation focuses on white-box face-recognition systems, but we also demonstrate how similar techniques can be used in black-box scenarios, as well as to avoid face detection.

##**Summary:** 
We develop a systematic method to automatically generate such attacks, which are realized through printing a pair of eyeglass frames. When worn by the attacker whose image is supplied to a state-of-the-art face-recognition algorithm, the eyeglasses allow her to evade being recognized or to impersonate another individual. We also demonstrate how similar techniques can be used in black-box scenarios, as well as to avoid face detection.


In [None]:
input_text = "Abstractive summarization is the task of compressing a long document into a coherent short document while retaining salient information. Modern abstractive summarization methods are based on deep neural networks which often require large training datasets. Since collecting summarization datasets is an expensive and time-consuming task, practical industrial settings are usually low-resource. In this paper, we study a challenging low-resource setting of summarizing long legal briefs with an average source document length of 4268 words and only 120 available (document, summary) pairs. To account for data scarcity, we used a modern pretrained abstractive summarizer BART (Lewis et al., 2020), which only achieves 17.9 ROUGE-L as it struggles with long documents. We thus attempt to compress these long documents by identifying salient sentences in the source which best ground the summary, using a novel algorithm based on GPT-2 (Radford et al., 2019) language model perplexity scores, that operates within the low resource regime. On feeding the compressed documents to BART, we observe a 6.0 ROUGE-L improvement. Our method also beats several competitive salience detection baselines. Furthermore, the identified salient sentences tend to agree with an independent human labeling by domain experts."
summary = generate_summary(input_text, max_length=200)
print('Summary: ', summary)

Summary:   abstractive summarization is the task of compressing a long document into a coherent short document while retaining salient information. Modern summarization methods are based on deep neural networks which often require large training datasets. Since collecting summarization datasets is an expensive and time-consuming task, practical industrial settings are usually low-resource. We thus attempt to compress these long documents by identifying salient sentences in the source which best ground the summary.


##**Actual:**
Abstractive summarization is the task of compressing a long document into a coherent short document while retaining salient information. Modern abstractive summarization methods are based on deep neural networks which often require large training datasets. Since collecting summarization datasets is an expensive and time-consuming task, practical industrial settings are usually low-resource. In this paper, we study a challenging low-resource setting of summarizing long legal briefs with an average source document length of 4268 words and only 120 available (document, summary) pairs. To account for data scarcity, we used a modern pretrained abstractive summarizer BART (Lewis et al., 2020), which only achieves 17.9 ROUGE-L as it struggles with long documents. We thus attempt to compress these long documents by identifying salient sentences in the source which best ground the summary, using a novel algorithm based on GPT-2 (Radford et al., 2019) language model perplexity scores, that operates within the low resource regime. On feeding the compressed documents to BART, we observe a 6.0 ROUGE-L improvement. Our method also beats several competitive salience detection baselines. Furthermore, the identified salient sentences tend to agree with an independent human labeling by domain experts.

## **Summary:**
abstractive summarization is the task of compressing a long document into a coherent short document while retaining salient information. Modern summarization methods are based on deep neural networks which often require large training datasets. Since collecting summarization datasets is an expensive and time-consuming task, practical industrial settings are usually low-resource. We thus attempt to compress these long documents by identifying salient sentences in the source which best ground the summary.

In [None]:
pip install PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 KB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
pip install GingerIt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Text Rank Algorithm 

In [None]:
import PyPDF2
import gensim
import spacy
from gensim.summarization import summarize
from gensim.utils import simple_preprocess

# Define a custom stopwords list
custom_stopwords = ["research", "paper", "study", "results", "conclusions"]

# Load a spaCy model for named entity recognition
nlp = spacy.load("en_core_web_sm")

# Open the PDF file in read-binary mode
with open('/content/drive/MyDrive/Research_Papers/VideoSumSpringer.pdf', 'rb') as f:
    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfReader(f)

    # Get the number of pages in the PDF file
    num_pages = len(pdf_reader.pages)

    # Loop through each page in the PDF file
    paper_text = ""
    for page_num in range(num_pages):
        # Get the page object for the current page
        page = pdf_reader.pages[page_num]

        # Extract the text from the page object
        text = page.extract_text()

        # Add the text to the paper_text variable
        paper_text += text

# Preprocess the text
sentences = gensim.summarization.textcleaner.split_sentences(paper_text)
tokens = [simple_preprocess(sentence) for sentence in sentences]

# Remove custom stopwords from the tokens
tokens_without_stopwords = [[token for token in sentence_tokens if token not in custom_stopwords] for sentence_tokens in tokens]

# Convert the tokens back to sentences
sentences_without_stopwords = [" ".join(sentence_tokens) for sentence_tokens in tokens_without_stopwords]

# Use spaCy for named entity recognition
ner_sentences = []
for sentence in sentences_without_stopwords:
    doc = nlp(sentence)
    ner_sentence = ""
    for token in doc:
        if token.ent_type_:
            ner_sentence += token.ent_type_ + " "
        else:
            ner_sentence += token.text + " "
    ner_sentences.append(ner_sentence)

# Join the sentences into a single string
paper_text_without_stopwords = ". ".join(ner_sentences)

# Perform text summarization using gensim's TextRank algorithm
sentences = summarize(paper_text_without_stopwords, ratio=0.2, split=True)

# Post-processing: Correct spelling and grammar errors and add missing punctuation
# Here, we use the GingerIt library for automatic spelling and grammar correction
from gingerit.gingerit import GingerIt
corrector = GingerIt()

# Process each sentence in the summary
summary = ""
for sentence in sentences:
    if sentence is not None:
        # Correct spelling and grammar errors and add missing punctuation
        corrections = corrector.parse(sentence)
        if 'Corrections' in corrections:
            sentence = corrections['Corrections']
        else:
            sentence = corrections['result']
        sentence = sentence.strip()
        if not sentence.endswith("."):
            sentence += "."
        summary += sentence + " "

# Print the summary
print(summary)



Video transcript extraction and summarization. Single frame videos require the viewer to watch the entire thing to fully. Algorithms such as text rank and seq seq models. PERSON PERSON algorithms and models the system proposed in this pa. Per converts audio chunks from input videos into transcripts. Marization model based on natural language processing and transfer. The developed model accepts user supplied video links as input. And outputs summary like description of video. Key words NLP transfer learning large ORG ORG pre trained. One can use our system to summarize the transcripts or captions of the video. Propose video transcript summarizes that. Takes YouTube links as input, extracts the transcript if not provided and gives. The summarized transcript using hugging, face transformers. This model only works for LANGUAGE. Input video transcripts. Proposes bertsum model for summarizing. The proposed model for summarization is trained. On combination of ORG DATE mail wikihow and how da

## **Summary:**
Video transcript extraction and summarization. Single frame videos require the viewer to watch the entire thing to fully. Algorithms such as text rank and seq seq models. PERSON PERSON algorithms and models the system proposed in this pa. Per converts audio chunks from input videos into transcripts. Marization model based on natural language processing and transfer. The developed model accepts user supplied video links as input. And outputs summary like description of video. Key words NLP transfer learning large ORG ORG pre trained. One can use our system to summarize the transcripts or captions of the video. Propose video transcript summarizes that. Takes YouTube links as input, extracts the transcript if not provided and gives. The summarized transcript using hugging, face transformers. This model only works for LANGUAGE. Input video transcripts. Proposes bertsum model for summarizing. The proposed model for summarization is trained. On combination of ORG DATE mail wikihow and how datasets. On CARDINAL CARDINAL dataset and evaluated using the rouge score and content. Datasets among training and testing and observed the best CARDINAL. From podcast extract text summarization to summarize the transcript and. Return the audio linked with the text summary. They produced summaries for the audio transcripts of several podcasts to gain. Algorithm to handle podcast summaries explicitly using these text summaries video transcripts. NLP methods to extract and summarize information from audio and video data. Extractive text summarizing is a method of summarizing that draws sum. Transcribed text they attach video strings based on subtitles using the algebraic. It uses fewer computing resources and ORG ORG any prior training data. Pre trained language models for deep neural networks are growing in pop. There are much different transfer learning models available for NLP tasks. An encoder decoder model that transforms all NLP issues into text. ORG model pre trained in the LANGUAGE language and has been refined. Town to train it ORG builds a model to retrieve the original text. Vision and natural language processing frequently build on pre trained models. Tasks hampered by lack of data and inadequate model generalization in the. We could use the model that had already been trained for CARDINAL job too. Strategies will be covered along with how they can be used for video transcript. Text summarization model. Convert video to audio we will be using the moviepy python library which. Video into the corresponding wav audio format. Text out of the video file or more accurately creating a textual version of the. The transcription process begins with an audio. Serves as the raw text converted to the video transcript using the writer's. Text summarization we use automatic obstructive summarization to create. Text summaries from the transcripts, we applied the transformers package for. Using the dataset of summarized YouTube transcripts. Which we described in section we fine tuned Facebook ORG large model to. How dataset, there are videos in this YouTube collection totaling. . Videos make up the entire dataset that was used and the split is shown below. The dataset processing process is followed before the model selection as well. As the model training process the preprocessing steps include. Text data file to video I do. Tabled evaluation of different pre trained models on how dataset. For choosing the best model for our proposed system we tested different trans. For learning pretrained models like small base ORG large ORG PRODUCT. ORG exam using how dataset and predicted the summary using this mod. Ability to calculate the rouge and PERSON score evaluation metrics as discussed. Large ORG gave an overall better score than all the models considered. Selected the ORG large ORG model for fine tuning process. The ORG model has. Already been trained in LANGUAGE language and has been fine tuned using ORG. Tune this ORG large ORG model and train it on our training dataset described. Description of ORG large ORG model. ORG model is explained in detail in section. Summary pairings the ORG large ORG model has been fine tuned. We fined tuned the ORG large ORG model from the section on the dataset from. Fast a package created to make deep learning more approachable video transcript. Algorithm fine tuning and model training. Output trained model. Build the obstructive summarization model of fine tuning. Training the model. Training the model. Book ORG large ORG model and prepare data for training. Fine tuning the model and preprocessing data for training now we will. Your raw data into modelable information. Training the model and evaluate in this phase, we prepare our ORG. Model for training by wrapping it in blurry object and using. The result of the prediction or summary that is given by our trained model. . The fine tuned large ORG can model is evaluated on the testing dataset. Reference summaries are provided by the rouge package which ORG ORDINAL Varun Mehta et al. Embeddings from PERSON that have already been trained Bert score evaluating. Text generation compares words in candidate and reference sentences based on. The rouge scores for all the models before fine tuning are compiled in. Fine tuned large ORG ORG model. PERSON score obtained for each epoch, while training which is shown in fig. We provided YouTube video url and converted. The video speech  transcripts and applied summarization using our model to. Generated and the video url is displayed in tabular form in a table. Table fine tuned ORG large ORG evaluation. PERSON score video transcript. The table predicted summary of the video. Video url extracted transcript predicted summary. Therefore, this proposes a system to extract video, audio and transcripts. Comparative analysis of some transformer models. Was done and the ORG large ORG model gave superior out of all the. The ORG large ORG model and the testing dataset mentioned in the section. Url we used this model to summarize the video. We can expand the data on which our model is trained. Video transcript Summarizer. Video CARDINAL and summarization using. Video summarization. Abstractive text summarization using sequences. 

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from collections import Counter
from gensim.summarization import summarize

# Load the spacy model
nlp = spacy.load("en_core_web_sm")

text = "The MAJ and DMD rules are compared in Figure 4. For each N, four trials of the discrete HMM training were performed to find the mean and standard deviation of DMD OCP. From the 71h order polynomial fitting, the best OCP is 89.7% when N=l4, and the worst OCP is 86.6% when N=6. The MAJ OCP is always 81.7%. For CMD rule, we search for the optimum setting starting from N=G=d=l. Figure 5 shows the CMD OCP on d. The peak OCP is 99.0% when d=8. Away from N=G=l and d=8, the CMD OCP decays monotonically with G and N as shown in Figure 6. Four trials of training were performed to find the means and standard deviations for each setting. Thus the best OCPs of the MAJ, DMD, and CMD rules can be compared to the single-frame face recognition, as summarized"

# Tokenize the text and remove stop words
doc = nlp(text)
tokens = [token.text for token in doc if not token.is_stop and token.is_alpha]

# Get the most frequent words and their counts
word_freq = Counter(tokens)
top_words = word_freq.most_common(10)

# Print the most frequent words and their counts
print("Top 10 most frequent words:")
for word, count in top_words:
    print(f"{word}: {count}")

# Summarize the text using gensim's TextRank algorithm
summary = summarize(text)

# Print the summary
print("Summary:")
print(summary)


Top 10 most frequent words:
OCP: 7
N: 5
CMD: 4
MAJ: 3
DMD: 3
Figure: 3
G: 3
rules: 2
compared: 2
trials: 2
Summary:
For each N, four trials of the discrete HMM training were performed to find the mean and standard deviation of DMD OCP.
Thus the best OCPs of the MAJ, DMD, and CMD rules can be compared to the single-frame face recognition, as summarized


## **Summary:**
We thus attempt to compress these long documents by identifying salient sentences in the source which best ground the summary, using a novel algorithm based on GPT-2 (Radford et al., 2019) language model perplexity scores, that operates within the low resource regime.

In [None]:
import re
import nltk
import heapq
import gensim
import numpy as np

nltk.download('punkt')

def preprocess_text(text):
    # remove citations
    text = re.sub(r'\[\d+\]', '', text)
    # remove newlines
    text = re.sub(r'\n', ' ', text)
    # split text into sentences
    sentences = nltk.sent_tokenize(text)
    # remove punctuations and convert to lowercase
    preprocessed_sentences = []
    for sentence in sentences:
        sentence = re.sub('[^a-zA-Z0-9]', ' ', sentence).lower()
        preprocessed_sentences.append(sentence)
    return preprocessed_sentences

def summarize_text(text, num_sentences=3):
    # preprocess text
    preprocessed_sentences = preprocess_text(text)
    # create document matrix
    documents = [gensim.utils.simple_preprocess(sentence) for sentence in preprocessed_sentences]
    dictionary = gensim.corpora.Dictionary(documents)
    bow_corpus = [dictionary.doc2bow(doc) for doc in documents]
    tfidf = gensim.models.TfidfModel(bow_corpus)
    corpus_tfidf = tfidf[bow_corpus]
    # create LSI model and apply it to the corpus
    lsi_model = gensim.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=len(documents))
    corpus_lsi = lsi_model[corpus_tfidf]
    # get sentence scores
    sentence_scores = []
    for i, sentence in enumerate(corpus_lsi):
        score = 0
        for _, val in sentence:
            score += val
        sentence_scores.append((i, score))
    # sort sentence scores and get top n sentences
    top_sentences = heapq.nlargest(num_sentences, sentence_scores, key=lambda x: x[1])
    top_sentences.sort(key=lambda x: x[0])
    # generate summary
    summary = ' '.join([preprocessed_sentences[i] for i, _ in top_sentences])
    # extract numeric facts from summary
    numeric_facts = re.findall(r'\d+\.?\d*', summary)
    # remove duplicates and convert to float
    numeric_facts = list(set([float(num) for num in numeric_facts]))
    # create summary with numeric facts
    summary_with_facts = summary + '\n\nNumeric Facts: ' + ', '.join([str(num) for num in numeric_facts])
    return summary_with_facts

# Example usage
# text = """
# The results show that the new algorithm outperforms existing algorithms on several benchmark datasets. For example, on the MNIST dataset, our algorithm achieved a classification accuracy of 98.5%, compared to the state-of-the-art accuracy of 97.8%. On the CIFAR-10 dataset, our algorithm achieved an accuracy of 92.3%, which is 1.5% higher than the state-of-the-art accuracy of 90.8%. Furthermore, our algorithm has a lower computational complexity, which makes it more suitable for real-time applications.
# """

# input text
text = "The MAJ and DMD rules are compared in Figure 4. For each N, four trials of the discrete HMM training were performed to find the mean and standard deviation of DMD OCP. From the 71h order polynomial fitting, the best OCP is 89.7% when N=l4, and the worst OCP is 86.6% when N=6. The MAJ OCP is always 81.7%. For CMD rule, we search for the optimum setting starting from N=G=d=l. Figure 5 shows the CMD OCP on d. The peak OCP is 99.0% when d=8. Away from N=G=l and d=8, the CMD OCP decays monotonically with G and N as shown in Figure 6. Four trials of training were performed to find the means and standard deviations for each setting. Thus the best OCPs of the MAJ, DMD, and CMD rules can be compared to the single-frame face recognition, as summarized"


summary = summarize_text(text, num_sentences=6)
print(summary)



for each n  four trials of the discrete hmm training were performed to find the mean and standard deviation of dmd ocp  the maj ocp is always 81 7   for cmd rule  we search for the optimum setting starting from n g d l  away from n g l and d 8  the cmd ocp decays monotonically with g and n as shown in figure 6  four trials of training were performed to find the means and standard deviations for each setting  thus the best ocps of the maj  dmd  and cmd rules can be compared to the single frame face recognition  as summarized

Numeric Facts: 8.0, 81.0, 6.0, 7.0


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Actual:**
The MAJ and DMD rules are compared in Figure 4. For each N, four trials of the discrete HMM training were performed to find the mean and standard deviation of DMD OCP. From the 71h order polynomial fitting, the best OCP is 89.7% when N=l4, and the worst OCP is 86.6% when N=6. The MAJ OCP is always 81.7%. For CMD rule, we search for the optimum setting starting from N=G=d=l. Figure 5 shows the CMD OCP on d. The peak OCP is 99.0% when d=8. Away from N=G=l and d=8, the CMD OCP decays monotonically with G and N as shown in Figure 6. Four trials of training were performed to find the means and standard deviations for each setting. Thus the best OCPs of the MAJ, DMD, and CMD rules can be compared to the single-frame face recognition, as summarized

## **Summary:**
The maj and dmd rules are compared in figure 4  for each n  four trials of the discrete hmm training were performed to find the mean and standard deviation of dmd ocp  from the 71h order polynomial fitting  the best ocp is 89 7  when n l4  and the worst ocp is 86 6  when n 6  the maj ocp is always 81 7   for cmd rule  we search for the optimum setting starting from n g d l  figure 5 shows the cmd ocp on d  the peak ocp is 99 0  when d 8  away from n g l and d 8  the cmd ocp decays monotonically with g and n as shown in figure 6  four trials of training were performed to find the means and standard deviations for each setting  thus the best ocps of the maj  dmd  and cmd rules can be compared to the single frame face recognition  as summarized