## Text Summary

Text summarization is important in the field of machine learning and natural language processing for several reasons:

1. **Information Retrieval:** Text summarization helps users quickly grasp the main points or key information from a large document, making it easier to decide whether to read the full document or not. This is particularly valuable in scenarios where individuals are inundated with vast amounts of textual data, such as news articles, research papers, or social media posts.

2. **Time Efficiency:** Summarization algorithms can process and generate summaries much faster than humans can read and summarize large texts. This saves time and allows users to focus their attention on the most relevant content.

3. **Content Extraction:** Text summarization can automatically extract essential information from a document, enabling applications like content recommendation, keyword extraction, and topic modeling.

4. **Content Generation:** Summarization models can be used to generate concise, coherent, and informative summaries for various purposes, such as creating abstracts for research papers, news article headlines, or social media post previews.

5. **Multilingual Support:** Text summarization can be applied to texts in multiple languages, making it a valuable tool for global communication and information retrieval.

6. **Personalization:** Summarization can be personalized to individual preferences. Machine learning models can learn from user feedback to generate summaries that align more closely with a user's interests and priorities.

7. **Scalability:** As the volume of digital content continues to grow, automated summarization becomes crucial for scaling information processing and retrieval. Machine learning-based summarization models can adapt and handle large volumes of text efficiently.

8. **Legal and Compliance:** In legal and regulatory contexts, automated summarization can help organizations review contracts, policies, and legal documents to ensure compliance and identify critical clauses or information.

9. **Search Engine Optimization (SEO):** Summarized content can be used to create concise and engaging snippets for search engine results, improving the discoverability of web content.

10. **Content Creation:** Summarization can be integrated into content creation tools, helping authors and content creators generate concise and informative content more efficiently.

Overall, text summarization is an essential component of machine learning and natural language processing, enabling efficient information retrieval, content extraction, and content generation across a wide range of applications and industries. It plays a critical role in handling the ever-increasing amount of textual data available in the digital age.

---
Exercise:

Now, as a data scientist expert in NLP, you are asked to create a model to be able to summarize text in Spanish. Your stakeholders will pass you an article and your model should summarize it.

In [26]:
!pip install requests beautifulsoup4 sumy langdetect nltk

import requests
from bs4 import BeautifulSoup
import nltk
from langdetect import detect
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
import requests
from bs4 import BeautifulSoup
from langdetect import detect
from transformers import pipeline

nltk.download('punkt')

# URL del artículo
url = "https://time.com/collection/time100-ai/6309026/geoffrey-hinton/"

# Realizar una solicitud HTTP para obtener el contenido de la página
response = requests.get(url)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [27]:
##### CODIGO USANDO PIPELINE SUMMARIZER (HUGGING FACE) ######
##### EL CÓDIGO SE ESTABLECE PARA RECIBIR TAMBIEN ARTÍCULOS EN ESPAÑOL #####

# Verificar si la solicitud fue exitosa
if response.status_code == 200:
    # Analizar el contenido HTML de la página con BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")

    # Encontrar el contenido del artículo (inspeccionar el HTML para encontrar la estructura adecuada)
    article_content = soup.find("div", {"class": "article-content"})

    # Extraer el texto del artículo
    article_text = ""
    if article_content is not None:
        for paragraph in article_content.find_all("p"):
            article_text += paragraph.get_text() + "\n"

        # Detectar idioma
        def detectar_idioma(texto):
            return detect(texto)

        # Resumir texto
        def resumir_texto(texto):
            idioma = detectar_idioma(texto)
            summarizer = pipeline("summarization")  # pipeline de summarization

            # Dividir texto
            fragmentos = []
            max_length = 1024  # Longitud máxima permitida por el modelo
            for i in range(0, len(texto), max_length):
                fragmentos.append(texto[i:i + max_length])

            resúmenes = []
            try:
                for fragmento in fragmentos:
                    resumen = summarizer(fragmento, max_length=150, min_length=30, do_sample=False)
                    if resumen:
                        resúmenes.append(resumen[0]['summary_text'])

                # Combinar resúmenes
                resumen_final = " ".join(resúmenes)
                print(f"Resumen:\n", resumen_final)

            except Exception as e:
                print(f"Error al resumir el texto: {e}")

        # Función de resumen
        resumir_texto(article_text)
    else:
        print("No se encontró el contenido del artículo. Verifica la estructura del HTML.")
else:
    print("Error al obtener la página:", response.status_code)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Your max_length is set to 150, but your input_length is only 44. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)


Resumen:
  Geoffrey Hinton, 76, has spent his career trying to build AI systems that model the human brain . He had always believed that the brain was better than the machines that he and others were building . But in February, he realized “the digital intelligence we’ve got now may be better than . the brain already. It’s just not scaled up quite as big”  Hinton worries about what could happen once AI systems are scaled up to the size of human brains . “This stuff will get smarter than us and take over,” he says . Hinton comes from a long line of luminaries, with relatives including the mathematician Mary Everest Boole and logician George Boole .  In the 1970s, artificial intelligence was going through a period of dampened enthusiasm now referred to as the “AI winter .” In this unfashionable field, Hinton pursued an unpopular idea: neural networks . He published pathbreaking research for which he was awarded the 2018 Turing Award .  In 2012, Hinton and two of his graduate students, Al

In [28]:
##### CODIGO USANDO SUMMARIZER LSA ######
##### EL CÓDIGO SE ESTABLECE PARA RECIBIR TAMBIEN ARTÍCULOS EN ESPAÑOL #####

if response.status_code == 200:
    # Analizar el contenido HTML de la página con BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")

    # Encontrar el contenido del artículo (puedes inspeccionar el HTML de la página para encontrar la estructura adecuada)
    article_content = soup.find("div", {"class": "article-content"})

    # Extraer el texto del artículo
    article_text = ""
    for paragraph in article_content.find_all("p"):
        article_text += paragraph.get_text() + "\n"

    # Detectar idioma
    def detectar_idioma(texto):
        return detect(texto)

    # Resumir texto
    def resumir_texto(texto, num_oraciones=10):
        idioma = detectar_idioma(texto)
        if idioma == 'es':
            tokenizador_idioma = "spanish"
        elif idioma == 'en':
            tokenizador_idioma = "english"
        else:
            print("Idioma no soportado.")
            return

        # analizador del texto
        parser = PlaintextParser.from_string(texto, Tokenizer(tokenizador_idioma))

        # Método de resumen LSA
        summarizer = LsaSummarizer()

        # Resumen del texto
        resumen = summarizer(parser.document, num_oraciones)

        # Imprimir
        for sentence in resumen:
            print(sentence)

    # Función para resumir el texto extraído
    resumir_texto(article_text)

else:
    print("Error al obtener la página:", response.status_code)


Alarmed, Hinton left his post as VP and engineering fellow in May and gave a flurry of interviews in which he explained that he had left in order to be able to speak freely on the dangers of AI—and his regrets over helping bring that technology into existence.
Each time he replied, “Give me another six months and I’ll prove to you that it works.” Upon completion of his Ph.D., Hinton moved to the U.S., where more funding was available for his research.
Toronto has become Hinton’s home base; he travels relatively infrequently because back problems prevent him from sitting down.
In 2012, Hinton and two of his graduate students, Alex Krizhevsky and Ilya Sutskever, now chief scientist at OpenAI, entered ImageNet, a once annual competition in which researchers competed to build the most accurate image-recognition AI systems.
He and his two students began receiving lucrative offers from big tech companies.
His work has potentially hastened the future he fears, in which AI becomes superhuman w

In [29]:
##### CODIGO USANDO SEQ2SEQ (BART) #####
##### EL CÓDIGO SE ESTABLECE PARA RECIBIR TAMBIEN ARTÍCULOS EN ESPAÑOL #####

if response.status_code == 200:
    # Analizar el contenido HTML de la página con BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")

    # Encontrar el contenido del artículo
    article_content = soup.find("div", {"class": "article-content"})

    # Extraer el texto del artículo
    article_text = ""
    if article_content:
        for paragraph in article_content.find_all("p"):
            article_text += paragraph.get_text() + "\n"

    # Article_text como cadena
    article_text = article_text.strip()

    # Dividir el texto
    max_input_length = 1024
    article_parts = [article_text[i:i + max_input_length] for i in range(0, len(article_text), max_input_length)]

    # Seq2Seq BART
    resumen_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")

    # Resumen del texto
    def resumir_texto(texto):
        resúmenes = []
        for part in texto:
            idioma = detect(part)  # Detectar idioma
            if idioma in ['es', 'en']:  # Español e inglés
                # Ajuste max_length y min_length
                summary = resumen_pipeline(part, max_length=80, min_length=20, do_sample=False)
                resúmenes.append(summary[0]['summary_text'])

        resumen_completo = "\n".join(resúmenes)
        print("Resumen:\n", resumen_completo)

    if article_parts:
        resumir_texto(article_parts)
    else:
        print("No se encontró contenido para resumir.")
else:
    print("Error al obtener la página:", response.status_code)



Your max_length is set to 80, but your input_length is only 43. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)


Resumen:
 Geoffrey Hinton is one of the most influential AI researchers of the past 50 years. Hinton, 76, has spent his career trying to build AI systems that model the human brain. Given the current rate at which AI companies are increasing the size of models, it could be less than five years until AI systems have 100 trillion connections.
Hinton worries about what could happen once AI systems are scaled up to the size of human brains. He worries about the prospect of humanity being wiped out by the technology he helped create. “This stuff will get smarter than us and take over,” he says.
In the 1970s, artificial intelligence, after failing to live up to its postwar promise, was going through a period of dampened enthusiasm. Hinton pursued an unpopular idea: neural networks, which mimicked the structure of the human brain. He published pathbreaking research, for which he was awarded the 2018 Turing Award.
In 2012, Hinton and two of his graduate students, Alex Krizhevsky and Ilya Sutsk

In [30]:
##### CODIGO USANDO SEQ2SEQ (BART) CON TRADUCCIÓN AL ESPAÑOL (MÉTODO HELSINKI) #####
##### EL CÓDIGO SE ESTABLECE PARA RECIBIR TAMBIÉN ARTÍCULOS EN ESPAÑOL #####

# Verificar si la solicitud fue exitosa
if response.status_code == 200:
    # Analizar el contenido HTML de la página con BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")

    # Encontrar el contenido del artículo
    article_content = soup.find("div", {"class": "article-content"})

    # Extraer el texto del artículo
    article_text = ""
    if article_content:
        for paragraph in article_content.find_all("p"):
            article_text += paragraph.get_text() + "\n"

    # Dividir el texto
    max_input_length = 1024
    article_parts = [article_text[i:i + max_input_length] for i in range(0, len(article_text), max_input_length)]

    # Seq2Seq BART
    resumen_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")

    # Helsinki para traducción
    traduccion_pipeline = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es")

    # Detectar idioma
    def detectar_idioma(texto):
        return detect(texto)

    # Resumen y traducción
    def resumir_y_traducir(texto):
        resúmenes = []
        for part in texto:
            # Detectar idioma
            idioma = detectar_idioma(part)
            # Resumen el texto
            resumen_en_idioma_original = resumen_pipeline(part, max_length=80, min_length=20, do_sample=False)[0]['summary_text']

            if idioma == 'en':
                resumen_traducido = traduccion_pipeline(resumen_en_idioma_original, src_lang='en', tgt_lang='es')[0]['translation_text']
                resúmenes.append(resumen_traducido)
            else:
                resúmenes.append(resumen_en_idioma_original)

        return "\n".join(resúmenes)

    # Función para Resumen y traducción
    resumen_completo = resumir_y_traducir(article_parts)
    print("Resumen:\n", resumen_completo)
else:
    print("Error al obtener la página:", response.status_code)

Your max_length is set to 80, but your input_length is only 44. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)


Resumen:
 Geoffrey Hinton es uno de los investigadores de IA más influyentes de los últimos 50 años. Hinton, de 76 años, ha pasado su carrera tratando de construir sistemas de IA que modelen el cerebro humano. Dado el ritmo actual al que las compañías de IA están aumentando el tamaño de los modelos, podría ser menos de cinco años hasta que los sistemas de IA tengan 100 billones de conexiones.
Hinton se preocupa por lo que podría pasar una vez que los sistemas de IA se amplíen al tamaño de los cerebros humanos. Se preocupa por la perspectiva de que la humanidad sea aniquilada por la tecnología que ayudó a crear. “Esto se volverá más inteligente que nosotros y se hará cargo”, dice.
En la década de 1970, la inteligencia artificial, después de no estar a la altura de su promesa de posguerra, estaba pasando por un período de entusiasmo amortiguado. Hinton persiguió una idea impopular: redes neuronales, que imitaban la estructura del cerebro humano. Publicó investigaciones pioneras, por las 