<a href="https://colab.research.google.com/github/sergiomora03/AdvancedTopicsAnalytics/blob/main/exercises/E7-TextSummary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 7 - Text Summary

**Estudiantes:**
- Mili Galindo
- Anguie Garcia
- Sonia Ramirez
- Lourdes Rodil

## Instrucciones:

Text summarization is important in the field of machine learning and natural language processing for several reasons:

1. **Information Retrieval:** Text summarization helps users quickly grasp the main points or key information from a large document, making it easier to decide whether to read the full document or not. This is particularly valuable in scenarios where individuals are inundated with vast amounts of textual data, such as news articles, research papers, or social media posts.

2. **Time Efficiency:** Summarization algorithms can process and generate summaries much faster than humans can read and summarize large texts. This saves time and allows users to focus their attention on the most relevant content.

3. **Content Extraction:** Text summarization can automatically extract essential information from a document, enabling applications like content recommendation, keyword extraction, and topic modeling.

4. **Content Generation:** Summarization models can be used to generate concise, coherent, and informative summaries for various purposes, such as creating abstracts for research papers, news article headlines, or social media post previews.

5. **Multilingual Support:** Text summarization can be applied to texts in multiple languages, making it a valuable tool for global communication and information retrieval.

6. **Personalization:** Summarization can be personalized to individual preferences. Machine learning models can learn from user feedback to generate summaries that align more closely with a user's interests and priorities.

7. **Scalability:** As the volume of digital content continues to grow, automated summarization becomes crucial for scaling information processing and retrieval. Machine learning-based summarization models can adapt and handle large volumes of text efficiently.

8. **Legal and Compliance:** In legal and regulatory contexts, automated summarization can help organizations review contracts, policies, and legal documents to ensure compliance and identify critical clauses or information.

9. **Search Engine Optimization (SEO):** Summarized content can be used to create concise and engaging snippets for search engine results, improving the discoverability of web content.

10. **Content Creation:** Summarization can be integrated into content creation tools, helping authors and content creators generate concise and informative content more efficiently.

Overall, text summarization is an essential component of machine learning and natural language processing, enabling efficient information retrieval, content extraction, and content generation across a wide range of applications and industries. It plays a critical role in handling the ever-increasing amount of textual data available in the digital age.

---
Exercise:

Now, as a data scientist expert in NLP, you are asked to create a model to be able to summarize text in Spanish. Your stakeholders will pass you an article and your model should summarize it.

## **Instalacion de librerias**

### 1. Librerias - requests beautifulsoup4
Se instalan las bibliotecas "requests" y "beautifulsoup4" son herramientas muy útiles para realizar tareas relacionadas con la web y el análisis de HTML

In [1]:
!pip install requests beautifulsoup4



### 2. Libreria - transformers
Se instalan las bibliotecas "requests" y "beautifulsoup4" son herramientas muy útiles para realizar tareas relacionadas con la web y el análisis de HTML

In [2]:
!pip install transformers



### 3. Librerias - Bert-extractive-summarizer translate

In [3]:
pip install bert-extractive-summarizer translate

Note: you may need to restart the kernel to use updated packages.


### 4. Libreria - tf-keras

In [4]:
pip install tf-keras

Note: you may need to restart the kernel to use updated packages.


### 5. Libreria - sentencepiece

In [5]:
pip install sentencepiece

Note: you may need to restart the kernel to use updated packages.


## **Adquision de los datos**

Usando las librerias de "BeautifulSoup" y "requests" se descargo un articulo de la revista Times en la seccion de**TIME100 AI** donde se presenta a Geoffrey Hinton.

![Geoffrey Hinton](Imagen_Geoffrey_Hinton.jpg)

Con el siguiente codigo se recupera el texto del articulo que tiene una longitud de **1.215**:

In [6]:
import requests
from bs4 import BeautifulSoup

# URL del artículo
url = "https://time.com/collection/time100-ai/6309026/geoffrey-hinton/"

# Realizar una solicitud HTTP para obtener el contenido de la página
response = requests.get(url)

# Verificar si la solicitud fue exitosa
if response.status_code == 200:
    # Analizar el contenido HTML de la página con BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")

    # Encontrar el contenido del artículo (puedes inspeccionar el HTML de la página para encontrar la estructura adecuada)
    article_content = soup.find("div", {"class": "article-content"})

    # Extraer el texto del artículo
    article_text = ""
    for paragraph in article_content.find_all("p"):
        article_text += paragraph.get_text() + "\n"

    # Imprimir el texto del artículo
    print(article_text)
else:
    print("Error al obtener la página:", response.status_code)

Over the course of February, Geoffrey Hinton, one of the most influential AI researchers of the past 50 years, had a “slow eureka moment.”
Hinton, 76, has spent his career trying to build AI systems that model the human brain, mostly in academia before joining Google in 2013. He had always believed that the brain was better than the machines that he and others were building, and that by making them more like the brain, they would improve. But in February, he realized “the digital intelligence we’ve got now may be better than the brain already. It’s just not scaled up quite as big.” 
Developers around the world are currently racing to build the biggest AI systems that they can. Given the current rate at which AI companies are increasing the size of models, it could be less than five years until AI systems have 100 trillion connections—roughly as many as there are between neurons in the human brain.
Alarmed, Hinton left his post as VP and engineering fellow in May and gave a flurry of in

In [11]:
import nltk

palabras = nltk.word_tokenize(article_text)
num_palabras = len(palabras)
print("Número de palabras:", num_palabras)


Número de palabras: 1215


## **Estratégia definida**
Para mejorar la calidad del resumen al español se procede a realizar el resumen del texto en Ingles dado que es el orginen del texto y posteriormente a traducir del ingles al español el resumen.

### **Resumen en ingles del texto**
Se utiliza las librerias de transformers el modelo **"t5-small"**. En  summarizer ( modelo de aprendizaje automático diseñado específicamente para resumir texto de manera automática) donde se define como parametro un minimo de 100 y un maximmo de 200 palabras considerando un 8% y un 16% del tamaño del documento original.

In [12]:
from transformers import pipeline

# Especificar el modelo y la revisión
model_name = "t5-small"
model_revision = "d769bba"

# Crear el pipeline de sumarización
summarizer = pipeline("summarization", model=model_name, revision=model_revision)

# Generar el resumen del texto
outputs = summarizer(article_text, min_length=100, max_length=200, clean_up_tokenization_spaces=True)

# Extraer el resumen en inglés
resumen_en = outputs[0]['summary_text']

# Imprimir Resumen en Ingles
print('\n','\n',"\033[1mResumén en Ingles\033[0m:",'\n',resumen_en)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.



 
 [1mResumén en Ingles[0m: 
 76-year-old geoffrey Hinton is one of the most influential AI researchers of the past 50 years. he has spent the past few months trying to build AI systems that model the human brain. his work has potentially hastened the future he fears, in which AI becomes superhuman with disastrous results for humans. "this stuff will get smarter than us and take over," he says of his regrets over helping bring that technology into existence.


### **Traducción al Español de resumen en Ingles**

 En biblioteca de Hugging Face, el objeto pipeline puede ser utilizado para traducción de texto) donde se utilizo el modelo **"Helsinki-NLP/opus-mt-en-es"** para la traduccion del ingles al español. Como parametro se definio un minimo de 200 palabras.

In [13]:
translator = pipeline("translation_en_to_de",model="Helsinki-NLP/opus-mt-en-es")
outputs = translator(resumen_en, clean_up_tokenization_spaces=True, min_length=100)
resumen_ES =(outputs[0]['translation_text'])

# Imprimir la traducción del resumen
print('\n','\n',"\033[1mResumén traducido al español\033[0m:",'\n',resumen_ES)

All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-es.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.



 
 [1mResumén traducido al español[0m: 
 Geoffrey Hinton, de 76 años, es uno de los investigadores de IA más influyentes de los últimos 50 años.Ha pasado los últimos meses tratando de construir sistemas de IA que modelen el cerebro humano.Su trabajo potencialmente ha acelerado el futuro que teme, en el que IA se vuelve sobrehumana con resultados desastrosos para los humanos. "Esto se hará más inteligente que nosotros y asumirá el control", dice de su pesar por ayudar a crear esa tecnología.


### **Resultado consolidado**

In [21]:
# Se concatenan los resultados y se imprimen:
print('\n','\n',"\033[1mTexto Original\033[0m:",'\n',article_text,'\n','\n',"\033[1mResumén en Ingles\033[0m:",'\n',resumen_en,'\n','\n',"\033[1mResumén traducido al español\033[0m:",'\n',resumen_ES)



 
 [1mTexto Original[0m: 
 Over the course of February, Geoffrey Hinton, one of the most influential AI researchers of the past 50 years, had a “slow eureka moment.”
Hinton, 76, has spent his career trying to build AI systems that model the human brain, mostly in academia before joining Google in 2013. He had always believed that the brain was better than the machines that he and others were building, and that by making them more like the brain, they would improve. But in February, he realized “the digital intelligence we’ve got now may be better than the brain already. It’s just not scaled up quite as big.” 
Developers around the world are currently racing to build the biggest AI systems that they can. Given the current rate at which AI companies are increasing the size of models, it could be less than five years until AI systems have 100 trillion connections—roughly as many as there are between neurons in the human brain.
Alarmed, Hinton left his post as VP and engineering fellow 