# RESUMENES DE TEXTO PARA EL DATASET NEWSPACE

En este notebook queremos intentar generar resumenes de los artículos que se encuentran en el DataSet analizado previamente en R.

Para hacer este notebook nos hemos apoyado en este otro:

- https://www.kaggle.com/code/midouazerty/text-summarizer-using-nlp-advanced



El Resumen de Texto Automático es una aplicación del Procesamiento de Lenguaje Natural (NLP) con un gran potencial de impacto en nuestras vidas. Además, es uno de los problemas más desafiantes e interesantes en el campo del NLP.Consistente en generar un resumen breve y relevante a partir de diversos recursos de texto, como libros, artículos de noticias, publicaciones de blogs, documentos de investigación, correos electrónicos y tweets.

Para nuestro caso y según lo que aparece el notebook en el que nos hemos basado, vamos a utilizar un algoritmo que se llama TextRank que además es similar al que usan algunos modelos preentrenados como Bert o GPT los cuales también probaremos.

El algoritmo TextRank se utiliza para resumir automáticamente textos. Es una técnica de procesamiento de lenguaje natural (NLP) que identifica las oraciones más importantes en un texto y las resume en un formato más conciso y significativo. Este algoritmo se basa en conceptos de grafos y análisis de texto para asignar puntajes de importancia a las oraciones según su relación con otras oraciones en el texto. Las oraciones con los puntajes más altos son seleccionadas para formar el resumen final.

## Librerias

In [1]:
!pip install pyreadr



In [2]:
import numpy as np
import pandas as pd
import warnings
import re
import nltk
import pyreadr

from nltk import word_tokenize
from nltk.tokenize import sent_tokenize
from textblob import TextBlob
import string
from string import punctuation
from nltk.corpus import stopwords
from statistics import mean
from heapq import nlargest
from wordcloud import WordCloud
import seaborn as sns
import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
punctuation = punctuation + '\n' + '—' + '“' + ',' + '”' + '‘' + '-' + '’'
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Prepración

En un principio quisimos importar el csv inicial donde se encontraba toda la información, pero tuvimos problemas con los separadores. Nuestro csv tenia como separadores la coma, y al parecer en el contenido de alguna noticia donde habia alguna coma, Python lo interpretaba como un seprador provacando que al final hubiese problemas con el número de columnas. Es por esto que decidimos importar mejor el rda.

Por otra parte, también vimos que era una mejor opción, pues en el trabajo previo en R separamos en dos DataSets el conjunto en función de si tenía o no resumen. Además de que esta bastante limpio.

Importamos el DataSet que previamente hicimos en R con los datos del DataSet SpaceNews que no tienen resumen. Para ello nos ayudamos de la libreria `pyread`

In [3]:
# Especifica la ruta de tu archivo RDA
archivo_rda = "datos_space_sin_extracto.rda"

# Usa la función read_r de pyreadr para cargar el archivo
datos = pyreadr.read_r(archivo_rda)


Ahora tenemos que pasar el DataSet a un DataFrame de Pandas para poder realizar operaciones en Python, para ello necesitamos saber las claves disponibles, que en teoría debería ser `datos_space_sin_extracto` que es el nombre que indicamos en R. Además, en nuestro caso solo tenemos una única key pues solo esta datos_space_sin_extracto, pero se podria dar el caso de que hubiese varias keys y tendriamos que seleccionar una de ellas.

In [4]:
# Verificar las claves disponibles
print(datos.keys())

# Elegir la clave correcta que contiene el DataFrame
# Por ejemplo, si el DataFrame está bajo la clave "df", puedes hacer:
df = datos["datos_space_sin_extracto"]


odict_keys(['datos_space_sin_extracto'])


Para verificar que tenemos todo correctamente, imprimimos los primeros elementos e indicamos cuántas filas y columnas tiene el DataFrame.

In [5]:
df.head()

# Obtener la cantidad de filas y columnas del DataFrame
filas, columnas = df.shape

print("Número de filas:", filas)
print("Número de columnas:", columnas)

Número de filas: 1367
Número de columnas: 6


## Algortimo de resumen

Primero vamos a crear un diccionario por si hay contracciones poder saber cual sería su forma normal. Por comodidad hemos cogido el mismo que habia en el notebook.

In [6]:
contractions_dict = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"doesn’t": "does not",
"don't": "do not",
"don’t": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y’all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have",
"ain’t": "am not",
"aren’t": "are not",
"can’t": "cannot",
"can’t’ve": "cannot have",
"’cause": "because",
"could’ve": "could have",
"couldn’t": "could not",
"couldn’t’ve": "could not have",
"didn’t": "did not",
"doesn’t": "does not",
"don’t": "do not",
"don’t": "do not",
"hadn’t": "had not",
"hadn’t’ve": "had not have",
"hasn’t": "has not",
"haven’t": "have not",
"he’d": "he had",
"he’d’ve": "he would have",
"he’ll": "he will",
"he’ll’ve": "he will have",
"he’s": "he is",
"how’d": "how did",
"how’d’y": "how do you",
"how’ll": "how will",
"how’s": "how is",
"i’d": "i would",
"i’d’ve": "i would have",
"i’ll": "i will",
"i’ll’ve": "i will have",
"i’m": "i am",
"i’ve": "i have",
"isn’t": "is not",
"it’d": "it would",
"it’d’ve": "it would have",
"it’ll": "it will",
"it’ll’ve": "it will have",
"it’s": "it is",
"let’s": "let us",
"ma’am": "madam",
"mayn’t": "may not",
"might’ve": "might have",
"mightn’t": "might not",
"mightn’t’ve": "might not have",
"must’ve": "must have",
"mustn’t": "must not",
"mustn’t’ve": "must not have",
"needn’t": "need not",
"needn’t’ve": "need not have",
"o’clock": "of the clock",
"oughtn’t": "ought not",
"oughtn’t’ve": "ought not have",
"shan’t": "shall not",
"sha’n’t": "shall not",
"shan’t’ve": "shall not have",
"she’d": "she would",
"she’d’ve": "she would have",
"she’ll": "she will",
"she’ll’ve": "she will have",
"she’s": "she is",
"should’ve": "should have",
"shouldn’t": "should not",
"shouldn’t’ve": "should not have",
"so’ve": "so have",
"so’s": "so is",
"that’d": "that would",
"that’d’ve": "that would have",
"that’s": "that is",
"there’d": "there would",
"there’d’ve": "there would have",
"there’s": "there is",
"they’d": "they would",
"they’d’ve": "they would have",
"they’ll": "they will",
"they’ll’ve": "they will have",
"they’re": "they are",
"they’ve": "they have",
"to’ve": "to have",
"wasn’t": "was not",
"we’d": "we would",
"we’d’ve": "we would have",
"we’ll": "we will",
"we’ll’ve": "we will have",
"we’re": "we are",
"we’ve": "we have",
"weren’t": "were not",
"what’ll": "what will",
"what’ll’ve": "what will have",
"what’re": "what are",
"what’s": "what is",
"what’ve": "what have",
"when’s": "when is",
"when’ve": "when have",
"where’d": "where did",
"where’s": "where is",
"where’ve": "where have",
"who’ll": "who will",
"who’ll’ve": "who will have",
"who’s": "who is",
"who’ve": "who have",
"why’s": "why is",
"why’ve": "why have",
"will’ve": "will have",
"won’t": "will not",
"won’t’ve": "will not have",
"would’ve": "would have",
"wouldn’t": "would not",
"wouldn’t’ve": "would not have",
"y’all": "you all",
"y’all": "you all",
"y’all’d": "you all would",
"y’all’d’ve": "you all would have",
"y’all’re": "you all are",
"y’all’ve": "you all have",
"you’d": "you would",
"you’d’ve": "you would have",
"you’ll": "you will",
"you’ll’ve": "you will have",
"you’re": "you are",
"you’re": "you are",
"you’ve": "you have",
}

contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))

Con esta función queremos expandir la contracción en caso de que nos la encontremos, de ahí que previamente hubiesemos creado un diccionario con las contracciones y expansiones más comunes.

In [7]:
def expand_contractions(s, contractions_dict = contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, s)

En esta función terminamos de preprocesar los artículos, que aunque se haya hecho una buena limpieza y preprocesamiento en la parte de R aún queda por dar unas últimas pinceladas.

In [8]:
def preprocessing(article):
    global article_sent

    # Removing the contractions
    article = article.apply(lambda x: expand_contractions(x))

    # Stripping the possessives
    article = article.apply(lambda x: x.replace("'s", ''))
    article = article.apply(lambda x: x.replace('’s', ''))
    article = article.apply(lambda x: x.replace("\'s", ''))
    article = article.apply(lambda x: x.replace("\’s", ''))

    # Removing the Trailing and leading whitespace and double spaces
    article = article.apply(lambda x: re.sub(' +', ' ',x))

    # Copying the article for the sentence tokenization
    article_sent = article.copy()

    # Removing punctuations from the article
    article = article.apply(lambda x: ''.join(word for word in x if word not in punctuation))

    # Removing the Trailing and leading whitespace and double spaces again as removing punctuation might
    # Lead to a white space
    article = article.apply(lambda x: re.sub(' +', ' ',x))

    # Removing the Stopwords
    article = article.apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words))

    return article

Aqui queremos normalizar la frecuencia de las palabras que salen de la función `word_frequncy`

In [9]:
def normalize(li_word):
    global normalized_freq
    normalized_freq = []
    for dictionary in li_word:
        max_frequency = max(dictionary.values())
        for word in dictionary.keys():
            dictionary[word] = dictionary[word]/max_frequency
        normalized_freq.append(dictionary)
    return normalized_freq

Calculamos la frecuencia de las palabras. Esto es verdad que lo hicimos en R, sin embargo es una información que aquí no tenemos por lo que tenemos que volver a calcular su frecuencia.

In [10]:
def word_frequency(article_word):
    word_frequency = {}
    li_word = []
    for sentence in article_word:
        for word in word_tokenize(sentence):
            if word not in word_frequency.keys():
                word_frequency[word] = 1
            else:
                word_frequency[word] += 1
        li_word.append(word_frequency)
        word_frequency = {}
    normalize(li_word)
    return normalized_freq

Para realizar un buen resumen es necesario dar un Score a las oraciones, esto es con el objetivo de utilizar las oraciones que sean más importantes y que el resumen aporte información concisa y relevante. El Score se da después de tokenizar la oración.

In [11]:
def sentence_score(li):
    global sentence_score_list
    sentence_score = {}
    sentence_score_list = []
    for list_, dictionary in zip(li, normalized_freq):
        for sent in list_:
            for word in word_tokenize(sent):
                if word in dictionary.keys():
                    if sent not in sentence_score.keys():
                        sentence_score[sent] = dictionary[word]
                    else:
                        sentence_score[sent] += dictionary[word]
        sentence_score_list.append(sentence_score)
        sentence_score = {}
    return sentence_score_list

Tokenizamos la oración que al igual que nos pasaba en la funcion para calcular la frecuencia, aquí tampoco disponemos de los tokens por lo que es necesario generalos de nuevo.

In [12]:
def sent_token(article_sent):
    sentence_list = []
    sent_token = []
    for sent in article_sent:
        token = sent_tokenize(sent)
        for sentence in token:
            token_2 = ''.join(word for word in sentence if word not in punctuation)
            token_2 = re.sub(' +', ' ',token_2)
            sent_token.append(token_2)
        sentence_list.append(sent_token)
        sent_token = []
    sentence_score(sentence_list)
    return sentence_score_list

Generamos el resumen usando oraciones que tengan el mayor Score, en concreto vamos a especificar que esten en el 20% más alto.

In [13]:
def summary(sentence_score_OwO):
    summary_list = []
    for summ in sentence_score_OwO:
        select_length = int(len(summ)*0.25)
        summary_ = nlargest(select_length, summ, key = summ.get)
        summary_list.append(".".join(summary_))
    return summary_list

Dado que estamos trabajando con la libreria Pandas queremos pasarlo todo a dicho formato.

In [14]:
def make_series(art):
    global dataframe
    data_dict = {'article' : [art]}
    dataframe = pd.DataFrame(data_dict)['article']
    return dataframe

Por último aquí, indicamos el orden en el que queremos que se ejecuten las funciones anteriores para el funcionamiento del algoritmo.

In [15]:
def article_summarize(artefact):

    if type(artefact) != pd.Series:
        artefact = make_series(artefact)

    df = preprocessing(artefact)

    word_normalization = word_frequency(df)

    sentence_score_OwO = sent_token(article_sent)

    summarized_article = summary(sentence_score_OwO)

    return summarized_article

Generamos los resúmenes para casi todos los artículos. No hemos podido generar para todos ya que al intentarlo nos daba un error de que nos pasabamos de límite, por lo que fuimos probando a mano hasta el máximo que no dejaba generar.

In [16]:
summaries = article_summarize(df['content'][0:1348])

In [17]:
print ("The Actual length of the article is : ", len(df['content'][0]))
df['content'][0]

The Actual length of the article is :  4585


'paris — with a recent funding round and growing demand for its radio-frequency geolocation capabilities, hawkeye 360’s chief executive says the company has reached an “inflection point” on the path towards profitability and potentially going public. hawkeye 360 announced july 13  it raised $58 million in a series d-1 round  led by blackrock. the company said then that the funding would support development of new satellites and analytics products. the company has raised $368 million to date, said john serafini, chief executive of hawkeye 360, in an interview during world satellite business week here sept. 13. that round may also be the last private funding the company needs to raise. “provided that we execute against our revenue forecasts, which are conservative and we think we can do, we won’t need to raise additional private capital,” he said. profitability, he added, “is on the horizon for us,” but didn’t offer a specific timeline for achieving it. in a presentation at an investor c

In [18]:
print ("The length of the summarized article is : ", len(summaries[0]))
summaries[0]

The length of the summarized article is :  1723


'the company longterm goal is to have 60 satellites in 20 threesatellite clusters which serafini said the company expects to achieve by 2025 or 2026 those satellites built both by the space flight laboratory at the university of toronto institute for aerospace studies as well as hawkeye 360 own facility in northern virginia will be a mix of both its existing block 2 design and new block 3 design.the company has raised 368 million to date said john serafini chief executive of hawkeye 360 in an interview during world satellite business week here sept 13 that round may also be the last private funding the company needs to raise.one of the areas of growth for hawkeye is understanding the tactical intelligence surveillance reconnaissance requirements of our customers and the networks and systems that they operate in such that our data can flow into their existing systems as seamlessly as possible and produce another layer of valuable intelligence not just drowning them in additional data th

Con un bucle for rellenamos las columnas de `postexcerprt` que estan vacías con los reumenes generados.

In [19]:
for i in range(0, len(summaries)):
  df['postexcerpt'][i] = summaries[i]

Y podemos comprabar que efectivamente se han rellenado.

In [20]:
df.head()

Unnamed: 0,title,url,content,author,date,postexcerpt
0,HawkEye 360 reaches inflection point on path t...,https://spacenews.com/hawkeye-360-reaches-infl...,paris — with a recent funding round and growin...,Jeff Foust,2023-09-14,the company longterm goal is to have 60 satell...
1,SES Q&A | Leveling up multi-orbit connectivity,https://spacenews.com/ses-qa-leveling-up-multi...,multi-orbit satellite operator ses is on the v...,Jason Rainbow,2023-09-13,after recently deploying the geostationary sat...
2,Rapid Starlink iteration poses challenges for ...,https://spacenews.com/rapid-starlink-iteration...,"tampa, fla. — spacex’s ability to quickly chan...",Jason Rainbow,2023-09-13,we do have channel conflict hofeller said whic...
3,Space Force to release guidelines for the use ...,https://spacenews.com/space-force-to-release-g...,washington — u.s. chief of space operations ge...,Sandra Erwin,2023-09-13,we will have terms of reference that the space...
4,ULA has ‘no issues’ with Space Force plan to s...,https://spacenews.com/ula-has-no-issues-with-s...,"washington — united launch alliance, one of ju...",Sandra Erwin,2023-09-13,washington united launch alliance one of just ...


Por último guardamos el DataSet con el formato .rda para su posterior anális en R.

In [21]:
# Guardar el DataFrame en un archivo .rda
pyreadr.write_rdata("datos_space_resumen.rda", df)

## MODELOS PREENTRENADOS

In [22]:
!pip install transformers
!pip install bert-extractive-summarizer



### Librerias

In [23]:
import os
import tensorflow as tf
import numpy as np
import transformers
from summarizer import Summarizer, TransformerSummarizer

### Modelos

In [24]:
def summarize(model, text):
    summary = ''.join(model(text, min_length=50))
    print(f'Summmery by {model}:\n\n\n{summary} \n\n\n')
    print(f'---------------------------------------------------*---------------------------------------------------')

Probamos a realizar un resumen con la primera noticia.

In [25]:
text='''
PARIS — With a recent funding round and growing demand for its radio-frequency geolocation capabilities, HawkEye 360’s chief executive says the company has reached an “inflection point” on the path towards profitability and potentially going public.

HawkEye 360 announced July 13 it raised $58 million in a Series D-1 round led by BlackRock. The company said then that the funding would support development of new satellites and analytics products.

The company has raised $368 million to date, said John Serafini, chief executive of HawkEye 360, in an interview during World Satellite Business Week here Sept. 13. That round may also be the last private funding the company needs to raise.

“Provided that we execute against our revenue forecasts, which are conservative and we think we can do, we won’t need to raise additional private capital,” he said. Profitability, he added, “is on the horizon for us,” but didn’t offer a specific timeline for achieving it.

In a presentation at an investor conference a year ago, Serafini said the company was considering going public though an initial public offering (IPO) of stock in two to three years. That is still the plan now, he said, although the timing will depend as much on market conditions as it will the state of the company.

“The market being open or closed has a lot to do with it,” he said of the timing of an IPO, which he said remains likely two to three years out. “Whether we can achieve the requisite milestones is the biggest issue,” such as achieving profitability and the right unit economics. “That’s what we can control and we’re rushing like heck to get to that spot.”

The new funding and the growth of the business have put HawkEye 360 into a good position, he argued. “I would say we’re at an inflection point,” he said, from the funding to plans to launch additional satellites and development of analytics tools that leverage machine learning and artificial intelligence. “All of that in the next 12 to 18 months has got us in a great position.”

Governments remain the largest customers for HawkEye 360, which Serafini said will likely be the case for the foreseeable future. That has included defense and intelligence applications as well as some civil and broader security applications, like tracking illegal fishing or deforestation.

“One of the tenets we set the company up with was to focus on where the money is. The money in remote sensing is, ultimately, in defense and intelligence,” he said. “If you can’t service those customers, you’re not going to exist as a company.”

That work has included work in Ukraine since Russia’s invasion more than a year and a half ago, tracking sources of GPS and other radio-frequency interference. Serafini declined to go into details about the company’s work there, but he said the conflict has highlighted the importance of both commercial remote sensing capabilities in general as well as the need to work closely with the users of those capabilities.

“Throwing remote sensing data over a fence probably doesn’t lead to success,” he said. “One of the areas of growth for HawkEye is understanding the tactical intelligence, surveillance, reconnaissance requirements of our customers, and the networks and systems that they operate in, such that our data can flow into their existing systems as seamlessly as possible and produce another layer of valuable intelligence, not just drowning them in additional data.”

That data comes from 21 satellites currently in orbit. Six more are scheduled to launch later this year on a Rocket Lab Electron from New Zealand. The company’s long-term goal is to have 60 satellites, in 20 three-satellite clusters, which Serafini said the company expects to achieve by 2025 or 2026.

Those satellites, built both by the Space Flight Laboratory at the University of Toronto Institute for Aerospace Studies as well as HawkEye 360’s own facility in Northern Virginia, will be a mix of both its existing Block 2 design and new Block 3 design. The plans for Block 3 are “very fluid,” he said, and could feature two different designs, a smaller one to focus on specific signals and a larger one to do “very advanced” work.

HawkEye 360 also announced Sept. 12 it promoted Rob Rainhart, the company’s chief operating officer since 2019, to president.

“Rob and I have been partners together for eight years,” Serafini said. Rainhart handles internal company operations, responsibilities he will continue as president. “He keeps the company running on time, and so it felt like the right time to make the move to promote him to president.”
'''

### Resumen con tres modelos preentrenados

In [26]:
model_bert = Summarizer() #Defining bert model

#Defining gpt2 model
model_gpt2 = TransformerSummarizer(transformer_type = "GPT2", transformer_model_key="gpt2-medium")

#Defining xlnet model
model_xlnet = TransformerSummarizer(transformer_type = "XLNet", transformer_model_key = "xlnet-base-cased")

In [27]:
summarize(model_bert, text)
summarize(model_gpt2, text)
summarize(model_xlnet, text)

Summmery by <summarizer.bert.Summarizer object at 0x7d45a55ccb20>:


PARIS — With a recent funding round and growing demand for its radio-frequency geolocation capabilities, HawkEye 360’s chief executive says the company has reached an “inflection point” on the path towards profitability and potentially going public. The company has raised $368 million to date, said John Serafini, chief executive of HawkEye 360, in an interview during World Satellite Business Week here Sept. 13. In a presentation at an investor conference a year ago, Serafini said the company was considering going public though an initial public offering (IPO) of stock in two to three years. That has included defense and intelligence applications as well as some civil and broader security applications, like tracking illegal fishing or deforestation. “Throwing remote sensing data over a fence probably doesn’t lead to success,” he said. “ That data comes from 21 satellites currently in orbit. Six more are scheduled to la

## **EXTRA**

También nos parecio interesante intentar generar nuestro propio articulo con ayuda de Deep Learning y NLP, creando y entrenando nuestro propio modelo. Para ello nos ayudamos del siguiente Notebook:

- https://www.kaggle.com/code/asif00/text-generation-with-tensorflow-nlp-rnn

### Librerias

En esta parte se han quitado algunas librerias ya que estaban importadas previamente en la parte donde generamos los resúmenes.

In [28]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pyreadr

from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional, GlobalAveragePooling1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

In [29]:
strategy = tf.distribute.MirroredStrategy()

La parte de crear un dataset importando el DataSet de R es igual que en la primera parte de los resúmenes. Lo único nuevo es que aqui solo nos interesa la parte de contenido de la noticia.

In [30]:
df = df[['content']]

Creamos un corpus que al igual que hemos mencionado anteriormente, aunque se haya hecho en R aquí no tenemos el corpus.

In [31]:
corpus = []
with strategy.scope():
    for key, value in df.items():  # Iterar sobre los elementos (clave, valor) del diccionario
        if isinstance(value, pd.Series):  # Verificar si el valor es un Series de Pandas
            lowercase_values = value.str.lower()  # Convertir todo el valor a minúsculas
            corpus.extend(lowercase_values)  # Extender la lista con los valores procesados
        elif isinstance(value, pd.DataFrame):  # Verificar si el valor es un DataFrame de Pandas
            lowercase_values = value.applymap(lambda x: x.lower() if isinstance(x, str) else x)  # Convertir todo el DataFrame a minúsculas
            corpus.append(lowercase_values)  # Agregar el DataFrame procesado a la lista
        else:
            corpus.append(value)  # Agregar otros tipos de datos directamente

corpus[:10]


['paris — with a recent funding round and growing demand for its radio-frequency geolocation capabilities, hawkeye 360’s chief executive says the company has reached an “inflection point” on the path towards profitability and potentially going public. hawkeye 360 announced july 13  it raised $58 million in a series d-1 round  led by blackrock. the company said then that the funding would support development of new satellites and analytics products. the company has raised $368 million to date, said john serafini, chief executive of hawkeye 360, in an interview during world satellite business week here sept. 13. that round may also be the last private funding the company needs to raise. “provided that we execute against our revenue forecasts, which are conservative and we think we can do, we won’t need to raise additional private capital,” he said. profitability, he added, “is on the horizon for us,” but didn’t offer a specific timeline for achieving it. in a presentation at an investor 

Tokenizamos

In [32]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)

word_to_token = tokenizer.word_index

Aqui podemos ver el DataFrame tokenizado

In [33]:
def key_pair(num):
    count=0
    for key, value in word_to_token.items():
        if count>=num: break
        print(f''''{key:}': {value},''')
        count +=1
key_pair(10)

'the': 1,
'to': 2,
'of': 3,
'and': 4,
'a': 5,
'in': 6,
'for': 7,
'that': 8,
'space': 9,
'said': 10,


Para ver algo de información, podemos saber cuantos pares de word-key tenemos.

In [34]:
total_words = len(word_to_token)+1
print(total_words)

23971


Ahora vamos a secuenciar, esto es convertir las oraciones a un valor numérico basado en los pares word-key que hemos sacado previamente.

In [35]:
input_sequences = []
with strategy.scope():
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)

len(input_sequences)

815610

In [36]:
len(input_sequences)

815610

In [37]:
input_sequences[:5]
before = input_sequences[1]

Necesitamos saber cual es la secuencia más grande.

In [38]:
max_seq_len = max(len(x) for x in input_sequences)
print(max_seq_len)

3126


Ahora no todas las entradas tienen la misma longitud, por lo que vamos a hacer que tengan la misma longitud rellenando una matriz con ceros.

In [None]:
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_seq_len, padding = 'pre'))
after = input_sequences[1]

In [None]:
print(f'Before: {before}')
print(f'After: {after}')

In [None]:
features, labels = input_sequences[:, :-1], input_sequences[:, -1],

Convertimos a categorical

In [None]:
labels = tf.keras.utils.to_categorical(labels, num_classes=total_words)

Nos quedamos con una pequeña porción de datos del DataFrame, en concreto con tan sólo el 5% de los datos.

In [None]:
with strategy.scope():
    n = 0.05 # We are only taking a chunk of this huge dataset to fit it on the RAM
    slice_size = int(len(features)*n)

### Modelo

Creamos el modelo

In [None]:
def generator_model():
    tf.random.set_seed(42)
    model = Sequential()
    model.add(Embedding(total_words, 100, input_length = max_seq_len-1)),
    model.add(Bidirectional(LSTM(64, return_sequences = True))),
    model.add(Bidirectional(LSTM(32))),
    model.add(Dense(64, activation = 'relu')),
    model.add(Dense(total_words, activation = 'softmax'))
    return model

In [None]:
with strategy.scope():
    model = generator_model()
    model.compile(loss = 'categorical_crossentropy',
                 optimizer = tf.keras.optimizers.Adam(learning_rate = 0.002),
                 metrics = ['accuracy'])

In [None]:
model.summary()

In [None]:
EPOCHS = 10
history = model.fit(features, labels, epochs = EPOCHS)

#### Loss Accuracy Curve

Podemos generar las gráficas de la accuracy de la loss

In [None]:
def plot_graph(history, string):
    plt.plot(history.history[string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.show()

In [None]:
plot_graph(history, 'accuracy')
plot_graph(history, 'loss')

In [None]:
model.save('test_generator.h5')

#### Generación de texto

Probamos el modelo.

In [None]:
def test_generator(string, num):
    if len(string)==0:
        print("Error: No word found")
        return
    for _ in range(num):
        token_list = tokenizer.texts_to_sequences([string])[0]
        token_list = pad_sequences([token_list], maxlen=max_seq_len-1, padding = "pre")
        probabilities = model.predict(token_list)
        choice = np.random.choice([1,2,3])
        predicted = np.argsort(probabilities, axis = -1)[0][-choice]
        if predicted !=0:
            generated_word = tokenizer.index_word[predicted]
            string += " " + generated_word
    print(string)

In [None]:
test_generator("HawkEye 360 announced", 20)

Hemos probado a ejecutar varias veces el notebook y no entendemos porque a veces si que se ejecuta y podemos ver cómo funciona nuestro modelo, y otras veces no termina de ejecutarse esta última parte debido a que gastamos toda la RAM que nos proporciona Collab de forma gratuita (12GB).