# Problema 1 - Geração de Texto com RNNs ou Transformers
## Descrição: Crie um gerador de texto para um propósito específico (por exemplo, poesias, histórias curtas ou trechos de código).

### Objetivos:

- Implementar um modelo RNN, LSTM ou Transformer para a geração de texto.
- Treinar o modelo em um conjunto de dados de escolha, como poemas, literatura clássica ou um conjunto de dados de
programação.

### Grupo:
- Miguel Cabral de Carvalho
- Luiz Fernando Paes de Barros Presta
- Ana Clara Didier
- Pedro Dantas Leite

# Coleta de dados

In [1]:
# import requests
# from bs4 import BeautifulSoup
# import re

# def get_poems():
#     # URL da seção "Animals"
#     url = "https://www.poetryfoundation.org/poems/browse/animals"

#     # Requisição HTTP
#     response = requests.get(url)
#     poems = []

#     if response.status_code == 200:
#         # Parsear o conteúdo da página
#         soup = BeautifulSoup(response.content, "html.parser")

#         # Buscar todos os links dos poemas
#         links = soup.find_all("a", class_="link-underline-on link-red")

#         for link in links:
#             poem_url = "https://www.poetryfoundation.org" + link.get("href")
#             poem_response = requests.get(poem_url)

#             if poem_response.status_code == 200:
#                 poem_soup = BeautifulSoup(poem_response.content, "html.parser")
#                 poem_content = poem_soup.find("div", class_="poem-body")

#                 if poem_content:
#                     # Remover caracteres desnecessários e tags HTML
#                     poem_text = poem_content.get_text(strip=True)
#                     poem_text = re.sub(r'\s+', ' ', poem_text)  # Substituir múltiplos espaços por um único
#                     poems.append(poem_text)

#     return poems

# poems = get_poems()

# print(f"Quantidade de poemas encontrados: {len(poems)}")

# if poems:
#     print("Primeiro poema encontrado:")
#     print(poems[0][:500])  # Imprime os primeiros 500 caracteres do primeiro poema encontrado
# else:
#     print("Nenhum poema encontrado.")


In [2]:
import re
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pandas as pd

file_path = "PoetryFoundationData.csv"

data_completa = pd.read_csv(file_path, index_col=0)
data = data_completa.sample(frac=0.03, random_state=42)

data = data.applymap(lambda x: x.strip().replace("\r", "").replace("\n", "") if isinstance(x, str) else x)
data['Tags'].fillna('No Tags', inplace=True)
data['Poem'].fillna('', inplace=True)


print(data.head())


ImportError: DLL load failed while importing _multiarray_umath: Não foi possível encontrar o módulo especificado.

ImportError: DLL load failed while importing _multiarray_umath: Não foi possível encontrar o módulo especificado.

                              Title  \
77  At Eighty-three She Lives Alone   
6              And the Gauchos Sing   
29                from Saying Grace   
78             Leaving the Hospital   
13                            Relic   

                                                 Poem               Poet  \
77  Enclosure, steam-heated; a trial casket.You ar...         Ruth Stone   
6   For Barry SileskyCatalpas blooming up and down...        Mike Puican   
29  for my motherthe living     1.After Independen...        Kevin Young   
78  As the doors glide shut behind me, the world f...        Anya Silver   
13  The first time I touched it, cloth fell under ...  Rachel Richardson   

                                                 Tags  
77                 Living,Growing Old,Arts & Sciences  
6   Arts & Sciences,Poetry & Poets,Social Commenta...  
29  Activities,Jobs & Working,Social Commentaries,...  
78                                            No Tags  
13                          

  data = data.applymap(lambda x: x.strip().replace("\r", "").replace("\n", "") if isinstance(x, str) else x)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Tags'].fillna('No Tags', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Poem'].fillna('', inplace=True)


# Pré processamento

In [3]:
def preprocess_text(poems):
    text = ' '.join(poems)

    text = re.sub(r'[^\w\s.,!?;:]', '', text) 
    text = text.lower()
    return text

poems = data['Poem'].tolist()
cleaned_text = preprocess_text(poems)

tokenizer = Tokenizer() 
tokenizer.fit_on_texts([cleaned_text]) # Tokenizar o texto

sequences = tokenizer.texts_to_sequences([cleaned_text])[0] # Converter o texto para sequência de tokens

word_index = tokenizer.word_index
reverse_word_index = {index: word for word, index in word_index.items()}

print("\nPrimeiras palavras dos poemas:")
first_10_words = [reverse_word_index[token] for token in sequences[:10]]
print(first_10_words)

seq_length = 50 
input_sequences = []

for i in range(seq_length, len(sequences)):
    input_sequences.append(sequences[i-seq_length:i+1]) 

input_sequences = np.array(input_sequences)
X, y = input_sequences[:,:-1], input_sequences[:,-1]
X = pad_sequences(X, maxlen=seq_length, padding='pre')

print(f"\nNúmero total de tokens: {len(tokenizer.word_index)}")
print(f"Exemplo de sequência X[0]: {X[0]}")
print(f"Exemplo de sequência y[0]: {y[0]}")



Primeiras palavras dos poemas:
['enclosure', 'steamheated', 'a', 'trial', 'casket', 'you', 'are', 'here', 'your', 'name']

Número total de tokens: 18347
Exemplo de sequência X[0]: [3240 5342    3 5343 3241   11   25  107   36  211   15    3 5344  362
 1882   53  198  160   17 5345   40   38  401   11   40   38 2382    5
   11   26    4   37 5346  992   61   37 5347 5348    1  345   37 5349
   25  636 5350  139   11   74    1 5351]
Exemplo de sequência y[0]: 266


# Modelo LSTM


In [4]:
import tensorflow as tf
# tf.test.is_built_with_cuda()
if tf.config.list_physical_devices('GPU'):
    print("GPU disponível")

else:
    print("GPU não detectada")




GPU disponível


In [5]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout
from tensorflow.keras.optimizers import Adam

model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=100, input_length=seq_length))
model.add(LSTM(512, return_sequences=True)) 
model.add(Dropout(0.3)) 
model.add(LSTM(512)) 
model.add(Dropout(0.3))
model.add(Dense(len(tokenizer.word_index) + 1, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])

model.summary()

model.fit(X, y, epochs=50, batch_size=64)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 100)           1834800   
                                                                 
 lstm (LSTM)                 (None, 50, 512)           1255424   
                                                                 
 dropout (Dropout)           (None, 50, 512)           0         
                                                                 
 lstm_1 (LSTM)               (None, 512)               2099200   
                                                                 
 dropout_1 (Dropout)         (None, 512)               0         
                                                                 
 dense (Dense)               (None, 18348)             9412524   
                                                                 
Total params: 14,601,948
Trainable params: 14,601,948
No

<keras.callbacks.History at 0x26beb8762e0>

# Gerar texto com o nosso modelo treinado

In [26]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

def generate_text(model, tokenizer, seq_length, seed_text, num_words=100):
    input_text = tokenizer.texts_to_sequences([seed_text])[0]
    input_text = pad_sequences([input_text], maxlen=seq_length, padding='pre')[0]

    output_text = seed_text
    for _ in range(num_words):
        input_sequence = np.array([input_text[-seq_length:]]) 
        pred = model.predict(input_sequence, verbose=0)
        predicted_word_index = np.argmax(pred)
        predicted_word = tokenizer.index_word.get(predicted_word_index, '')
        output_text += ' ' + predicted_word
        input_text = np.append(input_text, predicted_word_index)
        input_text = input_text[-seq_length:]

    return output_text

def get_user_input():
    seed_text = input("Digite o texto inicial para gerar o poema: ")
    num_words = int(input("Quantas palavras você deseja gerar? "))
    return seed_text, num_words

seed_text, num_words = get_user_input()

generated_text = generate_text(model, tokenizer, seq_length=50, seed_text=seed_text, num_words=num_words)


In [27]:
from googletrans import Translator
translator = Translator()
traducao = translator.translate(generated_text, src='en', dest='pt')


print("\nPoema gerado:")
print(generated_text)

print("\nPoema gerado traduzido:")
print(traducao.text)


Poema gerado:
life there is a word in the lexicon of love my

Poema gerado traduzido:
Vida há uma palavra no léxico do amor meu
