# Daily Challenge: Creating A Text Generator

## Preprocess Text Data:
1. Access the text
2. Create a preprocess function to clean non-words not relevant to the context using regular expressions
3. Split the text: how you should split the text considering the final objective of this project is to create a sentence?
- return the splited text as the corpus

4. print the first 200 characteres of the corpus.
Are there parts of the text that are not relevant to the analysis? If so, the function should remove them as well.

hint: you can use slicing to start and stop the text where you need (ignoring autoral credits in the begining and end) looking for the following phrases:

*** START
*** END

5. Using Tokenizer(), create the vocabulary and a variable called total_words which will be the lenght of tokenizer.word_index + 1

In [None]:
!pip install numpy tensorflow



In [None]:
!pip install nltk



In [None]:
from tensorflow import keras

In [None]:
from tensorflow.keras.preprocessing import text
from tensorflow.keras.preprocessing import sequence

In [None]:
import pandas as pd
import re
import nltk
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.callbacks import EarlyStopping
import requests
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer
from tensorflow import keras
#nltk.download('stopwords')
from nltk.stem import PorterStemmer
import spacy
from nltk import pos_tag
from nltk import pos_tag, ne_chunk
from gensim.models import Word2Vec
import gensim.downloader as api
#nltk.download('averaged_perceptron_tagger')
#nltk.download('maxent_ne_chunker')
#nltk.download('words')
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Load data
url = "https://www.gutenberg.org/cache/epub/11/pg11.txt"
response = requests.get(url)
text1 = response.text

In [None]:
def preprocess_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove non-alphabetic characters and extra whitespaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Split the text into sentences
def split_text_into_sentences(text):
    sentences = nltk.sent_tokenize(text)
    return sentences

# Preprocess the raw text
cleaned_text = preprocess_text(text1)
sentences = split_text_into_sentences(cleaned_text)

# Print the first 200 characters of the corpus
corpus = " ".join(sentences)
print("First 200 characters of the corpus:")
print(corpus[:200])

# Create vocabulary and calculate total_words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
total_words = len(tokenizer.word_index) + 1

print("\nTotal number of words in the vocabulary:", total_words)

First 200 characters of the corpus:
The Project Gutenberg eBook of Alices Adventures in Wonderland This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restric

Total number of words in the vocabulary: 3193


## Build The Neural Network Model:
Define a Sequential model in Keras.
Add an Embedding layer for text representation.
Include appropriate RNN layers for processing the sequences.
Add a Dense layer for output prediction.
Utilize Dropout for regularization.

In [None]:
# Convert text to sequences
sequences = tokenizer.texts_to_sequences(sentences)

# Create n-gram sequences for input data
input_sequences = []
for sequence in sequences:
    for i in range(1, len(sequence)):
        n_gram_sequence = sequence[:i+1]
        input_sequences.append(n_gram_sequence)

# Pad sequences
max_sequence_length = max(len(seq) for seq in input_sequences)
padded_input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre')

# Separate into input and output
X, y = padded_input_sequences[:, :-1], padded_input_sequences[:, -1]
y = to_categorical(y, num_classes=total_words)

## Compile And Train The Model:
Compile the model with an appropriate optimizer and loss function.
Train the model on the prepared data.
Implement EarlyStopping to prevent overfitting.

In [None]:
model = Sequential()

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Early stopping to prevent overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Train the model
history = model.fit(X, y, epochs=20, validation_split=0.2, callbacks=[early_stopping])

# Evaluate the model if needed
# evaluation = model.evaluate(X_test, y_test)

plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

## Evaluate The Model’s Performance On Test Data:
Create generate_text() function with the appropriate arguments. The function should to preprocess the seed_text, predict the next_words and add then after the seed text in an output string.

Experiment with different model architectures, hyperparameters, and preprocessing techniques to improve performance. ​(Try using LSTM and then try GRU)

In [None]:
def generate_text_lstm(model, tokenizer, seed_text, next_words, max_sequence_length):
    output_text = seed_text
    for _ in range(next_words):
        # Tokenize the seed text
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        # Pad the tokenized seed text
        token_list = pad_sequences([token_list], maxlen=max_sequence_length-1, padding='pre')
        # Predict the next word
        predicted_index = model.predict_classes(token_list, verbose=0)
        # Convert index to word
        predicted_word = tokenizer.index_word.get(predicted_index[0], '')
        # Update the seed text for the next iteration
        seed_text += " " + predicted_word
        # Add the predicted word to the output text
        output_text += " " + predicted_word
    return output_text

# Example usage for LSTM
seed_text_example = "Alice"
generated_text_lstm = generate_text_lstm(model, tokenizer, seed_text_example, next_words=20, max_sequence_length=max_sequence_length)
print("Generated Text (LSTM):", generated_text_lstm)

In [None]:
def generate_text_gru(model, tokenizer, seed_text, next_words, max_sequence_length):
    output_text = seed_text
    for _ in range(next_words):
        # Tokenize the seed text
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        # Pad the tokenized seed text
        token_list = pad_sequences([token_list], maxlen=max_sequence_length-1, padding='pre')
        # Predict the next word
        predicted_index = model.predict_classes(token_list, verbose=0)
        # Convert index to word
        predicted_word = tokenizer.index_word.get(predicted_index[0], '')
        # Update the seed text for the next iteration
        seed_text += " " + predicted_word
        # Add the predicted word to the output text
        output_text += " " + predicted_word
    return output_text

# Example usage for GRU
generated_text_gru = generate_text_gru(model, tokenizer, seed_text_example, next_words=20, max_sequence_length=max_sequence_length)
print("Generated Text (GRU):", generated_text_gru)