<a href="https://colab.research.google.com/github/LoQiseaking69/BrownLLM/blob/main/SephsVoice1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Conversational Model Using the Brown Corpus

**Objective**: This notebook is dedicated to constructing a sophisticated conversational model utilizing the Brown Corpus, a foundational text dataset in natural language processing. The Brown Corpus, known for its diverse range of text data, is an excellent resource for training a versatile and robust conversational model.

## Key Steps:
1. **Library Importation**:
    - TensorFlow for building the neural network model.
    - NLTK for advanced text processing.

2. **Data Loading and Preprocessing**:
    - Utilizing the Brown Corpus.
    - Implementing NLTK's tokenization and stopwords filtering for data refinement.

3. **Model Construction**:
    - Designing a sequential neural network.
    - Optimizing layers for natural language understanding.

4. **Model Training and Evaluation**:
    - Compiling and training the model.
    - Assessing its conversational capabilities.

**Goal**: To develop a well-trained model capable of engaging in a broad range of conversational contexts, demonstrating the Brown Corpus's versatility in NLP applications.
---

In [12]:
%%capture
!pip install keras


## Import Libraries
Here we import necessary libraries such as TensorFlow and its submodules, as well as other essential Python libraries for data handling and modeling.

In [13]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam  # Replaced AdamW with Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
import numpy as np
import requests
import nltk
from nltk.corpus import brown, stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
import pandas as pd

## Load and Process the Brown Corpus
In this section, we load the Brown Corpus, a comprehensive text dataset, leveraging the Natural Language Toolkit (NLTK) for its rich linguistic content. The Brown Corpus provides a diverse range of text data, making it ideal for training conversational models.

Additionally, we employ NLTK's Punkt tokenizer for effective sentence tokenization. This tokenizer is adept at breaking text into constituent sentences, a crucial step in understanding and processing natural language.

Furthermore, to refine our dataset, we incorporate the use of NLTK's English stopwords list. Stopwords are commonly used words (such as "the", "is", "in") that are often omitted in language processing tasks to reduce noise and focus on the meaningful content. By filtering out these stopwords, we enhance the quality of our input data, ensuring that our model learns from the most relevant linguistic elements.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [14]:
# Download NLTK resources
nltk.download('brown')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')

# Load the Brown Corpus
data = brown.sents()

# Load English stopwords
stop_words = set(stopwords.words('english'))

# Utility function to tokenize sentences and remove stopwords
def process_sentences(data):
    processed_text = []
    for sentence in data:
        words = word_tokenize(' '.join(sentence))
        words_filtered = [word for word in words if word.lower() not in stop_words]
        processed_text.append(' '.join(words_filtered))
    return processed_text

# Process the loaded data
text_data = process_sentences(data)

# Convert to DataFrame and save as CSV
df = pd.DataFrame({'text_data': text_data})
df.to_csv('/content/text_data.csv', index=False)
print("Text data saved as 'text_data.csv'")

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Text data saved as 'text_data.csv'


## Adding Sentiment Analysis to Processed Text
Having preprocessed our text data by tokenizing and filtering out stopwords, the next crucial step is to incorporate sentiment analysis. This process will enrich our data with sentiment scores, providing a deeper understanding of the emotional context of each sentence.

We use NLTK's SentimentIntensityAnalyzer to assign sentiment scores to each sentence in our processed text. These scores will be instrumental in our later stages of data analysis and modeling, allowing our model to recognize and generate responses that are sentiment-aware.

The resulting data structure, `text_data_with_sentiment`, will be a list of tuples. Each tuple consists of a processed sentence and its corresponding sentiment score, combining linguistic and emotional insights.

In [15]:
from nltk.sentiment import SentimentIntensityAnalyzer

# Initialize the sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Function to add sentiment analysis to processed text
def add_sentiment_to_text(processed_text):
    text_data_with_sentiment = []
    for sentence in processed_text:
        sentiment_score = sia.polarity_scores(sentence)
        text_data_with_sentiment.append((sentence, sentiment_score))
    return text_data_with_sentiment

# Create the variable with both text data and sentiment scores
text_data_with_sentiment = add_sentiment_to_text(text_data)


## Data Preprocessing
This part involves tokenizing the text data and converting it into sequences to be fed into the model.

In [16]:
# Assuming 'text_data_with_sentiment' is a list of tuples containing sentences and their sentiment scores
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n', lower=True)

# Extracting only the text part and fit tokenizer on the text data
texts = [item[0] for item in text_data_with_sentiment]
tokenizer.fit_on_texts(texts)  # More efficient fitting

total_words = len(tokenizer.word_index) + 1

# Efficiently generating n-gram sequences using list comprehension
input_sequences = [token_list[:i + 1]
                   for line in texts
                   for token_list in [tokenizer.texts_to_sequences([line])[0]]
                   for i in range(1, len(token_list))]

# Padding sequences to the same length
max_sequence_len = max(len(seq) for seq in input_sequences)
input_sequences = np.array(tf.keras.preprocessing.sequence.pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# Creating predictors and labels for model training
predictors, label = input_sequences[:, :-1], input_sequences[:, -1]
label = to_categorical(label, num_classes=total_words)


## Model Building
Construct the sequential model with advanced layers like Bidirectional LSTM and TransformerLayer, suitable for language modeling.

In [None]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Dropout, LeakyReLU
from tensorflow.keras import Sequential

# Define model parameters
embedding_dim = 50  # Reduced embedding dimension
lstm_units_1 = 75  # Reduced LSTM units
lstm_units_2 = 50  # Reduced LSTM units
dropout_rate = 0.2
leaky_alpha = 0.01

# Define the Sequential model
model = Sequential()

# Add layers to the model
model.add(Embedding(total_words, embedding_dim, input_length=max_sequence_len - 1))
model.add(Bidirectional(LSTM(lstm_units_1, return_sequences=True)))
model.add(Dropout(dropout_rate))
model.add(LSTM(lstm_units_2))
model.add(Dense(total_words // 2))
model.add(LeakyReLU(alpha=leaky_alpha))
model.add(Dense(total_words, activation='softmax'))

# Print the model summary
model.summary()



## Model Compilation
Here, the model is compiled with a sophisticated optimizer (AdamW) and loss function for training.

In [None]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau

# Compile the model
model.compile(
    loss='categorical_crossentropy',  # Loss function for multi-class classification
    optimizer=Adam(learning_rate=0.001),  # Adam optimizer with specified learning rate
    metrics=['accuracy']  # Metric to monitor for performance
)

# Optional: Learning rate scheduler callback
lr_scheduler = ReduceLROnPlateau(
    monitor='val_loss',  # Monitor the validation loss
    factor=0.1,  # Factor by which the learning rate will be reduced
    patience=5,  # Number of epochs with no improvement after which learning rate will be reduced
    min_lr=0.0001  # Lower bound on the learning rate
)

# Print the model summary to check the architecture
model.summary()


## Training the Model
The model is trained on the preprocessed data with callbacks like EarlyStopping and ModelCheckpoint for efficient learning.

In [None]:

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

# Callbacks
callbacks = [
    EarlyStopping(monitor='val_accuracy', patience=5, restore_best_weights=True),
    ModelCheckpoint(filepath='model_best.h5', monitor='val_accuracy', save_best_only=True),
    ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, min_lr=0.0001)
]

# Model fitting
model.fit(
    predictors,
    label,
    epochs=50,
    batch_size=64,
    verbose=1,
    callbacks=callbacks,
    validation_split=0.1  # Using 10% of data for validation
)


## Model Inference
Finally, use the trained model to generate text based on a given seed text, demonstrating the model's conversational abilities.

In [None]:
def generate_text(seed_text, next_words, model, max_sequence_len, temperature=1.0):
    # Mapping from index to word
    index_word = {index: word for word, index in tokenizer.word_index.items()}

    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = tf.keras.preprocessing.sequence.pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predictions = model.predict(token_list)[0]

        # Temperature-based sampling
        predictions = np.asarray(predictions).astype('float64')
        predictions = np.log(predictions + 1e-7) / temperature  # Adding a small number to avoid log(0)
        exp_preds = np.exp(predictions)
        predictions = exp_preds / np.sum(exp_preds)
        predicted = np.random.choice(range(len(predictions)), p=predictions)

        # Get the predicted word
        output_word = index_word.get(predicted, '')  # Fallback to empty string if not found
        seed_text += ' ' + output_word

    return seed_text

print(generate_text("History shows that", 20, model, max_sequence_len))

NameError: name 'model' is not defined