# ***Mission Statement***

The goal of this project is to leverage the advancements in deep learning and NLP to develop a responsive and intelligent chatbot. This chatbot will be capable of understanding and answering a variety of question by extracting and learning from structured conversational data. Through iterative development, fine-tuning, and rigorous evaluation, we aim to create a tool that demonstrates the potential of AI in automating and enhancing user engagement and informational retrieval."

# **1. Import and Setup**

This section imports all necessary libraries and modules used throughout the script.

In [None]:
# Importing essential libraries and modules for handling various data types, making HTTP requests,
# and performing both basic and advanced natural language processing tasks.
import json
import os
from pprint import pprint
import requests
from google.colab import userdata
import pandas as pd
from datetime import datetime, date
import numpy as np
import spacy
from transformers import pipeline
import re

# Importing NLTK for natural language tasks such as tokenization and stopword removal,
# and downloading necessary resources for tokenization and lemmatization.
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
from keras.models import load_model

# Conversational pairs will be stored in this list
conversations = []


# **2. Data Acquisition via API**

Setting up the credentials and API endpoint for fetching data using the Bing News API

Fetching real-time or recent data from an external source like Bing News API enables the analysis of current trends and information. This section sets up the necessary parameters to access this data securely and efficiently.

In [None]:
# Configure API access with a subscription key and endpoint URL for Bing News API.
# This setup is required to fetch news data for analysis.
subscription_key = userdata.get('Bing')
endpoint = 'https://api.bing.microsoft.com/v7.0/news/search'

# Define the specific query for fetching news, focusing on "Personal Finance" to
# align with the project's scope.
query = "Personal Finance"


# **3. API Query Parameters**

Defining parameters for the API request to specify the details of the news articles to fetch.

Parameters such as the market, freshness, and count determine the scope and volume of the data fetched. They are crucial for ensuring that the data retrieved is relevant to the specific needs of the project.

In [None]:
# Set the parameters for the API request, specifying the market as 'en-US' and looking
# for the most recent articles within the past week. A higher count could be specified
# to fetch more articles in a single request.
mkt = 'en-US'
params = {
    'q': query,
    'mkt': mkt,
    'freshness': 'week',
    'count': 100
}
headers = { 'Ocp-Apim-Subscription-Key': subscription_key }


# **4. Pagination and Data Collection**

Implementing pagnation to handle large sets of data returned from the API. Most APIs have limits on the number of records returned in a single request. Pagination is used to iteratively request and fetch all required data by adjusting the offset parameter.

In [None]:
# Initialize variables for pagination and data storage
data = []
offset = 0  # Start at the beginning for pagination
records_needed = 200
total_fetched_records = 0

while total_fetched_records < records_needed:
    params = {
        'q': query,
        'mkt': mkt,
        'freshness': 'week',
        'count': 100,  # Adjust based on API's max allowed per request or your preference
        'offset': offset  # Use offset for pagination
    }
    headers = {'Ocp-Apim-Subscription-Key': subscription_key}

    try:
        response = requests.get(endpoint, headers=headers, params=params)
        response.raise_for_status()  # This will raise an exception for HTTP error responses
        json_data = response.json()

        articles = json_data.get('value', [])
        if not articles:
            print("No more articles found.")
            break

        for article in articles:
            name = article.get('name', '')
            description = article.get('description', '')
            url = article.get('url', '')
            provider_list = article.get('provider', [])
            provider_name = next((provider.get('name', '') for provider in provider_list if provider.get('name')), '')
            date_published = article.get('datePublished', '')
            try:
                formatted_date = datetime.strptime(date_published[:10], "%Y-%m-%d")
            except ValueError:
                formatted_date = None

            category = article.get('category', '')
            data.append([name, description, url, provider_name, formatted_date, category])

        total_fetched_records += len(articles)
        offset += 100  # Adjust offset for the next batch

    except Exception as ex:
        print(f"An error occurred: {ex}")
        break

# Define the column names and create a DataFrame
columns = ['Name', 'Description', 'URL', 'Provider', 'Date Published', 'Category']
df = pd.DataFrame(data, columns=columns)

# Display the number of records fetched
print(f"Total records fetched: {len(df)}")


Total records fetched: 247


In [None]:
df.head()

Unnamed: 0,Name,Description,URL,Provider,Date Published,Category
0,39 Personal Finance Lessons In Honor Of My 39t...,"In honor of my 39 years, I’ve compiled a list ...",https://www.forbes.com/sites/jonathanshenkman/...,Forbes,2024-04-17,Business
1,Survey Says: Personal Finance Knowledge Gaps A...,Lack of financial knowledge is leading to cost...,https://finance.yahoo.com/news/survey-says-per...,YAHOO!Finance,2024-04-17,ScienceAndTechnology
2,MoneyHero Group Hosts Singapore’s Largest Pers...,"The iconic annual event, which was held on Apr...",https://markets.businessinsider.com/news/stock...,Business Insider,2024-04-11,ScienceAndTechnology
3,Study list of these must-know financial litera...,It probably doesn’t come as a surprise that sa...,https://www.msn.com/en-us/money/other/study-li...,Kansas City Star on MSN.com,2024-04-17,Business
4,Number of U.S. Public High School Students Gua...,The number of states guaranteeing a Personal F...,https://finance.yahoo.com/news/number-u-public...,YAHOO!Finance,2024-04-16,


# **5. Data Preparation**

In [None]:
def clean_text(text):
  text = text.lower() # set all words to lowercase

  text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text) # substitute special characters with white space

  stop_words = set(stopwords.words('english')) # bring in a stopwords dict
  tokens = nltk.word_tokenize(text) # split up text to word tokens
  filtered_text = [word for word in tokens if word not in stop_words] # filter out stopwords

  lemmatizer = WordNetLemmatizer()
  lemmatized_text = [lemmatizer.lemmatize(word) for word in filtered_text] # reduce words down to their roots through lemmatization

  cleaned_text = ' '.join(lemmatized_text) # rejoin words into continuous strings

  return cleaned_text

# Apply the cleaning function to the "All_Text" column
df['Cleaned_Description'] = df['Description'].apply(clean_text)
df.head()

Unnamed: 0,Name,Description,URL,Provider,Date Published,Category,Cleaned_Description
0,39 Personal Finance Lessons In Honor Of My 39t...,"In honor of my 39 years, I’ve compiled a list ...",https://www.forbes.com/sites/jonathanshenkman/...,Forbes,2024-04-17,Business,honor 39 year compiled list 39 personal financ...
1,Survey Says: Personal Finance Knowledge Gaps A...,Lack of financial knowledge is leading to cost...,https://finance.yahoo.com/news/survey-says-per...,YAHOO!Finance,2024-04-17,ScienceAndTechnology,lack financial knowledge leading costly financ...
2,MoneyHero Group Hosts Singapore’s Largest Pers...,"The iconic annual event, which was held on Apr...",https://markets.businessinsider.com/news/stock...,Business Insider,2024-04-11,ScienceAndTechnology,iconic annual event held april 6th produced se...
3,Study list of these must-know financial litera...,It probably doesn’t come as a surprise that sa...,https://www.msn.com/en-us/money/other/study-li...,Kansas City Star on MSN.com,2024-04-17,Business,probably come surprise saving one important co...
4,Number of U.S. Public High School Students Gua...,The number of states guaranteeing a Personal F...,https://finance.yahoo.com/news/number-u-public...,YAHOO!Finance,2024-04-16,,number state guaranteeing personal finance cou...


In [None]:
# Extracting questions and answers from the dataframe
def prepare_data(df):
    for index, row in df.iterrows():
        question = "What can you tell me about " + row['Name'] + "?"
        answer = row['Description']
        conversations.append([question, answer])

    sentences = [conv[0] for conv in conversations]
    responses = [conv[1] for conv in conversations]
    return sentences, responses

# Vectorization of sentences and creation of categorical encoding for responses
def vectorize_data(sentences, responses):
    vectorizer = CountVectorizer(binary=True)
    X = vectorizer.fit_transform(sentences).toarray()
    vocab = vectorizer.get_feature_names_out()
    response_index = {resp: idx for idx, resp in enumerate(set(responses))}
    y = np.array([response_index[resp] for resp in responses])
    return X, y, vectorizer, response_index, vocab


# **6. Neural Network Model Preparation**

Building and training a neural network model using the Keras library. The model is essential for learning patterns in the data, which can be used for classification, prediction, or other tasks. This section sets up the architecture, compiles the model and trains it using the prepared data.

In [None]:
# Define a function to build and train a neural network for classifying or processing text.
# This includes setting up the architecture, compiling the model, and training it.
def build_and_train_model(X, y, vocab, response_index):
    # Splitting the dataset into training and testing sets for validation purposes.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Creating a sequential model with dense layers, appropriate for many types of classification tasks.
    model = Sequential([
        Dense(16, input_dim=len(vocab), activation='relu'),
        Dense(len(response_index), activation='softmax')
    ])

    # Compile the model with a suitable loss function and optimizer for categorical data.
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    # Train the model with the training set, iterating through the data multiple times to improve accuracy.
    model.fit(X_train, y_train, epochs=100, verbose=1)
    return model, X_test, y_test

# **7. Model Evaluation and Saving**

Evaluating the model's performance and saving it for later use. After training, it's important to assess how well the model performs on unseen data. Saving the trained model allows it to be reused without needing to retrain from scratch.

In [None]:
def evaluate_and_save_model(model, X_test, y_test):
    # Evaluate the model using the test dataset to get the loss and accuracy.
    loss, acc = model.evaluate(X_test, y_test, verbose=0)
    print("Test Loss: {:.2f}".format(loss))
    print("Test Accuracy: {:.2f}%".format(acc * 100))

    # Predict classes using the testing data
    y_pred = model.predict(X_test)
    if y_pred.shape[-1] > 1:
        y_pred = np.argmax(y_pred, axis=1)  # Convert probabilities to class labels
    y_test = np.argmax(y_test, axis=1) if y_test.ndim > 1 else y_test  # Adjust for one-hot encoding

    # Calculate precision, recall, and F1 score
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    f1 = f1_score(y_test, y_pred, average='macro')

    print("Precision: {:.2f}".format(precision))
    print("Recall: {:.2f}".format(recall))
    print("F1 Score: {:.2f}".format(f1))

    # Save the model
    model.save('model_final_project.h5')
    print("Model saved to model_tinal_project.h5")



# **7. Interactive Chatbot Function**

Creating functions for the chatbot to interact with users and provide responses based on the trained model. This allows the model to be used in a practical application, demonstrating its utility in processing and responding to user inputs in real time.

In [None]:
# Define a function to get responses from the chatbot based on user input.
def get_bot_response(user_input, vectorizer, model, response_index):
    # Transform the user's input using the same vectorizer used for training data.
    # This ensures the input is in the correct format for the model to process.
    input_vector = vectorizer.transform([user_input]).toarray()

    # Predict the response using the pre-trained model. This function returns
    # a probability distribution over all possible classes (responses).
    prediction = model.predict(input_vector)[0]

    # Find the index of the highest probability in the prediction list, which
    # corresponds to the most likely response category.
    predicted_index = np.argmax(prediction)

    # Retrieve the actual response text corresponding to the predicted index.
    # This loop searches for the response that matches the predicted index.
    return next(key for key, value in response_index.items() if value == predicted_index)

# Define a function to handle the interaction loop between the chatbot and the user.
def chatbot_interaction(vectorizer, model, response_index):
    # Initial greeting from the chatbot.
    print("Chatbot: Hi there! Ask me anything.")

    # Start an infinite loop to continuously interact with the user.
    while True:
        # Prompt the user for input.
        user_input = input("You: ")

        # Check if the user wants to quit the chat. If so, break out of the loop.
        if user_input.lower() == "quit":
            break

        # Get the response from the chatbot based on the user's input.
        bot_response = get_bot_response(user_input, vectorizer, model, response_index)

        # Print out the chatbot's response to the user's input.
        print("Chatbot:", bot_response)


In [None]:
# This is the main entry point of the Python script. It checks if the script is being ru
# directly (not imported as a module in another script) and then executes the code block.
if __name__ == '__main__':
    # Call the prepare_data function to extract and format the questions and answers
    # from the DataFrame. This function processes text data, preparing it for vectorization.
    sentences, responses = prepare_data(df)

    # Vectorize the prepared sentences using the vectorize_data function. This function
    # transforms the text into a numerical format that the machine learning model can
    # understand and also encodes the responses for the model's output.
    X, y, vectorizer, response_index, vocab = vectorize_data(sentences, responses)

    # Build and train the neural network model using the vectorized data. This step involves
    # defining the model architecture and using the data to train the model, adjusting weights
    # to minimize prediction errors based on the provided input (X) and output (y) training data.
    model, X_test, y_test = build_and_train_model(X, y, vocab, response_index)

    # Evaluate the trained model's performance on the test dataset and save the trained model
    # to a file for future use. This function prints out the model's accuracy and saves it,
    # allowing the model to be reused without retraining from scratch.
    evaluate_and_save_model(model, X_test, y_test)

    # Begin interacting with the user using the trained model. This function implements
    # a loop that continuously accepts user inputs, processes them through the model, and
    # returns responses, simulating a chatbot interaction.
    chatbot_interaction(vectorizer, model, response_index)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  saving_api.save_model(


Precision: 0.90
Recall: 0.91
F1 Score: 0.90
Model saved to model_filename.h5
Chatbot: Hi there! Ask me anything.
You: mortgage rates
Chatbot: She covers mortgage rates, refinance rates, mortgage lender reviews, and homebuying for Personal Finance Insider.Before joining the Insider team, Molly was a blog writer for Rocket Companies ...
You: recession
Chatbot: Independent Online, popularly known as IOL, is one of South Africa’s leading news and information websites bringing millions of readers breaking news and updates on Politics, Current Affairs ...
You: stocks
Chatbot: Independent Online, popularly known as IOL, is one of South Africa’s leading news and information websites bringing millions of readers breaking news and updates on Politics, Current Affairs ...
You: quit
