This notebook trains, creates, and exports a chatbot AI model to JSON. Chat GPT genereated 99% of this code 😄.

In [1]:
import pandas as pd

df = pd.read_csv('training.csv', quotechar='"', escapechar='\\', on_bad_lines='skip')
df.head()

Unnamed: 0,input,response
0,Hi,Hello!
1,How are you?,I'm good thank you!
2,What is your name?,I am a chatbot.
3,Goodbye,Goodbye!
4,Tell me a joke,"Sure, why don't scientists trust atoms? Becaus..."


This code is preparing and training a machine learning model to classify text inputs, which is essential for training a chatbot. It uses the TfidfVectorizer from sklearn to convert text data into numerical features that represent the importance of words in the context of the document. Then, it applies a LogisticRegression model to classify the text based on these features. The code starts by downloading necessary NLTK data for text processing, then checks if the dataset contains more than one unique response, ensuring sufficient diversity for training. It splits the dataset into training and testing sets to evaluate model performance, and finally, it creates a pipeline that integrates the vectorizer and classifier, trains the model on the training data, and prepares it for making predictions on new text inputs. This process is crucial for enabling the chatbot to understand and respond to various user inputs accurately.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
import nltk

# Download NLTK data
nltk.download('punkt')


# Prepare data
X = df['input']
y = df['response']

# Check the number of unique responses
unique_responses = y.nunique()
if unique_responses < 2:
    raise ValueError(f"The dataset contains only {unique_responses} unique class(es). Please provide more diverse data.")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline that vectorizes the text and then applies a logistic regression model
model = make_pipeline(TfidfVectorizer(), LogisticRegression())

# Train the model
model.fit(X_train, y_train)



You can test out the model below:

In [None]:
# Function to get a response from the chatbot
def get_response(user_input):
    return model.predict([user_input])[0]

# Chat loop
print("Start chatting with the bot (type 'quit' to stop)!")
while True:
    user_input = input("You: ")
    if user_input.lower() == 'quit':
        break
    response = get_response(user_input)
    print(f"Bot: {response}")

Start chatting with the bot (type 'quit' to stop)!
You: hi
Bot: Hello!


Here is where the model is exported to JSON:

In [16]:
import json
import numpy as np

# Serialize the TF-IDF vectorizer
vectorizer_dict = {
    'vocabulary': model.named_steps['tfidfvectorizer'].vocabulary_,
    'idf': model.named_steps['tfidfvectorizer'].idf_.tolist()
}

with open('vectorizer.json', 'w') as f:
    json.dump(vectorizer_dict, f)

# Serialize the logistic regression model
model_coeffs = model.named_steps['logisticregression'].coef_.tolist()
model_intercept = model.named_steps['logisticregression'].intercept_.tolist()
classes = model.named_steps['logisticregression'].classes_.tolist()

model_data = {
    'coefficients': model_coeffs,
    'intercept': model_intercept,
    'classes': classes
}

with open('model.json', 'w') as f:
    json.dump(model_data, f)