# Language Learning Hybrid Chatbot



## NLP
NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.

## Import necessary libraries

In [None]:
import io
import openai # type: ignore
import random
import string # to process standard python strings
import warnings
import pandas as pd # type: ignore
import numpy as np # type: ignore
import tensorflow as tf # type: ignore
from sklearn.model_selection import train_test_split
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.metrics.pairwise import cosine_similarity
import warnings
from tensorflow.keras import Sequential # type: ignore
from tensorflow.keras.layers import Dense, Dropout # type: ignore
from japanese_dataset import japanese_dataset, explanations
from openai import OpenAI # type: ignore
client = OpenAI(api_key='')
warnings.filterwarnings('ignore')

## Downloading and installing NLTK
NLTK(Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

[Natural Language Processing with Python](http://www.nltk.org/book/) provides a practical introduction to programming for language processing.

For platform-specific instructions, read [here](https://www.nltk.org/install.html)



In [3]:
pip install nltk # type: ignore

Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#'

[notice] A new release of pip is available: 23.3.2 -> 24.3.1
[notice] To update, run: C:\Users\Muhammad Daffa A B\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


### Installing NLTK Packages




In [4]:
import nltk # type: ignore
from nltk.stem import WordNetLemmatizer # type: ignore
nltk.download('popular', quiet=True) # for downloading packages
#nltk.download('punkt') # first-time use only
#nltk.download('wordnet') # first-time use only

True

## Data loading

For our example, we'll be utilizing a JSON file called intents_exercise as a dataset that contains multiple intents with its corresponding patterns and suitable replies.

In [5]:
data = pd.read_json("./intents_exercise.json")
data

Unnamed: 0,intents
0,"{'tag': 'katakanaDefinition', 'patterns': ['Wh..."
1,"{'tag': 'hiraganaDefinition', 'patterns': ['Wh..."
2,"{'tag': 'katakanaExamples', 'patterns': ['Show..."
3,"{'tag': 'hiraganaExamples', 'patterns': ['Show..."
4,"{'tag': 'katakanaUsage', 'patterns': ['When to..."
...,...
58,"{'tag': 'katakanaComprehensionChallenges', 'pa..."
59,"{'tag': 'katakanaAndLanguageDevelopment', 'pat..."
60,"{'tag': 'katakanaProcessingEfficiency', 'patte..."
61,"{'tag': 'katakanaAndMentalMapping', 'patterns'..."


## Pre processing

During the pre-processing step, it will iterate through each intent in the dataset and tokenize each pattern in the intents. The patterns (sentences) in the dataset need to be broken down into individual words or tokens. This is done using NLTK's word_tokenize() function, which splits the sentences into words.

After that, lemmatization reduces words to their base or root form. For example, "running" becomes "run". This is important for reducing the vocabulary size and ensuring that similar words are treated the same way by the model. NLTK's WordNetLemmatizer is used for lemmatization.

Punctuation marks don't carry significant meaning in the context of natural language understanding, so they are often removed. Additionally, converting all words to lowercase ensures that the model treats words like "hello" and "Hello" as the same token.

In [6]:
words = []
classes = []
data_X = []
data_Y = []

for intent in data["intents"]:
    for pattern in intent["patterns"]:
        tokens = nltk.word_tokenize(pattern)
        words.extend(tokens)
        data_X.append(pattern)
        data_Y.append(intent["tag"]) ,

    if intent["tag"] not in classes:
        classes.append(intent["tag"])

lemmatizer = WordNetLemmatizer()

words = [lemmatizer.lemmatize(word.lower()) for word in words if word not in string.punctuation]

words = sorted(set(words))
classes = sorted(set(classes))
words

["'s",
 'and',
 'are',
 'art',
 'between',
 'book',
 'calligraphy',
 'challenge',
 'character',
 'child',
 'cognitive',
 'combination',
 'compared',
 'comprehension',
 'conjugation',
 'contribute',
 'cultural',
 'culture',
 'custom',
 'development',
 'difference',
 'difficulty',
 'do',
 'doe',
 'education',
 'efficiency',
 'efficiently',
 'emphasis',
 'emphasized',
 'emphasizing',
 'enhance',
 'evolution',
 'example',
 'face',
 'fast',
 'for',
 'foreign',
 'game',
 'genre',
 'hiragana',
 'hiragana-only',
 'historical',
 'how',
 'impact',
 'in',
 'influence',
 'is',
 'japanese',
 'kanji',
 'katakana',
 'language',
 'list',
 'literary',
 'literature',
 'load',
 'loanword',
 'map',
 'mapping',
 'me',
 'meaning',
 'medium',
 'mental',
 'mentally',
 'modern',
 'much',
 'name',
 'native',
 'no',
 'of',
 'on',
 'onomatopoeia',
 'origin',
 'other',
 'particle',
 'perceive',
 'perception',
 'poetry',
 'popular',
 'practice',
 'process',
 'processing',
 'read',
 'reading',
 'role',
 'script',
 '

In [7]:
training = []
out_empty = [0] * len(classes)

for idx, doc in enumerate(data_X):
    bow = []
    text = lemmatizer.lemmatize(doc.lower())
    for word in words:
        bow.append(1) if word in text else bow.append(0)
        output_row = list(out_empty)
        output_row[classes.index(data_Y[idx])] = 1
        training.append([bow, output_row])

random.shuffle(training)
training = np.array(training, dtype=object)

train_X = np.array(list(training[:, 0]))
train_Y = np.array(list(training[:, 1]))

In [8]:
# Split data into training and testing sets
train_X, test_X, train_Y, test_Y = train_test_split(train_X, train_Y, test_size=0.2, random_state=42)

model = Sequential()
model.add(Dense(128, input_shape=(len(train_X[0]),), activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(64, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(len(train_Y[0]), activation = "softmax"))
adam = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(loss='categorical_crossentropy',
              optimizer=adam,
              metrics=["accuracy"])
print(model.summary())
model.fit(x=train_X, y=train_Y, validation_data=(test_X, test_Y), epochs=10, verbose=1)
# Evaluate on training data
train_loss, train_accuracy = model.evaluate(train_X, train_Y, verbose=0)
print("Training Accuracy:", train_accuracy)

# Evaluate on testing data
test_loss, test_accuracy = model.evaluate(test_X, test_Y, verbose=0)
print("Testing Accuracy:", test_accuracy)

# Compare accuracies
if test_accuracy < train_accuracy:
    print("The model might be overfitting.")
else:
    print("The model's performance is consistent between training and testing data.")


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 128)               14848     
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_2 (Dense)             (None, 63)                4095      
                                                                 
Total params: 27199 (106.25 KB)
Trainable params: 27199 (106.25 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Epoch 1/10
Ep

In [9]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):
 
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

In [10]:
def get_random_question():
    """Select a random question from the Japanese dataset"""
    return random.choice(list(japanese_dataset.keys()))

def get_explanation(question):
    """Get the explanation for the given question"""
    return explanations.get(japanese_dataset.get(question, ""), "")

def check_answer(question, user_answer):
    """Check if the user's answer matches the correct answer"""
    correct_answer = japanese_dataset.get(question, "")
    return user_answer.strip().lower() == correct_answer.lower()

def clean_text(text):
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

def bag_of_words(text, vocab):
    tokens = clean_text(text)
    bow = [0] * len(vocab)
    for w in tokens:
        for idx, word in enumerate(vocab):
            if word == w:
                bow[idx] = 1
    return np.array(bow)

def pred_class(text, vocab, labels):
    bow = bag_of_words(text, vocab)
    result = model.predict(np.array([bow]))[0]
    thresh = 0.5
    y_pred = [[indx, res] for indx, res in enumerate(result) if res > thresh]
    y_pred.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in y_pred:
        return_list.append(labels[r[0]])
    return return_list

0
def get_response(intents_list, intents_json):
    if len(intents_list) == 0:
        result = "Sorry, I didn't understand that. Can you please provide more information?"
    elif len(intents_list) > 1:
        result = "I'm not sure which response to provide. Can you please clarify?"
    else:
        tag = intents_list[0]
        list_of_intents = intents_json["intents"]
        for i in list_of_intents:
            if i["tag"] == tag:
                result = random.choice(i["responses"])
                break
    return result


Finally, we will feed the lines that we want our bot to say while starting and ending a conversation depending upon user’s input.

In [None]:
openai.api_key = ""
def send_to_chatGPT(user_suggestion):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": user_suggestion}],
        max_tokens=100,
        n=1,
        stop=None,
        temperature=0.5,
    )
    content = response.choices[0].message.content.strip()
    print(content)


def main():
    while True:
        print("\nZeta: My name is Vestia Zeta. I will answer your queries about Hiraganas and Katakanas. If you want to exit, type 'Bye!'")
        print("\nPlease choose an action:")
        print("\n1. Japanese practice")
        print("\n2. Ask me a question regarding hiragana and katakana")
        choice = input("Enter your choice: ")

        if choice == "1":
            flag = True
            while flag:
                print("\nZeta: Welcome! I will help you learn some Japanese. Type 'Give me a question' to start or type 'return' to return back to the main menu.")
                user_input = input().strip().lower()

                if user_input == 'give me a question':
                    question = get_random_question()
                    print(f"\nZeta: What's the meaning of '{question}'?")
                    user_answer = input().strip()
                    print("\nYou:", user_answer)  # Displaying user input

                    if check_answer(question, user_answer):
                        print("\nZeta: Correct! Well done!")
                        print("\nZeta: Type 'how is it implemented' if you want to know more about the answer.")
                    else:
                        print("\nZeta: Sorry, that's not correct. The answer is:", japanese_dataset[question])
                        print("\nZeta: Type 'suggest me' if you would like me to provide you with some suggestion?")
                        user_suggestion_choice = input().lower()
                        if user_suggestion_choice == 'suggest me':
                            user_suggestion = "You're a Japanese teacher and this is your student response: " + user_answer + " to the question of " + question + ", please create a suggestion on where is the mistake on how to improve"
                            send_to_chatGPT(user_suggestion)

                elif user_input == 'how is it implemented':
                    if question:
                        explanation = get_explanation(question)
                        if explanation:
                            print("\nZeta:", explanation)
                        else:
                            print("\nZeta: Sorry, I don't have an explanation for that question yet.")
                    else:
                        print("\nZeta: You haven't answered any question yet.")

                elif user_input == 'return':
                    flag = False
                    print("\nZeta: Returning to main menu!")

                else:
                    print("\nZeta: I'm sorry, I didn't understand that. Type 'Give me a question' to start.")

        elif choice == "2":
            while True:
                print("\nZeta: Welcome! I will help you answere some question regarding Katakanas and Hiraganas. Type a question to start or type 'return' to return back to the main menu.")
                message = input("")
                if message == "return":
                    break
                print("\nYou:", message)  # Displaying user input
                intents = pred_class(message, words, classes)
                result = get_response(intents, data)
                print("Zeta:", result)

        elif choice.lower() == 'bye':
            print("Zeta: Goodbye! Have a great day!")
            break

        else:
            print("Zeta: Invalid choice. Please enter '1' or '2'. Or type 'Bye' to exit.")


if __name__ == "__main__":
    main()



Zeta: My name is Vestia Zeta. I will answer your queries about Hiraganas and Katakanas. If you want to exit, type 'Bye!'

Please choose an action:

1. Japanese practice

2. Ask me a question regarding hiragana and katakana
