# Intelligent Chatbot Development Project

## Introduction
This project presents the development of an **Intelligent Learning Chatbot**, designed to simulate human-like dialogue and assist users in exploring topics related to artificial intelligence, data science, and machine learning. The chatbot leverages **Natural Language Processing (NLP)** and **Machine Learning (ML)** techniques to interpret user intent, extract key information, and provide accurate, context-aware responses.

Built using **Python**, **spaCy**, **Sentence Transformers**, and **Gradio**, the chatbot follows a **retrieval-based approach**, ensuring that every answer is grounded in existing data rather than generated arbitrarily. This approach guarantees both accuracy and control, allowing the chatbot to deliver consistent and meaningful interactions.

The project emphasizes clarity, reproducibility, and modular design, with each phase, from dataset creation and embedding generation to model evaluation and deployment, structured for ease of understanding and future scalability. The chatbot is deployed on **Hugging Face Spaces**, enabling real-time interaction through a user-friendly interface.

### Goals
The primary objectives of this project are to:

1. **Develop a reproducible NLP workflow** for loading, cleaning, and structuring conversational data.

2. **Build a retrieval-based chatbot** capable of accurately identifying user intent and delivering the most relevant response.

3. **Implement entity recognition and sentiment awareness** to enrich and personalize responses.

4. **Maintain version control and collaboration** using Git and GitHub throughout the development process.

5. **Deploy a web-based chatbot interface** using Gradio for live, interactive demonstrations.

### Milestones
**M1: Data Loading & Cleaning** – Prepare and preprocess raw conversational and intent data.

**M2: Embeddings & Retrieval** – Generate text embeddings using Sentence Transformers and set up semantic search.

**M3: Intent Classification (Baseline)** – Establish a foundation for intent detection and response mapping.

**M4: Entity & Sentiment Enhancement** – Integrate named entity recognition and sentiment awareness for smarter replies.

**M5: Interactive Chat UI (Gradio)** – Develop and deploy an engaging user interface for real-time testing.


### Step 1 - Data loading and cleaning

In [1]:
%pip install --quiet numpy pandas scikit-learn nltk spacy "sentence-transformers>=3.0.0" transformers gradio

Note: you may need to restart the kernel to use updated packages.


In [2]:
import nltk, sys
# Downloading of small NLTK packs
nltk.download("punkt", quiet=True)
nltk.download("stopwords", quiet=True)

import spacy
try:
    spacy.load("en_core_web_sm")
except OSError:
    !python -m spacy download en_core_web_sm

In [3]:
!pip install tf-keras



In [4]:
import tf_keras as keras
import sys, platform
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import nltk, spacy

print("Python:", sys.version.split()[0], "| OS:", platform.system())
for name, mod in [
    ("numpy", np), ("pandas", pd),
    ("sentence-transformers", SentenceTransformer),
    ("transformers", AutoTokenizer),
    ("spacy", spacy), ("nltk", nltk),
]:
    try:
        v = mod.__version__ if hasattr(mod, "__version__") else "OK"
    except Exception:
        v = "OK"
    print(f"{name:22s} -> {v}")

# Quick sanity check: load a tiny embedding model
embedder = SentenceTransformer("all-MiniLM-L6-v2")
sample = ["hello world", "career in data science"]
emb = embedder.encode(sample, normalize_embeddings=True)
print("Embeddings shape:", emb.shape)





  from .autonotebook import tqdm as notebook_tqdm


Python: 3.10.16 | OS: Windows
numpy                  -> 2.0.2
pandas                 -> 2.2.3
sentence-transformers  -> OK
transformers           -> OK
spacy                  -> 3.8.7
nltk                   -> 3.9.2
Embeddings shape: (2, 384)


In Step 1, the project sets up the environment by installing and importing essential libraries used for Natural Language Processing (NLP), data handling, and model building. These include NumPy, Pandas, scikit-learn, NLTK, spaCy, Transformers, and Sentence Transformers. The setup ensures that all required NLP tools, such as tokenizers and stopwords, are downloaded, and the lightweight English model (en_core_web_sm) from spaCy is loaded for text cleaning and preprocessing. A small test is then performed using the Sentence Transformer model (all-MiniLM-L6-v2) to generate embeddings, numerical representations of text that capture meaning. This confirms that the libraries are correctly installed, the environment is configured, and the embedding model is functioning properly before moving to the next step of data processing.

### Step 2 - Data Setup (Test dataset)

In [5]:
import pandas as pd

data = {
    "intent": [
        "ask_skills_for_data_science",
        "ask_skills_for_ai",
        "ask_job_recommendation",
        "ask_learning_path",
        "greeting",
        "goodbye",
    ],
    "patterns": [
        ["What skills do I need for data science?", "How to become a data scientist?"],
        ["What do I need to study for AI?", "What are AI skills?"],
        ["What job is good for me if I like numbers?", "I enjoy problem-solving, any job ideas?"],
        ["Where should I start learning data analysis?", "Best way to learn machine learning?"],
        ["Hi", "Hello", "Hey"],
        ["Bye", "See you later", "Goodbye"],
    ],
    "responses": [
        "Data Science needs skills like Python, SQL, Statistics, and Machine Learning.",
        "AI requires knowledge of Python, Neural Networks, and Data Modeling.",
        "You might enjoy careers like Data Analyst, Statistician, or Research Scientist.",
        "Start with Python, then learn data visualization and basic machine learning.",
        "Hello! How can I help you today?",
        "Goodbye! Keep learning and stay curious!",
    ],
}

df = pd.DataFrame(data)
df


Unnamed: 0,intent,patterns,responses
0,ask_skills_for_data_science,"[What skills do I need for data science?, How ...","Data Science needs skills like Python, SQL, St..."
1,ask_skills_for_ai,"[What do I need to study for AI?, What are AI ...","AI requires knowledge of Python, Neural Networ..."
2,ask_job_recommendation,"[What job is good for me if I like numbers?, I...","You might enjoy careers like Data Analyst, Sta..."
3,ask_learning_path,"[Where should I start learning data analysis?,...","Start with Python, then learn data visualizati..."
4,greeting,"[Hi, Hello, Hey]",Hello! How can I help you today?
5,goodbye,"[Bye, See you later, Goodbye]",Goodbye! Keep learning and stay curious!


In [6]:
rows = []
for _, row in df.iterrows():
    for pattern in row["patterns"]:
        rows.append({"text": pattern, "intent": row["intent"], "response": row["responses"]})

chatbot_df = pd.DataFrame(rows)
chatbot_df.head(10)


Unnamed: 0,text,intent,response
0,What skills do I need for data science?,ask_skills_for_data_science,"Data Science needs skills like Python, SQL, St..."
1,How to become a data scientist?,ask_skills_for_data_science,"Data Science needs skills like Python, SQL, St..."
2,What do I need to study for AI?,ask_skills_for_ai,"AI requires knowledge of Python, Neural Networ..."
3,What are AI skills?,ask_skills_for_ai,"AI requires knowledge of Python, Neural Networ..."
4,What job is good for me if I like numbers?,ask_job_recommendation,"You might enjoy careers like Data Analyst, Sta..."
5,"I enjoy problem-solving, any job ideas?",ask_job_recommendation,"You might enjoy careers like Data Analyst, Sta..."
6,Where should I start learning data analysis?,ask_learning_path,"Start with Python, then learn data visualizati..."
7,Best way to learn machine learning?,ask_learning_path,"Start with Python, then learn data visualizati..."
8,Hi,greeting,Hello! How can I help you today?
9,Hello,greeting,Hello! How can I help you today?


In [7]:
chatbot_df.to_csv("chatbot_dataset.csv", index=False)
print("Saved: chatbot_dataset.csv")


Saved: chatbot_dataset.csv


In [8]:
!pip install kaggle



In [9]:
import os
os.environ["KAGGLE_CONFIG_DIR"] = os.path.join(os.getcwd(), ".kaggle")
# Optional check:
print("Kaggle config dir:", os.environ["KAGGLE_CONFIG_DIR"])
print("Has kaggle.json:", os.path.exists(os.path.join(os.environ["KAGGLE_CONFIG_DIR"], "kaggle.json")))


Kaggle config dir: C:\Users\hp\projects\intelligent-chatbot-notebook\.kaggle
Has kaggle.json: True


In [10]:
!kaggle datasets download -d elvinagammed/chatbots-intent-recognition-dataset -p data

Dataset URL: https://www.kaggle.com/datasets/elvinagammed/chatbots-intent-recognition-dataset
License(s): copyright-authors
chatbots-intent-recognition-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [11]:
import zipfile, glob

zip_path = sorted(glob.glob("data/*.zip"))[0]
with zipfile.ZipFile(zip_path, "r") as z:
    z.extractall("data")

print("Extracted files:")
for p in glob.glob("data/**/*", recursive=True):
    print(p)

Extracted files:
data\chatbots-intent-recognition-dataset.zip
data\Intent.json


In [12]:
import json
from pprint import pprint

# Step 1: Load and preview the JSON
with open("Intent.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Step 2: Show the first few items so we can inspect structure
if isinstance(data, dict):
    # If it's a dictionary, show keys and sample content
    pprint(list(data.keys()))
    if "intents" in data:
        pprint(data["intents"][:2])
    else:
        pprint(data)
else:
    # If it's a list, show first 2 items
    pprint(data[:2])


['intents']
[{'intent': 'Greeting',
  'responses': ['Hey there 👋 How are you doing today?',
                'Hi! Great to see you 😄',
                'Hello, friend! How’s your day going?',
                'Hey hey! What brings you here today?',
                'Hi there! 😊 What can I do for you?',
                'Yo! 👋 Ready to chat?',
                'Hey! I’m all ears — what’s on your mind?',
                'Hello sunshine ☀️ How can I help?',
                'Hi there! I’m your friendly AI assistant — what do you need?',
                'Howdy partner 🤠 What’s up?'],
  'text': ['Hi',
           'Hey',
           'Hey there',
           'Hello',
           'Hiya',
           'Good morning',
           'Good afternoon',
           'Good evening',
           "What's up",
           'Yo',
           'Howdy',
           'Hello there',
           'Hey bot',
           'Hi friend',
           'Greetings']},
 {'intent': 'Goodbye',
  'responses': ['Goodbye 👋 Take care of yourself!',
     

In [13]:
# Step 3: Convert JSON into a DataFrame
import pandas as pd

intents = data["intents"]

rows = []
for intent in intents:
    for pattern in intent["text"]:
        rows.append({
            "intent": intent["intent"],
            "text": pattern,
            "responses": intent["responses"]
        })

df = pd.DataFrame(rows)
print(f"✅ Loaded {len(df)} text patterns from {len(intents)} intents.")
df.head()


✅ Loaded 114 text patterns from 15 intents.


Unnamed: 0,intent,text,responses
0,Greeting,Hi,"[Hey there 👋 How are you doing today?, Hi! Gre..."
1,Greeting,Hey,"[Hey there 👋 How are you doing today?, Hi! Gre..."
2,Greeting,Hey there,"[Hey there 👋 How are you doing today?, Hi! Gre..."
3,Greeting,Hello,"[Hey there 👋 How are you doing today?, Hi! Gre..."
4,Greeting,Hiya,"[Hey there 👋 How are you doing today?, Hi! Gre..."


In [14]:
# Step 4: Create embeddings and save chatbot data

from sentence_transformers import SentenceTransformer
import pickle

print("🔄 Generating embeddings... please wait.")

model = SentenceTransformer("all-MiniLM-L6-v2")

# FIXED: use 'text' instead of 'pattern'
df["embedding"] = df["text"].apply(lambda x: model.encode(x, normalize_embeddings=True))

# Save updated data
with open("chatbot_with_embeddings.pkl", "wb") as f:
    pickle.dump(df, f)

print("✅ Embeddings created and saved successfully as chatbot_with_embeddings.pkl")

🔄 Generating embeddings... please wait.
✅ Embeddings created and saved successfully as chatbot_with_embeddings.pkl


In [15]:
import numpy as np

# Convert embeddings from list → NumPy array
df["embedding"] = df["embedding"].apply(lambda x: np.array(x))

print("✅ Embeddings converted to NumPy arrays!")
print(type(df["embedding"].iloc[0]))


✅ Embeddings converted to NumPy arrays!
<class 'numpy.ndarray'>


In [16]:
df.to_csv("chatbot_intents_dataset.csv", index=False)
print("Saved cleaned dataset as chatbot_intents_dataset.csv")


Saved cleaned dataset as chatbot_intents_dataset.csv


Next, a small sample dataset is created to simulate chatbot training data before using a larger real-world dataset. The dataset includes example user intents such as greetings, learning paths, and job recommendations, each paired with sample user messages (patterns) and chatbot replies (responses). This mock dataset helps verify that the data pipeline is working correctly. The project then saves the dataset as a CSV file (chatbot_dataset.csv) and connects to Kaggle using the Kaggle API to access and download a real dataset called “Chatbots Intent Recognition Dataset.” The downloaded data, stored in a JSON file, is extracted and previewed to understand its structure and contents. Finally, the data is transformed into a clean, organized Pandas DataFrame containing three main columns, intent, text, and responses, and embeddings are generated using a Sentence Transformer model to prepare the dataset for chatbot training.

### Step 3 - Embeddings and Retrieval

We’ll use a pre-trained SentenceTransformer model to create text embeddings for each pattern.
These embeddings represent the meaning of text as numeric vectors.
When a user sends a query, the chatbot will:
1. Convert the query into an embedding,
2. Compare it with all existing pattern embeddings using cosine similarity,
3. Retrieve the most relevant response.


In [17]:
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(df["text"].tolist(), normalize_embeddings=True)

df["embedding"] = embeddings.tolist()

# Sample
df.head(3)


Unnamed: 0,intent,text,responses,embedding
0,Greeting,Hi,"[Hey there 👋 How are you doing today?, Hi! Gre...","[-0.09047622978687286, 0.04043959826231003, 0...."
1,Greeting,Hey,"[Hey there 👋 How are you doing today?, Hi! Gre...","[-0.11423862725496292, 0.013737436383962631, 0..."
2,Greeting,Hey there,"[Hey there 👋 How are you doing today?, Hi! Gre...","[-0.10038460791110992, 0.018592018634080887, 0..."


In [18]:
from sklearn.metrics.pairwise import cosine_similarity

def get_responses(user_input, df, model):
    # Step 1: Encode user query
    query_emb = model.encode([user_input], normalize_embeddings=True)
    
    # Step 2: Compute cosine similarity with existing embeddings
    similarities = cosine_similarity(query_emb, np.vstack(df["embedding"].values))[0]
    
    # Step 3: Pick the highest similarity
    best_idx = np.argmax(similarities)
    
    # Step 4: Return intent, response, and similarity score
    return {
        "user_input": user_input,
        "predicted_intent": df.iloc[best_idx]["intent"],
        "response": df.iloc[best_idx]["responses"],
        "similarity": round(similarities[best_idx], 3)
    }


In [19]:
queries = [
    "hello there",
    "who are you",
    "bye",
    "what can you do",
    "hi human"
]

for q in queries:
    result = get_responses(q, df, model)
    print(f"\n🗣️ User: {result['user_input']}")
    print(f"🤖 Bot ({result['predicted_intent']}): {result['response']}")
    print(f"Similarity: {result['similarity']}")



🗣️ User: hello there
🤖 Bot (Greeting): ['Hey there 👋 How are you doing today?', 'Hi! Great to see you 😄', 'Hello, friend! How’s your day going?', 'Hey hey! What brings you here today?', 'Hi there! 😊 What can I do for you?', 'Yo! 👋 Ready to chat?', 'Hey! I’m all ears — what’s on your mind?', 'Hello sunshine ☀️ How can I help?', 'Hi there! I’m your friendly AI assistant — what do you need?', 'Howdy partner 🤠 What’s up?']
Similarity: 1.0

🗣️ User: who are you
🤖 Bot (Identity): ['I’m your intelligent chatbot assistant — created by Linet Lydia, the brilliant mind behind my design.', 'You can call me your digital companion 🤖. I was built by Linet Lydia!', 'I’m a blend of data, empathy, and code — created by Linet Lydia 💫.', 'Linet Lydia built me with the goal of making tech feel more human.', 'I’m an AI assistant designed to chat, help, and make you think!', 'Not human, but I’m learning from you every day 😄', 'You could say I’m Linet Lydia’s digital creation — made with purpose and curiosit

In [20]:
df.to_pickle("chatbot_with_embeddings.pkl")
print("Saved dataset with embeddings.")


Saved dataset with embeddings.


In Step 3, the chatbot begins learning how to understand and respond to user input through text embeddings. Using a pre-trained SentenceTransformer model (all-MiniLM-L6-v2), each question or pattern in the dataset is converted into a numerical vector that captures the meaning of the sentence. When a user enters a message, the chatbot transforms it into an embedding and compares it with all existing embeddings using cosine similarity to measure how closely they match. The response linked to the most similar pattern is then selected and displayed to the user. This step effectively enables the chatbot to understand natural language queries and respond intelligently, even if the exact phrasing wasn’t part of the training data. The dataset, now containing the generated embeddings, is saved as a .pkl file for later use in the final chatbot interface.

### Step 4 - Import Gradio and Build the Chat UI

In [21]:
!pip install gradio



In [None]:
import gradio as gr
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import random

def chatbot_reply(user_input):
    if not user_input.strip():
        return "🤖 Please say something!"

    query_emb = model.encode([user_input], normalize_embeddings=True)
    similarities = cosine_similarity(query_emb, np.vstack(df["embedding"].values))[0]
    best_idx = np.argmax(similarities)

    intent = df.iloc[best_idx]["intent"]
    responses = df.iloc[best_idx]["responses"]

    # ✅ Choose one random response instead of printing the full list
    if isinstance(responses, (list, tuple)) and responses:
        response = random.choice(responses)
    else:
        response = str(responses)

    score = round(similarities[best_idx], 3)
    return f"({intent}) — {response}\n\nConfidence: {score}"

demo = gr.ChatInterface(
    fn=lambda message, history: chatbot_reply(message),
    title="🧠 Intelligent Chatbot",
    description="A chatbot that understands user intent using embeddings and retrieval.",
)

demo.launch(share=True, debug=True)    

  self.chatbot = Chatbot(


* Running on local URL:  http://127.0.0.1:7861
* Running on public URL: https://5894c53facce2b05e2.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


In [23]:
df.to_pickle("chatbot_with_embeddings.pkl")
print("Final chatbot saved successfully!")


Final chatbot saved successfully!


In Step 4, the project integrates Gradio, a Python library that allows developers to create interactive web-based interfaces for machine learning models. After installing and importing Gradio, a simple chat interface is built to connect the backend logic with a user-friendly frontend. The chatbot function (chatbot_reply) processes the user’s input by converting it into embeddings, calculating similarity scores, and selecting the most relevant response from the dataset. The gr.ChatInterface() is then used to display the conversation in real time, complete with message bubbles, predicted intent, and confidence levels. Finally, the chatbot is launched locally and deployed publicly through Hugging Face Spaces, allowing anyone to chat with the intelligent assistant directly from a browser. The chatbot file with all embeddings and updates is then saved to finalize the build.

### Conclusion

This project successfully developed an intelligent chatbot capable of understanding and responding to user input using natural language processing and machine learning. Through a step-by-step approach, the system handled data preprocessing, intent classification, embedding generation, and interactive deployment. Using SentenceTransformers for semantic understanding and cosine similarity for response retrieval, the chatbot can interpret queries even when phrased differently from its training data. The integration of Gradio provided an intuitive, web-based interface for real-time interaction, while deployment on Hugging Face Spaces made the chatbot easily accessible online.

Overall, this project demonstrates the complete lifecycle of building an AI-powered conversational assistant, from dataset creation and text processing to model deployment. Future improvements could include adding Named Entity Recognition (NER), sentiment analysis, or context tracking to make conversations more human-like and dynamic. The outcome highlights how modern NLP tools can be combined to create practical, intelligent systems for education, customer support, and beyond.