# Intelligent Chatbot Development Project

## Introduction
This project builds a Smart Career Guidance Chatbot that:
- Understands user intent (e.g., “skills for data analyst”).
- Extracts entities (job titles, tools, skills).
- Retrieves or generates helpful responses.
- Adapts tone with basic sentiment awareness.

### Goals
1. Create a clean, reproducible notebook pipeline.
2. Start with a **retrieval-first** chatbot (fast, controllable).
3. Add NER + sentiment to personalize replies.
4. Keep everything version-controlled with Git.

### Milestones
- **M1:** Data loading + text cleaning
- **M2:** Embeddings + retrieval
- **M3:** Intent classifier (baseline)
- **M4:** NER + sentiment integration
- **M5:** Simple UI (Gradio) for demo


In [1]:
%pip install --quiet numpy pandas scikit-learn nltk spacy "sentence-transformers>=3.0.0" transformers gradio

Note: you may need to restart the kernel to use updated packages.


In [2]:
import nltk, sys
# Downloading of small NLTK packs
nltk.download("punkt", quiet=True)
nltk.download("stopwords", quiet=True)

import spacy
try:
    spacy.load("en_core_web_sm")
except OSError:
    !python -m spacy download en_core_web_sm

In [3]:
!pip install tf-keras



In [4]:
import tf_keras as keras
import sys, platform
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import nltk, spacy

print("Python:", sys.version.split()[0], "| OS:", platform.system())
for name, mod in [
    ("numpy", np), ("pandas", pd),
    ("sentence-transformers", SentenceTransformer),
    ("transformers", AutoTokenizer),
    ("spacy", spacy), ("nltk", nltk),
]:
    try:
        v = mod.__version__ if hasattr(mod, "__version__") else "OK"
    except Exception:
        v = "OK"
    print(f"{name:22s} -> {v}")

# Quick sanity check: load a tiny embedding model
embedder = SentenceTransformer("all-MiniLM-L6-v2")
sample = ["hello world", "career in data science"]
emb = embedder.encode(sample, normalize_embeddings=True)
print("Embeddings shape:", emb.shape)





  from .autonotebook import tqdm as notebook_tqdm


Python: 3.10.16 | OS: Windows
numpy                  -> 2.0.2
pandas                 -> 2.2.3
sentence-transformers  -> OK
transformers           -> OK
spacy                  -> 3.8.7
nltk                   -> 3.9.2
Embeddings shape: (2, 384)


## Step 2 — Data Setup (2A: toy dataset)
We’ll create a tiny in-notebook dataset to verify our flow before using a real Kaggle dataset.


In [5]:
import pandas as pd

data = {
    "intent": [
        "ask_skills_for_data_science",
        "ask_skills_for_ai",
        "ask_job_recommendation",
        "ask_learning_path",
        "greeting",
        "goodbye",
    ],
    "patterns": [
        ["What skills do I need for data science?", "How to become a data scientist?"],
        ["What do I need to study for AI?", "What are AI skills?"],
        ["What job is good for me if I like numbers?", "I enjoy problem-solving, any job ideas?"],
        ["Where should I start learning data analysis?", "Best way to learn machine learning?"],
        ["Hi", "Hello", "Hey"],
        ["Bye", "See you later", "Goodbye"],
    ],
    "responses": [
        "Data Science needs skills like Python, SQL, Statistics, and Machine Learning.",
        "AI requires knowledge of Python, Neural Networks, and Data Modeling.",
        "You might enjoy careers like Data Analyst, Statistician, or Research Scientist.",
        "Start with Python, then learn data visualization and basic machine learning.",
        "Hello! How can I help you today?",
        "Goodbye! Keep learning and stay curious!",
    ],
}

df = pd.DataFrame(data)
df


Unnamed: 0,intent,patterns,responses
0,ask_skills_for_data_science,"[What skills do I need for data science?, How ...","Data Science needs skills like Python, SQL, St..."
1,ask_skills_for_ai,"[What do I need to study for AI?, What are AI ...","AI requires knowledge of Python, Neural Networ..."
2,ask_job_recommendation,"[What job is good for me if I like numbers?, I...","You might enjoy careers like Data Analyst, Sta..."
3,ask_learning_path,"[Where should I start learning data analysis?,...","Start with Python, then learn data visualizati..."
4,greeting,"[Hi, Hello, Hey]",Hello! How can I help you today?
5,goodbye,"[Bye, See you later, Goodbye]",Goodbye! Keep learning and stay curious!


In [6]:
rows = []
for _, row in df.iterrows():
    for pattern in row["patterns"]:
        rows.append({"text": pattern, "intent": row["intent"], "response": row["responses"]})

chatbot_df = pd.DataFrame(rows)
chatbot_df.head(10)


Unnamed: 0,text,intent,response
0,What skills do I need for data science?,ask_skills_for_data_science,"Data Science needs skills like Python, SQL, St..."
1,How to become a data scientist?,ask_skills_for_data_science,"Data Science needs skills like Python, SQL, St..."
2,What do I need to study for AI?,ask_skills_for_ai,"AI requires knowledge of Python, Neural Networ..."
3,What are AI skills?,ask_skills_for_ai,"AI requires knowledge of Python, Neural Networ..."
4,What job is good for me if I like numbers?,ask_job_recommendation,"You might enjoy careers like Data Analyst, Sta..."
5,"I enjoy problem-solving, any job ideas?",ask_job_recommendation,"You might enjoy careers like Data Analyst, Sta..."
6,Where should I start learning data analysis?,ask_learning_path,"Start with Python, then learn data visualizati..."
7,Best way to learn machine learning?,ask_learning_path,"Start with Python, then learn data visualizati..."
8,Hi,greeting,Hello! How can I help you today?
9,Hello,greeting,Hello! How can I help you today?


In [7]:
chatbot_df.to_csv("chatbot_dataset.csv", index=False)
print("Saved: chatbot_dataset.csv")


Saved: chatbot_dataset.csv


In [8]:
!pip install kaggle



In [9]:
import os
os.environ["KAGGLE_CONFIG_DIR"] = os.path.join(os.getcwd(), ".kaggle")
# Optional check:
print("Kaggle config dir:", os.environ["KAGGLE_CONFIG_DIR"])
print("Has kaggle.json:", os.path.exists(os.path.join(os.environ["KAGGLE_CONFIG_DIR"], "kaggle.json")))


Kaggle config dir: C:\Users\hp\projects\intelligent-chatbot-notebook\.kaggle
Has kaggle.json: True


In [10]:
!kaggle datasets download -d elvinagammed/chatbots-intent-recognition-dataset -p data

Dataset URL: https://www.kaggle.com/datasets/elvinagammed/chatbots-intent-recognition-dataset
License(s): copyright-authors
chatbots-intent-recognition-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [11]:
import zipfile, glob

zip_path = sorted(glob.glob("data/*.zip"))[0]
with zipfile.ZipFile(zip_path, "r") as z:
    z.extractall("data")

print("Extracted files:")
for p in glob.glob("data/**/*", recursive=True):
    print(p)

Extracted files:
data\chatbots-intent-recognition-dataset.zip
data\Intent.json


In [12]:
import json
from pprint import pprint

# Step 1: Load and preview the JSON
with open("data/Intent.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Step 2: Show the first few items so we can inspect structure
if isinstance(data, dict):
    # If it's a dictionary, show keys and sample content
    pprint(list(data.keys()))
    if "intents" in data:
        pprint(data["intents"][:2])
    else:
        pprint(data)
else:
    # If it's a list, show first 2 items
    pprint(data[:2])


['intents']
[{'context': {'clear': False, 'in': '', 'out': 'GreetingUserRequest'},
  'entities': [],
  'entityType': 'NA',
  'extension': {'entities': False, 'function': '', 'responses': []},
  'intent': 'Greeting',
  'responses': ['Hi human, please tell me your GeniSys user',
                'Hello human, please tell me your GeniSys user',
                'Hola human, please tell me your GeniSys user'],
  'text': ['Hi',
           'Hi there',
           'Hola',
           'Hello',
           'Hello there',
           'Hya',
           'Hya there']},
 {'context': {'clear': True, 'in': 'GreetingUserRequest', 'out': ''},
  'entities': [{'entity': 'HUMAN', 'rangeFrom': 3, 'rangeTo': 4},
               {'entity': 'HUMAN', 'rangeFrom': 2, 'rangeTo': 3},
               {'entity': 'HUMAN', 'rangeFrom': 1, 'rangeTo': 2},
               {'entity': 'HUMAN', 'rangeFrom': 2, 'rangeTo': 3},
               {'entity': 'HUMAN', 'rangeFrom': 3, 'rangeTo': 4},
               {'entity': 'HUMAN', 'rangeFr

In [13]:
import pandas as pd

rows = []
for intent in data["intents"]:
    intent_name = intent.get("intent", "unknown")
    patterns = intent.get("text", [])
    responses = intent.get("responses", [])
    
    # Make one row per text pattern
    for pattern in patterns:
        rows.append({
            "intent": intent_name,
            "pattern": pattern,
            "response": responses[0] if responses else None
        })

df = pd.DataFrame(rows)
df.head(10)



Unnamed: 0,intent,pattern,response
0,Greeting,Hi,"Hi human, please tell me your GeniSys user"
1,Greeting,Hi there,"Hi human, please tell me your GeniSys user"
2,Greeting,Hola,"Hi human, please tell me your GeniSys user"
3,Greeting,Hello,"Hi human, please tell me your GeniSys user"
4,Greeting,Hello there,"Hi human, please tell me your GeniSys user"
5,Greeting,Hya,"Hi human, please tell me your GeniSys user"
6,Greeting,Hya there,"Hi human, please tell me your GeniSys user"
7,GreetingResponse,My user is Adam,Great! Hi <HUMAN>! How can I help?
8,GreetingResponse,This is Adam,Great! Hi <HUMAN>! How can I help?
9,GreetingResponse,I am Adam,Great! Hi <HUMAN>! How can I help?


In [14]:
df.to_csv("chatbot_intents_dataset.csv", index=False)
print("Saved cleaned dataset as chatbot_intents_dataset.csv")


Saved cleaned dataset as chatbot_intents_dataset.csv


## Step 3 — Embeddings and Retrieval

We’ll use a pre-trained SentenceTransformer model to create text embeddings for each pattern.
These embeddings represent the meaning of text as numeric vectors.
When a user sends a query, the chatbot will:
1. Convert the query into an embedding,
2. Compare it with all existing pattern embeddings using cosine similarity,
3. Retrieve the most relevant response.


In [15]:
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(df["pattern"].tolist(), normalize_embeddings=True)

df["embedding"] = embeddings.tolist()

# Sample
df.head(3)


Unnamed: 0,intent,pattern,response,embedding
0,Greeting,Hi,"Hi human, please tell me your GeniSys user","[-0.09047622978687286, 0.04043959826231003, 0...."
1,Greeting,Hi there,"Hi human, please tell me your GeniSys user","[-0.10393176972866058, 0.020045580342411995, 0..."
2,Greeting,Hola,"Hi human, please tell me your GeniSys user","[-0.06515946239233017, 0.09274395555257797, -0..."


In [16]:
from sklearn.metrics.pairwise import cosine_similarity

def get_response(user_input, df, model):
    # Step 1: Encode user query
    query_emb = model.encode([user_input], normalize_embeddings=True)
    
    # Step 2: Compute cosine similarity with existing embeddings
    similarities = cosine_similarity(query_emb, np.vstack(df["embedding"].values))[0]
    
    # Step 3: Pick the highest similarity
    best_idx = np.argmax(similarities)
    
    # Step 4: Return intent, response, and similarity score
    return {
        "user_input": user_input,
        "predicted_intent": df.iloc[best_idx]["intent"],
        "response": df.iloc[best_idx]["response"],
        "similarity": round(similarities[best_idx], 3)
    }


In [17]:
queries = [
    "hello there",
    "who are you",
    "bye",
    "what can you do",
    "hi human"
]

for q in queries:
    result = get_response(q, df, model)
    print(f"\n🗣️ User: {result['user_input']}")
    print(f"🤖 Bot ({result['predicted_intent']}): {result['response']}")
    print(f"Similarity: {result['similarity']}")



🗣️ User: hello there
🤖 Bot (Greeting): Hi human, please tell me your GeniSys user
Similarity: 1.0

🗣️ User: who are you
🤖 Bot (NameQuery): You can call me Geni
Similarity: 0.821

🗣️ User: bye
🤖 Bot (GoodBye): See you later
Similarity: 1.0

🗣️ User: what can you do
🤖 Bot (PodBayDoorResponse): It is classified, I could tell you but I would have to kill you!
Similarity: 0.455

🗣️ User: hi human
🤖 Bot (Greeting): Hi human, please tell me your GeniSys user
Similarity: 0.659


In [19]:
df.to_pickle("chatbot_with_embeddings.pkl")
print("Saved dataset with embeddings.")


Saved dataset with embeddings.
