Dataset Preparation


Generation of email queries by using llama3.2 and ollama locally
why llama 3.2 ?
LLaMA 3.2 is a powerful open-source language model that excels at:

Capturing natural phrasing

Generating realistic queries from minimal prompts

Avoiding overly formal or robotic language

This is crucial for your task, since user queries are often short, casual, and diverse.

In [1]:
import requests
import csv
import time

OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL_NAME = "llama3.2"

SEED_TOPICS = [
    "Sender or Recipient: Find emails sent by or received from specific people or addresses.",
    "Unread or Read: Filter emails based on whether they’ve been read or not.",
    "Emails with Attachments: Identify emails that contain attachments of any type or specific formats (e.g., PDF, image).",
    "Important or Starred: Find emails marked as important, flagged, or starred by the user.",
    "Specific Words or Phrases: Search for emails using keywords from the subject or body.",
    "Emails by Date or Time: Locate emails sent or received on specific dates or within time ranges.",
    "Replies or Forwards: Identify emails that are replies to or forwarded versions of other emails.",
    "Thread Length: Find email conversations with many replies (long threads).",
    "Emails with Links: Search for emails that contain hyperlinks or URLs.",
    "Emails with Calendar Invites: Identify emails that include event invitations or calendar information.",
    "Bulk or Promotional Emails: Filter emails classified under promotions, newsletters, or bulk senders.",
    "Emails by Size: Find large or small emails based on total size or attachment size.",
    "CC/BCC Recipients: Search for emails where someone was CC’d or BCC’d.",
    "Archived Emails: Find emails that were archived but not deleted.",
    "Spam or Junk: Locate emails classified as spam or moved to the junk folder.",
    "Action-based Queries: Find emails you might want to delete, archive, forward, or reply to.",
    "Folders, Tabs, or Labels: Search emails by Gmail folders, custom labels, or inbox tabs like Primary/Promotions.",
    "Meeting or Travel Details: Find emails related to scheduled meetings, travel bookings, or itineraries.",
    "Email Status Updates: Search for shipping, order, or service status update emails.",
    "Auto-generated Emails: Filter emails sent from bots, automated systems, or no-reply addresses."
]

def CreatePrompt(seed_topic):
    return f"""You are simulating a user searching their email based on the following category:

{seed_topic}

Generate 25 realistic and concise natural language queries that a user might type into an email search bar.

Guidelines:
- Keep queries short, natural, and varied.
- Avoid assistant-like phrases (e.g., "Can you", "Please", "Hey assistant").
- Make them sound like something a user would quickly type or say when searching.
- Cover different ways of asking about the same thing (diversity matters).
- Use informal but clear phrasing where appropriate.

Return only a **numbered list of 3 example queries**. No extra commentary or explanation."""


def query_ollama(prompt, temperature=0.3):
    payload = {
        "model": MODEL_NAME,
        "prompt": prompt,
        "system": "You simulate real user email queries for a virtual assistant..",
        "temperature": temperature,
        "stream": False
    }
    response = requests.post(OLLAMA_URL, json=payload)
    response.raise_for_status()
    return response.json()["response"]

def parse_response(text):
    queries = []
    for line in text.strip().split("\n"):
        if "." in line:
            parts = line.split(".", 1)
            query = parts[1].strip()
            if query:
                queries.append(query)
    return queries

def generate_queries(seed_topics, output_csv):
    all_queries = []
    for seed in seed_topics:
        print(f"\n[+] Generating queries for:\n{seed}")
        prompt = CreatePrompt(seed)
        try:
            response = query_ollama(prompt)
            queries = parse_response(response)
            all_queries.extend([(query, seed.split(":")[0].strip()) for query in queries])
            time.sleep(1)  # avoid API flooding
        except Exception as e:
            print(f"[!] Error for seed '{seed}': {e}")

    # Write to CSV
    print(f"\n[✓] Saving {len(all_queries)} queries to '{output_csv}'")
    with open(output_csv, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["query", "seed"])
        writer.writerows(all_queries)

# Run the script
generate_queries(SEED_TOPICS, "email_queries_dataset.csv")



[+] Generating queries for:
Sender or Recipient: Find emails sent by or received from specific people or addresses.

[+] Generating queries for:
Unread or Read: Filter emails based on whether they’ve been read or not.

[+] Generating queries for:
Emails with Attachments: Identify emails that contain attachments of any type or specific formats (e.g., PDF, image).

[+] Generating queries for:
Important or Starred: Find emails marked as important, flagged, or starred by the user.

[+] Generating queries for:
Specific Words or Phrases: Search for emails using keywords from the subject or body.

[+] Generating queries for:
Emails by Date or Time: Locate emails sent or received on specific dates or within time ranges.

[+] Generating queries for:
Replies or Forwards: Identify emails that are replies to or forwarded versions of other emails.

[+] Generating queries for:
Thread Length: Find email conversations with many replies (long threads).

[+] Generating queries for:
Emails with Links: S

Generation of calender queries by using llama3.2 and ollama locally

In [2]:

# Expanded and shuffled Calendar seed topics
CALENDAR_SEED_TOPICS = [
    "Meetings with a specific person",
    "Events scheduled for today or tomorrow",
    "Recurring events like weekly standups or classes",
    "Canceled or rescheduled events",
    "Events with Zoom or Google Meet links",
    "Personal events (e.g., birthdays, gym sessions, doctor appointments)",
    "Events at a specific location or city",
    "Free time or availability on a specific day",
    "Overlapping or conflicting events",
    "Events created by me",
    "Reminders or tasks scheduled",
    "All-day events",
    "Events during weekends or holidays",
    "Work-related events or meetings",
    "Events from shared calendars",
    "Events tagged as high priority or urgent",
    "Upcoming events in the next 7 days",
    "Events that were declined or not responded to",
    "One-on-one vs group meetings",
    "Events with certain keywords in title or description",
    "Back-to-back meetings",
    "Past events within a specific date range",
    "Morning or evening events",
    "Events that last more than 1 hour",
    "Recurring events that were skipped or modified",
    "Team meetings involving a specific department",
    "Calendar events that include attachments or documents",
    "Virtual events vs in-person events",
    "Events where I'm marked as 'Maybe'",
    "Events outside my working hours"
]


def create_calendar_prompt(seed_topic):
    return f"""You are simulating a user searching their calendar or scheduling assistant based on the following category:

{seed_topic}

Generate 25 realistic and concise natural language queries that a user might type into a calendar app or say to a virtual assistant.

Guidelines:
- Keep queries short, direct, and natural.
- Avoid assistant-style phrasing like "Can you", "Please", or "Hey assistant".
- Make them sound like something a person would quickly type when trying to find or manage events.
- Include a variety of phrasings for similar intents to ensure diversity.

Return only a **numbered list of 3 distinct example queries**. No explanation or extra text."""

# Send prompt to Ollama and get response
def Query_ollama(prompt, temperature=0.3):
    payload = {
        "model": MODEL_NAME,
        "prompt": prompt,
        "system": "You simulate real user calendar queries for a virtual assistant.",
        "temperature": temperature,
        "stream": False
    }
    response = requests.post(OLLAMA_URL, json=payload)
    response.raise_for_status()
    return response.json()["response"]

# Extract queries from model response
def parse_Queries(text):
    queries = []
    for line in text.strip().split("\n"):
        if "." in line:
            parts = line.split(".", 1)
            query = parts[1].strip()
            if query:
                queries.append(query)
    return queries

# Generate and save queries
def generate_Queries(output_csv):
    all_queries = []

    for seed in CALENDAR_SEED_TOPICS:
        print(f"[+] Generating queries for:\n{seed}")
        prompt = create_calendar_prompt(seed)
        try:
            response = Query_ollama(prompt)
            queries = parse_Queries(response)
            all_queries.extend([(query, seed) for query in queries])
            time.sleep(1)  # polite delay
        except Exception as e:
            print(f"[!] Error for seed '{seed}': {e}")

    print(f"[✓] Saving {len(all_queries)} queries to '{output_csv}'")
    with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["query", "seed_topic"])
        writer.writerows(all_queries)


generate_Queries("calendar_queries_dataset.csv")


[+] Generating queries for:
Meetings with a specific person
[+] Generating queries for:
Events scheduled for today or tomorrow
[+] Generating queries for:
Recurring events like weekly standups or classes
[+] Generating queries for:
Canceled or rescheduled events
[+] Generating queries for:
Events with Zoom or Google Meet links
[+] Generating queries for:
Personal events (e.g., birthdays, gym sessions, doctor appointments)
[+] Generating queries for:
Events at a specific location or city
[+] Generating queries for:
Free time or availability on a specific day
[+] Generating queries for:
Overlapping or conflicting events
[+] Generating queries for:
Events created by me
[+] Generating queries for:
Reminders or tasks scheduled
[+] Generating queries for:
All-day events
[+] Generating queries for:
Events during weekends or holidays
[+] Generating queries for:
Work-related events or meetings
[+] Generating queries for:
Events from shared calendars
[+] Generating queries for:
Events tagged as 

Loading and combinig both csvs which contains their respective queries and inorder to merge for single dataset

In [5]:
import pandas as pd
# Load both CSV files
email_df = pd.read_csv("email_queries_dataset.csv")
calendar_df = pd.read_csv("calendar_queries_dataset.csv")

# Assign labels
email_df["label"] = 0
calendar_df["label"] = 1

# Keep only required columns
email_df = email_df[["query", "label"]]
calendar_df = calendar_df[["query", "label"]]

# Concatenate both datasets
combined_df = pd.concat([email_df, calendar_df], ignore_index=True)

# Save to data.csv
combined_df.to_csv("dataset.csv", index=False)

In [6]:

combined_df.head()


Unnamed: 0,query,label
0,Who sent me that email?,0
1,Emails from john,0
2,Sent to my dad,0
3,From my bank,0
4,Email from sarah,0


In [25]:
import pandas as pd
#replace the dataset
df=pd.read_csv("dataset.csv")
df.head(10)

Unnamed: 0,query,label
0,Who sent me that email?,0
1,Emails from john,0
2,Sent to my dad,0
3,From my bank,0
4,Email from sarah,0
5,Recent emails from friends,0
6,Who's been emailing me lately,0
7,All emails from company email,0
8,Sarah's emails,0
9,Email address of customer service,0


In [7]:
import csv
gmail_queries = [
    "Find emails with PDF attachments",
    "Show me unread messages in my inbox",
    "Search for emails from Sarah about the project proposal",
    "Find messages with subject line 'quarterly report'",
    "Show all emails I've received from marketing@company.com",
    "Find emails with large attachments",
    "Search for emails I starred last week",
    "Show me all emails in my Promotions tab",
    "Find messages with the label 'Urgent'"
]

calendar_queries = [
    "When is my next meeting with the design team?",
    "Show me all events scheduled for next Tuesday",
    "Find appointments with Dr. Johnson",
    "When did I schedule the quarterly review?",
    "Show me all recurring meetings",
    "Find events where I'm marked as optional",
    "When is the marketing presentation scheduled?",
    "Show me all-day events in May",
    "Find meetings I haven't responded to yet",
    "When is the team lunch scheduled for?"
]
# Save Gmail (label 0)
with open('dataset.csv', 'a', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    for q in gmail_queries:
        writer.writerow([q,0])

# Save Calendar (label 1)
with open('dataset.csv', 'a', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    for q in calendar_queries:
        writer.writerow([q,1])

#Model Development
#importing requried packages

In [8]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from transformers import BertTokenizer, TFBertForSequenceClassification


In [None]:
pip install --upgrade transformers tensorflow

# 3. Train/Val/Test Splitting Dataset
splitting dataset into 80 and 20 for training and testing

In [9]:

train_df, test_df = train_test_split(combined_df, test_size=0.2, stratify=combined_df['label'], random_state=42)
val_df, test_df = train_test_split(test_df, test_size=0.5, stratify=test_df['label'], random_state=42)

Tokenization

In [None]:
# Load tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)

# Tokenize function
def tokenize_data(df, tokenizer, max_length=128):
    return tokenizer(
        df["query"].tolist(),
        max_length=max_length,
        truncation=True,
        padding="max_length",
        return_tensors="np"
    )

# Tokenize all splits
train_encodings = tokenize_data(train_df, tokenizer)
val_encodings = tokenize_data(val_df, tokenizer)
test_encodings = tokenize_data(test_df, tokenizer)

# Convert to TensorFlow Datasets manually (use Keras model.fit)
def convert_to_dataset(encodings, labels):
    inputs = {
        "input_ids": tf.convert_to_tensor(encodings["input_ids"], dtype=tf.int32),
        "attention_mask": tf.convert_to_tensor(encodings["attention_mask"], dtype=tf.int32),
    }
    targets = tf.convert_to_tensor(labels, dtype=tf.int32)
    return tf.data.Dataset.from_tensor_slices((inputs, targets)).batch(16)

train_dataset = convert_to_dataset(train_encodings, train_df["label"].values)
val_dataset = convert_to_dataset(val_encodings, val_df["label"].values)
test_dataset = convert_to_dataset(test_encodings, test_df["label"].values)

Model Develovement

choosing  bert-base-uncased because:

It is simple, robust, battle-tested, and works well for English classification tasks .

It’s accurate and efficient for moderately sized datasets with informal, lowercase-heavy text (like user queries).


In [12]:


# Load model
model = TFBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Custom training loop with Keras Model wrapper to avoid transformer.fit bug
input_ids = tf.keras.Input(shape=(128,), dtype=tf.int32, name="input_ids")
attention_mask = tf.keras.Input(shape=(128,), dtype=tf.int32, name="attention_mask")
outputs = model(input_ids, attention_mask=attention_mask)[0]
model_keras = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=outputs)

# Compile and train
model_keras.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

model_keras.fit(train_dataset, validation_data=val_dataset, epochs=3)

# Evaluate




All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3


AttributeError: 'numpy.ndarray' object has no attribute 'logits'

Model Evaluation


In [14]:
logits = model_keras.predict(test_dataset)
pred_labels = tf.argmax(logits, axis=1).numpy()
true_labels = test_df["label"].values




Model report

In [20]:
print("\nClassification Report:\n")
print(classification_report(true_labels, pred_labels))


Classification Report:

              precision    recall  f1-score   support

           0       1.00      0.87      0.93        45
           1       0.89      1.00      0.94        50

    accuracy                           0.94        95
   macro avg       0.95      0.93      0.94        95
weighted avg       0.94      0.94      0.94        95



In [None]:
# Inference Example
user input queries

In [21]:
def predict_query(query):
    inputs = tokenizer(query, return_tensors="tf", truncation=True, padding=True, max_length=128)
    logits = model(inputs).logits
    prediction = np.argmax(logits, axis=1).item()
    return "Gmail" if prediction == 0 else "Calendar"
sample_queries = [
    "Show me unread messages in my inbox",
    "When is my next meeting?",
    "Search for emails from boss",
    "What events are scheduled next Friday?"
]

for q in sample_queries:
    print(f"Query: {q} => Predicted Class: {predict_query(q)}")


Query: Show me unread messages in my inbox => Predicted Class: Gmail
Query: When is my next meeting? => Predicted Class: Calendar
Query: Search for emails from boss => Predicted Class: Gmail
Query: What events are scheduled next Friday? => Predicted Class: Calendar


In [23]:
query=input("input your query")

print(f"query:{query}=>Predicted Classs :{predict_query(query)}")

input your query welcome to this seminar meeting


query:welcome to this seminar meeting=>Predicted Classs :Calendar
