<div style="background-color: #F5F5F5; padding: 20px; font-family: 'Arial';">
<h1 style="font-size: 24px; color: #333333;">Query Domain Classification</h1>

<h4 style="font-size: 18px; color: #666666;">2003-02, Consulting and Professional Communications, Assignment 1</h4>

<p style="font-size: 16px; color: #333333;">Shahabuddin Syed - 100895169</p>

<hr style="border: none; border-top: 1px solid #CCCCCC; margin: 20px 0;">

<h3 style="font-size: 20px; color: #333333;">Problem Statement:</h3>

<p style="font-size: 16px; color: #333333;">
Build a system to categorize queries based on their domain using Natural Language Processing.
</p>

<p style="font-size: 16px; color: #333333;">
DSForum is a community-based portal where users can post queries related to data science topics such as machine learning, statistical analysis, data visualization, etc. The company aims to optimize the response time of queries by answering them promptly. However, the community can also provide answers through discussion forums. The queries can belong to various domains, including Techniques, Tools, Careers, etc.
</p>

<p style="font-size: 16px; color: #333333;">
Currently, users manually tag their queries when posting, selecting one of these categories: Techniques, Tools, Career, Hackathons, Resources, Misc, or Other. The query is then forwarded to the relevant team. However, this manual process is prone to errors and affects the query response time.
</p>

<p style="font-size: 16px; color: #333333;">
Can we design and develop a model that accurately classifies queries based on their domain? This will help improve the platform's response time by accurately identifying the domain of each query and redirecting it to the appropriate team for timely resolution.
</p>
</div>


In [2]:
# XGBoost classifier for gradient boosting
from xgboost import XGBClassifier

# Splitting the data into training and testing sets
from sklearn.model_selection import train_test_split

# Evaluation metric for measuring accuracy
from sklearn.metrics import accuracy_score

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Text preprocessing
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Encoding categorical variables
from sklearn.preprocessing import LabelEncoder

# Tokenization utility for text preprocessing
import tiktoken

# Utility function to obtain text embeddings
from openai.embeddings_utils import get_embedding


In [6]:
# Read the training data from 'train.csv' file
train = pd.read_csv('train.csv').dropna(subset = ['Title'])

# Read the test data from 'test.csv' file
test = pd.read_csv('test.csv')

# Read the sample submission file for reference
sample = pd.read_csv('sample_submission.csv')


<div style="background-color: #F9F9F9; padding: 20px;">
<h3 style="font-size: 20px; color: #333333;">Text Preprocessing</h3>
<p style="font-size: 16px; color: #666666;">
This function performs basic text preprocessing on a pandas Series of texts. The following steps are applied:
</p>
<ol style="font-size: 16px; color: #333333;">
    <li>Convert text to lowercase</li>
    <li>Remove punctuation</li>
    <li>Remove numbers</li>
    <li>Tokenize text into words</li>
    <li>Remove stopwords</li>
    <li>Lemmatize words</li>
    <li>Join preprocessed words back into sentences</li>
</ol>
<p style="font-size: 16px; color: #666666;">
The function takes a pandas Series containing the texts to be preprocessed and returns a new Series with the preprocessed texts.
</p>
</div>


In [None]:

def preprocess_text(text_series):
    """
    Function to perform basic text preprocessing on a pandas Series of texts.
    
    Parameters:
    - text_series (pandas Series): Series containing the texts to be preprocessed.
    
    Returns:
    - pandas Series: Series containing the preprocessed texts.
    """
    # Convert text to lowercase
    text_series = text_series.str.lower()
    
    # Remove punctuation
    text_series = text_series.str.replace('[{}]'.format(string.punctuation), '')
    
    # Remove numbers
    text_series = text_series.str.replace('\d+', '')
    
    # Tokenize text into words
    text_series = text_series.apply(word_tokenize)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text_series = text_series.apply(lambda x: [word for word in x if word not in stop_words])
    
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    text_series = text_series.apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
    
    # Join the preprocessed words back into sentences
    text_series = text_series.apply(lambda x: ' '.join(x))
    
    return text_series

In [None]:
train["Title"] = preprocess_text(train["Title"])
test["Title"] = preprocess_text(test["Title"])

<div style="background-color: #F9F9F9; padding: 20px;">
<h3 style="font-size: 20px; color: #333333;">Text Embedding</h3>
<p style="font-size: 16px; color: #666666;">
This function obtains embeddings for a pandas Series of texts. The following steps are performed:
</p>
<ol style="font-size: 16px; color: #333333;">
    <li>Specify the embedding model parameters</li>
    <li>Obtain embeddings for each text using the get_embedding() function</li>
</ol>
<p style="font-size: 16px; color: #666666;">
The function takes a pandas Series containing the texts for which embeddings are to be obtained and returns a new Series with lists of embeddings corresponding to each text.
</p>
<div style="background-color: #D6ECFF; padding: 10px;">
    <p style="font-size: 16px; color: #333333;">
    <b>Note:</b> The embedding model parameters used in this function are as follows:
    </p>
    <ul style="font-size: 16px; color: #333333;">
        <li>Embedding Model: text-embedding-ada-002</li>
        <li>Encoding: cl100k_base</li>
        <li>Maximum Tokens: 8000</li>
    </ul>
</div>
</div>


In [7]:
def get_text_embeddings(text_series):
    """
    Function to obtain embeddings for a pandas Series of texts.
    
    Parameters:
    - text_series (pandas Series): Series containing the texts for which embeddings are to be obtained.
    
    Returns:
    - pandas Series: Series containing lists of embeddings corresponding to each text.
    """
    # Embedding model parameters
    embedding_model = "text-embedding-ada-002"
    embedding_encoding = "cl100k_base"  # Encoding for text-embedding-ada-002
    max_tokens = 8000  # Maximum tokens for text-embedding-ada-002 is 8191
    
    # Obtain embeddings for each text using the get_embedding() function
    embeddings = text_series.apply(lambda x: get_embedding(x, engine=embedding_model))
    
    return embeddings


In [8]:
%time
# Convert "Title" column in the train DataFrame to embeddings
train["embedding"] = get_text_embeddings(train["Title"])

# Convert "Title" column in the test DataFrame to embeddings
test["embedding"] = get_text_embeddings(test["Title"])


CPU times: total: 0 ns
Wall time: 0 ns


In [9]:
# Save train DataFrame with embeddings
train.to_pickle('train_embeddings.pkl')

# Save test DataFrame with embeddings
test.to_pickle('test_embeddings.pkl')


In [10]:
# Load train DataFrame with embeddings
train = pd.read_pickle('train_embeddings.pkl')

# Load test DataFrame with embeddings
test = pd.read_pickle('test_embeddings.pkl')


<div style="background-color: #F9F9F9; padding: 20px;">
<h3 style="font-size: 20px; color: #333333;">Model Building</h3>
<p style="font-size: 16px; color: #666666;">
    This code performs the following steps to build a machine learning model for query domain classification:
</p>
<ol style="font-size: 16px; color: #333333;">
    <li>Extract embeddings from the 'embedding' column in the train DataFrame</li>
    <li>Perform label encoding on the target variable ('Domain')</li>
    <li>Split the dataset into training and validation sets</li>
    <li>Define the XGBoost classifier with the specified parameters</li>
    <li>Train the model using the training data and target variable</li>
    <li>Make predictions on the validation set</li>
    <li>Evaluate the model by calculating the accuracy score</li>
</ol>
<div style="background-color: #D6ECFF; padding: 10px;">
    <p style="font-size: 16px; color: #333333;">
        <b>Note:</b> The XGBoost classifier is configured as follows:
    </p>
    <ul style="font-size: 16px; color: #333333;">
        <li>Objective: multi:softmax</li>
        <li>Number of estimators: 500</li>
        <li>Maximum depth: 9</li>
        <li>Number of parallel threads: -1 (using all available cores)</li>
        <li>Learning rate: 0.02</li>
        <li>Evaluation metric: mean absolute error (mae)</li>
    </ul>
</div>
</div>


In [13]:
X = train[['embedding']].copy()
X = X.embedding.apply(np.array)

# Extract the target variable from the 'Domain' column and perform label encoding
y = train['Domain'].copy()
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(list(X.values), y, test_size=0.2, random_state=42)

# Define the XGBoost classifier
model = XGBClassifier(objective='multi:softmax', n_estimators=500, max_depth=9, n_jobs=-1, learning_rate=0.02, eval_metric='mae')

# Train the model
model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print("Accuracy:", accuracy)



Accuracy: 0.6883963494132985


<div style="background-color: #F9F9F9; padding: 20px;">
<h3 style="font-size: 20px; color: #333333;">Testing the Model on Unseen Data</h3>
<p style="font-size: 16px; color: #666666;">
    This code snippet performs the following steps to test the trained model on an unseen dataset:
</p>
<ol style="font-size: 16px; color: #333333;">
    <li>Prepare the test data by extracting embeddings from the 'embedding' column</li>
    <li>Make predictions on the test data using the trained model</li>
    <li>Convert the predicted labels back to their original domain categories</li>
    <li>Assign the converted domain labels to the 'Domain' column in the sample DataFrame</li>
</ol>
<div style="background-color: #D6ECFF; padding: 10px;">
    <p style="font-size: 16px; color: #333333;">
        <b>Note:</b> The test data is assumed to have an 'embedding' column containing the embeddings for the text data. The trained model (`model`) and label encoder (`label_encoder`) are used in this process.
    </p>
</div>
</div>


In [14]:
# Prepare the test data
test_x = test['embedding'].apply(np.array)

# Make predictions on the test data
test_preds = model.predict(list(test_x.values))

# Convert the predicted labels back to their original domain categories
sample['Domain'] = label_encoder.inverse_transform(test_preds)


<div style="background-color: #F9F9F9; padding: 20px;">
<h3 style="font-size: 20px; color: #333333;">Predicting Domain for a Random Query</h3>
<p style="font-size: 16px; color: #666666;">
    The following code demonstrates how to predict the domain for a random query using the trained model:
</p>
<p style="background-color: #D6ECFF; padding: 10px; font-size: 16px; color: #333333;">
    - Select a random query from the test dataset<br>
    - Convert the query to an embedding<br>
    - Reshape the embedding to match the input format expected by the model<br>
    - Predict the domain for the random query using the model<br>
    - Print the predicted domain
</p>
<p style="font-size: 16px; color: #666666;">
    This code selects a random query from the test dataset and converts it to an embedding. The embedding is then reshaped to match the input format expected by the model. Finally, the model predicts the domain for the random query, and the predicted domain is printed.
</p>
</div>


In [28]:
import random

# Random query
random_query = random.choice(test['Title'])

# Convert the query to embedding
embedding_model = "text-embedding-ada-002"
random_query_embedding = get_embedding(random_query, engine=embedding_model)

# Reshape the embedding to match the input format expected by the model
random_query_embedding = np.array(random_query_embedding).reshape(1, -1)

# Predict the domain for the random query
predicted_domain = label_encoder.inverse_transform(model.predict(random_query_embedding))[0]

# Print the predicted domain
print("Random Query:", random_query)
print("Predicted Domain:", predicted_domain)


Random Query: Help needed for sequence problem
Predicted Domain: Techniques
