# DTC Zoomcamp Q&A Baseline

This is a starter notebook for the [DTC Zoomcamp Q&A challenge on Kaggle](https://www.kaggle.com/competitions/dtc-zoomcamp-qa-challenge). that can help you to build the solution to submit. Created by Timur Kamaliev and [published on Kaggle](https://www.kaggle.com/code/svizor/dtc-zoomcamp-q-a-bert/notebook)

To get started:

* Join the competition and accept rules
* Download your Kaggle credentials file
* If you're running in Saturn Cloud, configure your instance to have access to access the kaggle credentials

When this is done, we start by downloading the data. We need to execute the following cell only once

In [1]:
!kaggle competitions download -c dtc-zoomcamp-qa-challenge
!mkdir data
!unzip dtc-zoomcamp-qa-challenge.zip -d data > /dev/null
!rm dtc-zoomcamp-qa-challenge.zip

In [2]:
!ls data

attachments	  test_questions.csv  train_questions.csv
test_answers.csv  train_answers.csv


Now we execute the rest of the notebook

In [3]:
import os
import pandas as pd
import numpy as np
import torch

from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings("ignore")

In [4]:
data_path = './data'
train_questions_df = pd.read_csv(f'{data_path}/train_questions.csv')
train_answers_df = pd.read_csv(f'{data_path}/train_answers.csv')
test_questions_df = pd.read_csv(f'{data_path}/test_questions.csv')
test_answers_df = pd.read_csv(f'{data_path}/test_answers.csv')

In [5]:
train_questions_df.head()

Unnamed: 0,question_id,question,course,year,candidate_answers,answer_id
0,79062,"For categorical target set, where the distribu...",Machine Learning Zoomcamp,2021,156400754877105368643810912439,156400
1,468946,Is there anything that we are not allowed to u...,Machine Learning Zoomcamp,2021,641330634887912439425941642829,634887
2,968800,I have been catching up and have been doing ho...,Data Engineering Zoomcamp,2022,9540161678567591936798838013,954016
3,688404,Could you please explain what code we should l...,Data Engineering Zoomcamp,2022,1986616298986865773699141765,3699
4,63921,Is it just me or does the model have really ba...,Machine Learning Zoomcamp,2021,754877604487912439858915425941,858915


In [6]:
# Let's marge the train questions and answers in one dataframe
train_merged_df = pd.merge(
    train_questions_df, train_answers_df, on='answer_id', how='inner', suffixes=('_question', '_answer')
)

In [7]:
# Drop duplicates
train_merged_df = train_merged_df.drop_duplicates()
train_merged_df.shape

(397, 10)

In [8]:
train_merged_df.head()

Unnamed: 0,question_id,question,course_question,year_question,candidate_answers,answer_id,answer,course_answer,year_answer,attachments_files
0,79062,"For categorical target set, where the distribu...",Machine Learning Zoomcamp,2021,156400754877105368643810912439,156400,Alexey\nShould we use something non-standard t...,Machine Learning Zoomcamp,2021,
1,468946,Is there anything that we are not allowed to u...,Machine Learning Zoomcamp,2021,641330634887912439425941642829,634887,"No, I don't think there is anything you cannot...",Machine Learning Zoomcamp,2021,
2,968800,I have been catching up and have been doing ho...,Data Engineering Zoomcamp,2022,9540161678567591936798838013,954016,"Alexey\nYes, you will be. You can submit the p...",Data Engineering Zoomcamp,2022,
3,688404,Could you please explain what code we should l...,Data Engineering Zoomcamp,2022,1986616298986865773699141765,3699,Alexey\nI think the question refers to the hom...,Data Engineering Zoomcamp,2022,
4,63921,Is it just me or does the model have really ba...,Machine Learning Zoomcamp,2021,754877604487912439858915425941,858915,"Dmitry\nIt's fine, because this is the showcas...",Machine Learning Zoomcamp,2021,


We can take BERT model for tokenization and building text emdeddings.

In [9]:
# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Available device: {device}")

Available device: cuda


In [10]:
# Getting pre-trained BERT model
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
model.to(device) # Move model to GPU

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [11]:
def get_bert_embeddings(text):
    """
    Function for getting text embeddings using BERT
    Returns one embedding as an average of the words embeddings of the text 
    """
    tokens = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    tokens = {key: value.to(device) for key, value in tokens.items()}
    with torch.no_grad():
        outputs = model(**tokens)
    return outputs['last_hidden_state'][0].mean(dim=0).cpu().numpy()

In [12]:
# Get questions and answers embeddings for the train part
train_question_embeddings = train_merged_df['question'].apply(get_bert_embeddings)
train_answer_embeddings = train_merged_df['answer'].apply(get_bert_embeddings)

In [13]:
# Normalization
scaler = StandardScaler()
train_question_embeddings_standardized = scaler.fit_transform(np.array(train_question_embeddings.tolist()))
train_answer_embeddings_standardized = scaler.transform(np.array(train_answer_embeddings.tolist()))

Now we can calculate distance between questions and answers embeddings. Let's start from training part.

In [14]:
def get_predictions(df_questions, df_answers):
    """
    Function that finds the best answer to each question according to their similarity.
    """
    predicted_answers = []
    predicted_answer_ids = []

    for index, row in df_questions.iterrows():
        question_text = row['question']
        candidate_answer_ids = [int(answer_id) for answer_id in row['candidate_answers'].split(",")]

        # Getting questions embeddings
        question_embedding = get_bert_embeddings(question_text)
        question_embedding_standardized = scaler.transform(question_embedding.reshape(1, -1))

        # Getting answer candidate embeddings
        candidate_answers_df = df_answers[df_answers['answer_id'].isin(candidate_answer_ids)]
        candidate_answer_embeddings = candidate_answers_df['answer'].apply(get_bert_embeddings)
        candidate_answer_embeddings_standardized = scaler.transform(np.array(candidate_answer_embeddings.tolist()))

        # Calculating similarity between question and answers embeddings
        similarities = cosine_similarity(question_embedding_standardized, candidate_answer_embeddings_standardized).flatten()

        # Taking index of the best answer candidate
        best_answer_index = similarities.argmax()

        predicted_answer = candidate_answers_df.iloc[best_answer_index]['answer']
        predicted_answer_id = candidate_answers_df.iloc[best_answer_index]['answer_id']
        predicted_answers.append(predicted_answer)
        predicted_answer_ids.append(predicted_answer_id)
        
    return predicted_answer_ids, predicted_answers

In [15]:
train_predictions_df = pd.DataFrame({
    'question_id': train_questions_df['question_id'],     
    'question': train_questions_df['question'],
    'candidate_answers': train_questions_df['candidate_answers'],
    'answer_id': train_questions_df['answer_id'],
})

In [16]:
train_predictions_df['predicted_answer_id'], train_predictions_df['predicted_answer'] = \
    get_predictions(train_questions_df, train_answers_df)

In [17]:
train_predictions_df.head()

Unnamed: 0,question_id,question,candidate_answers,answer_id,predicted_answer_id,predicted_answer
0,79062,"For categorical target set, where the distribu...",156400754877105368643810912439,156400,156400,Alexey\nShould we use something non-standard t...
1,468946,Is there anything that we are not allowed to u...,641330634887912439425941642829,634887,634887,"No, I don't think there is anything you cannot..."
2,968800,I have been catching up and have been doing ho...,9540161678567591936798838013,954016,954016,"Alexey\nYes, you will be. You can submit the p..."
3,688404,Could you please explain what code we should l...,1986616298986865773699141765,3699,3699,Alexey\nI think the question refers to the hom...
4,63921,Is it just me or does the model have really ba...,754877604487912439858915425941,858915,858915,"Dmitry\nIt's fine, because this is the showcas..."


In [18]:
# Accuracy calculation
accuracy = (train_predictions_df['predicted_answer'] == train_answers_df['answer']).mean()
print(f'Accuracy: {accuracy:.4f}')

Accuracy: 0.7330


Now we can use the same approach to get answers for the test part.

In [19]:
test_questions_df = test_questions_df.drop_duplicates(subset='question_id')
test_questions_df.shape

(514, 5)

In [20]:
# Creating the dataframe for the test part
test_predictions_df = pd.DataFrame({
    'question_id': test_questions_df['question_id'], 
})
test_predictions_df['predicted_answer_id'], test_predictions_df['predicted_answer'] = \
    get_predictions(test_questions_df, test_answers_df)

In [21]:
test_predictions_df.head()

Unnamed: 0,question_id,predicted_answer_id,predicted_answer
0,707,767296,Alexey\nProbably more than you want to put in....
1,534450,231208,"Yesâ€¦ and no? Sometimes, yeah. I wouldn't say o..."
2,996163,816559,Alexey\nYou can create a Python path variable ...
3,860215,988549,"Again, youâ€™ll probably hate me soon for saying..."
4,980124,384381,Alexey\nThe first thing about the dataset â€“ wh...


In [22]:
test_predictions_df[['question_id', 'predicted_answer_id']].to_csv('BERT_sample_submission.csv', index=False)

Now let's submit the predictions

In [23]:
!kaggle competitions submit dtc-zoomcamp-qa-challenge -f BERT_sample_submission.csv -m 'validation: 0.7330'

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 6.96k/6.96k [00:00<00:00, 7.69kB/s]
Successfully submitted to DTC Zoomcamp Q&A Challenge

What can help improve the solution:
* using some extra data
* taking other available features or generating the new one 
* using another approach or model (e.g. to build emdeddings and compare them)
* model fine-tunning