# Description of the dataset:


This project is devoted to the question-answering task. You are going to work with the **BoolQ** dataset from SuperGLUE .

BoolQ is a question answering dataset for yes/no.

Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. I used two  `.jsonl` files (`train, val), where each line is a JSON dictionary with the following format:

    Example:
    
    {
      "question": "is france the same timezone as the uk",
      "passage": "At the Liberation of France in the summer of 1944, Metropolitan France kept GMT+2 as it was the time then used by the Allies (British Double Summer Time). In the winter of 1944--1945, Metropolitan France switched to GMT+1, same as in the United Kingdom, and switched again to GMT+2 in April 1945 like its British ally. In September 1945, Metropolitan France returned to GMT+1 (pre-war summer time), which the British had already done in July 1945. Metropolitan France was officially scheduled to return to GMT+0 on November 18, 1945 (the British returned to GMT+0 in on October 7, 1945), but the French government canceled the decision on November 5, 1945, and GMT+1 has since then remained the official time of Metropolitan France."
      "label": false,
      "idx": 123,
    }




## Data analysis



In [1]:
import nltk
nltk.download('punkt_tab')
import os
import random
nltk.download('punkt')
import pandas as pd
import numpy as np
import torch
from collections import defaultdict
from sklearn.svm import SVC
import json


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\BCS.DESKTOP-
[nltk_data]     732EA67\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\BCS.DESKTOP-
[nltk_data]     732EA67\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
path = os.path.dirname(os.path.abspath('train.jsonl'))
path

'C:\\Users\\BCS.DESKTOP-732EA67\\My assignments Evgeniya\\LSML2'

In [3]:


# Load the dataset into a pandas dataframe.
df_train = pd.read_csv(path+'\\train.jsonl', delimiter='\t', header=None,names=['sentence_source'])


database=[]
for i in df_train.sentence_source:
    i_json= json.loads(i)
    i_list=[i_json['idx'],i_json['question'],i_json['passage'],i_json['label']]
    database.append(i_list)

df_train= pd.DataFrame(database,columns=['idx','question','passage','label']).set_index('idx')
df_train.head(5)

Unnamed: 0_level_0,question,passage,label
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,do iran and afghanistan speak the same language,"Persian language -- Persian (/Ààp…úÀêr í…ôn, - É…ôn/)...",True
1,do good samaritan laws protect those who help ...,Good Samaritan law -- Good Samaritan laws offe...,True
2,is windows movie maker part of windows essentials,Windows Movie Maker -- Windows Movie Maker (fo...,True
3,is confectionary sugar the same as powdered sugar,"Powdered sugar -- Powdered sugar, also called ...",True
4,is elder scrolls online the same as skyrim,The Elder Scrolls Online -- As with other game...,False


In [4]:
df_val = pd.read_csv(path+"\\val.jsonl", delimiter='\t', header=None, names=['sentence_source'])
database=[]
for i in df_val.sentence_source:
    i_json= json.loads(i)
    i_list=[i_json['idx'],i_json['question'],i_json['passage'],i_json['label']]
    database.append(i_list)

df_val= pd.DataFrame(database,columns=['idx','question','passage','label']).set_index('idx')
df_val.head(5)

Unnamed: 0_level_0,question,passage,label
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,does ethanol take more energy make that produces,Ethanol fuel -- All biomass goes through at le...,False
1,is house tax and property tax are same,Property tax -- Property tax or 'house tax' is...,True
2,is pain experienced in a missing body part or ...,Phantom pain -- Phantom pain sensations are de...,True
3,is harry potter and the escape from gringotts ...,Harry Potter and the Escape from Gringotts -- ...,True
4,is there a difference between hydroxyzine hcl ...,Hydroxyzine -- Hydroxyzine preparations requir...,True


I chose to use the Word2vec pre-trained model to vectorize words and then sentences.It's fast  and does ok on some simple (broadly-topical) tasks

In this case, it is better not to delete stop words and punctuation , since the algorithm relies on the broader context of the sentence to obtain high-quality word vectors. In addition, we will get a result that can be compared with the results of Bert. Let's break down the Question column and Passage column  into words,clean any non-english words,stemms them and calculate some statistics. Stemming is needed when we use Wort2vec,because not all of the forms of words are presented in the Wort2vec dictionary

In [5]:
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
import re

def words(i):
   #clean the data from non-english words, tokenize it and stemms
    regex = re.compile("[A-Za-z-]+")
    i= " ".join(regex.findall(i))
    tokens=word_tokenize(i.lower())
    snowball = SnowballStemmer("english")
    tokens= [snowball.stem(j) for j in tokens ]
    return tokens

df_train['question_token']= df_train.question.apply(lambda x: words(x))
df_train['passage_token']= df_train.passage.apply(lambda x: words(x))
df_val['question_token']= df_val.question.apply(lambda x: words(x))
df_val['passage_token']= df_val.passage.apply(lambda x: words(x))
df_train['question_token'][0]

['do', 'iran', 'and', 'afghanistan', 'speak', 'the', 'same', 'languag']

In [6]:

example_number = df_train.shape[0]
class_distribution =  df_train.label.value_counts()

mean_sentence_length_for_passage=[]
for i in df_train.passage_token:
    mean_sentence_length_for_passage.append(len(i))
mean_sentence_length_for_passage=np.mean(mean_sentence_length_for_passage)

mean_sentence_length_for_question=[]
for i in df_train.question_token:
    mean_sentence_length_for_question.append(len(i))
mean_sentence_length_for_question=np.mean(mean_sentence_length_for_question)

number_unique_words=set()
for i in df_train['question_token']:
    for j in i:
        number_unique_words.add(j)
for i in df_train['passage_token']:
    for j in i:
         number_unique_words.add(j)
number_unique_words=len(number_unique_words)



print('Checking the number of missing values : ' )
print(f'Number of missing values  in Question column: {df_train.question.isnull().sum()}',f'Number of missing values  in Passage column: {df_train.passage.isnull().sum()}',f'Number of missing values  in Label column: {df_train.label.isnull().sum()}','',sep='\n')
print(f'Number of examples in dataset: {example_number}','',sep='\n' )
print(f'Class_distribution: {class_distribution}','',sep='\n' )

print(f'Mean sentence length of a passage: {round(mean_sentence_length_for_passage,0)}',f'Mean length of a question: {round(mean_sentence_length_for_question,0)}','',sep='\n')

print(f'Number of unique words : {number_unique_words}')

Checking the number of missing values : 
Number of missing values  in Question column: 0
Number of missing values  in Passage column: 0
Number of missing values  in Label column: 0

Number of examples in dataset: 9427

Class_distribution: True     5874
False    3553
Name: label, dtype: int64

Mean sentence length of a passage: 96.0
Mean length of a question: 9.0

Number of unique words : 32137


As we can see, we don't have missing values in our dataset . But the dataset is unbalanced. The number of "true" answers is 1.6 times more than the number of "false" answers

 There's no single, official way to use word2vec to represent sentences. Once quick & crude approach is to create a vector for a sentence  by averaging all the word-vectors together.

In [7]:
# taking embeddings of the words from word2vec

import gensim
import gensim.downloader

word2vec = gensim.downloader.load('word2vec-google-news-300')




In [8]:
def get_embeddings_sentence_w2v(sentence,model):
    return np.mean([model.get_vector(word) for word in sentence if word in model], axis=0)

df_train['question_w2v_emb']=df_train.question_token.apply(lambda x: get_embeddings_sentence_w2v(x,word2vec))
df_train['passage_w2v_emb']=df_train.passage_token.apply(lambda x: get_embeddings_sentence_w2v(x,word2vec))
df_val['question_w2v_emb']=df_val.question_token.apply(lambda x: get_embeddings_sentence_w2v(x,word2vec))
df_val['passage_w2v_emb']=df_val.passage_token.apply(lambda x: get_embeddings_sentence_w2v(x,word2vec))
df_train.head(5)

Unnamed: 0_level_0,question,passage,label,question_token,passage_token,question_w2v_emb,passage_w2v_emb
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,do iran and afghanistan speak the same language,"Persian language -- Persian (/Ààp…úÀêr í…ôn, - É…ôn/)...",True,"[do, iran, and, afghanistan, speak, the, same,...","[persian, languag, --, persian, p, r, n, -, n,...","[0.021993002, 0.015218099, 0.12709554, 0.23776...","[-0.0054717134, 0.03689535, 0.08077174, 0.1108..."
1,do good samaritan laws protect those who help ...,Good Samaritan law -- Good Samaritan laws offe...,True,"[do, good, samaritan, law, protect, those, who...","[good, samaritan, law, --, good, samaritan, la...","[0.035839844, 0.041729737, 0.02211914, 0.09774...","[-0.0034797825, 0.019353127, 0.029191853, 0.08..."
2,is windows movie maker part of windows essentials,Windows Movie Maker -- Windows Movie Maker (fo...,True,"[is, window, movi, maker, part, of, window, es...","[window, movi, maker, --, window, movi, maker,...","[0.05419922, 0.017130533, 0.015370687, 0.03619...","[-0.015647124, 0.011743498, -0.022090148, 0.07..."
3,is confectionary sugar the same as powdered sugar,"Powdered sugar -- Powdered sugar, also called ...",True,"[is, confectionari, sugar, the, same, as, powd...","[powder, sugar, --, powder, sugar, also, call,...","[-0.023633685, -0.054600306, 0.0531529, 0.0678...","[-0.009095291, -0.032276616, 0.035875518, 0.08..."
4,is elder scrolls online the same as skyrim,The Elder Scrolls Online -- As with other game...,False,"[is, elder, scroll, onlin, the, same, as, skyrim]","[the, elder, scroll, onlin, --, as, with, othe...","[0.06457084, 0.031964984, -0.006452288, 0.0318...","[0.061226837, 0.04547988, 0.04089217, 0.051684..."


There are many stratages to train the model . I choose to concatinate question and passage texts and give it as an input to the model

In [9]:
def X_creation(dataset):
    X_dataset=[]
    for i in range(dataset.shape[0]):
          concatinated_line = dataset['question_w2v_emb'][i].tolist()+dataset['passage_w2v_emb'][i].tolist()
          X_dataset.append(concatinated_line)
    return X_dataset

In [10]:
X_train= X_creation(df_train)
y_train=df_train['label']
X_test= X_creation(df_val)
y_test=df_val['label']
len(X_train[0])

600

Let's take SVM as a classification model . But before training  the model let's handle with class imbalance with the help of ADASYN. ADASYN is an oversampling method that generates synthetic samples for minority classes, balancing the dataset and improving classification accuracy.As a performance metric let's choose accuracy

## Create and run experiments

In [11]:
import os
import sys
import warnings
import pprint

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet

import mlflow
import mlflow.sklearn

In [12]:
from imblearn.over_sampling import ADASYN
from sklearn.metrics import accuracy_score

In [18]:
%%writefile MLproject
name: tutorial

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      C: float
    command: "python train.py {C}"

Overwriting MLproject


In [19]:
%%writefile conda.yaml
name: tutorial
channels:
  - defaults
dependencies:
  - numpy>=1.14.3
  - pandas>=1.0.0
  - scikit-learn=0.19.1
  - pip
  - pip:
    - mlflow

Overwriting conda.yaml


In [22]:
MLFLOW_SERVER_URL = 'http://127.0.0.1:5000/'
experiment_name = 'experiment-for-svm'

warnings.filterwarnings("ignore")
np.random.seed(40)


client = mlflow.tracking.MlflowClient(MLFLOW_SERVER_URL)

mlflow.set_tracking_uri(MLFLOW_SERVER_URL)

mlflow.set_experiment(experiment_name)


for C in ((0.1), (1.0), (5.0), (10),(20)):
    with mlflow.start_run():

        adsin=ADASYN(sampling_strategy=0.8, n_neighbors=5, random_state=13)
        X_resampled_train,y_resampled_train = adsin.fit_resample(X_train,y_train )

        SVM=SVC(C=C, kernel='rbf', degree=3, gamma='scale', decision_function_shape='ovr', random_state=42)
        SVM.fit(X_resampled_train,y_resampled_train)
        y_pred=SVM.predict(X_test)

        quality= accuracy_score(y_test, y_pred)

        print("SVM with ADASYN (C=%f):" % (C))
        print("  Accuracy: %s" % quality)

        mlflow.log_param("C", C)
        mlflow.log_metric("Accuracy", quality)

        mlflow.sklearn.log_model(SVM, "model")


SVM with ADASYN (C=0.100000):
  Accuracy: 0.6217125382262997




üèÉ View run treasured-dove-995 at: http://127.0.0.1:5000/#/experiments/1/runs/f61d24d1b9344e9a998f2a2f03e82389
üß™ View experiment at: http://127.0.0.1:5000/#/experiments/1
SVM with ADASYN (C=1.000000):
  Accuracy: 0.653822629969419




üèÉ View run dapper-whale-878 at: http://127.0.0.1:5000/#/experiments/1/runs/36d6aed1ec044a8db6c1feee5227a7dd
üß™ View experiment at: http://127.0.0.1:5000/#/experiments/1
SVM with ADASYN (C=5.000000):
  Accuracy: 0.6724770642201835




üèÉ View run invincible-newt-149 at: http://127.0.0.1:5000/#/experiments/1/runs/8d7bcce8539e468e96622b24402fe0ac
üß™ View experiment at: http://127.0.0.1:5000/#/experiments/1
SVM with ADASYN (C=10.000000):
  Accuracy: 0.6730886850152905




üèÉ View run charming-grub-730 at: http://127.0.0.1:5000/#/experiments/1/runs/6502cfb885d04cdabc769735bbab9616
üß™ View experiment at: http://127.0.0.1:5000/#/experiments/1
SVM with ADASYN (C=20.000000):
  Accuracy: 0.6770642201834862




üèÉ View run inquisitive-mink-862 at: http://127.0.0.1:5000/#/experiments/1/runs/1ccfafbe7c4c4670be48e098135bc792
üß™ View experiment at: http://127.0.0.1:5000/#/experiments/1


In [57]:
client.delete_registered_model("svm-learn-model")

In [35]:
client = mlflow.tracking.MlflowClient(MLFLOW_SERVER_URL)
experiment = client.get_experiment_by_name(experiment_name)
client.search_runs(experiment.experiment_id)[0]

<Run: data=<RunData: metrics={'Accuracy': 0.6770642201834862}, params={'C': '20'}, tags={'mlflow.log-model.history': '[{"run_id": "1ccfafbe7c4c4670be48e098135bc792", '
                             '"artifact_path": "model", "utc_time_created": '
                             '"2024-12-18 19:01:11.310054", "model_uuid": '
                             '"fabeef7b16ca4d94b0fd16c03c02635e", "flavors": '
                             '{"python_function": {"model_path": "model.pkl", '
                             '"predict_fn": "predict", "loader_module": '
                             '"mlflow.sklearn", "python_version": "3.10.9", '
                             '"env": {"conda": "conda.yaml", "virtualenv": '
                             '"python_env.yaml"}}, "sklearn": '
                             '{"pickled_model": "model.pkl", '
                             '"sklearn_version": "1.2.1", '
                             '"serialization_format": "cloudpickle", "code": '
                        

In [53]:
# checking how the model works in test environment
reg_model_name = "svm-learn-model"


'1'

In [58]:

# register model
client.create_registered_model(reg_model_name)
# create a new version
result = client.create_model_version(
name=reg_model_name,
source="file:///C:/Users/BCS.DESKTOP-732EA67/artifacts/1/1ccfafbe7c4c4670be48e098135bc792/artifacts/model",
run_id='1ccfafbe7c4c4670be48e098135bc792'
)


client.transition_model_version_stage(
name=reg_model_name,
version=result.version,
stage="Staging"
)


model = mlflow.sklearn.load_model(model_uri=f"models:/{reg_model_name}/Staging")
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))

2024/12/18 22:18:25 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: svm-learn-model, version 1


0.6770642201834862


In [59]:
# checking how the model works in production environment
client.transition_model_version_stage(
name=reg_model_name,
version=result.version,
stage="Production"
  )

model = mlflow.sklearn.load_model(model_uri=f"models:/{reg_model_name}/Production")
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))


0.6770642201834862


## Python API

In [60]:
from imblearn.over_sampling import ADASYN
from sklearn.metrics import accuracy_score

adsin=ADASYN(sampling_strategy=0.8, n_neighbors=5, random_state=13)
X_resampled_train,y_resampled_train = adsin.fit_resample(X_train,y_train )

SVM=SVC(C=20, kernel='rbf', degree=3, gamma='scale', decision_function_shape='ovr', random_state=42)
SVM.fit(X_resampled_train,y_resampled_train)
y_pred=SVM.predict(X_test)

print("Accuracy: ", accuracy_score(y_test, y_pred))

Accuracy:  0.6770642201834862


In [61]:
import pickle

raw_data = pickle.dumps(SVM)

with open('SVM.pickle', 'wb') as f:
    f.write(raw_data)

In [145]:
%%writefile server.py
from flask import Flask, request
import json
import pickle
import re


app = Flask(__name__)


def load_model(pickle_path):
    with open(pickle_path, 'rb') as f:
        raw_data = f.read()
        model = pickle.loads(raw_data)
    return model

model = load_model('SVM.pickle')

def predict_result(data_list):
    result = SVM.predict(data_list)
    return result


@app.route('/')
def hello():
    return "Hello, from Flask"

@app.route('/predict', methods=["GET", "POST"])
def predict_q():
    if request.method == "POST":
        data = request.get_json(force=True)
        data_list = data['data_list']

        result = predict_result(data_list)

        response = {
            "result": result
        }
        return response
    else:
        return "You should use only POST query"

if __name__ == '__main__':
    app.run("127.0.0.1", 8000)

Overwriting server.py


In [147]:
import requests

data = {
    'data_list': X_test[:5]
}

In [140]:
r = requests.get("http://127.0.0.1:8000/predict", json=data)

print(r.text)

You should use only POST query


In [141]:
r = requests.get("http://127.0.0.1:8000/", json=data)

print(r.text)

Hello, from Flask


In [144]:
r = requests.post("http://127.0.0.1:8000/predict", json=data)

print(r)

<Response [500]>
