# Project 2: Non-Generative Model
Authors: Zechen Wu, Elena Franchini, Erifeoluwa Jamgbadi

# Investigate dataset

## Dataset selection
The dataset we will use is the "SQuAD2.0: The Stanford Question Answering Dataset". The website provides the training and validation (i.e. development) set in the form of JSON.


## Dataset analysis
Data in the training set consists of strings which represent questions and answers (that come from Wikipedia articles) and can be found as values under the 'data' key. Each 'title' key is associated to a 'paragraphs' key which is an array containing these questions and answers associated to that title (the title acts as a category). Each question is composed by the text representing the question, the id, an array of answers and a flag checking if answering to that question is impossible: if the flag is true, the array of answers is empty. In addition, each answer is associated to the 'answer_start' key whose value represent the starting position of the answer.
Some questions have also plausible answers, which should be other possible answers in addition to the correct ones (if any).

In [1]:
!git clone https://github.com/Ele975/AML_project2.git
!pip install --upgrade gensim
!pip install -U sentence-transformers
!pip install torchinfo
!pip install datasets
!pip install -U accelerate

fatal: destination path 'AML_project2' already exists and is not an empty directory.


In [2]:
#!pip install --upgrade transformers

In [3]:
import pandas as pd
import json
import math
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from gensim.models.word2vec import Word2Vec

##model parts imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from statistics import mean, stdev
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn import linear_model


##for calculating the accuracy of the model
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import normalize
from sklearn import metrics

import os
import pickle


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# import data from github repository
train = pd.read_json('AML_project2/train-v2.0.json')
val = pd.read_json('AML_project2/dev-v2.0.json')
print(train)
print(val)

    version                                               data
0      v2.0  {'title': 'Beyoncé', 'paragraphs': [{'qas': [{...
1      v2.0  {'title': 'Frédéric_Chopin', 'paragraphs': [{'...
2      v2.0  {'title': 'Sino-Tibetan_relations_during_the_M...
3      v2.0  {'title': 'IPod', 'paragraphs': [{'qas': [{'qu...
4      v2.0  {'title': 'The_Legend_of_Zelda:_Twilight_Princ...
..      ...                                                ...
437    v2.0  {'title': 'Infection', 'paragraphs': [{'qas': ...
438    v2.0  {'title': 'Hunting', 'paragraphs': [{'qas': [{...
439    v2.0  {'title': 'Kathmandu', 'paragraphs': [{'qas': ...
440    v2.0  {'title': 'Myocardial_infarction', 'paragraphs...
441    v2.0  {'title': 'Matter', 'paragraphs': [{'qas': [{'...

[442 rows x 2 columns]
   version                                               data
0     v2.0  {'title': 'Normans', 'paragraphs': [{'qas': [{...
1     v2.0  {'title': 'Computational_complexity_theory', '...
2     v2.0  {'title': 'Southern_Ca

This shows the structure of the json file, it has title, which is the topic, then the questions, the answers to those questions and the id of the questions. In the datasets, the questions and answers are organised under different topics.

In [5]:
trainingCol = train.columns
trainingCol
train.iloc[0, 1]["paragraphs"][:1]


[{'qas': [{'question': 'When did Beyonce start becoming popular?',
    'id': '56be85543aeaaa14008c9063',
    'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
    'is_impossible': False},
   {'question': 'What areas did Beyonce compete in when she was growing up?',
    'id': '56be85543aeaaa14008c9065',
    'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
    'is_impossible': False},
   {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
    'id': '56be85543aeaaa14008c9066',
    'answers': [{'text': '2003', 'answer_start': 526}],
    'is_impossible': False},
   {'question': 'In what city and state did Beyonce  grow up? ',
    'id': '56bf6b0f3aeaaa14008c9601',
    'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
    'is_impossible': False},
   {'question': 'In which decade did Beyonce become famous?',
    'id': '56bf6b0f3aeaaa14008c9602',
    'answers': [{'text': 'late 1990s', 'answer_start': 276}],
    'is_imposs

In [6]:
##title, paragraphs and context
train.iloc[1]["data"]["paragraphs"][:1]

[{'qas': [{'question': "What was Frédéric's nationalities?",
    'id': '56cbd2356d243a140015ed66',
    'answers': [{'text': 'Polish and French', 'answer_start': 182}],
    'is_impossible': False},
   {'question': 'In what era was Frédéric active in?',
    'id': '56cbd2356d243a140015ed67',
    'answers': [{'text': 'Romantic era', 'answer_start': 276}],
    'is_impossible': False},
   {'question': 'For what instrument did Frédéric write primarily for?',
    'id': '56cbd2356d243a140015ed68',
    'answers': [{'text': 'solo piano', 'answer_start': 318}],
    'is_impossible': False},
   {'question': 'In what area was Frédéric born in?',
    'id': '56cbd2356d243a140015ed69',
    'answers': [{'text': 'Duchy of Warsaw', 'answer_start': 559}],
    'is_impossible': False},
   {'question': 'At what age did Frédéric depart from Poland?',
    'id': '56cbd2356d243a140015ed6a',
    'answers': [{'text': '20', 'answer_start': 777}],
    'is_impossible': False},
   {'question': 'What year was Chopin born

In [7]:
for i in trainingCol[0:10]:
  train.iloc[0, 1]["title"]

### Count the number of data we have in both the training and validation sets.
The size of the dataset is quite small, but the partition between the training and validation set is good (since always we have a traning set much bigger than the validation set). Often the dataset is first split in traning and test set, since the validation set is obtained by further splitting the training set. In this case is it required to get the test set from the training set.

In [8]:
def count_data(series):
  nr_categories = 0
  nr_questions = 0
  nr_answers = 0

  for category in series:
    nr_categories += 1
    paragraphs = category.get('paragraphs', [])
    for para in paragraphs:
      qas_list = para.get('qas', [])
      nr_questions += len(qas_list)
      for qas in qas_list:
        answers = qas.get('answers', [])
        nr_answers += len(answers)
  return nr_categories, nr_questions, nr_answers

count_train = count_data(train['data'])
count_val = count_data(val['data']);

print("Categories in the training set:", count_train[0])
print("Categories in the validation set:", count_val[0], "\n")
print("Questions in the training set:", count_train[1])
print("Questions in the validation set:", count_val[1], "\n")
print("Answers in the training set:", count_train[2])
print("Answers in the validation set:", count_val[2], "\n")

print('Total data in training set (Q + A):', count_train[1] + count_train[2])
print('Total data in validation set (Q + A):', count_val[1] + count_val[2], "\n")

print('Total data in dataset (Q + A):', count_train[1] + count_train[2] + count_val[1] + count_val[2], "\n")

print('Partition dataset:')
print('\t Training set:',round((count_train[1] + count_train[2])/(count_train[1] + count_train[2] + count_val[1] + count_val[2])*100) , '%.')
print('\t Validation set:',round((count_val[1] + count_val[2])/(count_train[1] + count_train[2] + count_val[1] + count_val[2])*100) , '%.')


Categories in the training set: 442
Categories in the validation set: 35 

Questions in the training set: 130319
Questions in the validation set: 11873 

Answers in the training set: 86821
Answers in the validation set: 20302 

Total data in training set (Q + A): 217140
Total data in validation set (Q + A): 32175 

Total data in dataset (Q + A): 249315 

Partition dataset:
	 Training set: 87 %.
	 Validation set: 13 %.


Made some edits to the dataset from the previos files to factor in for data that was missing in each of the different columns, just so that the lengths would be the same and there won't be answers assigned to the wrong questions

In [9]:
#through this i want to separate the question and answer text

def input_vocabulary(data):
  questionsTrainList = []
  answersTrainList = []
  contextTrainList = []
  answer_startList = []

  for category in data:
      paragraphs = category.get('paragraphs', [])
      #print(paragraphs)
      for para in paragraphs:
        context_list = para.get('context', [])
        qas_list = para.get('qas', [])
        for qa in qas_list:
          # set lower case
          question = qa.get('question', '').lower()
          # remove punctuation, split '/' and numbers and words with numbers
          clean_question = re.sub(r'[^\w\s/]', '', question)
          clean_question = re.sub(r'/', ' ', clean_question)
          clean_question = re.sub(r'\b(?:\w*\d\w*|\d+)\b', '', clean_question)
          clean_question = re.sub(r'_', '', clean_question)
          questionsTrainList.append(clean_question)
          contextTrainList.append(context_list) ##through this i get the context of each topic
          answers = qa.get('answers',  []) ### because there were empty answer, this was taken care of by putting it as None as the length of the question and answer has to equal each other

          if answers == []:
            text = None
            ansStart = None
            answersTrainList.append(text)
            answer_startList.append(ansStart)
          else:
            for ans in answers:
              text = ans.get('text',  [])
              #print(answers)
              answersTrainList.append(text)
              ansStart = ans.get('answer_start', [])
              answer_startList.append(ansStart)
  return questionsTrainList, answersTrainList, contextTrainList, answer_startList

In [10]:
##calls the training set function
questionsTrainList, answersTrainList, contextTrainList, answerStartTrainList = input_vocabulary(train['data'])

In [11]:
##checking that the right question, answer, context is assigned correctly by viewing the last index in the dataframe
print(f"{questionsTrainList[130318]}")
print(f"{answersTrainList[130318]}")
print(f"{contextTrainList[130318]}")
print(f"{answerStartTrainList[130318]}")

print()
##checking that they are all the same lengths so that there is no issue with tokenising
print(f"{len(questionsTrainList)}")
print(f"{len(answersTrainList)}")
print(f"{len(contextTrainList)}")
print(f"{len(answerStartTrainList)}")

what field of study has a variety of unusual contexts
None
The term "matter" is used throughout physics in a bewildering variety of contexts: for example, one refers to "condensed matter physics", "elementary matter", "partonic" matter, "dark" matter, "anti"-matter, "strange" matter, and "nuclear" matter. In discussions of matter and antimatter, normal matter has been referred to by Alfvén as koinomatter (Gk. common matter). It is fair to say that in physics, there is no broad consensus as to a general definition of matter, and the term "matter" usually is used in conjunction with a specifying modifier.
None

130319
130319
130319
130319


put it in a dataframe for easier analysis and visualisation later on

In [12]:

data = {'Question': questionsTrainList, 'Answer': answersTrainList, 'Answer_Start': answerStartTrainList, 'Context': contextTrainList}
trainingData = pd.DataFrame(data)
trainingData.head(5)

Unnamed: 0,Question,Answer,Answer_Start,Context
0,when did beyonce start becoming popular,in the late 1990s,269.0,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
1,what areas did beyonce compete in when she was...,singing and dancing,207.0,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
2,when did beyonce leave destinys child and beco...,2003,526.0,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
3,in what city and state did beyonce grow up,"Houston, Texas",166.0,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
4,in which decade did beyonce become famous,late 1990s,276.0,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...


<br><br>
repeating the same process for the validation set because the the lengths of the column gave different answers when calling it with the train function above. Main difference here is where the contextList is placed in the dataset.
<br><br>

In [13]:
#through this i want to separate the question and answer text
questionsValList = []
answersValList = []
contextValList = []
answerStartValList = []

for category in val['data']:
    paragraphs = category.get('paragraphs', [])
    #print(paragraphs)
    for para in paragraphs:
      context_list = para.get('context', [])
      qas_list = para.get('qas', [])
      for qa in qas_list:
        # set lower case
        question = qa.get('question', '').lower()
        if question is None:
          text = "None"
          questionsValList.append(text)
        else:
          # remove punctuation, split '/' and numbers and words with numbers
          clean_question = re.sub(r'[^\w\s/]', '', question)
          clean_question = re.sub(r'/', ' ', clean_question)
          clean_question = re.sub(r'\b(?:\w*\d\w*|\d+)\b', '', clean_question)
          clean_question = re.sub(r'_', '', clean_question)
          answers = qa.get('answers',  []) ### because there were empty answer, this was taken care of by putting it as None as the length of the question and answer has to equal each other
          for ans in answers:
            text = ans.get('text',  [])
            #print(answers)
            answersValList.append(text)
            contextValList.append(context_list) ##through this i get the context of each topic
            ansStart = ans.get('answer_start', [])
            answerStartValList.append(ansStart)
            questionsValList.append(clean_question)

In [14]:
##checking that they are all the same lengths so that there is no issue with tokenising
print(f"{len(questionsValList)}")
print(f"{len(answersValList)}")
print(f"{len(contextValList)}")
print(f"{len(answerStartValList)}")

20302
20302
20302
20302


In [15]:
##put it in a dataframe for easier analysis and visualisation later on
data = {'Question': questionsValList, 'Answer': answersValList, 'Answer_Start': answerStartValList, 'Context': contextValList}
valData = pd.DataFrame(data)
valData.head(5)

Unnamed: 0,Question,Answer,Answer_Start,Context
0,in what country is normandy located,France,159,The Normans (Norman: Nourmands; French: Norman...
1,in what country is normandy located,France,159,The Normans (Norman: Nourmands; French: Norman...
2,in what country is normandy located,France,159,The Normans (Norman: Nourmands; French: Norman...
3,in what country is normandy located,France,159,The Normans (Norman: Nourmands; French: Norman...
4,when were the normans in normandy,10th and 11th centuries,94,The Normans (Norman: Nourmands; French: Norman...


added the training set and the validation set into one dataframe

In [44]:
trainingData = trainingData.iloc[:10]

In [95]:
##assiging a label to each of the answers
ansToLabel = {answer: idx for idx, answer in enumerate(trainingData['Answer'].unique())}
trainingData['Label'] = trainingData['Answer'].map(ansToLabel)
trainingData['Label']

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    4
9    8
Name: Label, dtype: int64

In [46]:
##dropping none type since the tokeniser won't accept nonetype
trainingData.dropna(inplace =True)


In [96]:
X_train, X_val, y_train, y_val = train_test_split(trainingData['Question'], trainingData['Label'], test_size = 0.3, random_state = 7)
print(f'\nShape checks:\nX_train: {X_train.shape} X_val: {X_val.shape}\ny_train: {y_train.shape} y_val: {y_val.shape}')


Shape checks:
X_train: (7,) X_val: (3,)
y_train: (7,) y_val: (3,)


In [97]:
##had to reset the index because after the split, they are not in order
X_train.reset_index(drop=True, inplace = True)
X_val.reset_index(drop=True, inplace = True)
y_train.reset_index(drop=True, inplace = True)
y_val.reset_index(drop=True, inplace = True)

In [98]:
X_train.index
y_train.index

RangeIndex(start=0, stop=7, step=1)

In [99]:
##adding the df back together because i want to combine the question and ans, since the training is going to be done with classification
data = {'Question': X_train, 'Answer': y_train}
trainDF = pd.DataFrame(data)
trainDF

Unnamed: 0,Question,Answer
0,when did beyonce leave destinys child and beco...,2
1,what areas did beyonce compete in when she was...,1
2,what role did beyoncé have in destinys child,8
3,who managed the destinys child group,7
4,in what city and state did beyonce grow up,3
5,what album made her a worldwide known artist,6
6,in which decade did beyonce become famous,4


In [100]:
data = {'Question': X_val, 'Answer': y_val}
testDF = pd.DataFrame(data)
testDF

Unnamed: 0,Question,Answer
0,when did beyoncé rise to fame,4
1,in what rb group was she the lead singer,5
2,when did beyonce start becoming popular,0


Tokenisation

In [101]:
from tokenizers.processors import BertProcessing
from tokenizers import ByteLevelBPETokenizer
from pathlib import Path

# Initialize a tokenizer
#tokenizer = ByteLevelBPETokenizer()
# Customize training
#tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=["","","",""])

In [129]:
from transformers import BertTokenizer, BertForMaskedLM
from transformers import AutoTokenizer
#tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
# tokenizer._tokenizer.post_processor = BertProcessing(
#     ("", tokenizer.token_to_id("")),
#     ("", tokenizer.token_to_id("")),
# )
# tokenizer.enable_truncation(max_length=512)

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [130]:
print("vocabulary size: ", len(tokenizer.vocab))

vocabulary size:  250002


In [103]:
tokenized_texts = tokenizer(questionsTrainList, answersTrainList, truncation="only_second", return_tensors='pt', padding=True )#, truncation=True
tokenized_texts

{'input_ids': tensor([[ 101, 2043, 2106,  ...,    0,    0,    0],
        [ 101, 2054, 2752,  ...,    0,    0,    0],
        [ 101, 2043, 2106,  ...,    0,    0,    0],
        ...,
        [ 101, 2054, 2003,  ...,    0,    0,    0],
        [ 101, 3043, 2788,  ...,    0,    0,    0],
        [ 101, 2054, 2492,  ...,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

# 2. Training Models

The traditional methods for question and answer based system is based on using concepts such as keyword matching, TF-IDF, or word embeddings to get and select appropriate responses based on matching patterns or keywords. The below model is a logistic regession model, that is trained on our training data.

## Train the model to perform the question and answering task

 training from scratch, so only the config file is done from the roberta model

In [131]:
from transformers import RobertaConfig, RobertaForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments

# ##creating tokeniser
# import torch

#tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

#Initializing a RoBERTa configuration
config = RobertaConfig(
 vocab_size=len(tokenizer.vocab),
 max_position_embeddings=514,
 num_attention_heads=12,
 num_hidden_layers=12,
 type_vocab_size=1,
)

In [132]:

model = RobertaForMaskedLM(config=config)

In [133]:
model.num_parameters() #110651648


278295186

In [107]:

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [32]:
    # inputs = tokenizer(
    #     questions,
    #     examples["Context"],
    #     max_length=384,
    #     truncation="only_second",
    #     return_offsets_mapping=True,
    #     padding="max_length",
    # )

Inserting model built from scratch here

In [33]:
!pip install transformers[torch]
!pip install accelerate>=0.20.1
#!pip install -U accelerate



In [140]:
train_data = [{'text':txt, 'context':lbl} for txt, lbl in zip(X_train, y_train)]
test_data = [{'text':txt, 'context':lbl} for txt, lbl in zip(X_val, y_val)]

In [141]:
# Create Huggingface datasets
from datasets import Dataset, DatasetDict

train_data = Dataset.from_list(train_data)
test_data = Dataset.from_list(test_data)

In [142]:
data = DatasetDict()
data['train'] = train_data
data['test'] = test_data

In [143]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True, return_tensors='pt')

tokenized_data = data.map(tokenize_function, batched=True, batch_size=len(train_data)) # Process all the data together
tokenized_data

Map:   0%|          | 0/7 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'context', 'input_ids', 'attention_mask'],
        num_rows: 7
    })
    test: Dataset({
        features: ['text', 'context', 'input_ids', 'attention_mask'],
        num_rows: 3
    })
})

In [167]:
RUN_NAME = '0'

##change the epochs and logging when running final version
training_args = TrainingArguments(
    output_dir="content/model",
    logging_dir="content/logs"+RUN_NAME,
    #overwrite_output_dir=True,
    num_train_epochs=8,
    learning_rate=6e-3,
    lr_scheduler_type='constant',
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    #save_steps=10_000,
    #save_total_limit=2,
    evaluation_strategy="steps", # Log also evaluation
    #prediction_loss_only=True,
    logging_steps=2,
    eval_steps=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_data['train'],
    eval_dataset=tokenized_data['test']
    #train_dataset=tokenized_texts['input_ids'],
)


In [168]:
##training the model
trainer.train()

Step,Training Loss,Validation Loss
2,4.0993,13.78831
4,6.6102,9.165077
6,4.1476,11.528354
8,5.0543,4.950547


TrainOutput(global_step=8, training_loss=4.977859139442444, metrics={'train_runtime': 100.3822, 'train_samples_per_second': 0.558, 'train_steps_per_second': 0.08, 'total_flos': 490650118560.0, 'train_loss': 4.977859139442444, 'epoch': 8.0})

In [173]:
##reducing the size of the dataset as it crashes the session training with the whole dataset
# newdf = newdf.iloc[:2000]
# newdf.head(2)

In [174]:
# data = {'Question': newdf['Question'], 'Answer': newdf['Answer']}
# df = pd.DataFrame(data )
#newdf.dropna(inplace = True)
# ###changing the text data to numeric, so that it will be able to use the svm model
# df['Question'] = tfidf_vectorizer.fit_transform(df['Question']).toarray()
# df['Answer'] = tfidf_vectorizer.fit_transform(df['Answer']).toarray()

using TF-IDF and cosine similarity for the QA system

##Logistic Regression Model

using a logistic regession model for the QA system

In [175]:
##assiging a label to each of the answers
ansToLabel = {answer: idx for idx, answer in enumerate(trainingData['Answer'].unique())}
trainingData['Label'] = trainingData['Answer'].map(ansToLabel)
trainingData['Label']

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    4
9    8
Name: Label, dtype: int64

In [176]:
#dataset is split into training and test dataset
X_train, X_val, y_train, y_val = train_test_split(trainingData['Question'], trainingData['Label'], test_size=0.3, random_state=42)


In [177]:
##had to reset the index because after the split, they are not in order
X_train.reset_index(drop=True, inplace = True)
X_val.reset_index(drop=True, inplace = True)
y_train.reset_index(drop=True, inplace = True)
y_val.reset_index(drop=True, inplace = True)

In [178]:
# Vectorize the training data using CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,2))
X_train_vectorized = vectorizer.fit_transform(X_train)


In [179]:
# Train a logistic regression classifier
lr = LogisticRegression(max_iter=10000)
lr.fit(X_train_vectorized, y_train)

# Evaluation of Model

In [180]:
##turn into numeric representation
X_test_vectorized = vectorizer.transform(X_val)

#getting the predictions
predictions = lr.predict(X_test_vectorized)

the model accuracy.

In [181]:
##accuracy is low for now cause of size of training data is 10
accuracy = accuracy_score(y_val, predictions)

print(f"Accuracy: {accuracy * 100}")

Accuracy: 0.0


In [182]:
##gives the f1 score, recall, precision
print(classification_report(y_val, predictions))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       0.0
           1       0.00      0.00      0.00       1.0
           4       0.00      0.00      0.00       1.0
           5       0.00      0.00      0.00       1.0
           7       0.00      0.00      0.00       0.0

    accuracy                           0.00       3.0
   macro avg       0.00      0.00      0.00       3.0
weighted avg       0.00      0.00      0.00       3.0



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [183]:
# # save the model trained
# with open('/content/lr.pkl','wb') as f:
#     pickle.dump(lr,f)

In [184]:
# #using logistic regression model
# import os
# import pickle

# ##checks if the model already exist
# if os.path.exists("lr.pkl"):
#   with open('lr.pkl', 'rb') as f: ##If it exist it uses it
#       lr = pickle.load(f)
# else: ##if the model doesn't exist
#   lr = LogisticRegression(max_iter=1000)
#   lr.fit(X_train_vectorized, y_train)

#   # save the model trained
#   with open('/content/lr.pkl','wb') as f:
#       pickle.dump(lr,f)


Testing on user asking question for the model

##NOTE : TRAINING SIZE IS SMALL SO THE FIRST FEW TRAINING WAS JUST ON BEYONCE ONLY

In [185]:
# getting user input and preforming the question and answer
closeChatWords = ['bye', 'quit']

print("Hi")
print("Write quit or bye when you have finished asking your questions")
while True:
  user_input = input("Question: ")
  if user_input.lower() in closeChatWords:
    print("End of QA session")
    break
  else:
    user_input_vectorized = vectorizer.transform([user_input])
    predicted_label = lr.predict(user_input_vectorized)[0] ###getting the prediction
    predAns = [answer for answer, label in ansToLabel.items() if label == predicted_label][0] ##finding the prediction in the ansLabel
    ##doing a response if the predicted is None
    if predAns == None:
      print("Response: I don't know how to answer this question, ask me another question")
    else:
      ##give the answser
      print("Response:", predAns)

Hi
Write quit or bye when you have finished asking your questions
Question: hi
Response: Mathew Knowles
Question: beyonce
Response: in the late 1990s
Question: what group is beyonce in
Response: Mathew Knowles
Question: bye
End of QA session


As can be seen from the chat above the bottom half shows the questions i got from the training data, which were all answered correctly, while there was wrong predictions in the one above. **NB - more training is needed**

## Potential extensions

**Improved the performance of a model on the task by finding additional training data from a similar/related dataset by training a similar dataset called CoQA and QuAC [1].**

CoQA:
https://downloads.cs.stanford.edu/nlp/data/coqa/coqa-dev-v1.0.json

QuAC: https://stanfordnlp.github.io/coqa/

**Turning the QA system into a chatbot - so will try to implement a dialog manager and will preform evaluation on this system.[3]**

## References


1.   https://arxiv.org/pdf/1809.10735.pdf
2.   https://www.sciencedirect.com/science/article/pii/S2772442523000655?ref=pdf_download&fr=RR-2&rr=84248bf61c1c9a24
3. https://en.wikipedia.org/wiki/Dialog_manager




