# Project 2: Natural Language Processing
Authors: Zechen Wu, Elena Franchini

# Investigate dataset

## Dataset selection
The dataset we will use is the "SQuAD2.0: The Stanford Question Answering Dataset". The website provides the training and validation (i.e. development) set in the form of JSON.


## Dataset analysis
Data in the training set consists of strings which represent questions and answers (that come from Wikipedia articles) and can be found as values under the 'data' key. Each 'title' key is associated to a 'paragraphs' key which is an array containing these questions and answers associated to that title (the title acts as a category). Each question is composed by the text representing the question, the id, an array of answers and a flag checking if answering to that question is impossible: if the flag is true, the array of answers is empty. In addition, each answer is associated to the 'answer_start' key whose value represent the starting position of the answer.
Some questions have also plausible answers, which should be other possible answers in addition to the correct ones (if any).

In [65]:
!git clone https://github.com/Ele975/AML_project2.git

fatal: destination path 'AML_project2' already exists and is not an empty directory.


In [112]:
import pandas as pd
import json
import math
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [113]:
# import data from github repository
train = pd.read_json('AML_project2/train-v2.0.json')
val = pd.read_json('AML_project2/dev-v2.0.json')
print(train)
print(val)

    version                                               data
0      v2.0  {'title': 'Beyoncé', 'paragraphs': [{'qas': [{...
1      v2.0  {'title': 'Frédéric_Chopin', 'paragraphs': [{'...
2      v2.0  {'title': 'Sino-Tibetan_relations_during_the_M...
3      v2.0  {'title': 'IPod', 'paragraphs': [{'qas': [{'qu...
4      v2.0  {'title': 'The_Legend_of_Zelda:_Twilight_Princ...
..      ...                                                ...
437    v2.0  {'title': 'Infection', 'paragraphs': [{'qas': ...
438    v2.0  {'title': 'Hunting', 'paragraphs': [{'qas': [{...
439    v2.0  {'title': 'Kathmandu', 'paragraphs': [{'qas': ...
440    v2.0  {'title': 'Myocardial_infarction', 'paragraphs...
441    v2.0  {'title': 'Matter', 'paragraphs': [{'qas': [{'...

[442 rows x 2 columns]


### Count the number of data we have in both the training and validation sets.
The size of the dataset is quite small, but the partition between the training and validation set is good (since always we have a traning set much bigger than the validation set). Often the dataset is first split in traning and test set, since the validation set is obtained by further splitting the training set. In this case is it required to get the test set from the training set.

In [114]:
def count_data(series):
  nr_categories = 0
  nr_questions = 0
  nr_answers = 0

  for category in series:
    nr_categories += 1
    paragraphs = category.get('paragraphs', [])
    for para in paragraphs:
      qas_list = para.get('qas', [])
      nr_questions += len(qas_list)
      for qas in qas_list:
        answers = qas.get('answers', [])
        nr_answers += len(answers)
  return nr_categories, nr_questions, nr_answers

count_train = count_data(train['data'])
count_val = count_data(val['data']);

print("Categories in the training set:", count_train[0])
print("Categories in the validation set:", count_val[0], "\n")
print("Questions in the training set:", count_train[1])
print("Questions in the validation set:", count_val[1], "\n")
print("Answers in the training set:", count_train[2])
print("Answers in the validation set:", count_val[2], "\n")

print('Total data in training set (Q + A):', count_train[1] + count_train[2])
print('Total data in validation set (Q + A):', count_val[1] + count_val[2], "\n")

print('Total data in dataset (Q + A):', count_train[1] + count_train[2] + count_val[1] + count_val[2], "\n")

print('Partition dataset:')
print('\t Training set:',round((count_train[1] + count_train[2])/(count_train[1] + count_train[2] + count_val[1] + count_val[2])*100) , '%.')
print('\t Validation set:',round((count_val[1] + count_val[2])/(count_train[1] + count_train[2] + count_val[1] + count_val[2])*100) , '%.')


Categories in the training set: 442
Categories in the validation set: 35 

Questions in the training set: 130319
Questions in the validation set: 11873 

Answers in the training set: 86821
Answers in the validation set: 20302 

Total data in training set (Q + A): 217140
Total data in validation set (Q + A): 32175 

Total data in dataset (Q + A): 249315 

Partition dataset:
	 Training set: 87 %.
	 Validation set: 13 %.


### Define the vocabulary size which is given by the number of distinct words in the collection.
To compute the size of the vocabulary (only of the input collection) we need first to split the questions strings (not the answers since we analyse the input only) into words and inserting them in a set which do not allow duplicates (the vocabulary size takes into consideration only distinct words). Stop words (most common words) are removed since they don't give additional value for our statistics. In the vocabulary size are taken into account there some wrong written words (e.g. aan instead of 'an') and it is not performant to remove them manually (they are only few).

In [130]:

def input_vocabulary(series):
  words_set = set()
  words_total = []
  for category in series:
    paragraphs = category.get('paragraphs', [])
    for para in paragraphs:
      qas_list = para.get('qas', [])
      for qa in qas_list:
        question = qa.get('question', '').lower()
        # remove punctuation, split '/' and numbers and words with numbers
        clean_question = re.sub(r'[^\w\s/]', '', question)
        clean_question = re.sub(r'/', ' ', clean_question)
        clean_question = re.sub(r'\b(?:\w*\d\w*|\d+)\b', '', clean_question)
        clean_question = re.sub(r'_', '', clean_question)
        words = clean_question.split()
        # remove stopwords
        words_nostopwords = [w for w in words if w not in stopwords.words('english')]
        # print(words_nostopwords)
        for word in words_nostopwords:
          words_set.add(word)
          words_total.append(word)
  return words_set, words_total

words_set_train, words_total_train = input_vocabulary(train['data'])
words_set_val, words_total_val = input_vocabulary(val['data'])

print("vocabulary size of training set:", len(words_set_train))
print("vocabulary size of validation set:", len(words_set_val))
print("total vocabulary size of collection:",  len(words_set_train) + len(words_set_val))

counts_train = nltk.Counter(words_total_train)
counts_val = nltk.Counter(words_total_val)
print(counts_train)
print(counts_val)
# ... continue


vocabulary size of training set: 44919
vocabulary size of validation set: 9835
total vocabulary size of collection: 54754


### Distributions over document length
We counted the frequency of each term within the documents

## Word2Vec embedding (or index documents)

# Train and evaluate model

## Train the model to perform the specific task

## Test pre-trained models on the task if they already exist

## Investigate the effectiveness of Large Language Models (LLMs) together with zero-shot and/or few-shot learning on the task

## Evaluate the different methods and compare their performance across a representative test set

# Add voice interactivity

## Investigate how effective and reliable the voice interactive components are

## If they are not particularly reliable, how might you change them to make them more robust?