## Transformers Model Solution

This is more of an experimental test, assessing whether transformers, in this case Table Question Answering, that can be loaded in a short amount of time are capable of accurately answering this type of questions.

For this example, I will use the Microsoft Tapex-large-finetuned-wtq model available on Hugging FaceThis is a relatively small model, with 406 million parameters, which suggests that the results might significantly suffer from hallucination issues.

https://huggingface.co/microsoft/tapex-large-finetuned-wtq









In [1]:
import pandas as pd
from datetime import date

from transformers import TapexTokenizer, BartForConditionalGeneration

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
file_path = 'AddressBook.txt'

data = []

with open(file_path, 'r') as file:
    for line in file:
        data.append(line.strip().split(', '))

columns = ['Name', 'Sex', 'Date of birth']
df = pd.DataFrame(data, columns=columns)

In [3]:
df

Unnamed: 0,Name,Sex,Date of birth
0,Bill McKnight,Male,16/03/77
1,Paul Robinson,Male,15/01/85
2,Gemma Lane,Female,20/11/91
3,Sarah Stone,Female,20/09/80
4,Wes Jackson,Male,14/08/74


In [4]:
# No need to transform the date of birth in datetime column as the model don't accept datetime columns.

In [5]:
# Reading the model
tokenizer = TapexTokenizer.from_pretrained("microsoft/tapex-large-finetuned-wtq")
model = BartForConditionalGeneration.from_pretrained("microsoft/tapex-large-finetuned-wtq")

In [6]:
# Question function
def ask_question(df: pd.DataFrame, question: str):
    # Encode question with database
    encoding = tokenizer(table=df, query=question, return_tensors="pt")

    # Generate model answer
    outputs = model.generate(**encoding)

    # Decode model answer
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

## Question 1: How many males are in the address book ?

In [7]:
question = 'How many males are in the address book ?'
print(ask_question(df, question))

 3


In [8]:
question = 'How many females are in the address book ?'
print(ask_question(df, question))

 2


## Question 2: Who is the oldest person in the address book ?

In [9]:
question = 'Who is the oldest person in the address book ?'
print(ask_question(df, question))

 wes jackson


In [10]:
question = 'Who is the youngest person in the address book ?'
print(ask_question(df, question))

 wes jackson


In [11]:
question = 'Who has the lowest age ?'
print(ask_question(df, question))

 wes jackson


The model appears to generate incorrect responses about ages, likely due to its inability to correlate the date of birth with the corresponding age.

To tackle this issue I will add an age column.

In [12]:
today = date.today()

df['Age'] = pd.to_datetime(df['Date of birth'], format='%d/%m/%y').apply(
    lambda x: today.year - x.year - ((today.month, today.day) < (x.month, x.day)))

In [13]:
df

Unnamed: 0,Name,Sex,Date of birth,Age
0,Bill McKnight,Male,16/03/77,46
1,Paul Robinson,Male,15/01/85,39
2,Gemma Lane,Female,20/11/91,32
3,Sarah Stone,Female,20/09/80,43
4,Wes Jackson,Male,14/08/74,49


In [14]:
# Let's try again
question = 'Who is the oldest person in the address book ?'
print(ask_question(df, question))

 wes jackson


In [15]:
question = 'Who is the youngest person in the address book ?'
print(ask_question(df, question))

 wes jackson


In [16]:
question = 'Who has the lowest age ?'
print(ask_question(df, question))

 gemma lane


There is an issue with the complexity of the question; it needs to be simpler. A more complex model, such as ChatGPT-4, could answer this. An interesting idea would be to try it with a model like Mistral 7B. However, I do not have it installed right now.

![ChatGPT_age_question image](ChatGPT_age_question.png)

## Question 3: How many days older is Bill than Paul ?

In [17]:
question = 'How many days separate Bill and Paul date of birth?'
print(ask_question(df, question))

 5


Without much surprise, it can't the last question which is much harder, let's add a column that indicates the number of days to today.

In [18]:
df['Days to today'] = (date.today() - pd.to_datetime(df['Date of birth']).dt.date).apply(lambda x: x.days)

  df['Days to today'] = (date.today() - pd.to_datetime(df['Date of birth']).dt.date).apply(lambda x: x.days)


In [19]:
df

Unnamed: 0,Name,Sex,Date of birth,Age,Days to today
0,Bill McKnight,Male,16/03/77,46,17164
1,Paul Robinson,Male,15/01/85,39,14302
2,Gemma Lane,Female,20/11/91,32,11802
3,Sarah Stone,Female,20/09/80,43,15880
4,Wes Jackson,Male,14/08/74,49,18109


In [20]:
question = 'What is the difference between the Days to today between Paul and Bill?'
print(ask_question(df, question))

 667


Same as the second question.

![ChatGPT_days_difference image](ChatGPT_days_difference.png)