# Question Answering System




This notebook shows how to use the PDF converter to create an input dataframe for the cdQA pipeline from a directory of PDF files.


In [1]:
!pwd

/content


In [2]:
!git clone https://github.com/cdqa-suite/cdqa.git
%cd /content/cdqa

Cloning into 'cdqa'...
remote: Enumerating objects: 1548, done.[K
remote: Total 1548 (delta 0), reused 0 (delta 0), pack-reused 1548[K
Receiving objects: 100% (1548/1548), 560.46 KiB | 22.42 MiB/s, done.
Resolving deltas: 100% (955/955), done.
/content/cdqa


In [3]:
!pip install -e.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining file:///content/cdqa
Collecting Flask==1.1.1
  Downloading Flask-1.1.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 3.3 MB/s 
[?25hCollecting flask_cors==3.0.8
  Downloading Flask_Cors-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting joblib==0.13.2
  Downloading joblib-0.13.2-py2.py3-none-any.whl (278 kB)
[K     |████████████████████████████████| 278 kB 55.2 MB/s 
[?25hCollecting pandas==0.25.0
  Downloading pandas-0.25.0-cp37-cp37m-manylinux1_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 32.6 MB/s 
[?25hCollecting prettytable==0.7.2
  Downloading prettytable-0.7.2.zip (28 kB)
Collecting transformers==2.1.1
  Downloading transformers-2.1.1-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 64.7 MB/s 
[?25hCollecting scikit_learn==0.21.2
  Downloading scikit_learn-0.21.2-cp37-cp37m-manylinux1_x8

In [4]:

import os
import pandas as pd
from ast import literal_eval

from cdqa.utils.converters import pdf_converter
from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline import QAPipeline
from cdqa.utils.download import download_model , download_bnpp_data

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):


### Download pre-trained reader model and PDF files

In [5]:
# Download model
download_model(model='bert-squad_1.1', dir='./models')


Downloading trained model...


In [6]:

download_bnpp_data(dir='./data/bnpp_newsroom_v1.1/')



Downloading BNP data...


In [7]:
df = pd.read_csv('./data/bnpp_newsroom_v1.1/bnpp_newsroom-v1.1.csv', converters={'paragraphs': literal_eval})
df = filter_paragraphs(df)
df.head()

Unnamed: 0,date,title,category,link,abstract,paragraphs
0,13.05.2019,The banking jobs : Assistant Vice President – ...,Careers,https://group.bnpparibas/en/news/banking-jobs-...,Within the Group’s Corporate and Institutional...,[I manage a team in charge of designing and im...
1,13.05.2019,BNP Paribas at #VivaTech : discover the progra...,Innovation,https://group.bnpparibas/en/news/bnp-paribas-v...,"From Thursday 16 to Saturday 18 May 2019, join...","[With François Hollande, Chairman of French fo..."
2,13.05.2019,"""The bank with an IT budget of more than EUR6 ...",Group,https://group.bnpparibas/en/news/the-bank-budg...,"Interview with Jean-Laurent Bonnafé, Director ...","[We did the groundwork between 2012 and 2016, ..."
3,10.05.2019,BNP Paribas at #VivaTech : discover the progra...,Innovation,https://group.bnpparibas/en/news/bnp-paribas-v...,"From Thursday 16 to Saturday 18 May 2019, join...","[As part of the ‘United Tech of Europe’ theme,..."
4,10.05.2019,When Artificial Intelligence participates in r...,Careers,https://group.bnpparibas/en/news/artificial-in...,As the competition to attract talent intensifi...,[Online recruitment is already the norm. Accor...


### Convert the PDF files into a DataFrame for cdQA pipeline

In [13]:
df = pdf_converter(directory_path='./data/pdf/')
df.head()

AttributeError: ignored

In [8]:
cdqa_pipeline = QAPipeline(reader='./models/bert_qa.joblib', max_df=1.0)

# Fit Retriever to documents
cdqa_pipeline.fit_retriever(df=df)

100%|██████████| 231508/231508 [00:00<00:00, 248698.18B/s]


QAPipeline(reader=BertQA(adam_epsilon=1e-08, bert_model='bert-base-uncased',
                         do_lower_case=True, fp16=False,
                         gradient_accumulation_steps=1, learning_rate=5e-05,
                         local_rank=-1, loss_scale=0, max_answer_length=30,
                         n_best_size=20, no_cuda=False,
                         null_score_diff_threshold=0.0, num_train_epochs=3.0,
                         output_dir=None, predict_batch_size=8, seed=42,
                         server_ip='', server_po..._size=8,
                         verbose_logging=False, version_2_with_negative=False,
                         warmup_proportion=0.1, warmup_steps=0),
           retrieve_by_doc=False,
           retriever=BM25Retriever(b=0.75, floor=None, k1=2.0, lowercase=True,
                                   max_df=1.0, min_df=2, ngram_range=(1, 2),
                                   preprocessor=None, stop_words='english',
                                   t

 ### Execute a query

In [9]:
query = input("")
prediction = cdqa_pipeline.predict(query)

Since when does the Excellence Program of BNP Paribas exist


### Explore predictions

In [10]:
print('query: {}'.format(query))
print('answer: {}'.format(prediction[0]))
print('title: {}'.format(prediction[1]))
print('paragraph: {}'.format(prediction[2]))

query: Since when does the Excellence Program of BNP Paribas exist
answer: January 2016
title: BNP Paribas’ commitment to universities and schools
paragraph: Since January 2016, BNP Paribas has offered an Excellence Program targeting new Master’s level graduates (BAC+5) who show high potential. The aid program lasts 18 months and comprises three assignments of six months each. It serves as a strong career accelerator that enables participants to access high-level management positions at a faster rate. The program allows participants to discover the BNP Paribas Group and its various entities in France and abroad, build an internal and external network by working on different assignments and receive personalized assistance from a mentor and coaching firm at every step along the way.
