<a href="https://colab.research.google.com/github/PrakharU08/QK-internship/blob/main/CDQA_from_PDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook [2]: Using the PDF converter



This notebook shows how to use the PDF converter to create an input dataframe for the cdQA pipeline from a directory of PDF files.


***Note:*** *To run this notebook you will need to have access to GPU. If you are using colab, you will need to install `cdQA` by executing `!pip install cdqa` in a cell.* 

In [None]:
!pip install cdqa

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import os
import pandas as pd
from ast import literal_eval

from cdqa.utils.converters import pdf_converter
from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline import QAPipeline
from cdqa.utils.download import download_model



### Download pre-trained reader model and PDF files

In [3]:
# Download model
download_model(model='bert-squad_1.1', dir='./models')


Downloading trained model...


In [4]:
%cd /content/drive/MyDrive/QK_PROJECT

/content/drive/MyDrive/QK_PROJECT


### Convert the PDF files into a DataFrame for cdQA pipeline

In [5]:
df = pdf_converter(directory_path='/content/drive/MyDrive/QK_PROJECT')
df.head()

2020-11-24 09:17:23,801 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to /tmp/tika-server.jar.
2020-11-24 09:17:25,149 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to /tmp/tika-server.jar.md5.
2020-11-24 09:17:25,581 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


Unnamed: 0,title,paragraphs
0,infosec_policy,[Information Security:  Everyone is responsib...


### Instantiate the cdQA pipeline from a pre-trained reader model

In [6]:
cdqa_pipeline = QAPipeline(reader='./models/bert_qa.joblib', max_df=1.0)

# Fit Retriever to documents
cdqa_pipeline.fit_retriever(df=df)

100%|██████████| 231508/231508 [00:00<00:00, 2615542.52B/s]


QAPipeline(reader=BertQA(adam_epsilon=1e-08, bert_model='bert-base-uncased',
                         do_lower_case=True, fp16=False,
                         gradient_accumulation_steps=1, learning_rate=5e-05,
                         local_rank=-1, loss_scale=0, max_answer_length=30,
                         n_best_size=20, no_cuda=False,
                         null_score_diff_threshold=0.0, num_train_epochs=3.0,
                         output_dir=None, predict_batch_size=8, seed=42,
                         server_ip='', server_po..._size=8,
                         verbose_logging=False, version_2_with_negative=False,
                         warmup_proportion=0.1, warmup_steps=0),
           retrieve_by_doc=False,
           retriever=BM25Retriever(b=0.75, floor=None, k1=2.0, lowercase=True,
                                   max_df=1.0, min_df=2, ngram_range=(1, 2),
                                   preprocessor=None, stop_words='english',
                                   t

 ### Execute a query

In [7]:
query = 'Goals of security?'
prediction = cdqa_pipeline.predict(query)
prediction

('to achieve Information & physical security and business continuity',
 'infosec_policy',
 'ISO 27001: \uf0d8 ISO 27001 is internationally accepted, certifiable, Information Security Management Standard. ISMS stand for Information Security Management System. \uf0d8  ISO 27001 ensures that all the possible threats to a business are accessed and managed by enforcing various security processes and to perform audits that these process are being performed on a required basis. It helps to achieve Information & physical security and business continuity. \uf0d8  ISO 27001 has 7 clauses and 14 domains.   ',
 8.135943793742195)

### Explore predictions

In [8]:
print('query: {}'.format(query))
print('answer: {}'.format(prediction[0]))
print('title: {}'.format(prediction[1]))
print('paragraph: {}'.format(prediction[2]))

query: Goals of security?
answer: to achieve Information & physical security and business continuity
title: infosec_policy
paragraph: ISO 27001:  ISO 27001 is internationally accepted, certifiable, Information Security Management Standard. ISMS stand for Information Security Management System.   ISO 27001 ensures that all the possible threats to a business are accessed and managed by enforcing various security processes and to perform audits that these process are being performed on a required basis. It helps to achieve Information & physical security and business continuity.   ISO 27001 has 7 clauses and 14 domains.   
