<a href="https://colab.research.google.com/github/Sapna714/ESC-50/blob/master/SearchEngine_CodeChallenge_Sapna_Sinha.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Document Retrieval

* Information Retrieval using Semantic Search
* Input Query provided by a user ---> Intelligent answer provided by the model from within documents 

Given a set of documents, Task is to review semantic search. 
Reference : [Amazon Kendra](https://aws.amazon.com/kendra/)

In [None]:
!pip install tika

In [None]:
from tika import parser # pip install tika

raw = parser.from_file('/content/RheovisFRCAnimalTesting.pdf')
print(raw['content'])

In [None]:
type(raw)

dict

In [None]:
#Installing required packages
#Ref: https://textract.readthedocs.io/en/stable/installation.html
!apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr \
flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
!pip install textract

#Ref: https://pypi.org/project/sentence-transformers/
!pip install sentence-transformers

In [None]:
#Importing required libraries
import os
#import textract
import nltk
import re
import scipy
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
#Reading all the PDF files

files_path = ['/content/RheovisFRC/RheovisFRCAnimalTesting.pdf',
               '/content/RheovisFRC/RheovisFRCNanoStatement.pdf',
               '/content/RheovisFRC/RheovisFRCProductSpecification.pdf',
               '/content/RheovisFRC/RheovisFRCSVHC.pdf',
               '/content/RheovisFRC/RheovisFRCTechnicalInformation.pdf',
              "/content/SokalanCP5/SokalanCP5AnimalTesting.pdf","/content/SokalanCP5/SokalanCP5FoodEU.pdf","/content/SokalanCP5/SokalanCP5NanoStatementFR.pdf","/content/SokalanCP5/SokalanCP5ProductSpecification.pdf","/content/SokalanCP5/SokalanCP5REACHpolymers.pdf",
              "/content/SokalanCP5/SokalanCP5SVHC.pdf","/content/SokalanCP5/SokalanCP5TechnicalInformation.pdf",
              "/content/TexaponN70/TexaponN70SVHC.pdf","/content/TexaponN70/TexaponN70TechnicalInformation.pdf"
              ]

# Excluding not .pdf files
files_path = [pdf for pdf in files_path if '.pdf' in pdf]

pdfs = []
for file in files_path:
    text = textract.process(file,
                            method='tesseract',
                            language='eng')

    pdfs += [text]

In [None]:
pdfs

[b'Ol- BASF\n\nWe create chemistry\n\nStatement\n\nAnimal Testing\n\n \n\ni Valid since 01/2020\nRheovis\xc2\xae FRC Revision 3.0\nPRD 30478270 WE-no.: 6808\nPage 1 of 1\n\n \n\n@\xc2\xae = Registered trademark of BASF in many countries \xe2\x80\x94_\xe2\x84\xa2 = Trademark of BASF Care Chemicals\n\nDue to legal requirements and regulations of the European Union, BASF SE is obliged to perform animal\nstudies on chemical substances. Within this document animals are regarded as vertebrates. The objective of\nthese studies is to minimize the risk to humans, animals and the environment. Animal studies are centrally\nmonitored by BASF SE to guarantee that the studies commissioned by BASF worldwide are carried out in\naccordance with the same ethical aspects and animal welfare considerations as those performed within BASF.\n\nBASF\xe2\x80\x99s company policy is to eliminate all unnecessary animal testing and to support the development of\nalternative test methods that involve in vitro testin

In [None]:
#Text Preprocessing
import re
#Converting to string
str1 = ''.join(str(e) for e in raw['content'])

#Converting to lower case
str1 = str1.lower()

#Removing special characters
final = [re.sub(r"\W+|_", ' ', k) for k in str1.split("\n")]

#Removing empty strings
sentences = list(filter(None, final))

In [None]:
sentences

['statement',
 'animal testing',
 'this document and any information provided herein is for your guidance only all information is given in good faith and is based on sources ',
 'believed to be reliable and accurate at the date of publication of this document this document shall be valid until superseded by a later ',
 'version basf makes no warranty of any kind either express or implied by fact or law including ',
 'warranties of merchantability or fitness for a particular purpose',
 'this is a computer generated document it is valid without signature',
 'due to legal requirements and regulations of the european union basf se is obliged to perform animal',
 'studies on chemical substances within this document animals are regarded as vertebrates the objective of',
 'these studies is to minimize the risk to humans animals and the environment animal studies are centrally',
 'monitored by basf se to guarantee that the studies commissioned by basf worldwide are carried out in',
 'accordanc

In [None]:
#Reference :https://www.sbert.net/, https://github.com/evergreenllc2020/ , https://www.aclweb.org/anthology/D19-1410.pdf
model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')
sentence_embeddings = model.encode(sentences)

100%|██████████| 1.31G/1.31G [00:48<00:00, 27.1MB/s]


In [None]:
#Forming test input queries
query = "Did they perform animal testing? on the product?"
#query = "When was latest animal test conducted for sokalan?"
#query = "What does data say about animals?"
#query = "What is the appearance of texapon?"
#query = " what is the shelf life of sokalan?"
#query = "what is the physical form of rheovis?"

In [None]:
queries = [query]
query_embeddings = model.encode(queries)

#5 sentences of the corpus for each query sentence based on cosine similarity
number_top_matches = 5 

print("Semantic Search Results")

for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], sentence_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])
    
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")
    
    values = set()
    for idx, distance in results[0:number_top_matches]:
      if distance not in values:
        print(sentences[idx].strip(), "(Cosine Score: %.4f)" % (1-distance))
        values.add(distance)

Semantic Search Results
Query: Did they perform animal testing? on the product?

Top 5 most similar sentences in corpus:
animal testing (Cosine Score: 0.7900)
test methods again requires the use of animals (Cosine Score: 0.7413)
alternative test methods that involve in vitro testing or that require the minimum possible number of animals (Cosine Score: 0.6750)
the product has not been tested on animals by basf the assessment has been derived from the components (Cosine Score: 0.6233)
these studies is to minimize the risk to humans animals and the environment animal studies are centrally (Cosine Score: 0.5693)


# Results

Semantic Search Results

======================


Query:  what is the shelf life of sokalan?

Top 5 most similar sentences in corpus:

shelf life sokalan xc2 xae cp 5 has a shelf life of at least 24 months in its original packaging (Cosine Score: 0.5398)

shelf life (Cosine Score: 0.4536)

for sokalan xc2 xae cp 5 (Cosine Score: 0.4448)

sokalan xc2 xae cp 5 (Cosine Score: 0.4334)

scenario provided in the annex of the sds (Cosine Score: 0.4142)

======================

Query: What does data say about animals?

Top 5 most similar sentences in corpus:

animal testing (Cosine Score: 0.7604)

test methods again requires the use of animals (Cosine Score: 0.7446)

studies on chemical substances within this document animals are regarded as vertebrates the objective of (Cosine Score: 0.7113)

alternative test methods that involve in vitro testing or that require the minimum possible number of animals (Cosine Score: 0.5833)

due to legal requirements and regulations of the european union basf se is obliged to perform animal (Cosine Score: 0.5065)

======================


Query: what is the physical form of rheovis?

Top 5 most similar sentences in corpus:

physical form aque (Cosine Score: 0.5074)

physical form white (Cosine Score: 0.4334)

physical form dispersion (Cosine Score: 0.4303)

sodium laureth sulfate 68 73 (Cosine Score: 0.4196)

if samples of rheovis xc2 xae frc are required for analytical testing they must be (Cosine Score: 0.4077)



======================


Query: When was latest animal test conducted for sokalan?

Top 5 most similar sentences in corpus:

latest tests on animals for this product were conducted by basf in 1985 (Cosine Score: 0.6070)

animal testing (Cosine Score: 0.4989)

test methods again requires the use of animals (Cosine Score: 0.4711)


**Scope of enhancement**

*In view of usecase*


*   Streamline Document structure and parsing
*   Text preprocessing scalability

*In view of techniques used*

* Sentence Bert was used for this task (Ref: https://www.aclweb.org/anthology/D19-1410.pdf, https://www.sbert.net/docs/quickstart.html)





All the works referenced have been cited. The implementation on the given dataset and the conclusions drawn belong to the owner of the notebook. 