# Material Science on Documents - Quick start

## Getting started

The [Deep Search Toolkit](https://ds4sd.github.io/deepsearch-toolkit/) allows document conversion with the following few lines of code. It's that simple! For more info or step-by-step guide:
- Visit https://ds4sd.github.io/deepsearch-toolkit/guide/convert_doc/
- Follow this example notebook

### Set notebook parameters

In [1]:
from dsnotebooks.settings import ProjectNotebookSettings

# notebook settings auto-loaded from .env / env vars
notebook_settings = ProjectNotebookSettings()

PROFILE_NAME = notebook_settings.profile  # the profile to use
PROJ_KEY = notebook_settings.proj_key     # the project to use

# default project_key = 1234567890abcdefghijklmnopqrstvwyz123456

Project key:  1234567890abcdefghijklmnopqrstvwyz123456


### Import example dependencies

In [2]:
import os
import json

import pandas as pd

import deepsearch as ds

from pathlib import Path
from zipfile import ZipFile

from deepsearch.documents.core.export import export_to_markdown
from IPython.display import display, Markdown, HTML, display_html

### Connect to Deep Search

In [3]:
api = ds.CpsApi.from_env(profile_name=PROFILE_NAME)

## Convert Document

In [4]:
output_dir = Path("./converted_docs")

fname = "20140197356.pdf"

documents = ds.convert_documents(
    api=api,
    proj_key=PROJ_KEY,
    source_path=f"../../data/samples/{fname}",
    progress_bar=True
)           
documents.download_all(result_dir=output_dir)
info = documents.generate_report(result_dir=output_dir)
print(info) 

Processing input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00, 117.21it/s][38;2;15;98;254m                                                                                                                                                            [0m
Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:02<00:00,  2.67s/it][38;2;15;98;254m                                                                                                                                                             [0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:14<00:00, 14.27s/it][38;2;15;98;254m                                                                                                                                                             [0m


{'Total documents': 1, 'Successfully converted documents': 1}


In [5]:
# Iterare output files and visualize the output
for output_file in output_dir.rglob("json*.zip"):
    with ZipFile(output_file) as archive:
        all_files = archive.namelist()
        for name in all_files:
            if not name.endswith(".json"):
                continue
            
            #basename = name.rstrip('.json')
            doc_json = json.loads(archive.read(name))
            
            ofile = output_dir / name
            print(f"writing {ofile}")
            with ofile.open("w") as fw:
                fw.write(json.dumps(doc_json, indent=2))
                
            doc_md = export_to_markdown(doc_json)

            ofile = output_dir / name.replace(".json", ".md")
            print(f"writing {ofile}")
            with ofile.open("w") as fw:
                fw.write(doc_md)

            

writing converted_docs/20140197356.json
writing converted_docs/20140197356.md


In [49]:
# display last document
# display(Markdown(doc_md))

## Find materials in a local PDF Document

### load models

In [4]:
from deepsearch_glm.utils.load_pretrained_models import load_pretrained_nlp_models

from deepsearch_glm.nlp_utils import (
    extract_references_from_doc,
    init_nlp_model,
    list_nlp_model_configs,
)

from deepsearch_glm.glm_utils import (
    create_glm_config_from_docs,
    create_glm_dir,
    create_glm_from_docs,
    expand_terms,
    load_glm,
    read_edges_in_dataframe,
    read_nodes_in_dataframe,
    show_query_result,
)

from tabulate import tabulate

models = load_pretrained_nlp_models(verbose=True, force=False)

downloading part-of-speech ... done!
downloading reference ... done!
downloading material ... done!
downloading language ... done!
downloading name ... done!
downloading semantic ... done!
downloading geoloc ... done!


### Run the model

In [5]:
ifile = "./converted_docs/20140197356.json"

with open(ifile) as fr:
    doc = json.load(fr)

model = init_nlp_model("language;term;material")
model.set_loglevel("INFO")

res = model.apply_on_doc(doc)

insts = pd.DataFrame(res["instances"]["data"], columns=res["instances"]["headers"])


In [6]:
#print(insts.columns)

materials = insts[insts["type"]=="material"][["type", "subtype", "name", "subj_path"]]
print(materials.to_string())

          type           subtype                                      name   subj_path
152   material   simple_chemical                silicon nitride-containing  #/texts/17
161   material   simple_chemical                                     abra›  #/texts/17
168   material   simple_chemical                               alkyne-diol  #/texts/17
170   material   simple_chemical                    alkyne dial ethoxylate  #/texts/17
275   material   simple_chemical                                      sec›  #/texts/23
304   material   simple_chemical                                     posi›  #/texts/24
320   material   simple_chemical                        attached substrate  #/texts/24
325   material   simple_chemical                               sub› strate  #/texts/24
340   material   simple_chemical                                      pol›  #/texts/24
356   material   simple_chemical                         silicon diox› ide  #/texts/24
361   material   simple_chemical           

In [7]:
print(res["texts"][59]["text"])

[0030] The compositions and methods of the invention pro› vide useful silicon nitride removal rates over a wide range of pH, abrasive concentration, and surfactant concentration, while unexpectedly suppressing polysilicon removal. In some particularly preferred embodiments, the silicon nitride removal rate is about 250 Angstroms per minute (A/min) or greater when polishing a silicon nitride blanket wafer with a Epicfi Dl00 polishing pad (Cabot Microelectronics Corpo› ration, Aurora, Ill.) on as table-top CMP polisher at a down force of about 2 pounds per square inch (psi), a platen speed of about 115 revolutions per minute (rpm), a carrier speed of about 60 rpm, and a polishing slurry flow rate of about 125 milliliters per minute (mL/min), in the presence of hydrogen peroxide (about 1 wt %). Surprisingly, the polysilicon removal rate obtained by polishing a polysilicon wafer under the same conditions generally is not more than about 80% of the silicon nitride removal rate, often not mo

## Extracting materials from Document collections

In [8]:
from numerize.numerize import numerize

# Fetch list of all data collections
collections = api.elastic.list()
collections.sort(key=lambda c: c.name.lower())

# Visualize summary table
results = [
    {
        "Name": c.name,
        "Type": c.metadata.type,
        "Num entries": numerize(c.documents),
        "Date": c.metadata.created.strftime("%Y-%m-%d"),
        "Coords": f"{c.source.elastic_id}/{c.source.index_key}",
    }
    for c in collections
]
display(pd.DataFrame(results[0:10]))

Unnamed: 0,Name,Type,Num entries,Date,Coords
0,AAAI,Document,16.02K,2023-08-29,default/aaai
1,ACL Anthology,Document,55.28K,2023-08-22,default/acl
2,Annual Reports,Document,107.38K,2024-01-12,default/annual-report
3,arXiv abstracts,Document,2.37M,2023-12-07,default/arxiv-abstract
4,arXiv category taxonomy,Record,155,2023-12-05,default/arxiv-category
5,arXiv full documents,Document,2.29M,2023-10-29,default/arxiv
6,BioRxiv,Document,357.76K,2023-11-09,default/biorxiv
7,Brenda,Record,7.12K,2023-01-03,default/brenda
8,ChEMBL,Record,2.11M,2023-01-03,default/chembl
9,ChemRxiv,Document,8.82K,2023-11-23,default/chemrxiv


In [9]:
from tqdm import tqdm
from copy import deepcopy

from deepsearch.cps.client.components.elastic import ElasticDataCollectionSource
from deepsearch.cps.queries import DataQuery


# Input query
search_query = "\"SUBSTITUTED 6-PHENYLNICOTINIC ACIDS AND THEIR USE\""
data_collection = ElasticDataCollectionSource(elastic_id="default", index_key="patent-uspto")
page_size = 50

# Prepare the data query
query = DataQuery(
    search_query, # The search query to be executed
    #source=["description.title", "description.authors", "identifiers"], # Which fields of documents we want to fetch
    limit=page_size, # The size of each request page
    coordinates=data_collection # The data collection to be queries
)


# [Optional] Compute the number of total results matched. This can be used to monitor the pagination progress.
count_query = deepcopy(query)
count_query.paginated_task.parameters["limit"] = 0
count_results = api.queries.run(count_query)
expected_total = count_results.outputs["data_count"]
expected_pages = (expected_total + page_size - 1) // page_size # this is simply a ceiling formula

print(f"#-found documents: ", count_results)

# Iterate through all results by fetching `page_size` results at the same time
documents = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor, total=expected_pages):
    # Iterate through the results of a single page, and add to the total list
    for row in result_page.outputs["data_outputs"]:
        documents.append(row["_source"])

print(f'Finished fetching all data. Total is {len(documents)} records.')

#-found documents:  RunQueryResult(outputs={'data_outputs': [], 'data_count': 2, 'data_aggs': {'deepsearch_total_size': {'value': 713388.0}}}, next_pages={}, timings=RunQueryResult.QueryTimings(overall=0.171320807072334, tasks={'0_ElasticQuery': RunQueryResult.QueryTimings.TaskTimings(overall=0.16991104697808623, details={})}))


3it [00:01,  2.26it/s]                                                                                                                                                                                                                                

Finished fetching all data. Total is 2 records.





In [18]:
import textwrap

# Create a TextWrapper object
wrapper = textwrap.TextWrapper(width=100)  # Set the desired width

model = init_nlp_model("language;term;material")
model.set_loglevel("INFO")

for doc in documents:

    for item in doc["main-text"]:

        if "text" not in item:
            continue
    
        res = model.apply_on_text(item["text"])

        insts = pd.DataFrame(res["instances"]["data"], columns=res["instances"]["headers"])

        materials = insts[insts["type"]=="material"][["type", "subtype", "name", "subj_path"]]

        if len(materials)>0:
            lines = wrapper.wrap(item["text"])
            print("\n---------------------\n")
            print("\n".join(lines))
            print(materials.to_string())


---------------------

The present application relates to novel substituted 6-phenylnicotinic acid derivatives, to
processes for their preparation, to their use for the treatment and/or prophylaxis of diseases and
to their use for preparing medicaments for the treatment and/or prophylaxis of diseases, preferably
for the treatment and/or prophylaxis of cardiovascular disorders, in particular dyslipidaemias,
arteriosclerosis and heart failure.
       type           subtype                    name subj_path
2  material  complex_chemical  6-phenylnicotinic acid         #

---------------------

The present invention relates to novel substituted 6-phenylnicotinic acid derivatives, to processes
for their preparation, to their use for the treatment and/or prophylaxis of diseases and to their
use for preparing medicaments for the treatment and/or prophylaxis of diseases, preferably for the
treatment and/or prophylaxis of cardiovascular diseases, in particular dyslipidaemias,
arteriosclerosis 

2024-03-22 16:20:06.430 ( 610.121s) [          25B517]          crf_model.cpp:2096   ERR| sequence is too long: 1000 > 1212
2024-03-22 16:20:06.432 ( 610.122s) [          25B517]       base_crf_model.h:142   WARN| encountered tokens-array exceeding max-len of 1000



---------------------

5.00 g (14.4 mmol) of Example 2A were initially charged in 100 ml abs. DMF. 57.5 ml (28.8 mmol) of
isobutylzinc bromide as a 0.5M solution in TI-IF were then quickly added dropwise, and 0.831 g
(0.719 mmol) of tetrakis(triphenylphosphine)palladium(0) were added. After the start of the reaction
(slightly exothermal reaction), the mixture was stirred at room temperature for another two hours
and then taken up in water and ethyl acetate. The reaction mixture was filtered through Celite. The
organic phase was separated off and washed with water and then with saturated aqueous sodium
chloride solution. After drying with magnesium sulfate, the solvent was removed by distillation
under reduced pressure. The residue was purified by column chromatography (silica gel:
cyclohexane−>cyclohexane/ethyl acetate=10/1). The slightly contaminated product fractions obtained
in this manner were combined and, after removal of the volatile components on a rotary evaporator,
purified 

2024-03-22 16:20:13.315 ( 617.006s) [          25B517]          crf_model.cpp:2096   ERR| sequence is too long: 1000 > 1212
2024-03-22 16:20:13.319 ( 617.010s) [          25B517]       base_crf_model.h:142   WARN| encountered tokens-array exceeding max-len of 1000



---------------------

Suitable for oral administration are administration forms which work in accordance with the prior
art and release the compounds according to the invention rapidly and/or in modified form and which
comprise the compounds according to the invention in crystalline and/or amorphicized and/or
dissolved form, such as, for example, tablets (uncoated or coated tablets, for example with enteric
coats or coats which dissolve in a delayed manner or are insoluble and which control the release of
the compounds according to the invention), films/wafers or tablets which dissolve rapidly in the
oral cavity, films/lyophilizates, capsules (for example hard or soft gelatin capsules), sugar-coated
tablets, granules, pellets, powders, emulsions, suspensions, aerosols or solutions.
        type          subtype                   name subj_path
44  material  simple_chemical       gelatin capsules         #
45  material  simple_chemical  sugar-coated tablets,         #

---------------