# SuperalloyDigger API Document

We provide users with a variety of scenarios to test and run, including corpus preprocessing, sentence classification, named entity recognition, relation extraction, table parsing, dependency parsing, and automated full-pipeline of the above steps. We also provide testing corpus used to run these demos on GitHub repo. The following codes will show you how to use SuperalloyDigger for several common situations. Please see below for details.

Before running the codes, please use 'pip install -r requirements.txt' to prepare the basic environments.

## 0. Article download

In [1]:
import os
from Elsevier_articles_archive.main import File_Download

Downloads for XML and TXT format files 

In [4]:
PATH = os.getcwd() # 获取当前文件的绝对路径
dois = [
    "10.1016/j.msea.2014.09.074",
]
api_path = os.path.join(PATH,"related_files/apikeys.txt")
arformat = "text/xml"  # text/xml为XML文件格式；text/plain为纯文本txt格式
corpus_type = "article" # article/abstract
output_path = os.path.join(PATH,"input_xml")
# output_path = os.path.join(PATH,"input_txt")
fd = File_Download(api_path, dois, arformat, corpus_type, output_path)
count = len(dois)
for i in range(0, count):# 当代码终止，将最新生成的doi所在dois中的索引（start_id）换掉这里的0
    doi = fd.run(0,dois,i)
    print(doi)



10.1016/j.msea.2014.09.074


Downloads for HTML format files 

In [5]:
from other_articles_archive.html_download import *

In [6]:
PATH = os.getcwd() # 获取当前文件的绝对路径
# the path of excel file contains Dois information
dois = [
    "10.1007/s11837-014-1181-y",
]
User_Agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0)\
                Gecko/20100101 Firefox/23.0'
# the path of folder to store the output excel files
output_path = os.path.join(PATH,"input_html")
for i in range(len(dois)):
    url_te = doi_info(dois[i])
    if url_te[0]!="other URL":
        html = getHtml(url_te,User_Agent)
        name = dois[i].replace("/","-")
        saveHtml(os.path.join(output_path,str(name)), html)

## 1.Corpus preprocess

In the folder 'input_txt', 'input_html', and 'input_xml', we provide a case corpus respectively.

In [None]:
from text_extractor.T_pre_processor import TPreProcessor
from text_extractor.get_full_text import Filter_text
import os

Here we take the plain text corpus('input_txt') for example. The corpus in the folder 'input_txt' is obtained automatically by Elsevier Dev API.

Each time run this code, please empty the folder 'output_files' to prevent duplicate writing of results into the same output file.

In [9]:
# the path of txt files to input
txt_path = r".\input_txt"
# the path of configuration file
c_path = r".\pipeline\dictionary.ini"
# target property
prop_name = "solvus"
# the output path of the corpus after preprocessing
text_path = ".\output_files"

In [42]:
FT = Filter_text(txt_path,text_path)
txt_name = FT.process()
length = os.listdir(text_path)

with open(os.path.join(text_path,length[0]) ,'r',encoding='utf-8') as file:
    data = file.read()
processor = TPreProcessor(data, prop_name, c_path)
filter_txt = processor.processor()
filter_txt

'Keyword Co-Al-W-base alloy Tensile behavior Dislocation structures Stacking fault 1 Introduction Nickel-base superalloys , possessing exceptional mechanical properties due to the well known strengthening of γ′ type γ′ precipitates , are widely used for manufacturing aircraft and power-generation engine turbines . Recently , Co-base superalloys strengthened by γ′ ( γ′ ) with γ′ structure have gained substantial interest . A series of experimental [ 1-7 ] and computational [ 8,9 ] efforts have been done to study the effects of alloying elements on the microstructure and mechanical property of the new Co-base alloys , suggesting that γ′ has some similarities with that of Ni3Al and can be practically used as the strengthening phase of Co-base superalloys . However , in the Co-Al-W-base alloys , large amount of W is added to stabilize γ′ , leading to a high density . The γ′ solvus temperature is relatively lower compared with that of Ni-base superalloys , a big restriction on high temperat

## 2.Sentence classification

To classifiy the sentences which contains the solvus temperature information, sentence classification is needed. Please run the following codes.

In [43]:
from text_extractor.sentence_positioner import Sentence_Positioner

In [44]:
positioner = Sentence_Positioner(filter_txt,prop_name,c_path)
target_sents = positioner.target_sent()
target_sents

{1: 'The γ′ solvus temperature of Co-9Al-8W-2Ta-2Cr ( at % ) alloy is above 1050°C which is slightly lower than that of Co-9Al-8W-2Ta ( at % ) alloy [ 3 ] , while the γ′ solvus temperature of Co-7.8Al-7.8W-2Ta-4.5Cr ( at % ) alloy is only 960°C [ 4 ] .',
 2: 'By comparing Co-9Al-9W ( at % ) alloy and Co-7.3Al-6.8W ( at % ) alloy , the γ′ solvus temperature is decreased from 985°C to 854°C , indicating a lower W content depresses the stability of γ′ .'}

## 3.Named entity recognition & relation extraction

Named entity recognition aims to recognize the alloy names and property parameters in the superalloy corpus. Relation extraction aims to extracted property tuples (alloy_named_entity, property_specifier, property_value) based on the results of named entity recognition.

In [49]:
from text_extractor.Phrase_parse import Phrase_parse
from text_extractor.Relation_extraciton import Relation_extraciton

In [52]:
for n,sent in target_sents.items():
    parse = Phrase_parse(sent, prop_name, c_path)
    sub_order, sub_id, object_list = parse.alloy_sub_search()
    print("Sentence:",sent)
    print("Alloy name:",sub_order)
    print("Property number:",object_list)
    RE = Relation_extraciton(prop_name, sent, sub_order, sub_id, object_list, c_path)
    all_outcome = RE.triple_extraction()
    print("Relation extraction results:",all_outcome)
    print("\n")

Sentence: The γ′ solvus temperature of Co-9Al-8W-2Ta-2Cr ( at % ) alloy is above 1050°C which is slightly lower than that of Co-9Al-8W-2Ta ( at % ) alloy [ 3 ] , while the γ′ solvus temperature of Co-7.8Al-7.8W-2Ta-4.5Cr ( at % ) alloy is only 960°C [ 4 ] .
Alloy name: ['Co-9Al-8W-2Ta-2Cr', 'Co-9Al-8W-2Ta', 'Co-7.8Al-7.8W-2Ta-4.5Cr']
Property number: ['1050°C', '960°C']
Relation extraction results: {1: ('Co-9Al-8W-2Ta-2Cr', 'solvus', '1050°C'), 2: ('Co-7.8Al-7.8W-2Ta-4.5Cr', 'solvus', '960°C')}


Sentence: By comparing Co-9Al-9W ( at % ) alloy and Co-7.3Al-6.8W ( at % ) alloy , the γ′ solvus temperature is decreased from 985°C to 854°C , indicating a lower W content depresses the stability of γ′ .
Alloy name: ['Co-9Al-9W', 'Co-7.3Al-6.8W']
Property number: ['985°C', '854°C']
Relation extraction results: {1: ('Co-9Al-9W', 'solvus', '985°C'), 2: ('Co-7.3Al-6.8W', 'solvus', '854°C')}




## 4.Table parsing（From XML file）

Here we take the xml file ('input_xml') for example. The corpus in the folder 'input_xml' is obtained automatically by Elsevier Dev API.

In [4]:
from table_extractor.elsevier_xml.class_modified import TableExtractorToAlloy,get_extraction_outcome
from table_extractor.elsevier_xml.dictionary import Dictionary

In [5]:
# the path of configuration file
config_path = r".\pipeline\dictionary.ini"
# the path of document contains xml files
xml_path = r'.\input_xml'
# the path of folder include excels that have been output
save_path = r'.\output_tables'
all_error_file, length = get_extraction_outcome(xml_path, save_path, config_path)

10.1016/j.msea.2014.09.074


View the table information extracted from xml file

In [6]:
import os
import xlrd
import pandas as pd

In [10]:
xlsx_files = os.listdir(save_path)
xlsx_files
xlsx_file = xlsx_files[0]
xlsx_feature = pd.read_excel(os.path.join(save_path,xlsx_file), usecols=[0,1,2,3,4]) 
feature = pd.DataFrame(xlsx_feature)
feature

Unnamed: 0,10.1016/j.msea.2014.09.074,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,"Liquidus, solidus, g'-solvus temperatures and ...",,,,
1,Alloy,Transformation temperature (degC),Transformation temperature (degC),Transformation temperature (degC),Density (gcm-3)
2,,Solidus,Liquidus,g' solvus,Density (gcm-3)
3,5W,1395,1426,1100,9.32
4,Co-7.3Al-6.8W (at%) [5],-,-,854,9.18
5,Co-9.2Al-9W (at%) [1],1441,1466,985,9.54
6,Co-8.8Al-9.8W-2Ta (at%) [4],1407,1451,1079,>9.54
7,Co-7.3Al-7.2W-20.2Ni (at%) [5],-,-,881,9.29
8,Co-7.8Al-7.8W-4.5Cr-2Ta (at%) [4],1412,1453,960,-
9,Co-9.9Al-4.8W-1.8Ta (at%) [5],-,-,983,9.09


## 5.Table parsing（From HTML file）

Here we take the doi('10.1115/1.2836743') for example. The tables of the article will be obtained automatically by crawler.

In [16]:
from table_extractor.web_other_journals_html.get_tifo_from_html import GetTableHtml
import os
import pandas as pd

In [11]:
# doi of article
doi = "10.1115/1.2836743"
# the path of folder to store the output excel files
output_path = r".\output_tables"
g_t = GetTableHtml(doi, output_path)
g_t.run()

Start crawling the page
complete!
complete!
****************************************************************************************************
[]


In [13]:
xlsx_files = os.listdir(output_path)
xlsx_files

['10.1115-1.2836743.xlsx']

View the generated information

In [19]:
xlsx_file = xlsx_files[0]
xlsx_feature = pd.read_excel(os.path.join(output_path,xlsx_file), usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12,13]) 
feature = pd.DataFrame(xlsx_feature)
feature

Unnamed: 0,10.1115/1.2836743,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,Table 1Base alloy and filler metal nominal com...,,,,,,,,,,,,,
1,Alloy,Condition,Al,C,Cr,Co,Fe,Mo,Nb,Ta,Ti,W,Zr,Ni
2,X-40,As cast,—,0.5,25,Bal.,1.5,—,—,—,—,7.5,—,10
3,IN738,Cast,3.4,0.17,16,8.5,—,1.75,0.9,1.75,3.4,2.6,0.1,Bal.


## 6.Pipeline（input files: xml and txt files;output:alloy composition and property information）

This section combines corpus proprecessing, sentence classification, named entity recognition, relation extraction, table extraction and dependency parse to obtain <article doi, target sentence, alloy name, property, parameters, unit> tuple automatically from xml and txt files.

Each time run this code, please empty the folder 'output_files' and "output_tables", and make the folder "m_output" only contains one folder "full_text" which doesn't contains any files.

In [2]:
from pipeline.class_modified import get_extraction_outcome
from pipeline.html_parser import Html_parser
from pipeline.other_journals import OtherJ
from pipeline.table_info_html import GetTInfoFromHtml
from pipeline.text_with_table import AcquireAllTargetInfo
from pipeline.main import AllCase

In [2]:
# the path of configuration file
config_path = r".\pipeline\dictionary.ini"
# target property
prop_name = "solvus"
# the path of folder to storage some intermediate file, before re-running the code each time, you need to ensure that this folder onlyg contains an empty folder named "full_text"
m_path = r".\m_output"
# the path of folder to storage tables information
table_save_path = r".\output_tables"
# path of final result
xml_out_path = r".\output_files\all-attributes.xls"
dependency_out_path = r".\output_files"
# path of input files
xml_path = ".\input_xml"
origin_text_path = r".\input_txt"

ac = AllCase(config_path, prop_name, dependency_out_path, table_save_path)
ac.case_2(xml_path, origin_text_path, xml_out_path, m_path)

Success: Extracted Tables from 10.1016/j.msea.2014.09.074
gather number :8
all_text number :4


View the final generated information

In [3]:
import os
import pandas as pd
xlsx_files = os.listdir(dependency_out_path)
xlsx_files

['all-attributes.xls', 'solvus.xlsx']

The extracted information is mainly storaged in document of xlsx format.

In [5]:
xlsx_file = xlsx_files[1]
xlsx_feature = pd.read_excel(os.path.join(dependency_out_path,xlsx_file), usecols=[0,1,2,3,4,5,6,7]) 
feature = pd.DataFrame(xlsx_feature)
feature

Unnamed: 0,Source,DOIs,table_topic,material,Property_name,Property_value,Unit,other_element_info
0,table,10.1016/j.msea.2014.09.074,"Liquidus, solidus, g'-solvus temperatures and ...",5W,g' solvus,1100,°C,
1,table,10.1016/j.msea.2014.09.074,"Liquidus, solidus, g'-solvus temperatures and ...",Co-7.3Al-6.8W (at%),g' solvus,854,°C,
2,table,10.1016/j.msea.2014.09.074,"Liquidus, solidus, g'-solvus temperatures and ...",Co-9.2Al-9W (at%),g' solvus,985,°C,
3,table,10.1016/j.msea.2014.09.074,"Liquidus, solidus, g'-solvus temperatures and ...",Co-8.8Al-9.8W-2Ta (at%),g' solvus,1079,°C,
4,table,10.1016/j.msea.2014.09.074,"Liquidus, solidus, g'-solvus temperatures and ...",Co-7.3Al-7.2W-20.2Ni (at%),g' solvus,881,°C,
5,table,10.1016/j.msea.2014.09.074,"Liquidus, solidus, g'-solvus temperatures and ...",Co-7.8Al-7.8W-4.5Cr-2Ta (at%),g' solvus,960,°C,
6,table,10.1016/j.msea.2014.09.074,"Liquidus, solidus, g'-solvus temperatures and ...",Co-9.9Al-4.8W-1.8Ta (at%),g' solvus,983,°C,
7,table,10.1016/j.msea.2014.09.074,"Liquidus, solidus, g'-solvus temperatures and ...",CMSX-4,g' solvus,1309,°C,
8,text,10.1016/j.msea.2014.09.074,,Co-9Al-8W-2Ta-2Cr,solvus,1050,°C,
9,text,10.1016/j.msea.2014.09.074,,Co-7.8Al-7.8W-2Ta-4.5Cr,solvus,960,°C,


## 7.Pipeline（input files: html files and doi;output:alloy composition and property information）

This section combines corpus proprecessing, sentence classification, named entity recognition, relation extraction, table extraction and dependency parse to obtain <article doi, target sentence, alloy name, property, parameters, unit> tuple automatically from html files.

Each time run this code, please empty the folder 'output_files' and "output_tables", and make the folder "m_output" only contains one folder "full_text" that doesn't contains any files.

In [1]:
from table_extractor.web_other_journals_html.get_tifo_from_html import GetTableHtml
import os
import pandas as pd
from pipeline.class_modified import get_extraction_outcome
from pipeline.html_parser import Html_parser
from pipeline.other_journals import OtherJ
from pipeline.table_info_html import GetTInfoFromHtml
from pipeline.text_with_table import AcquireAllTargetInfo
from pipeline.main import AllCase

In [3]:
# the path of configuration file
config_path = r".\pipeline\dictionary.ini"
# target property
prop_name = "solvus"
# the path of folder to storage some intermediate file, before re-running the code each time, you need to ensure that this folder onlyg contains an empty folder named "full_text"
m_path = r".\m_output"
# the path of folder to storage tables information
table_save_path = r".\output_tables"
# path of final result
dependency_out_path = r".\output_files"
out_path = r".\output_files\all-attributes.xls"
# doi of article
doi = "10.1007/s11837-014-1181-y"
# path of input html files
html_path = r'.\input_html'
# name of journal,it depends on the journal html files downloaded from
journal = "Springer"
# path of folder to storage full text information of article
out_path_txt = r".\txt_from_html"

ac = AllCase(config_path, prop_name, dependency_out_path, table_save_path)
ac.case_3(doi, html_path, journal, out_path_txt, out_path, m_path)

/article/10.1007/s11837-014-1181-y/tables/1
/article/10.1007/s11837-014-1181-y/tables/2
/article/10.1007/s11837-014-1181-y/tables/3
['https://link.springer.com/article/10.1007/s11837-014-1181-y/tables/1', 'https://link.springer.com/article/10.1007/s11837-014-1181-y/tables/2', 'https://link.springer.com/article/10.1007/s11837-014-1181-y/tables/3']
Start crawling the page
complete!
Start crawling the page
complete!
Start crawling the page
complete!
****************************************************************************************************
[]
gather number :9
all_text number :8


View the generated information

In [4]:
import os
import pandas as pd
xlsx_files = os.listdir(dependency_out_path)
xlsx_files

['all-attributes.xls', 'solvus.xlsx']

The extracted information is mainly storaged in document of xlsx format.

In [6]:
xlsx_file = xlsx_files[1]
xlsx_feature = pd.read_excel(os.path.join(dependency_out_path,xlsx_file), usecols=[0,1,2,3,4,5,6,7,8,9]) 
feature = pd.DataFrame(xlsx_feature)
feature

Unnamed: 0,Source,DOIs,table_topic,material,Property_name,Property_value,Unit,other_element_info,other_property_info,child_tag
0,table,10.1007/s11837-014-1181-y,Table I Nominal compositions in at.% and <i>γ<...,Base alloy,γ′ solvus temperature (°C),919.0,°C,,"{'Group': 'I', 'Alloy': 'Base alloy', 'Co': 'B...",
1,table,10.1007/s11837-014-1181-y,Table I Nominal compositions in at.% and <i>γ<...,Alloy Ta,γ′ solvus temperature (°C),998.0,°C,,"{'Group': 'I', 'Alloy': 'Alloy Ta', 'Co': 'Bal...",
2,table,10.1007/s11837-014-1181-y,Table I Nominal compositions in at.% and <i>γ<...,Alloy Ti,γ′ solvus temperature (°C),1070.0,°C,,"{'Group': 'I', 'Alloy': 'Alloy Ti', 'Co': 'Bal...",
3,table,10.1007/s11837-014-1181-y,Table I Nominal compositions in at.% and <i>γ<...,Alloy TaTi,γ′ solvus temperature (°C),113116.0,°C,,"{'Group': 'I', 'Alloy': 'Alloy TaTi', 'Co': 'B...",
4,table,10.1007/s11837-014-1181-y,Table I Nominal compositions in at.% and <i>γ<...,Alloy TaTi-A,γ′ solvus temperature (°C),1097.0,°C,,"{'Group': 'II', 'Alloy': 'Alloy TaTi-A', 'Co':...",
5,table,10.1007/s11837-014-1181-y,Table I Nominal compositions in at.% and <i>γ<...,Alloy TaTi-B,γ′ solvus temperature (°C),1131.0,°C,,"{'Group': 'II', 'Alloy': 'Alloy TaTi-B', 'Co':...",
6,table,10.1007/s11837-014-1181-y,Table I Nominal compositions in at.% and <i>γ<...,Alloy TaTi-C,γ′ solvus temperature (°C),115716.0,°C,,"{'Group': 'II', 'Alloy': 'Alloy TaTi-C', 'Co':...",
7,table,10.1007/s11837-014-1181-y,Table I Nominal compositions in at.% and <i>γ<...,Alloy TaTi-D,γ′ solvus temperature (°C),1184.0,°C,,"{'Group': 'II', 'Alloy': 'Alloy TaTi-D', 'Co':...",
8,table,10.1007/s11837-014-1181-y,Table I Nominal compositions in at.% and <i>γ<...,Alloy TaTi-E,γ′ solvus temperature (°C),1146.0,°C,,"{'Group': 'II', 'Alloy': 'Alloy TaTi-E', 'Co':...",
9,text,10.1007/s11837-014-1181-y,,Co-Al-W-base,solvus,1100.0,°C,,,
