Skip to content

Commit

Permalink
Merge pull request #52 from ContentMine/dev
Browse files Browse the repository at this point in the history
Update to 0.2.24
  • Loading branch information
tarrow committed Apr 22, 2016
2 parents 5f62864 + 4b5add0 commit dc65381
Show file tree
Hide file tree
Showing 3,378 changed files with 2,785,251 additions and 5,668 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
13 changes: 8 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# AMI

[2015-09-23: This is the current active repository, but will move to https://github.com/ContentMine/ami/ in the near future ].

AMI provides a generic infrastructure where plugins can search, index or transform structured documents on a high-through basis. The typical input is structured, normalized, tagged XHTML, possibly containing (or linked to) SVG and PNG files. The plugins are designed to analyse text or graphics or a combination according to the discipline.

## Historical note and obsoletion
## Running
There are binaries provided with releases which enable running of ami's plugins; shell and batch scripts are provided to make running these easier. Detailed tutorials of these can be found in the ContentMine software tutorials.

Run on a CTree which contains scholarly.html files of the papers you are analysing. This can be made using Norma.

AMI has been through 2 major revisions, and most recently has been split into two parts (a) ``Norma`` which processes legacy documents and normalizes HTML (NHTML) and (b) AMI which runs plugins over the NHTML. AMI currently processes PDF, XML, HTML, etc but these will be obsoleted in favour of the output from ``Norma``.
Schema
```ami2-<pluginname> --project <foldername> <plugin option> <options relating to plugin>
ami2-gene --project zika --g.gene --g.type human
```

## Building
```
Expand Down
2 changes: 1 addition & 1 deletion docs/RESULTS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# `contentMine` directories and `results` files

The `ami` plugins all use a conventional directory structure, modelled on theoutput from `quickscrape`. This used to be called a `quickscrape` directory, then `quickscrapeNorma` and now simply `contemtmine` or `cmdir` . (These names are still in flux, sorry). The flags `-q` and `qsN`, etc will become obsolete.
The `ami` plugins all use a conventional directory structure, modelled on the output from `quickscrape`. This used to be called a `quickscrape` directory, then `quickscrapeNorma` and now simply `contemtmine` or `cmdir` . (These names are still in flux, sorry). The flags `-q` and `qsN`, etc will become obsolete.

## contentMine directory

Expand Down
327 changes: 327 additions & 0 deletions docs/SEARCH_ANALYSIS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,327 @@
# Search and Analysis Tutorial

Updated 20160125

Read in conjunction with
https://github.com/petermr/ami-plugin/blob/master/src/test/java/org/xmlcml/ami2/TutorialTest.java
especially `testSpeciesSequencesGeneWordsCMine()`

This demonstrates the analysis of several fairly diverse `CTree`s containing a mixture of
* genes
* species
* sequences
* words

## files

There are 7 `CTree`s which have already been `norma`lized. `mixed` represents the `cProject`

```
mixed
├── file0
│   ├── fulltext.html
│   ├── fulltext.pdf
│   ├── fulltext.xml
│   └── scholarly.html
├── file1
│   ├── fulltext.html
│   ├── fulltext.pdf
│   ├── fulltext.xml
│   └── scholarly.html
... snipped
└── file6
├── fulltext.html
├── fulltext.pdf
├── fulltext.xml
├── results.json
└── scholarly.html
```
To avoid cluttering the test material we copy them to `target`. (If you trust your operation, you can write directly to the `cProject`)
```
File targetDir = new File("target/tutorial/mixed");
CMineTestFixtures.cleanAndCopyDir(new File("src/test/resources/org/xmlcml/ami2/mixed"), targetDir);
```
All results will thus occur in `target`

## Search

The first operation is to use the specific `ArgProcessor` for these concepts, create `results.xml` and then
analyze these. Note that if there are no results, `empty.xml` is created to represent the fact it has been
searched.
```
LOG.debug("search for DNA Primers");
args = "--project "+targetDir+" -i scholarly.html --sq.sequence --context 35 --sq.type dnaprimer";
new SequenceArgProcessor(args).runAndOutput();
LOG.debug("wordFrequencies");
args = "--project "+targetDir+" -i scholarly.html"
+ " --w.words wordFrequencies --w.stopwords /org/xmlcml/ami2/plugins/word/stopwords.txt ";
new WordArgProcessor(args).runAndOutput();
LOG.debug("species");
args = "--project "+targetDir+" -i scholarly.html --sp.species --context 35 --sp.type binomial genus";
new SpeciesArgProcessor(args).runAndOutput();
LOG.debug("genes");
args = "--project "+targetDir+" -i scholarly.html --g.gene --context 100 --g.type human";
new GeneArgProcessor(args).runAndOutput();
```

These take about 500 ms for 7 articles - so ca 70 secs for 5 operations (DNA, words, species *2, genes) or 15 secs
each.

## results.xml

The operations all write to `results.xml`. Part of the output tree (with fulltext.* omitted):
```
.
├── file0
│   ├── results
│   │   ├── gene
│   │   │   └── human
│   │   │   └── results.xml
│   │   ├── sequence
│   │   │   └── dnaprimer
│   │   │   └── empty.xml
│   │   ├── species
│   │   │   ├── binomial
│   │   │   │   └── empty.xml
│   │   │   └── genus
│   │   │   └── results.xml
│   │   └── word
│   │   └── frequencies
│   │   ├── results.html
│   │   └── results.xml
│   ├── scholarly.html
│── file1
│   ├── results
│   │   ├── gene
│   │   │   └── human
│   │   │   └── empty.xml
│   │   ├── sequence
│   │   │   └── dnaprimer
│   │   │   └── empty.xml
│   │   ├── species
│   │   │   ├── binomial
│   │   │   │   └── results.xml
│   │   │   └── genus
│   │   │   └── empty.xml
│   │   └── word
│   │   └── frequencies
│   │   ├── results.html
│   │   └── results.xml
│   ├── scholarly.html
│── file2
│   ├── results
│   │   ├── gene
│   │   │   └── human
│   │   │   └── empty.xml
│   │   ├── sequence
│   │   │   └── dnaprimer
│   │   │   └── empty.xml
│   │   ├── species
│   │   │   ├── binomial
│   │   │   │   └── results.xml
│   │   │   └── genus
│   │   │   └── empty.xml
│   │   └── word
│   │   └── frequencies
│   │   ├── results.html
│   │   └── results.xml
│   ├── scholarly.html
```
Notice that trees have some tips with `results.xml` and others with `empty.xml` (of course `word` always has `results.xml`).

## analysis

The `--analyze` flag operates on both `ctree`s and the containg `cproject`. For each `ctree` it:

* lists all the `results.xml` files within the `ctree`
* lists all the `snippets` extracted within the `results.xml`

```
├── file0
│   ├── geneFiles.xml
│   ├── geneSnippets.xml
│   ├── genegeneSnippets.xml
│   ├── results
│   │   ├── gene
│   │   │   └── human
│   │   │   └── results.xml
│   │   ├── sequence
│   │   │   └── dnaprimer
│   │   │   └── empty.xml
│   │   ├── species
│   │   │   ├── binomial
│   │   │   │   └── empty.xml
│   │   │   └── genus
│   │   │   └── results.xml
│   │   └── word
│   │   └── frequencies
│   │   ├── results.html
│   │   └── results.xml
│   ├── scholarly.html
│   ├── sequenceFiles.xml
│   ├── sequenceSnippets.xml
│   ├── speciesFiles.xml
│   ├── speciesSnippets.xml
│   ├── wordFiles.xml
│   └── wordSnippets.xml
├── file1
│   ├── geneFiles.xml
│   ├── geneSnippets.xml
│   ├── genegeneSnippets.xml
│   ├── results
│   │   ├── gene
│   │   │   └── human
│   │   │   └── empty.xml
│   │   ├── sequence
│   │   │   └── dnaprimer
│   │   │   └── empty.xml
│   │   ├── species
│   │   │   ├── binomial
│   │   │   │   └── results.xml
│   │   │   └── genus
│   │   │   └── empty.xml
│   │   └── word
│   │   └── frequencies
│   │   ├── results.html
│   │   └── results.xml
│   ├── scholarly.html
│   ├── sequenceFiles.xml
│   ├── sequenceSnippets.xml
│   ├── speciesFiles.xml
│   ├── speciesSnippets.xml
│   ├── wordFiles.xml
│   └── wordSnippets.xml
├── file2
```
There are four `ami-plugin`s,

* `gene`
* `sequence`
* `species`
* `word`

and each can have a `*Files` and `*Snippets` file. The snippets can come from any XML file, but most usually either the
input (`fulltext.xml` or equivalent) or `results.xml`. Thus `geneFiles` lists the files under `results/gene`.
In our example there is only `human` as a gene category but in principle there might be many different genes.
`species` has `binomial` and `geneus` so coulsd have up to two `results.xml` files.


### `*Files`

As an example `speciesFiles` can contain 0, 1 or 2 components:

```
file1/speciesFies.xml
<cTreeFiles cTree="target/tutorial/mixed/file0">
<file name="target/tutorial/mixed/file0/results/species/genus/results.xml"/>
</cTreeFiles>
file3/speciesFies.xml
<cTreeFiles cTree="target/tutorial/mixed/file3">
<file name="target/tutorial/mixed/file3/results/species/binomial/results.xml"/>
<file name="target/tutorial/mixed/file3/results/species/genus/results.xml"/>
</cTreeFiles>
````

with no `results.xml` there is a stub:

```
file5/sequenceFiles.xml
<cTreeFiles cTree="target/tutorial/mixed/file5"/>
```

### `*Snippets`

The `<result>` elements in `results.xml` files are "snippets" o the document, usually with
the extracted entity and some characters/words of context. This can be tansferred to a `*Snippets.xml`
file. Typical file `file5/speciesSnippets.xml` (aggregated)
```
<projectSnippetsTree>
<snippetsTree>
<snippets file="target/tutorial/mixed/file0/results/species/genus/results.xml">
<result pre="euronal progenitor) lineage in the " exact="Drosophila" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][11]/*[local-name()='div'][3]/*[local-name()='p'][3]" match="Drosophila" post=" brain. Loss of orthodenticle&lt;/i" name="genus"/>
<result pre="vely simple brain of the fruit fly " exact="Drosophila" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][11]/*[local-name()='div'][3]/*[local-name()='p'][3]" match="Drosophila" post=" have been identified. Furthermore," name="genus"/>
<result pre="identity of each neuroblast in the " exact="Drosophila" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][11]/*[local-name()='div'][3]/*[local-name()='p'][3]" match="Drosophila" post=" brain are known. These genes may a" name="genus"/>
<result pre="xt to each other in the developing " exact="Drosophila" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][11]/*[local-name()='div'][3]/*[local-name()='p'][3]" match="Drosophila" post=" brain, produces neurons for differ" name="genus"/>
...
</snippets>
</snippetsTree>
<snippetsTree>
<snippets file="target/tutorial/mixed/file1/results/species/binomial/results.xml">
<result pre="Longevity of " exact="Rhizoprionodon terraenovae" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][11]/*[local-name()='div'][4]/*[local-name()='p'][1]" match="Rhizoprionodon terraenovae" post=" and Carcharhinus acronotus " name="binomial"/>
<result pre="Rhizoprionodon terraenovae and " exact="Carcharhinus acronotus" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][11]/*[local-name()='div'][4]/*[local-name()='p'][1]" match="Carcharhinus acronotus" post=" in the western North Atlantic Ocea" name="binomial"/>
<result pre="om 7.7-14.0 years (mean =10.1) for " exact="R. terraenovae" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][11]/*[local-name()='div'][4]/*[local-name()='p'][1]" match="Rhizoprionodon terraenovae" post=" and 10.9-12.8 years (mean =11.9) f" name="binomial"/>
...
</snippets>
</snippetsTree>
<snippetsTree>
<snippets file="target/tutorial/mixed/file2/results/species/binomial/results.xml">
<result pre=" in cooperative tasks (marmosets ( " exact="Callithrix jacchus" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][3]/*[local-name()='p'][3]" match="Callithrix jacchus" post="): Werdenich and Huber, 2002; chimp" name="binomial"/>
<result pre="1993; Melis et al., 2006b; rooks ( " exact="Corvus frugilegus" match="Corvus frugilegus" post="): Seed et al., 2008; Scheid and No" name="binomial"/>
</snippets>
</snippetsTree>
<snippetsTree>
<snippets file="target/tutorial/mixed/file3/results/species/binomial/results.xml">
<result pre="vailable for Fusarium poae. " exact="F. poae" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][5]/*[local-name()='div'][2]/*[local-name()='p'][1]" match="F. poae" post=" is one of the species complexes in" name="binomial"/>
<result pre=" organic compounds associated with " exact="F. poae" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][5]/*[local-name()='div'][2]/*[local-name()='p'][1]" match="F. poae" post=" metabolism could provide good mark" name="binomial"/>
<result pre="he volatile profile of healthy and " exact="F. poae" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][5]/*[local-name()='div'][2]/*[local-name()='p'][1]" match="F. poae" post="-infected durum wheat kernels by SP" name="binomial"/>
<result pre="mpounds, could be used to identify " exact="F. poae" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][5]/*[local-name()='div'][2]/*[local-name()='p'][1]" match="F. poae" post=" contamination of durum wheat grain" name="binomial"/>
<result pre="m species responsible for FHB, " exact="F. graminearum" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][5]/*[local-name()='div'][2]/*[local-name()='p'][1]" match="F. graminearum" post=" is considered the most important, " name="binomial"/>
...
</snippets>
```

## project aggregation

The project has 7 `ctree`s and the `*Files` and `*Snippets` files cane be combined into summary files as
direct children of the`cProject`.

## `file` and `xpath` of `--analyze`

The `--analyze` flag currently takes one argument, `searchExpression`. This has the forms:

* `file(glob-expression)`
* `file(glob-expression)xpath(xpath-expression)`

### glob syntax

The `glob and `xpath` expressions follow the Java NIO file glob syntax and the W3C XPath 1.0 syntax
(currently spaces can't be included). `file` selects files through wildcards. Thus:
```
**/species/**/results.{xml,html}`
```
looks for a file path which contains `species` and `results.xml` or `results.html`
```
The full syntax is in: https://docs.oracle.com/javase/7/docs/api/java/nio/file/FileSystem.html
(section `getPathMatcher`)
### xpath
XPath navigates an XML document in a syntax derived from a filesystem. Example:
```
//result[contains(@pre,'genotype')]
````
finds all `<result>` elements anywhere in the document, also javing a "pre" attribute which contains the string
"genotype"
```
<result pre="ly, the adult ALad1 lineage is labelled by the Cha7.4- Gal4 (a cholinergic neuron label) and " exact="GH146" post=" -Gal4 lines, while the adult LALv1 lineage is not ( Figure 1P , Table 1 ). " />
<result pre="te otd −/− MARCM clones, females of the genotype FRT19A,otd " exact="YH13" post=" /FM7c or FRT19A,oc 2 /FM7c or FRT19A, otd &lt;s" />
````
The first line is not matched, the seocnd is (contains "genotype")

Anotehr example from patents:
```
file(**/fulltext.xml)xpath(//description[heading[.='BACKGROUND']]/p[contains(.,'polymer')])
```
finds all BACKGROUND <description> sections of the `fulltext.xml` which contain"polymer"

2 changes: 2 additions & 0 deletions notes.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
https://lucene.apache.org/core/4_1_0/facet/org/apache/lucene/facet/doc-files/userguide.html
https://lucene.apache.org/core/features.html
Loading

0 comments on commit dc65381

Please sign in to comment.