Skip to content

AndresCalimero/corpus-search-auto

Repository files navigation

CorpusSearchAuto

Automatization tools for CorpusSearch

Summary

CorpusSearchAuto is a Java program that automates large researches in corpus linguistics, it uses CorpusSearch for searching syntactically annotated (parsed) corpora.

The software is a set of tools (currently only have 2 tools, described below).

Features

  • Classify YCOE and PENN corpus by genre (tool: genre-finder)
  • Run bulk searches of lexical and syntactic configurations by genre in several corpora (tool: statistics-by-genres)
  • Export results to HTML and Excel formats
  • Resource optimization

Usage

  CospusSearchAuto -tool [tool] [parameters]

The "data" folder

The "data" folder is a folder in the same directory as the CorpusSearchAuto executable that should store all the corpora files and the CS.jar executable, CorpusSearchAuto will search for both in this directory.

An example of its structure can be:

  data
    corporas
      YCOE
        info
          YcoeTextInfo.htm
          ...
        pos
          coadrian.o34.pos
          ...
        psd
          coadrian.o34.psd
          ...
      PPCEME
        info
          description.html
          ...
        pos
          abott-e1-p1.pos
          ...
        psd
          abott-e1-p1.psd
          ...
    output
      13-04-2017 18.46.36.333 YCOE
        prepositions
          homilies_and_sermons
            result.out
          ...
        ...
      ...
    searches
      queries
        YCOE
          prepositions.q
          ...
        ...
        PENN
          prepositions.q
          ...
        ...
      search.xml
      ...
    CS.jar

NOTE: The only thing that is mandatory to have into the data folder is the corpora and the CS.jar executable, the rest is just a suggestion to organize more efficiently the .q search files and the output files.

The XML search file

CorpusSearchAuto requires an XML file that contains all the information necessary to perform a massive search in corpora, in that file you must specify the variables to be searched (associated with their corresponding .q) and all corpus belonging to each corpora, classified by genre.

An example of a XML search file would be:

<?xml version="1.0" encoding="UTF-8"?>
<search xmlns="http://corpus.search.auto" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <corpora name="YCOE">
        <variables>
            <variable q-file="data\searches\queries\YCOE\adverbial_subordinators.q">Adverbial subordinators</variable>
            <variable q-file="data\searches\queries\YCOE\agentless_passives.q">Agentless passives</variable>
            <variable q-file="data\searches\queries\YCOE\attributive_adjetives.q">Attributive adjetives</variable>
            <variable q-file="data\searches\queries\YCOE\by_passives.q">By passives</variable>
            ...
        </variables>
        <genres>
            <genre name="Philosophy">
                <corpus>coboeth.o2</corpus>
                <corpus>codicts.o34</corpus>
                ...
            </genre>
            <genre name="History">
                <corpus>cobede.o2</corpus>
                <corpus>cochronA.o23</corpus>
                <corpus>cochronC</corpus>
                ...
            </genre>
            ...
        </genres>
    </corpora>
    <corpora name="PPCEME">
        <variables>
            <variable q-file="data\searches\queries\PENN\adverbial_subordinators.q">Adverbial subordinators</variable>
            <variable q-file="data\searches\queries\PENN\agentless_passives.q">Agentless passives</variable>
            <variable q-file="data\searches\queries\PENN\attributive_adjetives.q">Attributive adjetives</variable>
            <variable q-file="data\searches\queries\PENN\by_passives.q">By passives</variable>
            ...
        </variables>
        <genres>
            <genre name="Philosophy">
                <corpus>boethco-e1-h</corpus>
                <corpus>boethco-e1-p1</corpus>
                ...
            </genre>
            <genre name="History">
                <corpus>burnetcha-e3-h</corpus>
                <corpus>burnetcha-e3-p1</corpus>
                <corpus>burnetcha-e3-p2</corpus>
                ...
            </genre>
            ...
        </genres>
    </corpora>
    ...
</search>

Tools

genre-finder

Generates an XML file that lists all the corpus of the specified corpora and classifies them by genres, currently only supports the YCOE and PENN corpora.

The resulting XML file is useful to create the search file that the tool statistics-by-genres requires.

An example of the resulting XML would be:

<?xml version="1.0" encoding="UTF-8"?>
<genres xmlns="corpus.search.auto" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="corpus.search.auto genres-schema.xsd" corpora="YCOE">
  <genre name="religious treatise">
    <corpus>coadrian.o34</corpus>
    <corpus>coalcuin</corpus>
    <corpus>cocura.o2</corpus>
    ...
  </genre>
  <genre name="homilies">
    <corpus>coaelhom.o3</corpus>
    <corpus>coaugust</corpus>
    ...
  </genre>
  ...
</genres>
Usage
  CospusSearchAuto -tool genre-finder -corpora [CORPORA] -in [PATH_OF_INFO_FILE] (-out [PATH_OF_OUT_FILE])
Parameters
Name Mandatory Default value Description
corpora Yes - The corpora to be used, for example: YCOE
in Yes - The corpora html file specifying the genres to which each corpus belongs, for example: "data\corporas\YCOE\info\YcoeTextInfo.htm"
out No in path Path of the resulting XML file
Examples
  CorpusSearchAuto -tool genre-finder -corpora YCOE -in "data\corporas\YCOE\info\YcoeTextInfo.htm"
  CorpusSearchAuto -tool genre-finder -corpora PPCEME -in "data\corporas\PPCEME\info\description.html"
  CorpusSearchAuto -tool genre-finder -corpora PPCME2 -in "data\corporas\PPCME2\info\description.html" -out "C:\genres.xml"

statistics-by-genres

Processes the XML search file (described above) and generates one or more output files with the results in the format(s) specified.

Usage
  CospusSearchAuto -tool statistics-by-genres -in [PATH_OF_SEARCH_FILE] (-show-only [all|hits|tokens|total] -out-format [all|html|excel] -out [PATH_OF_OUT_FOLDER])
Parameters
Name Mandatory Default value Description
in Yes - Path of the XML search file, for example: "data/searches/search.xml"
show-only No all What should be displayed in the output file(s) (hits, tokens, total or all)
out-format No all What format the output file(s) should have (html, excel or all)
out No in path Path to the folder where the output files will be saved
Examples
  CorpusSearchAuto -tool statistics-by-genres -in "data\searches\search.xml"
  CorpusSearchAuto -tool statistics-by-genres -in "data\searches\search.xml" -show-only hits
  CorpusSearchAuto -tool statistics-by-genres -in "data\searches\search.xml" -show-only hits -out-format excel -out "C:\search_output"

License

MIT