Skip to content

A corpus expansion toolkit written in Python

License

Notifications You must be signed in to change notification settings

GITOC-DigitalTool/cascadePy

Repository files navigation

# cascadePy cascadePy (CaPy) is a corpus expansion toolkit written in Python.

cascadePy is a Python toolkit developed by Centre for the Analysis of Social Media (CASM) Technology LLP in collaboration with the Global Initiative against Transnational Organised Crime (GITOC).

cascadePy combines a number of NLP, information-extraction and web-collection methods to provide a set of tools primarily for use in open-source intelligence (OSINT) efforts against the illicit online wildlife trade.

The intended use of cascadePy is to discover, characterise and expand the vernacular used by those complicit in the illicit wildlife trade, and identify the places they advertise on the web.

This work is an expansion on the original work that can be found here.

The installation instructions for the toolkit can be found below and a brief summary of each module can be found in the accompanying Wiki.

Citing this work

If you intend to use this toolkit, please use the following citation:

Pay, Jack Frederick, 2020. The Corpus Expansion Toolkit: finding what we want on the web (Doctoral thesis, University of Sussex).

In bibtex:

@phdthesis{pay2020corpusexpansion,
           title = {The Corpus Expansion Toolkit: finding what we want on the web},
          author = {Jack Frederick Pay},
            year = {2020},
          school = {University of Sussex},
             url = {http://sro.sussex.ac.uk/id/eprint/93062/},
}

Installation instructions

Prerequisites

  1. It is recommended that your Python environment is >=3.8
  2. Is is also recommended to use a data-science focused Python environment, such as Anaconda.

Installation

  1. Clone the repository to your local machine
  2. run python setup.py install
  3. Install any relevant spaCy models you require. For example, for English run the following command: python -m spacy download en
  4. Follow the below instructions to install the Surprising Phrase Detector (SFPD)

Installing the SFPD

Robertson, Andrew David, 2019. Characterising semantically coherent classes of text through feature discovery (Doctoral thesis, University of Sussex).

  1. Clone the repository found here (citing where necessary).
  2. Follow the necessary installation instructions.

Usage

The toolkit is primarily a library or programming API for others to develop their own corpus expansion pipelines and methodologies. However, a brief breakdown of each module can be found in the accompanying Wiki.

How to contribute

Please feel free to raise any issues found when using this toolkit, create pull requests or create discussion threads.

Liability

Neither CASM LLP or GITOC accept any liability for the misuse of any of the tools provided in this library.

About

A corpus expansion toolkit written in Python

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages