MaximumEntropy/UPSITE
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
UPSITE =============================== UPSITE is a large scale bioNLP classification system created by researchers from the University of Pittsburgh and Carnegie Mellon University. At the time of writing this documentation, the corresponding publication had been accepted to GLBio2015 conference and IEEE Transactions on Computational Biology and bioinformatics under the title “Text Mining for Validating Protein Interactions”. An exhaustive description of the system can be found within that paper. UPSITE has gained many functions over the course of its development and therefore can be used for a wide variety of ML related entity classification problems. The default and primary use of UPSITE is its ability to automatically synthesize and collate large portions of the ~24million document PubMed corpus and automatically classify entity-entity interactions (primarily PPIs). UPSITE was designed using only the highest performing modules available to the BioNLP community in an effort to minimize pipeline error propagation. This is great for optimizing performance, but makes its installation quite difficult. For this reason, I highly recommend accessing the publicly available Amazon Machine Instance (AMI) at the following URL: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;search=UPSITE;sort=name The AMI contains a fully pre-configured Ubuntu 14.04 environment and running version of UPSITE. We highly recommend that you run UPSITE using this AMI since all TEES models, corpora and NLP tools are already configured for you. If however you need to run UPSITE on your local skip the following instructions and proceed to the section that describes how to setup UPSITE on your local machine. The following are the steps to use the AMI to run UPSITE Step 1: Create an Amazon EC2 Instance with the UPSITE AMI Note: These instructions are written for Ubuntu 14.04.2 but should work for most popular linux distributions as well as the mac operating system. Windows users will find this document helpful but will have to make minor adjustments. 1. Create an Amazon Web Services Account at https://aws.amazon.com 2. Navigate to the AWS console > EC2 > IMAGES > AMIs 3. In the drop down menu located in the search bar, select Public images 4. Type “UPSITE” into the search bar and press enter ◦ The original UPSITE file can be identified by AMI ID:ami-2f5a9144 and owner:503952387608. Alternatively, you may find the AMI at this web address: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;search=upsite;sort=name 5. When the AMI for UPSITE is located, make sure it is selected and click the Launch button. You will be redirected to the EC2 configuration tutorial. Follow steps 1-7. Below are the recommended settings: 1. AMI: UPSITE 2. Instance Type: Use default settings (t2.micro) 3. Instance Configuration: Use default settings 4. Storage: 30GB 5. Instance Tag: key:myUPSITE Value:UPSITE1 6. Security Group: Type: SSH, Source:My IP 6. When you are finished setting up your EC2 instance, press Launch 7. You will be instructed to assign a key pair. UPSITE is a Linux AMI. For Linus AMIs, the private key file allows you to securely SSH into your UPSITE instance from your local computer. Select “Create a new key pair” and name it myUPSITEkey. 8. Select “Download Key Pair” and save the file (myUPSITEkey.pem) to your computer. 9. To verify your UPSITE instance is now launched, navigate to the EC2 dashboard and click on Instances > Instances. You should see UPSITE listed with Instance State:”running”. Note the public IP address listed for your instance on this screen, you will need it later. It should be something like: “54.86.228.154” Step 2: SSH into your remote UPSITE EC2 instance 1. Open a command terminal on your local system and navigate to the folder containing myUPSITEkey.pem 2. Enter the following commands to SSH into your UPSITE instance: ◦ chmod 400 myUPSITEkey.pem ◦ ssh -i myUPSITEkey.pem ubuntu@”Public_IP_Address_From_Step_9” ▪ ex. ssh -i myUPSITEkey.pem ubuntu@54.86.228.154 3. You will then be presented with a welcome screen displaying information pertaining to your remote UPSITE linux instance. Congratulations, you have successfully gained access to a fully working version of UPSITE. If you configured UPSITE using the AMI, you can skip the subsequent instructions and proceed to the end of this README that describes the command line usage of this tool. Running UPSITE on your local machine: In the event that you need to run TEES on your local Ubuntu machine, clone this repository with the following command git clone --recursive https://github.com/MaximumEntropy/UPSITE.git Run the installation script INSTALL.sh as follows: 1. chmod +x INSTALL.sh 2. ./INSTALL.sh This script fetches the necessary dependencies required to run UPSITE and TEES and also configures the TEES event extraction system. Once the dependencies are installed, TEES can now be configured. The TEES configuration is an interactive command line system to download related datasets, models and NLP tools required to run the system. You should see a prompt that looks like [X] 1) Install classifier (SVM Multiclass) [X] 2) Install models (TEES models for BioNLP'09-13 and DDI'11-13) [X] 3) Install corpora (BioNLP'09-13 and DDI'11-13) [X] 4) Install preprocessing tools (BANNER, BLLIP parser etc) * c) Continue and install selected items q) Quit * indicates the default option at , press c to continue ============================== Install Directory ============================== 1. By default, all data and tools will be installed to one directory, the DATAPATH. You can later set the installation directory individually for each component, or you can change the default path now. 2. TEES reads its configuration from a file defined by the environment variable "TEES_SETTINGS". This environment variable must be set, and point to a configuration file for TEES to work. By editing this configuration file you can configure TEES in addition (or instead of) using this configuration program. The "TEES_SETTINGS" environment variable is not set, but a configuration file has been found in the default location. This installation program will use the existing file, and by default install only missing components. -------------------------------------------------------------------------------- 1) Change DATAPATH (/home/ubuntu/.tees) 2) Change TEES_SETTINGS (/home/ubuntu/.tees_local_settings.py) * c) Continue ================================================================================ press c to continue again. This step allows you to change the path where your TEES settings and data are stored. We recommend sticking to the default settings. ================================== Classifier ================================== TEES uses the SVM Multiclass classifer by Thorsten Joachims for all classification tasks. You can optionally choose to compile it from source if the precompiled Linux-binary does not work on your system. The SVM_MULTICLASS_DIR setting is already configured, so the default option is to skip installing. SVM_MULTICLASS_DIR=/home/ubuntu/.tees/tools/SVMMultiClass -------------------------------------------------------------------------------- [ ] 1) Compile from source * i) Install s) Skip ================================================================================ This installs LibSVM for classification. Press i install. ==================================== Models ==================================== TEES models are used for predicting events or relations using classify.py. Models are provided for all tasks in the BioNLP'11, BioNLP'09 and DDI'11 shared tasks, for all BioNLP'13 tasks except BB task 1, and for task 9.2 of the DDI'13 shared task. For a list of models and instructions for using them see https://github.com/jbjorne/TEES/wiki/Classifying. -------------------------------------------------------------------------------- [ ] 1) Redownload already downloaded files * i) Install s) Skip ================================================================================ TEES uses various event models for its information extraction. Press i to install them. =================================== Corpora =================================== The corpora are used for training new models and testing existing models. The corpora installable here are from the three BioNLP Shared Tasks (2009, 2011 and 2013) on Event Extraction (organized by University of Tokyo), and the two Drug- Drug Interaction Extraction tasks (DDI'11 and 13, organized by Universidad Carlos III de Madrid). The 2009 and 2011 corpora are downloaded as interaction XML files, generated from the original Shared Task files. If you need to convert the corpora from the original files, you can use the convertBioNLP.py, convertDDI.py and convertDDI13.py programs located at Utils/Convert. The 2013 corpora will be converted to interaction XML from the official corpus files, downloaded automatically from the task websites. Installing the BioNLP'13 corpora will take about 10 minutes. It is also recommended to download the official BioNLP Shared Task evaluator programs, which will be used by TEES when training or testing on those corpora. -------------------------------------------------------------------------------- [ ] 1) Redownload already downloaded files [X] 2) Install BioNLP'11 corpora [x] 3) Install BioNLP'09 (GENIA) corpus [x] 4) Install DDI'11 (Drug-Drug Interactions) corpus [X] 5) Install BioNLP'13 corpora [x] 6) Install DDI'13 (Drug-Drug Interactions) corpus [x] 7) Install BioNLP evaluators * i) Install s) Skip ================================================================================ TEES trains its algorithms on various biomedical coropora that can be downloaded in this step. We recommend downloading all corpora for best performance. ==================================== Tools ==================================== The tools are required for processing unannotated text and can be used as part of TEES, or independently through their wrappers. For information and usage conditions, see https://github.com/jbjorne/TEES/wiki/Licenses. Some of the tools need to be compiled from source, this will take a while. The external tools used by TEES are: The GENIA Sentence Splitter of Tokyo University (Tsuruoka Y. et. al.) The BANNER named entity recognizer by Robert Leaman et. al. The BLLIP parser of Brown University (Charniak E., Johnson M. et. al.) The Stanford Parser of the Stanford Natural Language Processing Group The GENIA_SENTENCE_SPLITTER_DIR setting is already configured, so the default option is to skip installing. GENIA_SENTENCE_SPLITTER_DIR=/home/ubuntu/.tees/tools/geniassThe BANNER_DIR setting is already configured, so the default option is to skip installing. BANNER_DIR=/home/ubuntu/.tees/tools/BANNERThe BLLIP_PARSER_DIR setting is already configured, so the default option is to skip installing. BLLIP_PARSER_DIR=/home/ubuntu/.tees/tools/BLLIP/dmcc-bllip-parser-cb43c6cThe STANFORD_PARSER_DIR setting is already configured, so the default option is to skip installing. STANFORD_PARSER_DIR=/home/ubuntu/.tees/tools/stanford-parser-2012-03-09 -------------------------------------------------------------------------------- [x] 1) Redownload already downloaded files [x] 2) Install GENIA Sentence Splitter [x] 3) Install BANNER named entity recognizer [x] 4) Install BLLIP parser [x] 5) Install Stanford Parser * i) Install s) Skip ================================================================================ This step installs all the NLP tools. Once again, we recommend that you install all the NLP tools. WARNING: This step involves the download of several resources not developed by us. In the even that one of the models, corpora or NLP tools are unable to download or install, please try again after a while and if the problem persists, please use our .ami file to run our system on Amazon EC2. A full list of dependencies required to run this system are listed below: 1. Ruby (sudo apt-get install ruby) 2. Make (sudo apt-get install make) 3. g++ (sudo apt-get install g++-multilib) 4. flex (sudo apt-get install flex) 5. boost (sudo apt-get install libboost-all-dev) 6. Java JRE (sudo apt-get install default-jre) 7. Java JDK (sudo apt-get install default-jdk) 8. Numpy (sudo apt-get install python-numpy) 9. Scipy (sudo apt-get install python-scipy) 10. Sklearn (sudo apt-get install python-sklearn) 11. NLTK (sudo apt-get install python-nltk) Example usage of UPSITE: The system is intended to be used a command line tool. An example usage of this system is: sudo python UPTEES/UPSITE.py -q MDM2 -w TERT -i single -n 60 -o mdm2_tert.tsv This will run experiments using the REL11, EPI11 and ID11 TEES models, to run it on a specific model, sudo python UPTEES/UPSITE.py -q MDM2 -w TERT -i single -n 60 -m REL11 -o mdm2_tert.tsv For a detailed description of the command line arguments, run the following command python UPTEES/UPSITE.py -help The runtime of this system is heavily dependent on the number of papers parsed. For best performance, we recommend setting n to atleast 60 papers to replicate the results in our paper. This entails several hours of runtime. UPSITE was written by Adam G. Roth. The author can be contacted for questions or collorborations at Roth.AdamG@gmail.com Disclaimer: The Turku Event Extraction System (TEES) although distributed as a submodule with this repository, is not developed by us.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published