Updated manifest #58

ivansg44 · 2019-07-04T21:37:45Z

To include new resources used in bucket classification

* Prepare for LexMapr-0.1.2 (#32) * Python 2.7 compatibility (#6) * Clear list in backwards-compatible way * Add python2.7 to setup.py * Convert sets to lists before printing * Update TravisCI and Coveralls badges for development branch * Ivansg44/development/testing (#8) * Abstracted lexmapr.pipeline to pipeline. Split lines exceeding PEP8-imposed limits. Made note to make further PEP8-imposed changes. * Increased test coverage for functions being tested outside TestPipeline class. Added docstrings to new and existing functions outside TestPipline class. * Added tests for preProcess, find_between_r, find_left_r and addSuffix. Changed some single quotes to double quotes, and added note to increase TestPipelineMethods class abstraction. * Merged methods in TestPipelineMethods, so that there is a single method corresponding to each helper method in lexmapr.pipeline outside lexmapr.pipeline.run. See GitHub commit comment for justification. Also added class docstring for TestPipelineMethods class. * Added tests for lexmapr.pipeline.allPermutations. * Added tests for lexmapr.pipeline.combi. Added test_addSuffix and test_combi to list of public methods in TestPipelineMethods class docstring. * Added tests for lexmapr.pipeline.punctuationTreatment. Added note of potentially unintended function of lexmapr.pipeline.punctuationTreatment. * Added tests for lexmapr.pipeline.retainedPhrase. Added notes of potentially unintended functions and bugs of lexmapr.pipeline.retainedPhrase. * Added note of how to approach lexmapr.pipeline.run testing. * Added tests for lexmapr.pipeline.run with and without "full" format argument. Also added class docstring for TestPipeline, with notes on potential bugs in lexmapr.pipeline.run, along with notes to abstract methods and utilize parallel programming in the future. * Abstracted tests for lexmapr.pipeline.run into single function. Added constructor to TestPipeline class with instance variable test_files to track all input and output file test case combations. Extended TestPipelineMethods class docstring to include mention of parent class. Added note to show all assertions that fail in TestPipline (if there are multiple failures) in the future. * Updated list of public methods in TestPipeline class docstring. * Removed TestPipeline constructor, and converted TestPipeline.test_files from instance variable to class variable. See GitHub comments for justification. * Added tests for lexmapr.pipeline.run when punctuation treatment must be applied to, or extra inner spaces removed from, the raw sample. * Added test for lexmapr.pipeline.run when samples are broken into varying numbers of tokens. * Added test for lexmapr.pipeline.run when samples require varying amounts of preprocessing. * Renamed some test files for lexmapr.pipeline.run to more uniform pattern. * Added test for lexmapr.pipeline.run when samples require varying amounts of inflection treatment. * Added test for lexmapr.pipeline.run when samples require varying amounts of spelling corrections. * Fixed forgotten hash tag behind intended comment * Added comments to lexmapr.pipeline.run input test files. * Made note of critical bug in TestPipeline class. * TestPipeline now lists all failed comparisons in single assertion error, rather than first failed comparison seen. Updated function docstring to document this. Also replaced some single quotes with double quotes. * Fixed bug that made asserting curly bracketed content outputted by LexMapr.pipeline.run difficult, by implementing helper functions TestPipeline._order_contents and TestPipeline._get_curly_bracket_indices to sort all curly bracketed content. Also re-commented a line that I accidently uncommented in commit f494b30e5d0032a9bf6f78ef0b33ee1bf83136f7, and removed parantheses around an if-function. * Revert "Fixed bug that made asserting curly bracketed content outputted by LexMapr.pipeline.run difficult, by implementing helper functions TestPipeline._order_contents and TestPipeline._get_curly_bracket_indices to sort all curly bracketed content. Also re-commented a line that I accidently uncommented in commit f494b30e5d0032a9bf6f78ef0b33ee1bf83136f7, and removed parantheses around an if-function." This reverts commit 1c917de5cc03358027cf9ab644f89d9d5a5a4864. * Edited test output files to be consistent with python3 ordering of sets when PYTHONHASHSEED==0. Also fixed forgotten hash tag behind intended comment. * Made test files for lexmapr/pipeline.run spelling corrections more robust. Removed note of potential bug made in commit d7223d83cf75bcce23135098165ceb9a58f3b993, as it was a misunderstanding (the word being checked for spelling mistakes does not have its case changed, but rather the words in the resource containing all spelling mistakes). * Removed out-dated note of bug. * Added test for lexmapr.pipeline.run when samples require varying amounts of abbreviation and acronym translation. * Added test for lexmapr.pipeline.run when sampels require varying amounts of non-english to english translation. * Added test for lexmapr.pipeline.run when samples contain varying amounts of stop words. * Made note of difficult-to-test scenario. * Added test for lexmapr.pipeline.run examining the varying paths taken to create a candidate phrase. * Added test for lexmapr.pipeline.run when some rows have a SampleId, but no sample. * Added test for lexmapr.pipeline.run when some samples are full-term direct matches. * Added test for lexmapr.pipeline.run when some samples are full-term matches, provided a change-of-case in input or resource data. * Added test for lexmapr.pipeline.run when some samples are full-term matches, provided a permutation of bracketed or non-bracketed terms. * Added test for lexmapr.pipeline.run when some samples are full-term matches that require the addition of a suffix. * Extended test for lexmapr.pipeline.run (when some samples are full-term direct matches) to include the case where some clean samples are full-term direct matches. * Extended test for lexmapr.pipeline.run (when some samples are full-term matches, provided a change-of-case in input or resource data) to include the case where some clean samples are full-term matches, provided a change of case in input or resource data. * Extended test for lexmapr.pipeline.run (when some samples are full-term matches, provided a permutation of bracketed or non-bracketed terms) to include the case where some clean samples are full-term matches when permutated. * Extended test for lexmapr.pipeline.run (when some samples are full-term matches that require the addition of a suffix) to include the case where some clean samples are full-term matches that require the addition of a suffix. * Added notes on potential bugs in lexmapr.pipeline.run. * Added test for lexmapr.pipeline.run when some clean samples are full-term matches with a Wikipedia-based collocation resource. * Made note of potential bug, and about testing component matching. * Added class docstring and alphabetized imports. * Removed completed TODO. * Replaced relative paths with absolute paths when opening files for comparison in LexMapr/lexmapr/tests/test_pipeline.py. * Commented out tests that match function specification but not current function implementation. Corrected docstring documentation to reflect this. * Fixed incorrect declaration of expected_output_path variable. * Adjusted output files used in test_pipeline.py to match changes to output incurred by commit be2b5018d2cbb5401457f5b44d30ab285cac41d1. * Commented out print statement. * Commented out TestPipeline class for the purposes of finding code responsible for failed Travis CI builds. * Replaced assertCountEqual list comparisons with assertSetEqual set comparisons in LexMapr/lexmapr/tests/test_pipeline.py test_combi to maintain Python 2.7 compatibility. * Uncommented out TestPipeline class. Commented out all but one test file. This was for the purposes of finding code responsible for failed Travis CI builds. * Uncommented out more test files for the purposes of finding code responsible for failed Travis CI builds. * Uncommented out more test files for the purposes of finding code responsible for failed Travis CI builds. * Uncommented out more test files for the purposes of finding code responsible for failed Travis CI builds. * Replaced use of sets with use of ordered dictionaries when creating statusAddendumSetFinal to maintain order. * Revert "Replaced use of sets with use of ordered dictionaries when creating statusAddendumSetFinal to maintain order." This reverts commit 4151d7a20ca3e03f9409c3ff7e79ee5115cc109a. * Uncommented out more test files for the purposes of finding code responsible for failed Travis CI builds. * Disabled logger. * Added PYTHONHASHSEED=0 environmental variable to Travis CI build. * Allow python 2.7 travis CI build failures. * Uncommented out more test files for the purposes of finding code responsible for failed Travis CI builds. * Abstract production of dictionaries and lists from LexMapr resource files. (#12) * Update shebang line to use either Python 2 or 3. * Wrote pipeline.get_resource_dict stub. * Wrote pipeline.get_resource_dict function docstring. * Began implementation of pipeline.get_resource_dict. * Added requirement of file_name location to pipeline.get_resource_dict function docstring. * Replaced generation of synonymsDict in pipeline with call to get_resource_dict. * Replaced stub return value in pipeline.get_resource_dict. * Removed old, commented-out code. * Replaced generation of collocationDict and processDict in pipeline.py with calls to get_resource_dict. * Removed old, commented-out code. * Add lowercase functionality to pipeline.get_resource_dict. * Replaced generation of abbreviationDict and abbreviationLowerDict in pipeline.py with calls to get_resource_dict. * Corrected unintentional declaration of sets. * Removed old, commented-out code. * Replaced generation of nonEnglishwordsDict and nonenglishWordsLowerDict in pipeline.py with calls to get_resource_dict. * Replaced generaiton of spellingDict and spellingLowerdict in pipeline.py with calls to get_resource_dict. * Fixed typo in 'Neflex'. * Replaced generation of qualityDict and qualityLowerDict in pipeline.py with calls to get_resource_dict. * Update TODO list for pipeline.get_resource_dict * Update pipeline.get_resource_dict specification. * Handle case in pipeline.get_resource_dict when values are empty. * Replaced generation of inflectionExceptionList in pipeline.py with call to get_resource_dict. * Changed variable name: inflectionExceptionList to inflectionExceptionDict. * Replaced generation of stopWordsList in pipeline.py with call to get_resource_dict. * Changed variable name: stopWordsList to stopWordsDict. * Update TODO list for pipeline.get_resource_dict. * Replaced generation of resourceTermsIDBasedDict in pipeline.py with call to get_resource_dict. * Replaced generation of resourceTermsDict with modification of resourceTermsIDBasedDict. * Renamed some variables to fit PEP style guidelines. * Repalced generation of resourceRevisedTermsDict with modification of resourceTermsDict. * Removed old, commented-out code. * Replaced use of flag to skip first row in csv file in pipeline.get_resource_dict with use of next() function. * Renamed multiple variables. * Update resource files (#14) * Update resource files * Used `iconv -f ISO-8859-1 -t UTF-8` to convert resource files to UTF-8 * Add .extension versions of test output files to compare with current test output files * Overwrite output files to make it easier to compare with GitHub online diff * Temporarily enable printing expected/actual output during testing * Revert printing expected/actual output during testing * Re-generate test output files, this time setting PYTHONHASHSEED=0 * Update suffixList, modify test output files to accomodate new suffixList (#17) - Purposely bypassing failed tests. Will address non-deterministic output in a future update. * Modified output test file for suffix term matching. (#19) * Add MANIFEST.in file to include resource files in package (#21) * Version bump * Added TODO at start of 'Component Matches Section' that we must abstract. * Added stub inner function find_component_match to pipeline.run. * Abstracted generation of newChunk1-5Grams through use of get_gram_chunks in pipeline.run.find_component_match. * Renamed newChunk to cleaned_chunk, newChunkTokens to cleaned_chunk_tokens, and newChunk1-5grams to cleaned_chunk_1-5_grams. * Created dictionary cleaned_chunk_grams in find_component_matches to avoid repeated declarations of get_gram_chunks. * Fixed outdated code. * [WIP] Abstract application of rules (#18) * Removed unnecessary blank line. * Added stub for function get_annotations in LexMapr/lexmapr/pipeline.py. * Renamed get_annotations to find_full_term_match. Implemented handling of empty samples through find_full_term_match. * Updated handling of empty samples in find_full_term_match to include matched_term and all_match_terms_with_resource_ids annotations. * Created new Exception class MatchNotFoundError. * Completed docstrings for MatchNotFoundError class. * Moved find_full_term_match inside pipeline.run. Made a note in find_full_term_match docstring why I did this, as well as notes on future abstraction. * Commented out all uses of statusAddendum string, which do not seem to serve any purpose. Delete all instances in the future if truly unnecessary. * Replaced individual modificaiton of ret keys in find_full_term_match with use of dictionary.update function. * Modified empty output test file to make tab pattern more consistent with other output scenarios. This will allow us to better abstract the process of writing outputs to files. * Implemented handling of full-term direct matches through find_full_term_match. * Renamed retSet. * Renamed remSet to remaining_tokens. * Renamed statusAddendumSetFinal to final_status. * Renamed statusAddendumSet to status_addendum. * Implemented handling of full-term change of case in input data matches through find_full_term_match. * Handled write to output file when args.format != full. * Renamed coveredAllTokensSet to covered_tokens. * Altered output files with direct matches to make use of apostrophe more similar to other matches. This will make abstraction of writing to outputs easier. * Updated handling of full term direct matches in find_full_term_match to make all_match_terms_with_resource_ids more similar to other matches. * Deleted unneccessary duplication of code of common to all full-term matches in find_full_term_match. * Deleted more unneccessary duplicate code in find_ful_term_match, and adhered some line lengths to PEP style guidelines. * Implemented handling of full-term change in resource data matches through find_full_term_match. * Renamed resourcePermutationTermsDict to resource_permutation_terms. * Implemented handling of full-term matches with permutation of resource term through find_full_term_match. * Deleted old code. * Renamed resourceBracketedPermutationTermsDict to resource_bracketed_permutation_terms. * Implemented handling of full-term matches with permutation of bracketed resource term through find_full_term_match. * Renamed suffixList to suffixes. * Implemented handling of full-term matches with change of resource and addition of suffix through find_full_term_match. * Renamed newPhrase to cleaned_sample. * Implemented handling of full-term matches with cleaned sample through find_full_term_match. * Implemented handling of full-term matches with cleaned sample and change of resource through find_full_term_match. * Implemented handling of full_term matches with cleaned sample and permutation of resource term through find_full_term_match. * Implemented handling of full-term matches with cleaned sample and permutation of bracketed resource term through find_full_term_match. * Implemented handling of full-term matches with cleaned sample and addition of suffix through find_full_term_match. * Implemented handling of full-term matches of cleaned sample with multi-word collocations through find_full_term_match. * Completed find_full_term_match docstring. * Modified find_full_term_match docstring and modified ret dictionary initialization. * Removed pipeline.addSuffix, and all calls to it. * Modified comments around new code. * Updated find_full_term_match docstring. * Updated out-of-date comments. * Move retained_token to local scope of find_full_term_match. * Removed out-dated TODO. * Removed all commented-out instances of statusAddendum. * Moved handling of 5 gram chunks to find_component_match. * Removed unneccessary code. * Resolved merge conflict. * Resolved merge conflicts. * Moved handling of 4 gram chunks to find_component_match. * Moved handling of 3 gram chunks to find_component_match. * Moved handling of 2 gram chunks to find_component_match. * Changed order of iteration in find_component_match from 1-5 to 5-1. Moved handling of 1 gram chunks to find_component_match. * Got rid of dictionary return value from find_component_match. * Combined 4 and 5-gram component matching code in find_component_match. * Simplified calls to get_gram_chunks to one instance. * Combined 3, 4 and 5-gram component matching code in find_component_match. * Fixed potential indentation typo. * Version bump * Combined 2-5 gram component matching code in find_component_match. * Combined 1-5 gram component matching code in find_component_match. * Updated find_component_match TODO list. * Renamed partialMatchedList, partialMatchedResourceList and partialMatchedSet to partial_matches, partial_matches_with_ids and partial_matches_final. Moved declaration of partial_matches_with_ids to a more relevant area in the code. Removed unneccessary declaration of partial_matches_final. * Removed unneccessary if-statement, and outdated TODOs. * Renamed grm1 and grmTokens to concatenated_gram_chunk and gram_tokens respectively. * Added missing localTrigger check and indentation that were erroneously removed. * Renamed local_trigger to match_found in find_component_match. * Renamed allPermutations to all_permutations. Renamed setPerm to permutaitons in find_component_match. * Renamed allPermutations to all_permutations in test_pipeline.py. * Renamed grm to concatenated_gram_chunk, and adhered some code to PEP style guidelines in find_component_match. * Adhering code to PEP style guidelines in find_component_match. * Abstracted code in find_component_match with implementation of inner function handle_component_match. * Added parameter to finder_component_match.handle_component_match, which allowed abstraction of suffix-addition code in find_component_match. * Abstracted code in semantic tagging code in find_component_match. * Abstracted candidate processes match code in find_component_match. * Added find_component_match function docstring. * Refactor preProcess (#28) * Renamed preProcess to preprocess. * Refactored preprocess, and wrote function docstring. * [WIP] Cache resource dictionaries (#27) * Wrote stub for function cache_resource_dict. * Imported json, and changed planned use of single cache_resource_dict function with dual use of load_lookup_table and update_lookup_table. Wrote pseudocode for both functions. * Implemented writing lookup_table to lookup_table.json in pipeline.update_lookup_table. * Renamed punctuationsList to punctuations. * Implemented reading of lookup_table.json into global lookup_table variable. Added synonyms to lookup_table.json. * Replaced all calls to synonyms with lookup_table[synonyms]. * Implemented creation of lookup_table.json if file does not exist in pipeline.load_lookup_table. * Implemented update of lookup_table.json if resources folder was modified more recently. * Removed print testing code that was erroneously committed. * Attempt to prevent resources folder not found by Travis CI with use of os.path.join. * Attempt to prevent resources folder not found by Travis CI with use of os.path.join. * Created and implemented get_path helper function to tidy up load_lookup_table. * Removed unneccessary calls to get_path. * Removed unneccessary calls to get_path. * Added abbreviations to lookup_table. * Replaced leftover calls to abbreviation_lower with calls to lookup_table[abbreviation_lower]. * Added non_english_words and non_english_words_lower to lookup_table. * Added spelling_mistakes and spelling_mistakes_lower to lookup_table. * Replaced some leftover calls to non_english_words with calls to lookup_table[non_english_words]. * Added processes to lookup_table. * Added qualities and qualities_lower to lookup_table. * Added collocations to lookup_table. * Added infection_exceptions to lookup_table. * Added stop_words to lookup_table. * Added resource_terms_ID_based to lookup_table. * Added resource_terms to lookup_table. * Added resource_terms_revised to lookup_table. * Added resource_permutation_terms to lookup_table. * Added resource_bracketed_permutation_terms to lookup_table. * Push lookup_table.json cache to repository. * Revert "Push lookup_table.json cache to repository." This reverts commit c34290ad9a622770f4dcbba29aa87ccfe8b7df3a. * Abstracted load_lookup_table with use of newly implemented function is_lookup_table_ outdated. * Abstracted update_lookup_table through use of newly implemented function get_all_resource_dicts. * Renamed update_lookup_table to add_lookup_table_to_cache. * Renamed load_lookup_table to get_lookup_table_from_cache, and eliminated use of global variable lookup_table. * Reordered functions. * Completed is_lookup_table_outdated docstring. * Added calls to get_path. * Wrote get_path docstring. * Wrote get_all_resource_dicts docstring. * Wrote add_lookup_table_to_cache docstring. * Wrote get_lookup_table_from_cache docstring. * Added meaningful comments to addition of dictionaries to ret in get_all_resource_dicts. * Cleaned up code used to get resource_permutation_terms and resource_bracketed_permutation_terms in get_all_resource_dicts. * Updated outdated docstring. * Let lookup_table be loaded from cache in utf-8 (as opposed to unicode) when running python 2, through use of new helper function str_hook. * Renamed str_hook to unicode_to_utf_8, and did some cleaning up. * [WIP] Pull find_full_term_match out of pipeline.run. (#29) * Pulled find_full_term_match out of run. It has not yet been modified to run out of run. Rearranged MatchNotFoundError to below find_full_term_match. * Added TODO to MatchNotFoundError. * Added parameters to find_full_term_match to make it work out of run. * Removed accidentally commited lookup_table.json file. * Made suffixes into a global variable, and removed it as a parameter to find_full_term_match. * Made lookup_table into a global variable, and removed it as a parameter to find_full_term_match. * Made suffixes variable local to run again. * modified line lengths of find_full_term_match. * Made lookup_table local again. * Updated TODO list with notes on how to reduce number of parameters in find_full_term_match. * Removed covered_tokens and remaining_tokens from find_full_term_match to reduce parameters, and also make function follow single responsibility principle better. * Converted adding and removing of tokens to and from covered_tokens and remaining_tokens into list comprehension format after full-term match is found. * Updated find_full_term_mtch docstring.' * Updated TODO list of find_full_term_match. * Added suffixes to lookup_table, and removed it as a parameter from find_full_term_match. * Added suffixes file in resources folder. * Renamed outdated variable. * Pulled find_component_match out of run. * Modified line lengths of find_component_match to meet new limit of 99 characters per line, and also account for new indentations due to find_component_match now being outside of run. * Replaced cleaned_chunk and cleaned_chunk_tokens parameters in find_component_match with cleaned_sample. * Replaced cleaned_chunk with cleaned_sample, as they are the same value due to recent changes. Renamed cleaned_chunk_tokens to cleaned_sample_tokens. * Pulled get_gram_chunks out of find_component_match. * Updted TODO list of find_component_match. * Updated find_component_match return value to return partial matches and covered tokens. This allowed the removal of partial_matches, covered_tokens and remaining_tokens as parameters from find_component_match, and the updating of these variables in run instead. * Added get_gram_chunks docstring. * Cleaned up find_component_match code. * Updated find_component_match docstring. * Some cleanup. * [WIP] Fetch terms and synonyms from ontologies (#31) * Copy pasted ontofetch.py into lexmapr folder from https://github.com/GenEpiO/geem/blob/master/scripts/ontofetch.py. * Import ontofetch into pipeline. * Made print statements in ontofetch.py Python 3 compatible. * Copy-pasted ontohelper.py from https://github.com/GenEpiO/geem/blob/master/scripts/python/ontohelper.py. * Made print statements in ontohelper.py compatible with python 3. * Added rdflib and rdfextra dependencies. Made note of it in README.md. * Attempting to exclude ontofetch.py and ontohelper.py from coveralls analysis. * Rename coverage.py to .coveragerc. * Replaced ontofetch and ontohelper with updated versions of each from geem repository. * Removed rdfextras dependency. * Corrected ontofetch and ontohelper to work with python3, and directory tree of lexmapr. * Replaced calls to iteritems with calls to items in ontohelper.py to ensure Python 3 compatibility. * Eliminated encoding to utf-8 for variables meant for JSON serialization, when executing Python 3, in ontohelper.py. * Implemented framework for fetching and storing ontology terms. * Add WebOntologies.csv to resources folder. * Create fetched_ontologies folder if one does not exist. * Made temporary print statement to determine why integration testing is failing. * Made temporary print statement to determine why integration testing is failing. * Made temporary print statement to determine why integration testing is failing. * Made temporary print statement to determine why integration testing is failing. * Made temporary print statement to determine why integration testing is failing. * Made temporary print statement to determine why integration testing is failing. * do_output_json should now creates file in lexmapr.fetched_ontologies during integration testing. * do_output_tsv should now creates file in lexmapr.fetched_ontologies during integration testing. * Got rid of temporary print statements used to debug failed integration testing. * Update ontology_table when WebOntologies.csv has been updated. * Abstracted reading of JSON for both Python 2 and 3 into function read_json(path). * Change synonym value in ontology_table to list of labels, in case there are more than 1. * Fit line to 99 characters. * End python-2.7 Support (#33) * pass partialMatchedResourceListSet as list to retainedPhrase() * Updated tests for retainedPhrase() * Updated small_simple expected test output * Clean up test output files * Sort outputs before printing, GComponent -> Component, add newline at end * Show more detail when test output doesn't match expected output * Print ontofetch/ontohelper to stderr. Update test output * Use print function in py2.7 * Remove unused pretty printer * Stop testing py27 start testing py37 * Drop py37 testing * Make basic output the default (#34) * pass partialMatchedResourceListSet as list to retainedPhrase() * Updated tests for retainedPhrase() * Updated small_simple expected test output * Clean up test output files * Sort outputs before printing, GComponent -> Component, add newline at end * Show more detail when test output doesn't match expected output * Print ontofetch/ontohelper to stderr. Update test output * Use print function in py2.7 * Remove unused pretty printer * Change default behavior to give basic output * Update MANIFEST.in (#36) * Fetch resources from online ontologies (#37) * Update lexmapr.ontofetch and lexmapr.ontohelper. * In response to https://git.io/fhhn1 * Began developing new ontology fetching mechanism. * Goal: allow users to supply ontologies to fetch from the command line, as opposed to requiring users to specify ontologies to fetch in lexmapr/resources/WebOntologies.csv * Add -w or --web parameter to specify ontology url * Call ontofetch.py with url, if -w is specified * Implemented tests for this * Stub test for future implementation: allow user to specify root from command line as well * Comment out current call to get_ontology_table for now * The existing infrastructure for generating an ontology table is messy and elaborate--we may attempt to develop a cleaner mechanism based on new command-line implementation * Specify root term when fetching ontologies. * Allow user to specify -r flag in ontofetch.py when fetching ontologies through LexMapr * Added -r/--root command-line argument * Conditional: can only be specified when -w/--web is specified * Adjusted call to lexmapr/ontofetch.py accordingly * Added new tests and adjusted old tests accordingly * Create stub ontology lookup tables. * Begin process of creating "ontology lookup tables" when fetching ontologies * Will contain resources needed for mapping terms * Created as JSON files in lexmapr/ontology_lookup_tables * Currently contain empty JSON objects * Modified .gitignore to include cached resources * Abstract calls to pipeline in lexmapr.tests.test_pipeline.TestOntologyMapping with helper function run_pipeline_with_args * Implement testing for ontology lookup table creation * Implement structure of ontology lookup tables. * Ontology lookup tables now contain the keys needed for mapping * Values are currently stubs (empty dictionaries) * Implemented test to check this * Main pipeline change: * Renamed key in lookup_table.json abbreviation_lower to abbreviations_lower * resource_terms_ID_based in ontology lookup table. * Implement generation of resource_terms_ID_based in ontology lookup table when fetching ontologies * Moved generation of ontology lookup table structure to new function create_ontology_lookup_table_skeleton in lexmapr/pipeline.py * More work in lexmapr/tests * Test generation of resource_terms_ID_based in ontology lookup table * Created tests/ontologies to store ontology files * This allows us to change test_pipeline.TestOntologyMapping tests (in the future), so they can fetch the ontologies in this file from a Github iri in our repository * Important because the ontologies currently being fetched are not under our control and may change in the future * This could break our tests * Abstractions in test.pipeline.TestOntologyMapping * class attribute to store ontologies fetched during tests * Replace redundant tearDown with call to setUp * Wrap loading of fetched_ontologies/ and ontology_lookup_tables/ to dictionaries in static methods * Use tests/ontologies in TestOntologyMapping. * Replace calls to online ontologies in lexmapr.tests.test_pipeline.TestOntologyMapping with the iri's of files in lexmapr/tests/ontologies on GitHub * Fetch resource_terms and resource_terms_revised. * Implemented generation of resource_terms and resource_terms_revised in ontology lookup table when fetching ontologies * Wrote tests for this * Edit bfo test ontology to include more synonyms. * This will make testing the generation of synonyms in the ontology lookup table easier * Fetch synonyms. * Implemented generation of synonyms in ontology lookup table when fetching ontologies * Wrote tests for this * Check if fetched ontology tsv and ontology table json files exist before un-caching them during testing in TestOntologyMapping * Functions for getting permutations. * Replaced blocks of codes in pipeline.get_all_resource_dicts with calls to two new functions: get_resource_permutation_terms and get_resource_bracketed_permutation_terms * Wrote tests for these functions * Removed unnecessary functions pipeline.find_left_r and pipeline.find_between_r, which were replaced with a simple line of code in get_resource_bracketed_permutation_terms * Also removed their tests * Fetch resource_permutation_terms. * Implemented generation of resource_permutation_terms in ontology lookup table when fetching ontologies * Wrote tests for this * Edit bfo test ontology to produce entity with two synonyms * We must test the ability to fetch multiple synonyms from one entity * Renamed several incorrectly-named variables in TestOntologyMapping * Handle multiple synonyms per entity. * When generating synonyms in ontology lookup table, we will now recognize entities with multiple synonyms, and add each one accordingly * Modified test_ontology_table_synonyms to test this * Correct get_resource_bracketed_permutation_terms. * Modify return value of get_resource_bracketed_permutation_terms when there is no bracket in the term * Should be an empty list, because resource_bracketed_permutation_terms is empty in these cases * Modified tests accordingly * Remove conditional checks in get_all_resource_dicts when getting permutations * Conditional check for a bracket is already in get_resource_bracketed_permutation_terms, so the number of resource_bracketed_permutation_terms for non-bracketed terms will remain at 0 * We removed the else clause, so bracketed terms will now have resource_permutation_terms as well * I do not think this is something to worry about, and I do not think it will lead to false matches * Tests still pass * Some renaming of variables * Edit bfo test ontology to include bracketed terms. * This will make testing the generation of resource_bracketed_permutation_terms in the ontology lookup table easier * Fetch resource_bracketed_permutation_terms. * Implemented generation of resource_bracketed_permutation_terms in ontology lookup table when fetching ontologies * Wrote test for this * Restrict generation of resource_permutation_terms and resource_bracketed_permutation_terms in ontology lookup table to resources comprised of less than seven tokens * As in pipeline.get_all_resource_dicts * Reduces performance overhead * Docstrings, style and removing stale code. * Wrote several docstrings * lexmapr/pipeline.py: * get_resource_permutation_terms * get_resource_bracketed_permutation_terms * create_ontology_lookup_table_skeleton * create_ontology_lookup_table * lexmapr/tests/test_pipeline.py * TestOntologyMapping * Added blank lines between methods in lexmapr/pipeline.py to follow PEP8 style guidelines * Removed stale code in lexmapr/pipeline.py corresponding to the old, flawed way of fetching online ontologies * Call to get_ontology_table_from_cache in run * Several functions: * fetch_ontology * get_ontology_terms * is_ontology_table_outdated * add_ontology_table_to_cache * get_ontology_table_from_cache * Fill all fields in ontology lookup tables. * Filled in the remaining fields of ontology lookup tables when fetching online ontologies * Just used the content from Gurinder's pre-defined csv resources * Wrote test that simply checks the content of each field is non-empty * Map terms to resources from fetched ontologies. * When the user specifies the -w flag, they will now map their input_file to resources from the corresponding ontology lookup table * Ontology lookup tables are now generated from cache * Minimal implementation * If the -w argument matches an already existing ontology lookup table, **that table will be fetched from cache regardless of whether it is out of date or if it uses a different root** * To fetch a new version of a table, you must delete it from ontology_lookup_tables/ * I think this is acceptable at this stage of development * Had to split test_fetch_ontology_specify_root into two functions * Due to caching, we could no longer make a second call to pipeline and retrieve a new pizza_json value * Replace command-line interface with JSON. * Removed command-line arguments -w and -r in favour of a single -c argument * Path to JSON file with keys == IRI of ontologies to be fetched, and values == IRI of root terms in said ontologies * Enables the generation of fetching of multiple ontologies at once * Modified tests to fit these new changes * Altering existing tests * Adding new tests * Unrelated re-ordering of some code * Nit-picking, but I think expected values should come before actual values in assertion statements * Fetch terms from multiple ontologies at once. * Replace generation of one lookup table corresponding to one ontology with generation of one lookup table corresponding to multiple ontologies * create_ontology_lookup_table_skeleton renamed to create_online_ontology_lookup_table_skeleton * Same function * create_online_ontology_lookup_table renamed to add_to_online_ontology_lookup_table * Slightly modified function * Instead of generating a lookup_table from scratch, modifies and returns existing lookup_table * Modified flow of operations * Name of lookup_table now corresponds to name of config file * Call create_online_ontology_lookup_table in run * Iterate over ontologies in config file * Add terms from each ontology the lookup_table * Cache concatenated lookup_table * Implemented some tests for this new handling of multiple ontologies * Update resources to Gurinder's latest version. * Edited several files in lexmapr/resources to reflect changes made by Gurinder on his local machine * Edited tests to reflect these changes * Renamed cached online ontology lookup tables. * When an online ontology lookup table is cached, the file name has a "lookup_" prefix attached to it * This is to prevent confusion between the config file and lookup table file, which had the same file name prior to this commit * Adjusted tests accordingly * Combined pre-defined and fetched resources. * When utilizing online ontology fetching, samples will now be matched against both terms fetched from online ontologies and terms from Gurinder's pre-defined resource files * If there are overlapping resources, priority is given to resources from fetched ontologies * Changes in code * Leave non-fetched fields in online ontology ontology lookup tables empty, as they are no longer needed * Pipeline gets those terms from the pre-defined resources which are now used alongside the fetched online resources * Removed corresponding test * New function: merge_lookup_tables * Used to combine pre-defined and fetched resources into one lookup table * Wrote tests * Modified run function to accommodate new strategy for matching samples * Renamed some variables in other tests to better distinguish between the lookup table containing only the fetched resources, and the lookup table containing both fetched and pre-defined resources * Brought in some of Gurinder's code. * Gurinder wrote several functions on his machine to further modularize pipeline.run * Ordered iteration over config files when fetching online ontologies (#39) * Add new test ontology file for future extension. * Added new file to lexmapr/tests/ontologies called pizza_two.owl * Same as pizza.owl--with three differences * Picante -> Picante_two * Media -> Media_two * Neopicante -> Neopicante_two * Will be used to test future extension: prioritising certain ontologies over others when fetching conflicting online terms * Wrote stub test_ontology_table_resource_terms_prioritisation function * Fixed some outdated variable names and error messages * Edited test/ontologies/pizza_two.owl. * We do not want to test the prioritisation of entities with different labels, but identical IRI values * It is assumed that different entities have different IRI values * Instead, we want to test the prioritisation of entities with the same labels, but different IRI values * Change the labels in pizza_two.owl to match pizza.owl * Change the IRI values of Hot, Medium and Mild entities in pizza_two.owl * Edited test/ontologies/pizza_two.owl. * Removed underscore in spiciness IRI values, to prevent ontohelper from using a different separator for pizza and pizza_two * Allows for cleaner tests * Ordered iteration of online ontology config files. * Implemented iteration over user-specified ontology IRI values in the order they are listed (from top to bottom) in the config file * Wrote tests for this * typo * Multiple root terms per ontology (#41) * Cleaned up imports. * Removed unused imports * Prioritized import of modules over functions * Separated standard-library, third-party and local imports with blank lines * Wrote a TODO for a script-level docstring * Modified config file structure. * By using an outer JSON array and inner JSON objects, we can now map to multiple root terms from the same ontology * Modified code in `pipeline.py` * Wrote a test * Modified test config files * Revert "Cleaned up imports." This reverts commit 146c55e8983f88f6f2038fc4b73ef89ee5d5ee74. I meant to start this branch from master, not general_cleanup. This gets rid of the only commit in general_cleanup. * Add dependency versions to setup.py * Cache relative to caller (#42) * Fixed retainedPhrase bug * retainedPhrase was not removing repetitive elements from retainedSet due to an accumulator variable located in the incorrect scope * Adjusted tests accordingly * Also: small stylistic change in a docstring * Simplify caching of lookup table * Added resources * Component matching treatment order (#43) * Fixed Retained_Terms_with_Resource_IDs bug. * Ontology IDs with a ":" were being split into multiple tokens, and only the first token was kept * Fixed test to accomodate last commit. * Fixed retainedPhrase bug * retainedPhrase was not removing repetitive elements from retainedSet due to an accumulator variable located in the incorrect scope * Adjusted tests accordingly * Also: small stylistic change in a docstring * Implement secondary method for component matching * ``find_component_match`` becomes ``find_component_matches`` * New method introduced as ``find_component_match`` * Contains code previously found in ``find_component_matches`` * Other minor changes in documentation * Moved more code out of ``find_component_matches`` * Also deleted inner method ``handle_component_match``, and changed some documentation * Simplify basic format (#45) * Added resources * Simplified basic output * Removed "matched term" column * Added ``Matched_Components`` column (with label in first row) * Replaced output of component match with list of matched components with resource_id * Adjusted tests accordingly * Revert "Added resources" This reverts commit c6474e15f9083569720cda6d168f705de3cb3879. * Revert "Merge branch 'master' of github.com:Public-Health-Bioinformatics/LexMapr" This reverts commit 471092de99aef1c5f3118129c60deaf49e47851f, reversing changes made to 4c8556aef803fbcce96edb375889665af1643dd1. * Revert "Added resources" This reverts commit c6474e15f9083569720cda6d168f705de3cb3879. * Simplified basic output * Removed "matched term" column * Added ``Matched_Components`` column (with label in first row) * Replaced output of component match with list of matched components with resource_id * Adjusted tests accordingly * Added and updated resources (#47) * Added and updated resources * Modified tests to accomodate last commit * Modify ontology resource_id values (#48) * Now consistent with resource_id values form pre-defined resources * Updated tests to reflect this change * Fetch parents from online ontologies and add them to lookup tables (#49) * Structural changes in lookup table creation * Renamed ``create_online_ontology_lookup_table_skeleton`` to ``create_lookup_table_skeleton`` * Will be used to create online ontology and pre-defined resource lookup tables * Now only have to define keys for both in one place * Renamed ``get_all_resource_dicts`` to ``add_predefined_resources_to_lookup_table`` * More accurate name * Takes advantage of ``create_lookup_table_skeleton`` * Renamed ``add_to_online_ontology_lookup_table`` to ``add_fetched_ontology_to_lookup_table`` * More accurate name * Change to test_pipeline.TestPipeline * Recreate lookup_table.json each time the test class is invoked * Allows for more up-to-date testing * Improved testing environments * Use pkg_resources to get the path of files, instead of os.path * TestPipeline and TestOntologyMapping are now run in temporary directories * Much cleaner code * Other changes * Add ``parents`` key to lookup tables * Adjust tests accordingly * Changed some stale documentation * ``add_predefined_resources_to_lookup_table`` docstring * ``add_fetched_ontology_to_lookup_table`` docstring * Update test config files and add __init__ files * Add ``parent_id`` from ontologies to lookup table. * Wrote additional tests for parent fetching * Also replaced the ``:`` in fetched parents with a ``_`` * Also submitting a new test ontology ``envo.owl`` * Forgot to commit new test files last commit * Modified test ontologies * Modified test ontologies * Implemented fetching of ``other_parents`` too * Wrote test too * Concatenate parents from different fetches * Implemented and tested this functionality * Renamed ``test_ontology_table_parents_multiple_parents_per_resource`` to ``test_ontology_table_multiple_parents_per_resource`` * Prevent duplicate parents * Implemented and tested * Also committing test config files that I forgot to submit in previous commits * Forgot to commit test config files again * [WIP] Increased modularization (#50) * Cleaned up imports in prep for modularization * Also removed use of logger, and any references to it * Also removed unused function ``assign_confidence_level`` * Moved functions to ``pipeline_helper.py`` * Moved functions in ``pipeline.py`` outside ``run`` to ``pipeline_helper.py`` * Renamed ``TestPipelineMethods`` to ``TestPipelineHelpers``, and modified tests accordingly * Reduced number of output columns (#51) * Modified test files accordingly * Removed code that was no longer needed, due to it only being relevant in now-removed columns * --version now independent of input file (#53) * Remove case variants from lookup table (#54) * Renamed key in lookup table * ``resource_terms_ID_based`` now ``resource_terms_id_based`` * Removed case-variants from lookup tables * Modified ``get_resource_dict`` * Always converts data to lowercase now * Renamed parameter and updated docstring for clarity * Removed ``resource_terms_revised`` and ``*_lower`` keys from lookup tables, and removed code responsible for the population of their values * Adjusted code in many places accordingly * Removed several calls to lower(), and several comparisons with no longer existing dictionaries * Standardized pre-defined and fetched samples, along with other fetched values, to lower-case where appropriate * Adjusted tests * Removed more calls to lower() * Removed some stale code/comments about case * [WIP] Bucket classification (#52) * Add flag for optional bucket classification * Flag is ``-b``, or ``--bucket`` * Adjusted tests accordingly * Appended ``test_`` prefix to folders inside ``tests/`` * Modified references to said folders accordingly * Generalized ``TestPipeline`` * Will make it easier to test bucket classifications, and any future extensions * Also added a default value for ``-b``, ``--bucket`` flag * So no value needs to be specified * Changed default of -b to False * Output bucket headers when specified * Implemented and tested * Also changed some stale ``bucket`` values in test from ``None`` to ``False`` * Created stub ``classification_lookup_table`` * Implemented and tested * New test class ``TestClassification`` * More keys added to lookup table skeletons * Add resources to classification table * Functionality in ``add_classification_resources_to_lookup_table`` * Other small changes * Renamed incorrectly-named variable * Eliminated unnecessary calls to ``os.path.abspath`` * Created new ``pipeline_classification.py`` file * Moved code for creating a classification lookup table into this file * Removed some stale imports * Fixed one last merge conflict * Changed action to intended value * Began implementing ``classify_sample`` * Get default classification for sample * Rest of function is WIP * Function called in ``pipeline.run`` * Modified ``find_full_term_match`` * Cleaned up function docstring * Lists in return value no longer converted to strings * Provides more flexibility to callee, who can convert it to string if they wish, but they may also require the list * ``classify_sample`` requires the list form of ``retained_terms_with_resource_ids`` * Changed parameter name in ``classify_sample`` * Also WIP implementation of ``classifiy_sample`` * Implemented method for retrieving parent hierarchy * ``pipeline_helpers.get_term_parent_hierarchy`` * Tested in ``TestPipelineHelpers`` * Added ``classification_lookup_table.json`` to ``.gitignore`` * WIP classify_sample * Implement ``classify_sample_helper`` * Maps samples to pre-defined buckets using the samples' hierarchies * Called in ``classify_sample`` * wip * Implement ``refine_ifsac_micro_labels`` * Used to refine final ifsac classification based on pre-defined rules * Called in ``classify_sample`` * Fixed stale code bug: ``resource_terms_ID_based`` is now ``resource_terms_id_based`` * Better documentation and variable names * Output classification for non-component matches * Renamed some variables and functions too * New function ``get_resource_id`` * Has logic from now-removed ``get_component_match_withids`` * Iterative and string construction logic moved to ``pipeline.run`` * Refine ``partial_matches_with_ids`` * Now excludes elements that are ancestral to other elements * Other changes: * Moved the scope of default classification in ``classify_samples`` * Improved docstring for ``get_term_parent_hierarchy`` * Output for component classification * Simpler output for non-full format * Make component matching more accurate * Attempt to match components without ``additional_processing`` first, and if a match is not found, then attempt attempt with ``additional_processing`` * Makes matches, and classification, more accurate * Updated genomeTrackerMaster.csv * Improved classification * Map ``label_refinements`` keys to ``sample`` tokens in ``refine_ifsac_final_labels``, rather than ``sample`` substrings * Make the return value of ``get_term_parent_hierarchy`` ``term_id``-inclusive * Because some buckets map to ``term_id`` * Modified tests accordingly * ``get_term_parent_hierarchy`` renamed and modified * Now called ``get_term_parent_hierarchies`` * Returns nested list of all possible hierarchies * Clarified in function docstring * Modified tests accordingly * Modified calls to function accordingly * Modified resources * Remove duplicates from final buckets * Updated ontofetch and ontohelper * Updated resources * wip * wip * wip * wip * Updated ``ontofetch.py`` and ``ontohelper.py`` * Modified test ontology and config files accordingly * Modified ``pipeline.py`` to accomodate new ``ontofetch.py`` output * wip * wip * Fetch exact, narrow and broad synonyms * Implemented and tested * Modify component matching * Remove component matches that are subsets of larger component matches * Updated resources * Updated rules * Updated resources * Improved sample mapping to defaults/refinement * Update resources and fix typos * Updated resources * Update resources * Update rules and sample mapping * Update resources and rules * Update resources and rules * Update rules * Update resources * Updated resources * Updated resources and rules * Update resources * Updated resources and rules * Updated resources and rules * Updated resources * Update resources and rules * Update resources * Updated rules * Updated rules * Made output more consistent * Output same number of columns per row, no matter the circumstances * Previously, if a match was not found, the row would have fewer columns * Now, it has the same number of columns, but those columns are empty * This will make it easier to parse the results * Updated resources * Update resources * Updated resources * Updated resources * Updated resources * Modified tests to accommodate new resources * Remove qualities from lookup tables * Modify tests and delete ``SemLex.csv`` accordingly * Updated resources too * Update to 0.1.3 (#56) * Update lexmapr version * Update Manifest * Make urls in test config files up to date * Updated readme (#57) * Updated manifest (#58) * Fix TravisCI & Coveralls Badges (#59) * Some bug fixes (#60) * Some bug fixes * Read input file row by row, instead of loading all rows to memory and then reading them * Catch both types of errors possible from ``dateutil.parser.parse`` * Accomodate bug in ``ontofetch.py`` that sometimes declares an entity as parent to itself * Removed some stale code * Re-implemented some code taken out last commit * Remove dates from cleaned samples * Adjusted tests accordingly * Update to 0.1.4 * Substitute erroneous break with continue * Updated buckets * Removed all candidate terms * Remove terms with weird IDs * Added buckets and modified rules * Stop mapping to processes and collocations * Update tests * Fix ``punctuationTreatment`` bug * " " being returned instead of "" * Update tests * Updated the different resource lookup tables (#61) * Updated the different resource lookup tables with additional curated entries * Update tests to reflect changes in resources * Removed abbreviation and updated test output * Changed pipeline_helpers.py and updated the different resource lookup tables (#62) * Updated the different resource lookup tables with additional curated entries * Update tests to reflect changes in resources * Removed abbreviation and updated test output * Changed pipeline_helpers.py to look for synonyms in full-term match and component match cases before the suffix addition rule is applied. Updated the different resource lookup tables and deleted some previously used resource files. * Update pipeline_helpers.py Incorporated the reviewed and suggested changes in pipeline_helpers.py * Update tests * Updated and sorted resource (#64) * Updated the different resource lookup tables with additional curated entries * Update tests to reflect changes in resources * Removed abbreviation and updated test output * Changed pipeline_helpers.py to look for synonyms in full-term match and component match cases before the suffix addition rule is applied. Updated the different resource lookup tables and deleted some previously used resource files. * Update pipeline_helpers.py Incorporated the reviewed and suggested changes in pipeline_helpers.py * Update tests * Updated and sorted resource files to reflect latest changes in resources * Fix encoding * Merge update_resources to master (#68) * Updated the different resource lookup tables with additional curated entries * Update tests to reflect changes in resources * Removed abbreviation and updated test output * Changed pipeline_helpers.py to look for synonyms in full-term match and component match cases before the suffix addition rule is applied. Updated the different resource lookup tables and deleted some previously used resource files. * Update pipeline_helpers.py Incorporated the reviewed and suggested changes in pipeline_helpers.py * Update tests * Updated and sorted resource files to reflect latest changes in resources * Fix encoding * Update resources and tests (#66) * Update resources (#67) * Updated and sorted resource (#64) * Updated the different resource lookup tables with additional curated entries * Update tests to reflect changes in resources * Removed abbreviation and updated test output * Changed pipeline_helpers.py to look for synonyms in full-term match and component match cases before the suffix addition rule is applied. Updated the different resource lookup tables and deleted some previously used resource files. * Update pipeline_helpers.py Incorporated the reviewed and suggested changes in pipeline_helpers.py * Update tests * Updated and sorted resource files to reflect latest changes in resources * Fix encoding * Update resources and tests * Cache files independent of cwd (#69) * Files now cached in ``lexmapr/cache`` * Regardless of cwd * Simplify, and update ``MANIFEST.in`` * New file ``definitions.py`` to store global variables * Only stores absolute path of ``lexmapr/`` right now * Shortens many lines of code * Update tests to accommodate changes * Other minor changes * Prepare for 0.2.0 * ``--no-cache`` flag (#70) * Cache files independent of cwd * Files now cached in ``lexmapr/cache`` * Regardless of cwd * Simplify, and update ``MANIFEST.in`` * New file ``definitions.py`` to store global variables * Only stores absolute path of ``lexmapr/`` right now * Shortens many lines of code * Update tests to accommodate changes * Other minor changes * Add --no-cache flag * Update resources * 0.3.0 * Improved documentation (#71) * Tutorial slides * For people with little to no experience working with command line * Improved README * Improved README * Fixed logo * Recognize tsv input files (#72) * Recognize tsv input files * Tested * Also added type checking of input files at argparse level * Forgot new test files last commit * 0.4.0 * Delete file that resulted from merge conflict

Updated manifest

4b81941

ivansg44 merged commit b9f93b7 into cidgoh:master Jul 4, 2019

ivansg44 deleted the update_manifest branch October 15, 2019 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated manifest #58

Updated manifest #58

ivansg44 commented Jul 4, 2019

Updated manifest #58

Updated manifest #58

Conversation

ivansg44 commented Jul 4, 2019