Some tools written for collecting useful information.
We are working on some medical-related project, where we need to find a large amount of trainning data. Medical records are usually quite private and not accessible for us, so we collect not-so-private data from social medias instead.
This program can generate a list of medical terms, and collect related text files related to each term.
For example, check the folders in this repo, which contains collected text related to "breast cancer" (this list is generated by search "breast cancer" in detail files). The dictionary file shows which terms are related to "breast cancer", and the twi_collection files are collected text from twitter.
mediterm.py: Generate a local medical dictionary by running a scraper.
related.py: For a particular topic, find all related terms, generate a wordlist.
twi_collection.py: Collecting information from twitter.
A classifier demo is in the classify folder, which can be run with python3.
The idea is to use a neural network to classify a text sentence(in this demo, try to classify whether it is about "state fair" or "breast cancer"). With this kind of classification, we can tag the text files and make better decisions (pick best language model to use, etc) in the further designs.