Written by Maximillian von Briesen (mobyvb), Anna Laura Davenport (annalaura), Raphael Gontijo Lopes (iRapha), and Juliet K. Benjamin (julietkb) at MHacks V in 2015
##Concept This Text Simplifier is a beta version of a larger idea. The idea is simple: to help improve the global literacy rates. But it is hard to implement. Half of children not attending school come from conflict-afflicted areas. So the problem becomes not to bring them to education, but to bring the education to them. The obstacle is the lack of stability and access to resources. Textbooks and internet-dependent programs can be pricey and impossible.
Our idea is a tool that simplifies text. The essential concept is that a child, teen, or adult can type in a subject, chose a level, and have access to an article on their level to read. Then once they understand, they can choose a harder level. The program could be universal and distributed on tablets to households.
Since MHacks V is just 36 hours, what we have created is a beta version that uses the internet. Our text simplifier is a Chrome extension that edits a page and makes it easier to read by grade level. There is currently only one setting, and since our team is very new to natural language processing, it's not an incredibly drastic change. But it works.
##How it works The way this currently works is by using a list of the 10,000 most commonly used English words, ordered by frequency of use. Whenever it encounters a word that is not of a specific frequency or better, it replaces that word with the best fit synonym that's in the list of frequently used words. If it cannot find a synonym that is a better fit, and it is an adjective or an adverb, it removes the word entirely, since in many cases, sentences will keep their meaning regardless of these types of words.
This tool also has a method for finding the grammatical structure of a body of text by using the Natural Language Toolkit (NLTK). The NLTK has a tool that tokenizes each word in a sentence and assigns a part of speech to them. We then created a program that recognizes certain combinations of POS tags and creates phrases. These phrases are then visualized as subtrees in a tree that represents a sentence. This way, in the future, we can find these subtree "chunks" that can be removed or reworded for a much more flaw-proof way of simplifying. The problem now is efficiency, as navigating and manipulating the tree can be very slow. Here's an example of a sentence being parsed into a tree based on grammatical structures using Stanford's POS tagger.
##Examples ###Readability changes of different articles: "Readability" below refers to the Flesch-Kincaid Reading Ease score. A higher number means it is more readable by the formula detailed here.
This does in any way mean our application 100% works. Sometimes information can be lost or changed due to the fallibility of our method.
|Link||Readability before||Readability after|
These are relatively small changes in scores, because most of what we return is the same as the original. This is in large part due to how many words we are using to determine whether words are "too difficult". We're currently using the 10k most common words in English, reduced to about 7k common word stems. If this is reduced, the readability score improves, but once again, we run a higher risk of losing information.
###Readability changes based on number of "non-difficult" words, using the article at http://en.wikipedia.org/wiki/Unicorn:
Once again, even though the readability score improves, the article does not necessarily have all of its meaning. In fact, using something like 1000 words below makes the article unreadable. It is a tradeoff that we haven't completely experimented with yet.
|Number of "non-difficult" stem words||Readability after|
As an example of what our extension changes, here is a sentence before and after going through our program:
Phonetics studies acoustic and articulatory properties of the production and perception of speech sounds and non-speech sounds.
Phonetics studies and properties of the production and perception of speech sounds and non-speech sounds.
Note that "acoustic" and "articulatory" are removed because they are uncommon adjectives. Also note that although these are removed, the sentence still makes sense and the difference in meaning is negligible. Of course, by not analyzing grammatical structure in this version, the "and" was not removed, but this extension is a prototype, and these are things that can be fixed in the future.
- Clone this repository
- Go to chrome://extensions in Google Chrome
- Check "Developer mode" in the upper right and then click "Load unpacked extension" in the upper left
- Navigate to the chrome folder of this repository then click "Select" in the popup window
- Follow the steps here for installing NLTK, a python library we make heavy use of
- Follow the steps here for installing the NLTK data. We selected the "all" option described there
- Make sure pip is installed on your machine
- Go to the server folder in the terminal
pip install flaskand
pip install beautifulsoup4. If either doesn't work, add
sudoto the beginning of the command
- In the terminal, navigate to the server folder of this repository
- In Chrome, go to a page you want to simplify. We have tested primarily with Wikipedia pages
- Click the extension icon and wait a little bit after it's highlighted for the page text to be replaced by simpler text
- You can compare the readability of the two blocks of text returned (our simplified version and the original version) at Readability-Score.com