Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
1. Project rationale (Social Science Chapter)
This Rationale attempts to explain why this program exists, why you should care, and where its focus is. Since there are all sorts of people who might find the 2.-Open-Table-Explorer useful, I will pitch to three groups:
- environmentalists (who will be interested first in 3.-Home-Energy-Explorer, then in collecting data on the environment),
- environmental skeptics (who would be interested in collecting data to refute the claims of the environmentalists),
- corporate types (who will be focused on data to help them make money) who will see this project as a data-mining tool.
My hope is that all of the above can cooperate in this project. It is further my hope that all of the above may actually share in making various data sources more available and useful, although their interpretations may vary.
An example of the type of analysis this project hopes to support includes Why Cities Keep Growing, Corporations and People Always Die, and Life Gets Faster.
Environmentalists would like to save the world, but don't know exactly how. A key to this is moving toward more sustainable energy production and consumption. Another key is building on the brain power of many distributed activities across the internet. While advocacy is important, to be credible, this advocacy must be built on a solid foundation of transparent engineering and data analysis. How can a tiny open source project hope to address such a large seemingly intractable goal? As an individual I can make my small contribution by providing the right glue code to glue together the existing pieces of the solution and providing a framework in which to concretely address the larger problems. If we hope to see far we must stand on the shoulders of giants. I hope to see a better world by standing on the backs of the Internet, SQL databases, statistical analysis, and the brain power of engineers. Your vision may vary.
Another human frailty this project seeks to address is source amnesia where we forget or choose not to remember where our beliefs come from. Can we shame ourselves into linking our beliefs to the evidence they are based on.
In some sense, the powerful self-delusion of the narrative fallacy dooms a project like this; people will find data analysis work boring and ever so unappealing. But the few that understand can potentially have outsized influence, since in the land of the blind the one-eyed man is king (but may be considered a madman or a nerd). The narrative fallacy is often propagated by journalists who love data but are math phobic (e.g. daily business newscasts full of numbers but devoid of understanding of the risks the economy faces).
Most of the Internet today consists of narrative. If the narrative fallacy is a useful warning, we should be asking where's the data behind the tremendous number of claims on the Internet today? Ideally I'd like to see data to support each empirical claim on Wikipedia. Wikipedia asks its contributors to provide citations to existing literature to justify their claims. This s a laudable goal that helps improve the quality of Wikipedia, but I would like to further ask contributors to cite the data upon which their claims are based.
Some have described the Internet as an attention economy.
Presumably you have heard of the benefits of Free and Open Source Software [FOSS] to humanity. I will not repeat them here but recommend the classic anthropological study "The Cathedral and the Bazaar". As software consists of both source and data, I do not believe enough attention has been paid to Open Data.
As a motivation I will discuss my experience with Quicken. For years (from some of the earliest versions) I used Quicken to record my financial data. I abandoned using Quicken when I realized after years of exhaustive work putting my data into Quicken, I did not own that data. I could not use my data in the ways I believed I had every right to do so. Many handy data structures could not be imported or exported to text files. When I ran out of disk space, my Quicken data file became corrupted, and since I had forgotten to backup my data for a month or so (this was in the days that disk space was quite a bit more expensive). All of my changes since the last backup were lost to me. Also reading forums on the web, it seems that Intuit was not just doing the understandable thing of storing data in binary format for improved speed, but was also using a secret password and possibly encryption to prevent any binary access.
The other major problem is that programs (even open source ones) create data silos, where data cannot be easily shared between programs. Anyone with a little familiarity with database systems knows that separate copies of the same data are inevitably inconsistent and incomplete.
Access to binary files is trivial to add (and will probably be done on an unknown schedule), it is much harder to do successfully. I expect the evolution of this project to follow the following steps:
- Text only support,
- Proper escaping so that binary data does not break text tools,
- dual use analysis functions such as diff and character and string frequency analysis,
- binary data type statistical analysis to discriminate different word sizes and float versus integer encoding.
This project, for legal reasons, will probably avoid any serious attempt on encrypted data. But since there are legal reasons to decrypt one's own data, this project will not be antagonistic to other projects that do address encrypted data. Since this program supports shell access, encryption could even exist as unsupported plug-ins. Perhaps the Debian model of putting software with different legal restrictions into separate repositories. (2.3.4.-Please insert brain power here.)
While the previous section on openness is quite expansive and all inclusive, mere humans need to focus on more limited goals to produce any useful software on a reasonable schedule. So this project focuses on only table data structures at this time. A table consists of rows and columns and is believed by data base psychologists as one of the best understood data structures for ordinary users. SQL databases have inspired the rigorous definitions and the commercial data processing of the world has proven the practicality of tables in numerous fields demanding efficiency, consistency, error control, transparency, and security. Thus it would be foolish to not build on one of the towering achievements of computer science.Ruby Rails version 3 even goes one better than SQL by providing a relational algebra for SQL. While relational SQL databases provide a mathematically sound approach to the analysis of digital data, statistical analysis packages bring clarity to analog data. Fortunately statistical analysis programs support tables quite well. The first extension beyond simple tables is likely to be Object Oriented single inheritance as supported by 2.2.2.-ActiveRecord. This extension is complicated by the fact that I fear it would break the R interface. (2.3.4.-Please insert brain power here.)
This is an attempt to provide useful software by focusing on a narrow range of functionality first. An explorer focuses on data navigation and simple, basic data analysis.
This is an attempt to focus on what a single person can comprehend and afford. Small business is equally applicable. Even a single person's view of a corporate monstrosity is applicable.
Energy is a concept from physics with huge applicability and yet the average person has a workable understanding. Energy is often highly correlated with economics. Our energy savings could conceivably fund this project.
Ruby was chosen as an ideal tool to parse Internet text data. I would have used Perl, but my Perl code has proven unmaintainable. As a dynamically typed scripting language, it is well suited to the project goal of interactively discovering the data types of tables from unknown sources.
I want to interact with my programs and data from any computer in my house or elsewhere. I wanted a rich graphical interface which people would not need to be trained to use. These requirements led to the selection of the browser interface. Ruby Rails was the interesting open source package that supported ruby, database, and web interfaces.
The Internet currently consists of vast stores of data, but the human brain is the bottleneck in its understanding. Some of this complexity is inherent in the problems we are trying to solve, but much of the complexity is artificial and can be hidden by the right glue software. One obvious problem of the Internet is that that vast amount of data is not in a form where it can be computationally combined. In contrast, data bases provide an algebra for selection, projection, aggregation, and grouping for any digital data. Since most large web servers on the Internet are in fact built on databases, the translation of this data into your own database is fairly straight forward.
As powerful as database management programs are in the digital domain, they are rather feeble in the analog domain. The type of software that provides powerful analog analysis is statistical analysis programs. Such package can uncover the hidden structure in our data. While it is often said that "correlation is not causation", it is hard to imagine prediction and control with out correlation.
The most inspiring approach to statistics is Exploratory Data Analysis as developed by John Tukey and Paul Velleman. Exploratory Data Analysis seeks to provide interactive direct manipulation of Visual Display of Quantitative Information and seeks to free the data analyst from the tedium of statistical mathematics. The program Data Desk implemented these principals well over a decade ago, but has not been updated substantially since. I know of no open source equivalent, though there have been many attempts which could be supported. In the Open Table Explorer, I hope to support a nice feature of Data Desk: suggested data analyses based on the types of the columns in your table.
One feature I have not seen is a response time control since statistical databases can easily be huge with millions of records. It would let you express your delay tolerances with choices like: interactive, near real time, coffee break, overnight, and what-ever it takes. The program would subsample your data sets so that your analysis takes no longer than you can stand. Typically quick response is required in the early exploratory phases and long responses are tolerable for confirming models already developed. Subsampling is also required when the combinatoric explosion of RAM memory limits multi-factor analysis.
Retrieving data from other peoples databases and attempting to assert copyright to the result involves the copyright concept of derivative works. The copyright of a derivative work belongs to the original document owner. How much processing must be done to the inputs to create a non-derivative work? Traditionally if the processing was done by brains, the new work was not derivative if originality had been added. The constitutional purpose of the copyright is to promote science and learning. Copyright law (particularly fair use) has not kept pace with the growth of the Internet. Professional computer scientists such as those from the ACM have been warning about this for decades but current political lobbying trends have lead to a stalemate between de facto fair use and de jure copyright law. My only advice here is the same as my boss use to give me after she had given contradictory instructions, "Do the right thing". So you are all not off the hook for respecting copyright law, but must weigh the good you do, against the harm you can reasonably cause, and the legal risks you can afford to take.
Eric Raymond's essay on pot-latch economics suggests that if I write cool software, I will be thought of as altruistic and useful and qualify as a big chief. Actually I'm hoping for more like, I scratch your back, you scratch my back. That is if I write software useful to you, you will help write software useful to me. Fortunately in software unlike the physical world, one bit of software can scratch my back and your back simultaneously. More abstractly economics is the allocation of value-added between producers, suppliers, users, etc. Currently the value added is more potential than actual, so I, as a producer, am funding (out of my retirement savings) future value added to myself as a user. I purchase energy measuring electronic devices from suppliers. I have quit buying such devices until I've gotten the ones I already have to work. In the future other producers who believe in this vision could contribute code. As the code matures and produces value-added for users, they can contribute back code, documentation, or cash. Electronic device manufacturers may want to contribute equipment for review, support or technical data to counter bad reviews. In the attention economy, this project lacks sex-appeal, but should attract the support of thoughtful and talented people.