forked from LinkedInAttic/datacl
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.html
95 lines (87 loc) · 5.16 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
<HTML>
<div style="background-image: url("");" id="preview" class="shadowbox"><h1 id="datacl">datacl</h1>
<p><strong>datacl</strong> is a collection of data-wrangling utilities. While Hadoop has
emerged as the de-facto standard for large data analysis tasks, there are
often a large number of simple data manipulation operations that need to
be performed in interactive mode as a post-processing step. The data for
post-processing is often small enough to fit on a user's desktop.
<strong>datacl</strong> is collection of efficient utilities for a data scientist.</p>
<p>In our experience, the life of a data scientist is often less glamorous
than the public perception. Accelerating the speed of inquiry --- both
in terms of run time and scripting time --- significantly decreases some
of the pains that come with the territory, opening up time to do
higher-level thinking.</p>
<p>While <strong>datacl</strong> was designed primarily as a useful toolkit for data
scientists, we believe it can be used in other contexts as well. A case
in point is our paper entitled "In Data Veritas --- Data Driven Testing for Distributed
Systems", which was presented at DBTest2013, a SIGMOD workshop in New York in 2013.</p>
<p>We believe by open sourcing, we can both contribute to the community
while benefiting from their feedback and contributions.</p>
<p>The name is inspired by Tcl. Ousterhout writes</p>
<blockquote>
<p>The motive behind this model is that standard compiled languages are good for writing low-level algorithms to perform the basic computational tasks required, but tedious for less computational, more descriptive tasks like defining interfaces. Tcl's implementation of the dichotomy, for example, basically involves writing functions in C, exposing them as Tcl procedures using the fairly simple Tcl c API, and then writing scripts to tie the functions together and create a working application out of it."</p>
</blockquote>
<hr>
<h1 id="installation">Installation</h1>
<ul>
<li>Copy the file datacl.tgz into a directory and cd to that directory</li>
<li>tar -zcvf datacl.tgz</li>
<li>cd src </li>
<li>make</li>
<li>You should get a single file called <strong>q</strong></li>
<li>cd ../test/</li>
<li>You will see a number of examples and unit tests over here. Each sub-directory has a file called <em>test.sh</em> which can be executed</li>
</ul>
<h2 id="you-need-to-set-2-environment-variables-">You need to set 2 environment variables.</h2>
<ul>
<li><strong>Q_DOCROOT</strong> This is the directory where the meta data will be stored</li>
<li><strong>Q_DATA_DIR</strong> This is the directory where the data will be stored</li>
<li><strong>Q_LOGFILE</strong> Name of log file </li>
</ul>
<hr>
<h1 id="hello-world">Hello World</h1>
<pre><code>mkdir /tmp/datacl_test
export Q_DOCROOT=/tmp/datacl_test
export Q_DATA_DIR=/tmp/datacl_test
q init # This initializes the meta data
q add_tbl t1 10 # Creates a table with 10 rows
# Create a 4 byte integer field f1 in t1 whose values are 1, 2, ...
q s_to_f t1 f1 'op=[seq]:start=[1]:incr=[1]:fldtype=[I4]'
q list_tbls # list tables
q describe t1 # describes table t1
q describe t1 f1 # describes fld f1 in table t1
q pr_fld t1 f1 # Prints the values of f1 in t1
q fop t1 f1 sortD # Sorts f1 in t1 in descending order
q pr_fld t1 f1 # Notice that values have switched order
q delete t1 # Clean up</code></pre>
<hr>
<h1 id="organization-of-code">Organization of Code</h1>
<p>The code is organized as follows</p>
<ul>
<li>AUX -- contains (usually) tiny functions where the bulk of compute time should be spent e.g., adding 2 columns to create a third, etc.</li>
<li>dir_utils -- contains utility functions of generic utility e.g., hashing functions, parsing strings, etc.</li>
<li>dir_dbutils --- contains utility functions that are specific to datacl * e.g., creating temporary files, etc.</li>
<li>dir_ro_verbs --- contains functions that are <em>read-only verbs</em> e.g., the MySQL equivalent of <em>show tables</em> or <em>describe table</em></li>
<li>dir_ro_verbs --- contains functions that are <em>read-only verbs</em> e.g., the MySQL equivalent of <em>show tables</em> or <em>describe table</em></li>
<li>dir_wr_verbs --- contains functions that are <em>writable verbs</em> e.g., these cause changes in the data stored e.g., adding 2 columns to create a third is effected by the verb <em>f1f2opf3</em></li>
</ul>
<hr>
<h1 id="todo">TODO</h1>
<p>There are several way in which the code can be improved. We list the
most important ones.</p>
<ul>
<li>Functional testing in terms of unit tests</li>
<li>Performance tests and benchmarks</li>
<li>Reducing the size of the code by using templates</li>
</ul>
<h1 id="license">License</h1>
<p>© [2013] LinkedIn Corp. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may
not use this file except in compliance with the License.
You may obtain
a copy of the License at <a href="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</a></p>
<p>Unless required by applicable law or agreed to in writing,
software
distributed under the License is distributed on an "AS IS"
BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.</p>
</div>
</HTML>