Permalink
Browse files

repeat some changes that were done by mistake on the md file

  • Loading branch information...
1 parent 2fabd46 commit 4a2e6de617a3784cf9bd2b42f8b5be116f14f17e @piccolbo piccolbo committed Sep 26, 2012
@@ -9,7 +9,7 @@ library(rmr2)
* Still more a collection of snippets than anything organized
* Thanks Damien and @ryangarner for the examples and Koert for conversations on the subject
-Internally `rmr` uses R's own serialization in most cases and typedbytes serialization when in vectorized mode. The goal is to make you forget about representation issues most of the time. But what happens at the boundary of the
+Internally `rmr` uses R's own serialization in most cases and its own typedbytes extension for some atomic vectors. The goal is to make you forget about representation issues most of the time. But what happens at the boundary of the
system, when you need to get non-rmr data in and out of it? Of course `rmr` has to be able to read and write a variety of formats to be of any use. This is what is available and how to extend it.
## Built in formats
@@ -19,11 +19,11 @@ The complete list is:
```
1. `text`: for english text. key is `NULL` and value is a string, one per line. Please don't use it for anything else.
-1. `json`-ish: it is actually <JSON\tJSON\n> so that streaming can tell key and value. This implies you have to escape all newlines and tabs in the JSON part. Your data may not be in this form, but almost any
+1. `json`-ish: it is actually `<JSON\tJSON\n>` so that streaming can tell key and value. This implies you have to escape all newlines and tabs in the JSON part. Your data may not be in this form, but almost any
language has decent JSON libraries. It was the default in `rmr` 1.0, but we'll keep because it is almost standard. Parsed in C for efficiency, should handle large objects.
1. `csv`: A family of concrete formats modeled after R's own `read.table`. See examples below.
1. `native`: based on R's own serialization, it is the default and supports everything that R's `serialize` supports. If you want to know the gory details, it is implemented as an application specific type for the typedbytes format, which is further encapsulated in the sequence file format when writing to HDFS, which ... Dont't worry about it, it just works. Unfortunately, it is written and read by only one package, `rmr` itself.
-1. `sequence.typedbytes`: based on specs in HADOOP-1722 it has emerged as the standard for non Java hadoop application talking to the rest of Hadoop.
+1. `sequence.typedbytes`: based on specs in HADOOP-1722 it has emerged as the standard for non Java hadoop application talking to the rest of Hadoop. Also implemented in C for efficiency, its underlying data model is different from R's and we tried to map the two systems the best we could.
## Custom formats
@@ -185,7 +185,7 @@
<li>Thanks Damien and @ryangarner for the examples and Koert for conversations on the subject</li>
</ul>
-<p>Internally <code>rmr</code> uses R&#39;s own serialization in most cases and typedbytes serialization when in vectorized mode. The goal is to make you forget about representation issues most of the time. But what happens at the boundary of the<br/>
+<p>Internally <code>rmr</code> uses R&#39;s own serialization in most cases and its own typedbytes extension for some atomic vectors. The goal is to make you forget about representation issues most of the time. But what happens at the boundary of the<br/>
system, when you need to get non-rmr data in and out of it? Of course <code>rmr</code> has to be able to read and write a variety of formats to be of any use. This is what is available and how to extend it.</p>
<h2>Built in formats</h2>
@@ -198,11 +198,11 @@
<ol>
<li><code>text</code>: for english text. key is <code>NULL</code> and value is a string, one per line. Please don&#39;t use it for anything else.</li>
-<li><code>json</code>-ish: it is actually <JSON\tJSON\n> so that streaming can tell key and value. This implies you have to escape all newlines and tabs in the JSON part. Your data may not be in this form, but almost any
+<li><code>json</code>-ish: it is actually <code>&lt;JSON\tJSON\n&gt;</code> so that streaming can tell key and value. This implies you have to escape all newlines and tabs in the JSON part. Your data may not be in this form, but almost any
language has decent JSON libraries. It was the default in <code>rmr</code> 1.0, but we&#39;ll keep because it is almost standard. Parsed in C for efficiency, should handle large objects.</li>
<li><code>csv</code>: A family of concrete formats modeled after R&#39;s own <code>read.table</code>. See examples below.</li>
<li><code>native</code>: based on R&#39;s own serialization, it is the default and supports everything that R&#39;s <code>serialize</code> supports. If you want to know the gory details, it is implemented as an application specific type for the typedbytes format, which is further encapsulated in the sequence file format when writing to HDFS, which &hellip; Dont&#39;t worry about it, it just works. Unfortunately, it is written and read by only one package, <code>rmr</code> itself.</li>
-<li><code>sequence.typedbytes</code>: based on specs in HADOOP-1722 it has emerged as the standard for non Java hadoop application talking to the rest of Hadoop.</li>
+<li><code>sequence.typedbytes</code>: based on specs in HADOOP-1722 it has emerged as the standard for non Java hadoop application talking to the rest of Hadoop. Also implemented in C for efficiency, its underlying data model is different from R&#39;s and we tried to map the two systems the best we could.</li>
</ol>
<h2>Custom formats</h2>
@@ -224,7 +224,7 @@
NULL
else keyval(NULL, df)
}
-&lt;environment: 0x106921f08&gt;
+&lt;environment: 0x105e1a870&gt;
$streaming.format
NULL
@@ -249,7 +249,7 @@
v
else cbind(k, v), ..., row.names = FALSE, col.names = FALSE)
}
-&lt;environment: 0x10349c550&gt;
+&lt;environment: 0x101bbb718&gt;
$streaming.format
NULL
@@ -1,6 +1,9 @@
+
+
+
* This document responds to several inquiries on data formats and how to get data in and out of the rmr system
* Still more a collection of snippets than anything organized
* Thanks Damien and @ryangarner for the examples and Koert for conversations on the subject
@@ -47,7 +50,7 @@ function (con, keyval.length)
NULL
else keyval(NULL, df)
}
-<environment: 0x106921f08>
+<environment: 0x105e1a870>
$streaming.format
NULL
@@ -76,7 +79,7 @@ function (kv, con)
v
else cbind(k, v), ..., row.names = FALSE, col.names = FALSE)
}
-<environment: 0x10349c550>
+<environment: 0x101bbb718>
$streaming.format
NULL

0 comments on commit 4a2e6de

Please sign in to comment.