New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pausing a stream #135
Comments
Hi, In node.js when readstream is you will need to change your code into following: const rs = fs.createReadStream(this.csvPath);
let count = 0;
csv()
.fromStream(rs)
.on("json", (json) => {
count++;
console.log(count);
})
.on("done", () => {
cb(null, count);
})
.on("error", (err) => {
cb(err);
})
process.nextTick(()=>{
rs.pause();
}) again, if you dont want parse the readstream that time, you just dont need call the parser. ~Keyang |
I've got a use case where I need to process a huge (>2 million rows) CSV and insert it into a DB. To do this without running into memory issues, I intend to process the CSV as a stream, pausing the stream every 10000 rows, inserting the rows in my DB and then resuming the stream. |
Apologies, the example I gave above was a bit too cut down. So this, process.nextTick(()=>{
rs.pause();
}) would pause the stream right before it starts. What I want to do is something like this, where I pause and resume within the .on("json", (json) => {
rows.push(json);
// for every 10, 000 rows - pause the stream,
// asynchronously save to DB, and then resume the stream
if (rows.length % 10000 === 0) {
rs.pause();
this.saveToDb(rows, () => {
rs.resume();
rows = [];
})
}
}) |
First throught on your case is that you should write your own
stream.Writable class so you can pipe the result to it rather than doing
the buffer / throttling yourself. e.g.
```js
rs.pipe(csv()).pipe(yourWritable)
```
The whole point of node.js Stream is being transparent to developers for
its details so that you wont need worry about when to pause and resume the
upper stream.
Anyway, if you want to do the way you mentioned, it is possible as well.
You can pause the readable stream every 10000 rows and resume it once they
are processed.
…On 22 January 2017 at 16:27, Nicholas Kyriakides ***@***.***> wrote:
I've got a use case where I need to process a large (>2 million rows) CSV
and insert it into a DB.
To do this without running into memory issues, I intend to process the CSV
as a stream, pausing the stream every 10000 rows, inserting the rows in my
DB and then resuming the stream.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#135 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABCs98D7QAEoXDHptS4b9e1t4Zy7wzpNks5rU4NzgaJpZM4LqJLl>
.
|
Thanks a million - calling |
Ok. I see what you mean.
Again this is not correct way of using Node.js stream. The reason it is not
working is downstream will `resume` upstream if it is "drained"
Thus `pause()` upstream will be voided if downstream is draining data.
Node.js has hidden these underline details for developers.
You can although call `rs.unpipe()` to stop the rs populating data and once
processing finished pipe it back. so you can do somethign like below:
```js
const rs = fs.createReadStream(this.csvPath);
let count = 0;
var csvParser=csv()
.fromStream(rs)
.on("json", (json) => {
rows.push(json);
// for every 10, 000 rows - pause the stream - asynchronously save to DB,
and then resume the stream
if (rows.length % 10000 === 0) {
rs.unpipe();
this.saveToDb(count, () => {
rs.pipe(csvParser);
rows = [];
})
}
})
.on("done", () => {
cb(null, count);
})
.on("error", (err) => {
cb(err);
})
```
~Keyang
…On 22 January 2017 at 16:57, Nicholas Kyriakides ***@***.***> wrote:
Anyway, if you want to do the way you mentioned, it is possible as well.
You can pause the readable stream every 10000 rows and resume it once they
are processed.
Thanks a million - calling rs.pause() from within the json event listener
doesn't seem to have an effect.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#135 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABCs97QuBy8sXtJejX9coP4Nrs5jOg6hks5rU4plgaJpZM4LqJLl>
.
|
Thanks,
It looks like when .on("json", (json) => {
rows.push(json);
console.log(rows.length);
if (rows.length % 1000 === 0) {
console.log("unpiping");
rs.unpipe();
this._insertEntries(db, rows, ()=> {
rs.pipe(csvParser);
rows = [];
});
}
}) And here's some
|
However, this works: .on("json", (json) => {
rows.push(json);
if (rows.length % 1000 === 0) {
rs.unpipe();
// clear `rows` right after `unpipe`
const entries = rows;
rows = [];
this._insertEntries(db, entries, ()=> {
rs.pipe(csvParser);
});
}
}) |
Hi,
`unpipe` will only stop reading file. `json` event is emitted if there is a
row of csv being parsed.
As there are multiple lines of csv being read for each reading, `json`
could still be emitted even there is no more file reading.
~Keyang
…On 22 January 2017 at 18:03, Nicholas Kyriakides ***@***.***> wrote:
However, this works:
.on("json", (json) => {
rows.push(json);
if (rows.length % 1000 === 0) {
rs.unpipe();
const entries = rows;
rows = [];
this._insertEntries(db, entries, ()=> {
rs.pipe(csvParser);
});
}
})
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#135 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABCs9zqFRoGzELvEkr0XwTiKt4H6AEuiks5rU5n0gaJpZM4LqJLl>
.
|
Got it - thanks. Are there any plans to make "pausing" a bit more easy? I'm guessing that what I'm trying to do here is a common task. I've got it working with your suggestions, but it feels a tad hackish. |
:D As mentioned before, this is not correct way of doing things.
Node.js has encapsulated this in its stream implementation and we should
not need to do this.
Implement your own Writable implementation is the correct way of doing
this. see https://nodejs.org/api/stream.html#stream_simplified_construction
here is an example of what you want to achieve:
```js
var tmpArr=[];
rs.pipe(csv({},{objectMode:true})).pipe(new Writable({
write: function(json, encoding,callback){
tmpArr.push(json);
if (tmpArr.length===10000){
myDb.save(tmpArr,function(){
tmpArr=[];
callback();
})
}else{
callback();
}
} ,
objectMode:true
}))
.on('finish',function(){
if (tmpArr.length>0){
myDb.save(tmpArr,function(){
tmpArr=[];
})
}
})
```
~Keyang
…On 23 January 2017 at 14:41, Nicholas Kyriakides ***@***.***> wrote:
Got it - thanks.
Are there any plans to make "pausing" a bit more easy?
I've got it working with your suggestions, but it feels a tad hackish.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#135 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABCs93cHFZCjsymuKhGc-qvVE3Dgh5zmks5rVLwagaJpZM4LqJLl>
.
|
Nice - that looks much cleaner. Thanks again. |
Pausing a stream() doesn't seem to have an effect - The parser proceeds reading the CSV as usual.
Take for example the following code:
count
is logged 200 times (equal to the amount of rows in the CSV) - I was expecting it not to log anything since the stream is paused before passing it over tofromStream()
The text was updated successfully, but these errors were encountered: