Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

command line script for node #84

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

iloveitaly
Copy link
Contributor

  • build: adding rehype-parse for node dom parsing
  • feat: use rehype-node when running on the cli
  • feat: remove top-level element to fix node dom parsing
  • feat: cli script

Copy link
Owner

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I don’t have a lot of time to dive through this right now, but added a few quick questions inline. I will try to get back with a more detailed review later.

cmd.js Outdated Show resolved Hide resolved
cmd.js Outdated
Comment on lines 44 to 53
const inputHTML = stdinReadSync()

if(!inputHTML) {
console.error('no HTML provided over stdin');
process.exit(1);
}

convertDocsHtmlToMarkdown(inputHTML).then(markdown => {
process.stdout.write(markdown);
});
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all needs exception handling. A CLI tool meant to general usage should avoid crashing with a stack trace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it would be better to expose the raw exception until we know what type of exceptions can occur. IMHO it's frusterating when a generic error is given from a CLI tool, it requires that I go and edit out the catch block to actually understand what went wrong and if there's anything simple I can do to fix it.

Will add this in though!

Comment on lines 125 to 142
// this seems to only occur when running via nodejs (not the browser): there's a top-level <b> tag
// this is only called once with the parent node
function removeRootBoldWrapper(node) {
// there are some cases, like translating <b>something something</b> where we don't want to remove the root node
if(node.children.length === 1 && node.children[0].tagName === 'b') {
return
}

for(let i = 0; i < node.children.length; i++) {
const child = node.children[i];

if(child.tagName === 'b') {
node.children.splice(i, 1, ...child.children);
}
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more about what the purpose here is? Why are you concerned particularly with <b> elements here, and what circumstances are you encountering? If this is a CLI-specific issue, it should probably be handled in the CLI portion of the code, not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more in the comments! Let me know if this still isn't clear.

README.md Outdated
Comment on lines 48 to 64
However, what you really want it run this after copying text from Google Docs. To do this, you'll need to extract the HTML
on the clipboard. Here's a script to do this:

```shell
swift - <<EOF | npx google-docs-to-markdown | pbcopy
import Cocoa

let type = NSPasteboard.PasteboardType.html

guard let string = NSPasteboard.general.string(forType:type) else {
fputs("Could not find string data of type '\(type)' on the system pasteboard\n", stderr)
exit(1)
}

print(string)
EOF
```
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needs to be made more clear that this is a macOS-specific example, and will not work on other platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted the comment to make this clear

Copy link
Owner

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more detailed notes inline in addition to my previous question.

This also needs to at least run the unit tests in Node. That should hopefully be as simple as:

  1. Add a new script in package.json named test-unit-node (or something like that) that just runs mocha (maybe needs some other options/settings).
  2. Add that script to the main test script.
  3. Add a CI job to run it as well:
    unit-tests:
    # Use MacOS so we can test in Safari.
    runs-on: macos-latest
    steps:
    - uses: actions/checkout@v3
    - name: Install Node.js
    uses: actions/setup-node@v3
    with:
    node-version: 16
    cache: npm
    - name: Install dependencies
    run: npm ci
    - name: Unit Tests (in Browser)
    run: npm run test-unit
    e2e-tests:
    # Use MacOS so we can test in Safari.
    runs-on: macos-latest
    steps:
    - uses: actions/checkout@v3
    - name: Install Node.js
    uses: actions/setup-node@v3
    with:
    node-version: 16
    cache: npm
    - name: Install dependencies
    run: npm ci
    - name: End-to-End Tests
    run: npm run test-e2e

Ideally, there should be some basic end-to-end tests for the CLI, too (e.g. run node cmd.js with some input), but I think we can iterate towards that.

Could you also please add an engines section to package.json that specifies Node.js >=16.20.0? I think we might be able to do an earlier 16.x version, but 16.20+ is definitely supported. (Or change #83 to do this instead and we’ll merge that.)

cmd.js Outdated
Comment on lines 9 to 11
if (process.stdin.isTTY) {
return '';
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why give up if the session is interactive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is stop accepting new input and send it through for processing. ^D didn't work for me by default. Maybe there's a better trick here or fixing up the stdin processing would be better.

cmd.js Outdated Show resolved Hide resolved
cmd.js Outdated
}

if (!n) break
data += b.toString('utf8', 0, n)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think this is critical for a first pass on the CLI feature, but it might be good to either add a CLI option to specify the encoding or add sniffing via jschardet (or both).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of node:stream/consumers! Thanks for pointing that out.

Comment on lines +4 to +5
import {default as rehypeDom} from 'rehype-dom-parse';
import {default as rehypeNode} from 'rehype-parse';
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we’d be better off here to require just rehype-parse, and substitute rehype-dom-parse during Webpack compilation. (Then rehype-dom-parse can also just be a dev-time dependency.)

Copy link
Contributor Author

@iloveitaly iloveitaly Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with webpack—what's the best way to do this? Can you submit a change for this PR for it?

lib/convert.js Outdated Show resolved Hide resolved
@@ -35,10 +37,14 @@ function doubleBlankLinesBeforeHeadings (previous, next, _parent, _state) {
return undefined;
}

const isNode = typeof process !== 'undefined' && process.versions && process.versions.node;
const rehypeParse = isNode ? rehypeNode : rehypeDom;
const rehypeParseOptions = isNode ? {fragment: true, verbose: true, emitParseErrors: true} : {}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an important reason to have different parser options on Node vs. browsers? I know the default for fragment is different in the two parsers, but I think it’s probably clearer and less error-prone to just specify it explicitly in both cases.

Why set verbose? This adds line/character information to the nodes AST nodes the parser outputs, which we don’t use.

Why set emitParseErrors here if we are not showing the errors we get? Is this a debugging flag that should have been removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember exactly, but I believe the existing parser just didn't work at all/well when I was testing.

@iloveitaly
Copy link
Contributor Author

Thanks for all of the great comments! I won't have much more time to devote here until this script breaks for me (which could happen soon with weird gdocs formatting). I'll reply then! Feel free to commit to/improve this PR if you have the time in the meantime.

@iloveitaly
Copy link
Contributor Author

Fixed most/all of the comments outside the mocha/test comment. Will handle that another time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants