Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically fetch and parse complete feed list from wiki #59

Closed
humphd opened this issue Nov 6, 2019 · 10 comments
Closed

Automatically fetch and parse complete feed list from wiki #59

humphd opened this issue Nov 6, 2019 · 10 comments
Assignees
Projects

Comments

@humphd
Copy link
Contributor

@humphd humphd commented Nov 6, 2019

At the moment we're using a hack to read feed URLs from a text file. This is fine for our initial efforts, but we should really be pulling those feed URLs from https://wiki.cdot.senecacollege.ca/wiki/Planet_CDOT_Feed_List.

Looking at the markup in that page, we need to grab the contents of the single pre element on the page:

<pre>
################# Failing Feeds Commented Out [Start] #################

#Feed excluded due to getaddrinfo ENOTFOUND s-aleinikov.blog.ca s-aleinikov.blog.ca:80
#[http://s-aleinikov.blog.ca/feed/atom/posts/]
#name=Sergey Aleinikov
...
</pre>

We then need to lightly process this:

  • ignore lines beginning with # (comments)
  • read URLs out of the [url] format (surrounded in square brackets)
  • read the person's name from name=User Name

Later we might decide to store this in a db or something, but let's begin by pulling this down from the cdot wiki.

@cryolis

This comment has been minimized.

Copy link
Contributor

@cryolis cryolis commented Nov 6, 2019

I would like to work on this.

@humphd

This comment has been minimized.

Copy link
Contributor Author

@humphd humphd commented Nov 6, 2019

Go for it Igor. You might be able to get away with a regex for this, or might decide to use some node based DOM parser to extract the <pre>'s innerText.

@cryolis

This comment has been minimized.

Copy link
Contributor

@cryolis cryolis commented Nov 6, 2019

I was thinking of using request and cheerio libraries to connect and traverse DOM, also maybe a good idea to start using express.js at this point.

@humphd

This comment has been minimized.

Copy link
Contributor Author

@humphd humphd commented Nov 6, 2019

Don't use request, as it's a dead project. Use bent, see https://github.com/mikeal/bent and look at my code in the worker for downloading feeds.

For parsing the DOM, I'd use https://www.npmjs.com/package/jsdom vs. cheerio, since it's also better maintained.

And you won't need express for this. We're just going to feed this data into the queue, like I do now in the src/index.js file.

@cryolis

This comment has been minimized.

Copy link
Contributor

@cryolis cryolis commented Nov 10, 2019

I created PR for the parser, currently it displays data to the console, but later we can save it to json or db. I commented out the code that saves the data to the txt file. It can be removed later or used for test purposes.

@cryolis

This comment has been minimized.

Copy link
Contributor

@cryolis cryolis commented Nov 10, 2019

Also some of the feeds were written a bit different from the rest so I did a check to remove unnecessary tags. Like this one for example:

[href="https://amddeeb.wordpress.com/feed/]
name = Ahmed Deeb

It has "href=" inside square brackets, I'm not really sure how many more mistakes like that is in the feed list, cause its huge and I didn't go through the whole list, but if whoever notices, your welcome to add another check or let me know and I'l add one.

@mskuybeda

This comment has been minimized.

Copy link
Collaborator

@mskuybeda mskuybeda commented Nov 13, 2019

@humphd I will make functionality that will receive the data parsed by @cryolis and send them to redis db.

@cryolis

This comment has been minimized.

Copy link
Contributor

@cryolis cryolis commented Nov 14, 2019

I managed to push my commits to upstream/issue#59 branch but they are not shown in the PR that was created. Should I create another PR?

@c3ho

This comment has been minimized.

Copy link
Collaborator

@c3ho c3ho commented Nov 14, 2019

@cryolis try pushing to origin instead with your branch

@cryolis

This comment has been minimized.

Copy link
Contributor

@cryolis cryolis commented Nov 14, 2019

I think I did it but it pushed like 20 commits with it.

@humphd humphd closed this in 674ec97 Nov 15, 2019
Main automation moved this from In progress/Review to Closed Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Main
Closed
5 participants
You can’t perform that action at this time.