Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made download_blogs requests async #19

Merged
merged 20 commits into from Oct 12, 2022
Merged

Made download_blogs requests async #19

merged 20 commits into from Oct 12, 2022

Conversation

cantudo
Copy link
Contributor

@cantudo cantudo commented Oct 8, 2022

Hello, I saw your message about issue #18 on Hacktoberfest's Discord. I have tried to address it, I have set tokio as a non optional dependency.

Using this feeds.txt:

https://antoniosbarotsis.github.io/index.xml
https://blog.rust-lang.org/feed.xml
https://blog.rust-lang.org/inside-rust/feed.xml
https://github.blog/feed

This is the output of hyperfine:

  • With the synchronous requests:

image

  • New asynchronous requests:

image

Please feel free to correct anything, this is my first time working with async and tokio in rust.

It would be helpful if you posted your feeds.txt, so i could test it more thoroughly. Thanks.

@AntoniosBarotsis
Copy link
Owner

AntoniosBarotsis commented Oct 9, 2022

Oh wow that was quick, here you go

https://antoniosbarotsis.github.io/index.xml

# Youtubers
https://www.youtube.com/feeds/videos.xml?channel_id=UCiSIL42pQRpc-8JNiYDFyzQ # Chris Biscardi
https://www.youtube.com/feeds/videos.xml?channel_id=UCUMwY9iS8oMyWDYIe6_RmoA # NoBoilerplate
https://www.youtube.com/feeds/videos.xml?channel_id=UC8ENHE5xdFSwx71u3fDH5Xw # ThePrimeagen
https://www.youtube.com/feeds/videos.xml?channel_id=UCsBjURrPoezykLs9EqgamOA # Fireship
https://www.youtube.com/feeds/videos.xml?channel_id=UC2Xd-TjJByJyK2w1zNwY0zQ # Beyond Fireship

# Rust stuff
https://blog.rust-lang.org/feed.xml
https://blog.rust-lang.org/inside-rust/feed.xml
https://this-week-in-rust.org/rss.xml
https://rust.libhunt.com/newsletter/feed
https://rustsec.org/feed.xml

# Console.dev 
https://console.dev/tools/rss.xml
https://console.dev/betas/rss.xml
https://console.dev/articles/rss.xml
https://console.dev/interviews/rss.xml
https://console.dev/profiles/rss.xml
https://blog.console.dev/rss/

https://blog.jetbrains.com/feed/
https://github.blog/feed/
https://stackoverflow.blog/feed/

# Random
https://dusted.codes/feed/rss
https://vladmihalcea.com/blog/feed/
https://arraying.de/index.xml
https://ossinsight.io/blog/rss.xml
https://www.smartercode.io/feed
https://fasterthanli.me/index.xml
https://raphlinus.github.io/feed.xml
https://www.huy.rocks/rss.xml
https://gideonwolfe.com/index.xml
https://blog.m-ou.se/index.xml
https://liam.rs/index.xml
https://qristin.wordpress.com/feed/
https://wakatime.com/blog/all.atom
https://developerlife.com/feed.xml

# Other
https://grafana.com/blog/index.xml
https://www.elastic.co/blog/feed
https://redis.com/blog/feed
https://www.stephanboyer.com/rss
https://medium.com/feed/@clockwork-labs

Unrelated but let me know if you think the getting started docs could be improved in some way :)

@AntoniosBarotsis
Copy link
Owner

AntoniosBarotsis commented Oct 9, 2022

It seems to be a bit slower than the master branch (for me at least)

# Yours
Benchmark 1: cargo run
  Time (mean ± σ):     20.506 s ±  3.317 s    [User: 3.331 s, System: 0.333 s]
  Range (min … max):   16.724 s … 22.921 s    3 runs

# Master
Benchmark 1: cargo run
  Time (mean ± σ):     14.180 s ±  0.784 s    [User: 4.654 s, System: 0.145 s]
  Range (min … max):   13.630 s … 15.078 s    3 runs

If I use the 4 feeds that you used I get very comparable results

# Yours
Benchmark 1: cargo run
  Time (mean ± σ):     747.2 ms ±  47.1 ms    [User: 445.7 ms, System: 170.0 ms]
  Range (min … max):   715.5 ms … 801.4 ms    3 runs

# Master
Benchmark 1: cargo run
  Time (mean ± σ):     922.0 ms ±  75.0 ms    [User: 507.6 ms, System: 115.7 ms]
  Range (min … max):   869.8 ms … 1008.0 ms    3 runs

@cantudo
Copy link
Contributor Author

cantudo commented Oct 9, 2022

Oh, thank you for your feedback, I am testing it also, but I get varying results, not sure why, maybe ratelimiting or something.

@AntoniosBarotsis
Copy link
Owner

AntoniosBarotsis commented Oct 9, 2022

I ran mine a few times and I seem to be getting roughly the same as the ones I posted (for the big input).

It seems that the pipeline is failing, this seems to be an issue with OpenSSL and can likely be solved by adding

[target.'cfg(unix)'.dependencies]
openssl = "0.10.42"

to cargo.toml. See here.

Edit: Actually not sure since the Build & Test job is passing, maybe it's just that OpenSSL is installed in the Github Runners and not in the Docker image. In any case, you can ignore this for now, I'll deal with it later.

@cantudo
Copy link
Contributor Author

cantudo commented Oct 9, 2022

Without hyperfine I can see that there are some problems with the code, I am trying to fix it.

image

@AntoniosBarotsis
Copy link
Owner

This is definitely something introduced with the new HTTP library you added, I didn't get these in master.

@AntoniosBarotsis
Copy link
Owner

AntoniosBarotsis commented Oct 9, 2022

I didn't try this but it seems the issue is here. The inputs include say application/xml but the error is thrown because of application/xml; *charset UTF-8*. Maybe it would make more sense to do an includes check instead of checking for equality (I mean includes for each item in the array).

@cantudo
Copy link
Contributor Author

cantudo commented Oct 9, 2022

Yest, it seems that reqwest shows the content type header in this format, application/rss+xml; charset=UTF-8 when converting to a &str, maybe we could parse it so only the first part (before the semicolon) is passed into is_supported_content function. I am open to trying other libraries, i chose this one because it seemed well supported.

@AntoniosBarotsis
Copy link
Owner

I guess we could also try splitting, either works so up to you. The library itself is probably fine, it's just a matter of getting the string to cooperate 😄

@cantudo
Copy link
Contributor Author

cantudo commented Oct 9, 2022

Now I am getting good results:

# Mine
Benchmark 1: cargo run
  Time (mean ± σ):      5.279 s ±  2.544 s    [User: 4.914 s, System: 0.285 s]
  Range (min … max):    3.593 s …  8.205 s    3 runs

# Master:
Benchmark 1: cargo run
  Time (mean ± σ):     15.963 s ±  1.296 s    [User: 3.662 s, System: 0.039 s]
  Range (min … max):   14.820 s … 17.370 s    3 runs

But one request is still failing due to bad Content Type header, this one https://blog.jetbrains.com/feed/, which is spitting this
Custom("Invalid content text/html for https://blog.jetbrains.com/feed/"). Not sure how to fix it, any ideas?

Is it fast for you now?

@AntoniosBarotsis
Copy link
Owner

AntoniosBarotsis commented Oct 9, 2022

I am getting all sorts of errors.

The requests seem to be too quick for my laptop to keep up

Reqwest(reqwest::Error { kind: Request, url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("grafana.com")), port: None, path: "/blog/index.xml", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 10060, kind: TimedOut, message: "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." })) })

I am also getting this?

Error in https://blog.jetbrains.com/feed/
    ["Syntax: 4:7 Unexpected closing tag: HEAD, expected META"]

I got that by adding "text/html" to the supported methods.

I am not sure if these issues are because of my internet connection, OS or laptop but considering you have been getting better results, it's probably not the code.

The extra dependency doesn't seem to be working, as I mentioned above;

maybe it's just that OpenSSL is installed in the Github Runners and not in the Docker image

Maybe adding openssl here would fix it but I can just experiment with this later as it's not necessarily important for the scope of the PR.

Considering that this seems to have external issues that we can't control, would it make sense to make this a feature? Similar to how the aws-lambda feature works.

I seem to be still getting worse performance + some connection errors but this could be potentially faster than the current master branch on say AWS Lambda.

@cantudo
Copy link
Contributor Author

cantudo commented Oct 9, 2022

Ahhh, I still think I can fix this. It is possible to limmit the number of concurrent connections, see the first answer here,

I went with
let contents = rt.block_on(futures::future::join_all(...
which makes all requests on the feeds.txt file concurrent, I used this because it was faster but I think we can do better and get it working on all machines. I will try again tomorrow and let you know

The jetbrains site has something weird, I think we could add a request header for the accepted content types like this, maybe it will fix it, again I will check tomorrow.

Regarding openssl, in my opinion the best we can do is move to another async http client implementation, see hyper, using unlike reqwests can use rust-tls, I think this is better because of the reduced number of external dependencies.

Thanks for your help, I am sorry you could not see the results I was getting.

@AntoniosBarotsis
Copy link
Owner

Great to hear! I was also about to go as it is getting late here as well.

I am sorry you could not see the results I was getting.

If it works, it works. I am hosting this on Linux anyway, Windows is weird sometimes. Even if we can't make this better on Windows, it will still be faster for 2 other major platforms and that's still useful.

Thanks a lot for your contributions!

@cantudo
Copy link
Contributor Author

cantudo commented Oct 9, 2022

Hello again, I have fixed the jetbrains page issue, they were rejecting requests without a UserAgent header, so i just added a mock one. I also added hyper-tls as a feature for the reqwest crate, I am hoping this fixes de failing pipeline. I also limited the number of concurrent requests to 10, hopefully it is working on your machine this time. This is the last benchmark I ran with all the new changes and your feeds.txt:

# New code
Benchmark 1: cargo run
  Time (mean ± σ):      5.367 s ±  0.453 s    [User: 3.769 s, System: 0.034 s]
  Range (min … max):    5.079 s …  5.889 s    3 runs

# Master
Benchmark 1: cargo run
  Time (mean ± σ):     20.572 s ±  3.694 s    [User: 3.922 s, System: 0.026 s]
  Range (min … max):   17.919 s … 24.792 s    3 runs

Edit: Removed unwrap to fix the linting error. It seems that openssl is still being build despite using hyper-tls.

@cantudo
Copy link
Contributor Author

cantudo commented Oct 9, 2022

Sorry for all these commits, I did not know clippy was that strict!

@AntoniosBarotsis
Copy link
Owner

AntoniosBarotsis commented Oct 9, 2022

Don't worry about it, I just defined an absurd amount of rules. I'll review this later today, thanks for the commits!

Also the build seems to be passing which is great.

@AntoniosBarotsis
Copy link
Owner

I am getting this now

Benchmark 1: cargo run
  Time (mean ± σ):      5.037 s ±  0.052 s    [User: 4.535 s, System: 0.293 s]
  Range (min … max):    4.977 s …  5.069 s    3 runs

so I guess you did manage to make this work on windows as well after all, good job!

I'll try this on AWS real quick and if that works fine (which it should since the pipeline is passing) I'll merge.

@AntoniosBarotsis
Copy link
Owner

Looking at the code, I think I actually managed to make this work yesterday when I was trying stuff out on my own before opening the issue.

I had tried this general pattern of block_on(stream()) but I didn't specify buffer_unordered(CONCURRENT_REQUESTS) and I was getting times similar to what I was getting with your implementation earlier. I was just unaware that this was indeed faster and my laptop was just choking.

@cantudo
Copy link
Contributor Author

cantudo commented Oct 9, 2022

Also what I don't understand, did the logger make this crash??? Or was it the other stuff you changed in cargo.toml

I think it was the logger but we could try to revert that line and see if it crashes

Turned into a draft so I don't click big green merge button accidentally

Good idea haha

@cantudo
Copy link
Contributor Author

cantudo commented Oct 9, 2022

I think the problem is with my computer, I am getting this error now on your master 😮‍💨

2022-10-09T20:38:04Z WARN  rss2email] Error in https://antoniosbarotsis.github.io/index.xml
    Ureq(Transport(Transport { kind: ConnectionFailed, message: Some("tls connection init failed"), url: Some(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("antoniosbarotsis.github.io")), port: None, path: "/index.xml", query: None, fragment: None }), source: Some(Kind(UnexpectedEof)) }))

Dont really know what can I have done to mess up even the currently working one!

@AntoniosBarotsis
Copy link
Owner

Why did this become so cursed. Both branches work fine for me now.

??????????????????

@cantudo
Copy link
Contributor Author

cantudo commented Oct 9, 2022

I do not now hahaha I am removing all cargo packages and installing again, just to be sure. I think we both deserve the tshirt after this is working.

@cantudo
Copy link
Contributor Author

cantudo commented Oct 9, 2022

I have tested this on another machine and it also works for me, I believe I messed something up in my rust installation while testing. If you still want to merge I would recommend further testing, maybe a friend or something because I cannot confidently say this will not fail.

@AntoniosBarotsis
Copy link
Owner

Appreciate the concern ♥

I will force my friends to run this tomorrow after running it myself a few times.

@AntoniosBarotsis
Copy link
Owner

I'll try to get at least one person to run the code on each platform, are you using Mac or Linux?

@cantudo
Copy link
Contributor Author

cantudo commented Oct 10, 2022

Great! I am using linux.

@AntoniosBarotsis AntoniosBarotsis linked an issue Oct 10, 2022 that may be closed by this pull request
@AntoniosBarotsis
Copy link
Owner

The pipeline passing is definitely reassuring here, made it build, test and run the code in the same feeds.txt that we are both using in this issue on all major platforms.

I think I will wait until #20 is resolved at least partially (just the extra Arm image part) and if that seems to also be working without issues then I'd say it is safe to assume this is working fine and I'll finally merge.

@AntoniosBarotsis
Copy link
Owner

AntoniosBarotsis commented Oct 10, 2022

The GitHub Runners have pretty fast internet connections but probably some weird hardware resources so the speedup is smaller (from 10 to 7 seconds) but that is expected and looks fine to me. It also seems to work fine even if ran multiple times (the benchmark runs the code 5 times) which is good to know.

Are you still getting those errors locally?

@cantudo
Copy link
Contributor Author

cantudo commented Oct 10, 2022

It also seems to work fine even if ran multiple times (the benchmark runs the code 5 times) which is good to know.

That is great to hear!

Are you still getting those errors locally?

Unfortunately, yes, at this point I think my internet connection is just bad, I tested with my uni connection this morning and everything ran well, I also tested it in a uni server and it ran well. Running the current master version (still with ureq and no async) it gives me the same errors at home, but if I activate my uni vpn it works without a problem. This is two runs of the current master, one without and one with VPN.

# First run (current master no async and no vpn)
2022-10-10T20:14:08Z INFO  rss2email] Days set to 7
[2022-10-10T20:15:11Z WARN  rss2email] Error in https://blog.rust-lang.org/feed.xml
    Ureq(Transport(Transport { kind: ConnectionFailed, message: Some("tls connection init failed"), url: Some(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("blog.rust-lang.org")), port: None, path: "/feed.xml", query: None, fragment: None }), source: Some(Kind(UnexpectedEof)) }))
^C⏎

# Second run (current master no async and using my university's vpn)
cantu@cantu-l ~/U/T/Rss2Email (master) [SIGINT]> cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/rss2email`
[2022-10-10T20:15:50Z INFO  rss2email] Days set to 7
[2022-10-10T20:16:31Z INFO  rss2email] Elapsed time for download_blogs was 40561ms
[2022-10-10T20:16:31Z INFO  rss2email] Downloaded 17 blogs with 97 posts total.
[2022-10-10T20:16:31Z INFO  rss2email] <h1>Rss2Email - 2022-10-10UTC</h1><h2>Tony's Blog</h2><ul

Sometimes it runs without errors and sometimes I get those, not always on the same domains. I have also tested this at home with a Raspberry Pi 3 and the same thing happens. I just do not know what is the problem, but whatever it is I think is beyond my control and not related to tokio / async / reqwest.

@AntoniosBarotsis
Copy link
Owner

AntoniosBarotsis commented Oct 10, 2022

Yes seems like it's your wifi then.

I didn't find much regarding the error with one Google search but I'll ask around in case anyone has an idea about this out of curiosity. Only thing I can think of is packet loss.

This again means that the PR is very likely fine but I'll wait as I mentioned above regardless. Worst case scenario, if it is indeed buggy then I'll have more time to encounter the issue.

@cantudo
Copy link
Contributor Author

cantudo commented Oct 10, 2022

I didn't find much regarding the error with one Google search but I'll ask around in case anyone has an idea about this out of curiosity.

Great, let me know if you find something out!

This again means that the PR is very likely fine but I'll wait as I mentioned above regardless. Worst case scenario, if it is indeed buggy then I'll have more time to encounter the issue.

Sure, sounds good also I am sorry this took so long, in my mind it was a simple change. I do not study / work in the software engineering field so this was definitely a learning experience for me, I have loved seeing the CI workflow you have implemented, it works really well. This is also my first hacktoberfest PR so there is that.

Only thing I can think of is packet loss.

To be fair I do not know, TCP protocol should handle packet retransmissions and timeouts, but it could be something like that.

@AntoniosBarotsis
Copy link
Owner

AntoniosBarotsis commented Oct 10, 2022

I am sorry this took so long

I really do not mind especially seeing as how I would likely not do this myself, at least not now.

this was definitely a learning experience for me

That is great to hear :)
Being new to Rust I have also learned a lot of stuff through this project

This is also my first hacktoberfest PR so there is that

This is definitely way above whatever I did for my first Hacktoberfest pr 🗿

src/lib.rs Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@AntoniosBarotsis
Copy link
Owner

I have tested the code manually as well and it seems fine, performance went from 20 to 12 seconds!

@AntoniosBarotsis AntoniosBarotsis marked this pull request as ready for review October 12, 2022 16:00
@AntoniosBarotsis AntoniosBarotsis merged commit 8640680 into AntoniosBarotsis:master Oct 12, 2022
@AntoniosBarotsis
Copy link
Owner

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make HTTP requests asynchronous
2 participants