Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Not archiving Twitter correctly #1086

Closed
m-primo opened this issue Jan 19, 2023 · 5 comments
Closed

Bug: Not archiving Twitter correctly #1086

m-primo opened this issue Jan 19, 2023 · 5 comments

Comments

@m-primo
Copy link

m-primo commented Jan 19, 2023

Describe the bug

No screenshot, single file, and output.html are saved.
And not the tweet itself "Hmm...this page doesn’t exist. Try searching for something else.".
Check the screenshot

Steps to reproduce

  1. Open your ArchiveBox instance.
  2. Archive any Twitter tweet.
  3. Check for yourself.

Even in your own demo instance it doesn't work!

Screenshots or log output

image

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.15.0-58-generic-x86_64-with-glibc2.35 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.10.6         valid     /usr/bin/python3.10

 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/dist-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.81.0         valid     /usr/bin/curl

 √  WGET_BINARY           v1.21.2         valid     /usr/bin/wget

 √  NODE_BINARY           v18.12.1        valid     /usr/bin/node

 √  SINGLEFILE_BINARY     v1.0.25         valid     ./node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.4          valid     ./node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.34.1         valid     /usr/bin/git

 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v90.0.4430.212  valid     /usr/bin/chromium

 X  RIPGREP_BINARY        ?               invalid   rg


[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /usr/local/lib/python3.10/dist-packages/archivebox
 √  TEMPLATES_DIR         3 files         valid     /usr/local/lib/python3.10/dist-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled


[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled

 -  COOKIES_FILE          -               disabled


[i] Data locations:
 √  OUTPUT_DIR            6 files         valid     /home/<USERNAME_REDACTED>/archivebox

 √  SOURCES_DIR           1 files         valid     ./sources

 √  LOGS_DIR              2 files         valid     ./logs

 √  ARCHIVE_DIR           2 files         valid     ./archive

 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf

 √  SQL_INDEX             216.0 KB        valid     ./index.sqlite3


[!] Warning: Missing 1 recommended dependencies
    ! RIPGREP_BINARY: rg (unable to detect version)
@m-primo
Copy link
Author

m-primo commented Jan 19, 2023

btw, I tried to save tweets with headless chromium and i got the same result.

@pirate
Copy link
Member

pirate commented Jan 21, 2023

Yup, you should archive the equivalent Nitter URLs (or use another alternative frontend instead of twitter). Twitter has always been very broken. This is also true for Reddit -> Teddit, Instagram -> Bibliogram, and a couple other big companies that implement advanced bot-detection and blocking, see a longer list of alternative front-ends here: https://hackmd.io/MCpUlTbLThyF6cw_fywT_g?view. It's not ideal but it's better than not having any solution.

Follow here for updates: #345

@pirate pirate closed this as completed Jan 21, 2023
@m-primo
Copy link
Author

m-primo commented Jan 22, 2023

Yup, you should archive the equivalent Nitter URLs (or use another alternative frontend instead of twitter). Twitter has always been very broken. This is also true for Reddit -> Teddit, Instagram -> Bibliogram, and a couple other big companies that implement advanced bot-detection and blocking, see a longer list of alternative front-ends here: https://hackmd.io/MCpUlTbLThyF6cw_fywT_g?view. It's not ideal but it's better than not having any solution.

Follow here for updates: #345

That's what I thought at first, but I opened an issue so if anyone can help or find out any solution, because I've tried many archiving solutions, and some work arounds, ig the only one worked was pywb. But thanks, I'll take a look at the link in your reply.

@pirate
Copy link
Member

pirate commented Jan 22, 2023

Yeah if you're doing a lot of twitter/fb/insta/etc. archiving I highly recommend https://github.com/webrecorder/browsertrix-crawler, it uses the same engine as pywb and is written by the same team.

Check out their whole suite here: https://webrecorder.net/

@m-primo
Copy link
Author

m-primo commented Jan 23, 2023

Okay, thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants