Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding HTML Scraping functionality #75

Merged
merged 3 commits into from
May 2, 2020
Merged

Adding HTML Scraping functionality #75

merged 3 commits into from
May 2, 2020

Conversation

lucanello
Copy link
Contributor

I created a fix for the problem regarding the Google YouTube Search/Data API. It simply uses the normal search by requesting the HTML page, parsing and filtering the video IDs. There is also the possibility to fetch all the videos contained on the first search page and more informations like video title etc.

The problem with PR #73 is, that it's using client-side hardcoded authentication and removes the ability to still provide the official YouTube Data API. The authentication in his PR will lead to more frequent rate limitings because we're all using the same footprint.

Issues regarding this problem:
#60

Todo's:

  • Add correct exception handling
  • Tests how often those requests work and possibly add rate limiting handler

The pro's are:

  • No need to specify a Google API Key (please check if my removal of the boundaries work)
  • Possibility to download way bigger playlists

The con's are:

  • Data is not nicely structured
  • You could be blocked by Google (may use a VPN)

PS: I really like this repo! Thanks for making Spotify more open for everybody.

SathyaBhat and others added 3 commits March 29, 2020 11:02
…#71)

* add test, handle Youtube quota expiry better (#68)

* add sigint handler to handle Ctrl+C better

* major version bump due to Python version change (now needs 3.6+)

* move sentry around so that github action doesn;t complain when trying to fetch version for publishing

* move sentry around

* oops, moved signal to wrong place
@SathyaBhat
Copy link
Owner

Wow, @lucanello - this is fantastic, thank you very much. From a first glance it looks great and I love that your PR gives an option to scrape as well as to go via the API route. I'll take a deeper look this week.

Once again, much thanks for the PR

Copy link

@opiumozor opiumozor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was about to jump on this since it bothered me as well, great implementation 👍

@SathyaBhat SathyaBhat added this to the 4.0 milestone Apr 25, 2020
@SathyaBhat SathyaBhat added this to In progress in Spotify DL via automation Apr 25, 2020
@SathyaBhat SathyaBhat self-assigned this Apr 25, 2020
@SathyaBhat SathyaBhat modified the milestones: 4.0, 5.0 Apr 25, 2020
Copy link
Owner

@SathyaBhat SathyaBhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @lucanello - thanks for your contribs. Overall looks good, except for some minor changes. I'm working on incorporating these changes, just wanted to pass the feedback to you

video_id = re.search("((\?v=)[a-zA-Z0-9_-]{4,15})", video[0]).group(0)[3:]
print(video_id)
return YOUTUBE_VIDEO_URL + video_id
except:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blank except are bad, as this will capture any and all exceptions - including the ones that are raised when Ctrl+C is pressed, so this makes it impossible to exit when the search is happening.

@@ -45,6 +45,10 @@ def check_for_tokens():

Generate the key from
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if --scrape or -s has been provided, then we don't need to print this

''')
if args.scrape:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can move this check before printing "get dev key.."

else:
secho(f"\t Search failed due to {error_domain}:{error_reason}, message: {error_message}")
return None
if error_reason == 'quotaExceeded' or error_reason == 'dailyLimitExceeded':
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this intended? if there are multiple errors then it will repeatedly print the error message

@SathyaBhat SathyaBhat changed the base branch from master to dev May 2, 2020 19:00
@SathyaBhat SathyaBhat merged commit f19386a into SathyaBhat:dev May 2, 2020
Spotify DL automation moved this from In progress to Done May 2, 2020
@SathyaBhat
Copy link
Owner

SathyaBhat commented May 2, 2020

@lucanello I have merged this into dev branch, will be out when the next release is cut., thanks for your contribs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Spotify DL
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants