Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YouTube Scraper Bug #140

Open
evamaxfield opened this issue Jun 13, 2023 · 5 comments
Open

YouTube Scraper Bug #140

evamaxfield opened this issue Jun 13, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@evamaxfield
Copy link
Member

WARNING: [youtube] gC9sITpsrtU: nsig extraction failed: You may experience throttling for some formats
         Install PhantomJS to workaround the issue. Please download it from https://phantomjs.org/download.html
         n = bpvIBN4MSHwUszM_nK ; player = https://www.youtube.com/s/player/8c7583ff/player_ias.vflset/en_US/base.js
[download] Finished downloading playlist: Adams 12 Five Star Schools - ESC - Search - Adams 12 Board of Education Meeting after:2023-05-05 before:2023-06-07
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13880/4276127022.py in <module>
     11 )
     12 
---> 13 events = scraper.get_events(
     14     begin=datetime(2023, 5, 5),  # jan 1 2023
     15     end=datetime(2023, 6, 7),  # jan 10 2023

~\Documents\school\anaconda\lib\site-packages\cdp_scrapers\youtube_utils.py in get_events(self, begin, end)
    264         end = pytz.utc.localize(end)
    265         events = self.iter_events(begin=begin, end=end)
--> 266         events = reduced_list(events, collapse=False)
    267         return events

~\Documents\school\anaconda\lib\site-packages\cdp_scrapers\scraper_utils.py in reduced_list(input_list, collapse)
     49         None if all items were None and collapse is True.
     50     """
---> 51     filtered = [item for item in input_list if item is not None]
     52     if collapse and len(filtered) == 0:
     53         filtered = None

~\Documents\school\anaconda\lib\site-packages\cdp_scrapers\scraper_utils.py in <listcomp>(.0)
     49         None if all items were None and collapse is True.
     50     """
---> 51     filtered = [item for item in input_list if item is not None]
     52     if collapse and len(filtered) == 0:
     53         filtered = None

~\Documents\school\anaconda\lib\site-packages\cdp_scrapers\youtube_utils.py in iter_events(self, begin, end)
    217 
    218             sessions = map(self.get_session, video_info_list)
--> 219             sessions = reduced_list(sessions, collapse=False)
    220             sessions = filter(
    221                 lambda s: s.session_datetime >= begin and s.session_datetime <= end,

~\Documents\school\anaconda\lib\site-packages\cdp_scrapers\scraper_utils.py in reduced_list(input_list, collapse)
     49         None if all items were None and collapse is True.
     50     """
---> 51     filtered = [item for item in input_list if item is not None]
     52     if collapse and len(filtered) == 0:
     53         filtered = None

~\Documents\school\anaconda\lib\site-packages\cdp_scrapers\scraper_utils.py in <listcomp>(.0)
     49         None if all items were None and collapse is True.
     50     """
---> 51     filtered = [item for item in input_list if item is not None]
     52     if collapse and len(filtered) == 0:
     53         filtered = None

~\Documents\school\anaconda\lib\site-packages\cdp_scrapers\youtube_utils.py in get_session(self, video_info)
    158 
    159         video_title = video_info["title"]
--> 160         video_datetime = self.parse_datetime(video_title)
    161         video_uri = video_info["webpage_url"]
    162 

~\Documents\school\anaconda\lib\site-packages\cdp_scrapers\youtube_utils.py in parse_datetime(self, title)
    136         """
    137         date_match = re.search(r"[a-z]+ \d{1,2}, \d{4}", title, re.I)
--> 138         date_time = datetime.strptime(date_match.group(), "%B %d, %Y")
    139         date_time = self.localize_datetime(date_time)
    140         return date_time

AttributeError: 'NoneType' object has no attribute 'group'
@evamaxfield evamaxfield added the bug Something isn't working label Jun 13, 2023
@evamaxfield
Copy link
Member Author

Failed during:

from cdp_scrapers.youtube_utils import YoutubeIngestionScraper
from datetime import datetime
scraper = YoutubeIngestionScraper(
    channel_name="adams12fivestarschools-esc86",
    body_search_terms={
        "Adams 12": "Adams 12 Board of Education Meeting"
        },
    timezone="MST"
)
events = scraper.get_events(
    begin=datetime(2023, 5, 5),  # jan 1 2023
    end=datetime(2023, 6, 7),  # jan 10 2023
)

@evamaxfield
Copy link
Member Author

Also totally different but the scraper is returning different events (correctly) but incrementing the session index for each one? i.e. event 0 has session 0 and event 1 has session 1 and event 2 has session 2 (these should all be session 0)

@Shak2000
Copy link
Contributor

Which scraper or URL is this referring to? I want to try to reproduce it.

@evamaxfield
Copy link
Member Author

The YoutubeIngestionScraper currently in the cdp-scrapers repo. The last snippet should be able to reproduce

@dphoria
Copy link
Collaborator

dphoria commented Sep 6, 2023

Forgot about this!

@dphoria dphoria self-assigned this Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants