Skip to content

Conversation

@phi-line
Copy link
Collaborator

@phi-line phi-line commented Jun 11, 2018

New Functionality:

  • Advanced Scraper that gets all data back until 2010 (100k+ courses, 64 quarters, 2 campuses)
  • Selenium cookie scraper that allows for authenticated requests to MyPortal

New Modules:

  • scrape_advanced.py Archival scraper
  • selenium_login.py Logs you into MyPortal and gathers cookie

To run this PR:
Set a couple environment variables either in Pycharm or in your bash_rc. They should contain your MyPortal login details.

MP_USER="INSERT CWID"
MP_PASS="INSERT PASSWORD"

pip install pipenv

pipenv install

pipenv run scrape_advanced.py

Watch as the magic happens (takes ~30 mins)

34875566_383537795386422_2637826474907795456_n

@phi-line phi-line requested a review from TryExceptElse June 11, 2018 03:02
@phi-line
Copy link
Collaborator Author

phi-line commented Jun 11, 2018

Fixes #33

@phi-line phi-line requested a review from fractalbach June 11, 2018 07:48
@phi-line phi-line mentioned this pull request Jun 12, 2018
Copy link
Contributor

@TryExceptElse TryExceptElse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, but there are a few things that could probably be improved before this is pulled.

.gitignore Outdated
*.DS_Store
*.json
schedule.html
*.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to exclude all the template and example .html files?

return res.content


def advanced_parse(content, db, term=''):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function might be broken up into smaller functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been addressed with commit 363b07b

:return: (list(str)) list of term codes
"""
codes = []
for i in range(YEAR_RANGE[0], YEAR_RANGE[1] + 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be condensed using a single itertools.product call.

settings.py Outdated
'room', 'campus', 'units', 'instructor', 'seats', 'wait_seats', 'wait_cap')
CURRENT_TERM_CODES = CAMPUS_LIST = {'fh': '201911', 'da': '201912', 'test': 'test'}

ADVANCED_FORM_DATA = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is used in scrape_advanced.py, it can be declared and used there to reduce clutter in the setup.py module. The same applies to any other constants used only within a single module.

db.table(f'{subject}').insert(j)
except BlankRow:
continue
except AttributeError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching all AttributeErrors produced in such a large are of code could easily swallow many errors that we do not want to be caught, and can be very hard to debug. Is there a way to avoid excepting these from such a large area, or entirely?

@DanielNghiem
Copy link

DanielNghiem commented Jun 13, 2018

After a 'pipenv install` and running scrape_advanced.py, I get the following:

Traceback (most recent call last):
  File "scrape_advanced.py", line 12, in <module>
    from selenium_login import scrape_cookies, kill_driver
  File "/Users/danielnghiem/Projects/OwlAPI/selenium_login.py", line 13, in <module>
    driver = webdriver.Chrome(chrome_options=chrome_options)
  File "/Users/danielnghiem/.local/share/virtualenvs/OwlAPI-TkGlEiua/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 68, in __init__
    self.service.start()
  File "/Users/danielnghiem/.local/share/virtualenvs/OwlAPI-TkGlEiua/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

It looks like I need to download chromedriver and add it to my PATH.
@phi-line : Can you add the setup instructions?

@phi-line
Copy link
Collaborator Author

phi-line commented Jun 14, 2018

@DanielNghiem I have addressed your comment in this wiki page. Please feel free to add anything that you think is incorrect about the setup instructions I have provided.

@phi-line
Copy link
Collaborator Author

@TryExceptElse please merge this PR when you get a chance so I can move on with the new data format (all CRN in one table). I'd like to separate out these two PRs instead of lumping everything into one.

@phi-line phi-line merged commit 5ff7e5d into master Jun 29, 2018
@madhavarshney madhavarshney deleted the 33-archive-scraper branch July 28, 2020 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants