-
Notifications
You must be signed in to change notification settings - Fork 7
Archive scraper #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive scraper #59
Conversation
…l be hard to parse
…ise with data that is inconsistent from year to year
|
Fixes #33 |
TryExceptElse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, but there are a few things that could probably be improved before this is pulled.
.gitignore
Outdated
| *.DS_Store | ||
| *.json | ||
| schedule.html | ||
| *.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want to exclude all the template and example .html files?
| return res.content | ||
|
|
||
|
|
||
| def advanced_parse(content, db, term=''): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function might be broken up into smaller functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been addressed with commit 363b07b
scrape_advanced.py
Outdated
| :return: (list(str)) list of term codes | ||
| """ | ||
| codes = [] | ||
| for i in range(YEAR_RANGE[0], YEAR_RANGE[1] + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be condensed using a single itertools.product call.
settings.py
Outdated
| 'room', 'campus', 'units', 'instructor', 'seats', 'wait_seats', 'wait_cap') | ||
| CURRENT_TERM_CODES = CAMPUS_LIST = {'fh': '201911', 'da': '201912', 'test': 'test'} | ||
|
|
||
| ADVANCED_FORM_DATA = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is used in scrape_advanced.py, it can be declared and used there to reduce clutter in the setup.py module. The same applies to any other constants used only within a single module.
scrape_advanced.py
Outdated
| db.table(f'{subject}').insert(j) | ||
| except BlankRow: | ||
| continue | ||
| except AttributeError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching all AttributeErrors produced in such a large are of code could easily swallow many errors that we do not want to be caught, and can be very hard to debug. Is there a way to avoid excepting these from such a large area, or entirely?
|
After a 'pipenv install` and running scrape_advanced.py, I get the following: It looks like I need to download chromedriver and add it to my PATH. |
|
@DanielNghiem I have addressed your comment in this wiki page. Please feel free to add anything that you think is incorrect about the setup instructions I have provided. |
|
@TryExceptElse please merge this PR when you get a chance so I can move on with the new data format (all CRN in one table). I'd like to separate out these two PRs instead of lumping everything into one. |
New Functionality:
New Modules:
scrape_advanced.pyArchival scraperselenium_login.pyLogs you into MyPortal and gathers cookieTo run this PR:
Set a couple environment variables either in Pycharm or in your bash_rc. They should contain your MyPortal login details.
Watch as the magic happens (takes ~30 mins)