Archive scraper #59

phi-line · 2018-06-11T03:02:17Z

New Functionality:

Advanced Scraper that gets all data back until 2010 (100k+ courses, 64 quarters, 2 campuses)
Selenium cookie scraper that allows for authenticated requests to MyPortal

New Modules:

scrape_advanced.py Archival scraper
selenium_login.py Logs you into MyPortal and gathers cookie

To run this PR:
Set a couple environment variables either in Pycharm or in your bash_rc. They should contain your MyPortal login details.

MP_USER="INSERT CWID"
MP_PASS="INSERT PASSWORD"

pip install pipenv

pipenv install

pipenv run scrape_advanced.py

Watch as the magic happens (takes ~30 mins)

…l be hard to parse

…ise with data that is inconsistent from year to year

phi-line · 2018-06-11T03:04:48Z

Fixes #33

TryExceptElse

Looks good overall, but there are a few things that could probably be improved before this is pulled.

TryExceptElse · 2018-06-13T00:25:27Z

.gitignore

 *.DS_Store
 *.json
-schedule.html
+*.html


Are we sure we want to exclude all the template and example .html files?

TryExceptElse · 2018-06-13T00:36:06Z

scrape_advanced.py

+    return res.content
+
+
+def advanced_parse(content, db, term=''):


This function might be broken up into smaller functions.

This has been addressed with commit 363b07b

TryExceptElse · 2018-06-13T00:54:51Z

scrape_advanced.py

+    :return: (list(str)) list of term codes
+    """
+    codes = []
+    for i in range(YEAR_RANGE[0], YEAR_RANGE[1] + 1):


This can be condensed using a single itertools.product call.

TryExceptElse · 2018-06-13T01:04:05Z

settings.py

+           'room', 'campus', 'units', 'instructor', 'seats', 'wait_seats', 'wait_cap')
+CURRENT_TERM_CODES = CAMPUS_LIST = {'fh': '201911', 'da': '201912', 'test': 'test'}
+
+ADVANCED_FORM_DATA = [


If this is used in scrape_advanced.py, it can be declared and used there to reduce clutter in the setup.py module. The same applies to any other constants used only within a single module.

TryExceptElse · 2018-06-13T01:10:35Z

scrape_advanced.py

+                    db.table(f'{subject}').insert(j)
+            except BlankRow:
+                continue
+    except AttributeError as e:


Catching all AttributeErrors produced in such a large are of code could easily swallow many errors that we do not want to be caught, and can be very hard to debug. Is there a way to avoid excepting these from such a large area, or entirely?

DanielNghiem · 2018-06-13T01:38:04Z

After a 'pipenv install` and running scrape_advanced.py, I get the following:

Traceback (most recent call last):
  File "scrape_advanced.py", line 12, in <module>
    from selenium_login import scrape_cookies, kill_driver
  File "/Users/danielnghiem/Projects/OwlAPI/selenium_login.py", line 13, in <module>
    driver = webdriver.Chrome(chrome_options=chrome_options)
  File "/Users/danielnghiem/.local/share/virtualenvs/OwlAPI-TkGlEiua/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 68, in __init__
    self.service.start()
  File "/Users/danielnghiem/.local/share/virtualenvs/OwlAPI-TkGlEiua/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

It looks like I need to download chromedriver and add it to my PATH.
@phi-line : Can you add the setup instructions?

phi-line · 2018-06-14T06:31:50Z

@DanielNghiem I have addressed your comment in this wiki page. Please feel free to add anything that you think is incorrect about the setup instructions I have provided.

phi-line · 2018-06-16T01:35:33Z

@TryExceptElse please merge this PR when you get a chance so I can move on with the new data format (all CRN in one table). I'd like to separate out these two PRs instead of lumping everything into one.

phi-line added 11 commits June 5, 2018 01:26

added format for advanced search. request hits fine but the table wil…

237fd49

…l be hard to parse

finished more of the advanced scraper. some potential problems may ar…

2cb8cd2

…ise with data that is inconsistent from year to year

scraper working for each quarter's data

564a09f

scraping working for one quarter. hybrid classes handles

b8e6ddf

refined scraper

e442b8f

added some feedback to the scraping process

da81de1

refined scraper to use dynamic department codes

920cab8

made things look pretty (important)

5119fc7

added alternate form field format. fixes 98% of cases

d5e1d9f

removed html files from .gitignore

c9edb41

fixed edgecase for past quarters

21a397b

phi-line requested a review from TryExceptElse June 11, 2018 03:02

increased pylint score

2d49897

phi-line requested a review from fractalbach June 11, 2018 07:48

phi-line added 4 commits June 11, 2018 01:37

cleaned up code and added docstrings

93ce5c6

removed need for 'old' subdirectory

5227f68

fixed big created when reformatted

363b9e3

added debug mode

1872b6c

phi-line mentioned this pull request Jun 12, 2018

Tests for Archival Data #61

Closed

TryExceptElse suggested changes Jun 13, 2018

View reviewed changes

phi-line added 2 commits June 13, 2018 11:50

increased pylint score and made debug mode better

787cd3b

addressed TryExceptElse's changes

363b07b

phi-line mentioned this pull request Jun 14, 2018

Spike to nullify the need for '5 minute' #62

Closed

fixed bug with wrong value in settings.py

4a5f9f2

phi-line merged commit 5ff7e5d into master Jun 29, 2018

madhavarshney deleted the 33-archive-scraper branch July 28, 2020 04:57

This was referenced Aug 2, 2020

Clean, parse, and convert archive data into a proper format #86

Open

Write a new data scraper for getting previous quarter's data #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Archive scraper #59

Archive scraper #59

Uh oh!

phi-line commented Jun 11, 2018 •

edited

Loading

Uh oh!

phi-line commented Jun 11, 2018 •

edited

Loading

Uh oh!

TryExceptElse left a comment

Uh oh!

TryExceptElse Jun 13, 2018

Uh oh!

TryExceptElse Jun 13, 2018

Uh oh!

phi-line Jun 14, 2018

Uh oh!

TryExceptElse Jun 13, 2018

Uh oh!

TryExceptElse Jun 13, 2018

Uh oh!

TryExceptElse Jun 13, 2018

Uh oh!

DanielNghiem commented Jun 13, 2018 •

edited

Loading

Uh oh!

phi-line commented Jun 14, 2018 •

edited

Loading

Uh oh!

phi-line commented Jun 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Archive scraper #59

Archive scraper #59

Uh oh!

Conversation

phi-line commented Jun 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phi-line commented Jun 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TryExceptElse left a comment

Choose a reason for hiding this comment

Uh oh!

TryExceptElse Jun 13, 2018

Choose a reason for hiding this comment

Uh oh!

TryExceptElse Jun 13, 2018

Choose a reason for hiding this comment

Uh oh!

phi-line Jun 14, 2018

Choose a reason for hiding this comment

Uh oh!

TryExceptElse Jun 13, 2018

Choose a reason for hiding this comment

Uh oh!

TryExceptElse Jun 13, 2018

Choose a reason for hiding this comment

Uh oh!

TryExceptElse Jun 13, 2018

Choose a reason for hiding this comment

Uh oh!

DanielNghiem commented Jun 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phi-line commented Jun 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phi-line commented Jun 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

phi-line commented Jun 11, 2018 •

edited

Loading

phi-line commented Jun 11, 2018 •

edited

Loading

DanielNghiem commented Jun 13, 2018 •

edited

Loading

phi-line commented Jun 14, 2018 •

edited

Loading