## Why I made this
For the first project, my learning goal is to practice crawling websites. 
Initially, I crawled Indeed.com for jobs and related information, but I hit a snag half way into the project when my bot failed CAPTCHA enforced by the website. After looking to Robots.txt on Indeed.com, I realized that it prohibits me crawling information in large quantity. 

I found it interesting that (almost) every site I visit has specific rules in regard to crawling. Parsons course catalogue is the first I encountered that allows all crawling behaviors. This is the primary reason why I pivoted from crawling indeed.com to Parson's course catalogue. Another reason is because I want to aggreagte course information in a way that conforms to my habit better:

First, I wish to get more accurate results when I search by instructor name. If I search for Sven Travis on the current website, it would return all courses taught by instructors whose full name contains either "Sven" or "Travis", instead of just whom I am looking for (The best Sven there is.) 
Second, I want to see more information on the instructor for every course. I would usually look into the intructor if I am interested in any class, it would be great if I can see this information at the same place as the course. Since ratemyprofessors.com is heavily biased and prohibits crawling, I looked for course evals on Parsons. My efforts were to no avail because such information was only available to faculty. Luckily, I discovered that every faculty has a page in Parson's faculty directory, and the URL follows the same pattern that can be uniquely identified by the instructor name. 
Third, I wish to shorten the distance to get CRN for every class. On the course website, crn is not shown until I click open a specific course. Since this information comes handy for registration, I want to see it right away. 

In [1]:
import requests
import json 
import csv
import time 
from datetime import datetime 
from bs4 import BeautifulSoup

## Why selenium library
One of the biggest challenge I encoutered was dealing with the static url on Parsons course catalogue. To filter the courses that I can choose from, I would need to click on the filters to reveal those courses. However, the URL remains the same. This makes crawling tricky because requests relies on the URL to get the web page. 

To address this, I found the Selenium library. It enabled me to emulate a robot that can perform clicking and scrolling actions like a human user on a website. This way, even though the url remains static, the page I am crawling will contain the right information. 

In [2]:
!pip install selenium



In [3]:
import os
import sys
#find my current root directory, program will look for some sort of web driver .exe there. 
os.path.dirname(sys.executable)

'C:\\Users\\frech\\Anaconda3'

In [4]:
from selenium import webdriver

In [5]:
driver = webdriver.Chrome()
driver.maximize_window()

In [6]:
driver.get('https://courses.newschool.edu/')

In [7]:
# automate clicking the fields to narrow down class selection
from selenium.webdriver.support.ui import Select

In [8]:
#  emulate a human user and click on specific filters on the course catalogue page. 
#  Therer are 3 filters I would click on while browsing classes: "Art, Media, and Technolgy", "Graduate" and "Spring 2o21"
time.sleep(3)
select = Select(driver.find_element_by_name('college[]'))
select.select_by_value('art_media_technology')
#driver.find_element_by_id("submit").click()
#print(driver.find_element_by_id("gradlevel001").is_selected())
level = driver.find_element_by_id('gradlevel001')
driver.execute_script("arguments[0].click();", level)
term = driver.find_element_by_id('term000')
driver.execute_script("arguments[0].click();", term)
#time.sleep delays execution of the next block. Much needed here because it takes a while for the page to actually update after bot selects fields. 
time.sleep(3)


In [9]:
# get all the course urls from the inital page
allCourseLinks = driver.find_elements_by_css_selector("div.crse_page p a")
links = []
for link in allCourseLinks:
    url = link.get_attribute("href")
    links.append(url)
print("Done, we got all the course links.")

Done, we got all the course links.


In [10]:
#iterate through the course links, use Beautiful Soup from now on, because no user interaction needs to happen on the 
# specific course page
# save all the data into a dictionary. One course can have many sections, use course number as the keys. 
scrapeData = []

for i in range(len(links)):
    courseLink = links[i]
    response = requests.get(courseLink)
    soupObject = BeautifulSoup(response.text, "html.parser")
    sectionInfoArr = []
    #get the course ID, same for all sections
    courseID = soupObject.find('p','dept').text + " " + soupObject.find('p','crse').text
    print(courseID)

    #get all sections under the same course ID
    allSections = soupObject.find_all('div', 'section_details')
    
    #TODO, so far we are only getting 1 section from each course, how to get all? what data structure?
    for j in range(len(allSections)):
        sectionTitle = allSections[j].find('div','title').h1.text
        instructor = allSections[j].find('div','instructor').text.partition(':')[2].strip()
        crn = allSections[j].find('div','crn').text.partition(':')[2].strip()
        description = allSections[j].find('div','description').text.strip()
        checkSeats = allSections[j].find('div','seats').span.text
        checkStatus = allSections[j].find('div','status').span.text
        classTime = allSections[j].find('div','days').text.partition(':')[2].strip() + allSections[j].find('div','times').text.partition(':')[2]
        dateRange = allSections[j].find('div','dates').text.partition(':')[2].strip()
        print(instructor)
    
    # get every instructor's info page from https://www.newschool.edu/parsons/faculty/
    # this link follows the same format, insert - between names of the instructor.
    # TODO: if a course is co-taught by two or more professors, need to parse further
        instructorDash = instructor.replace(' ', '-');
        instructorInfo = '';
        insUrl = 'https://www.newschool.edu/parsons/faculty/' + instructorDash
        response = requests.get(insUrl)
        time.sleep(1)
    
        if (response.status_code == 200):
            instructorInfo = insUrl
        else:
            instructorInfo = "Not available"
        
        sectionInfo = {'course ID': courseID, 'section title': sectionTitle, 'CRN':crn, 'instructor':instructor, 'instructorInfo':instructorInfo, 'classTime':classTime, 'dateRange':dateRange, 'checkSeats':checkSeats, 'checkStatus':checkStatus, 'description':description, 'courseLink':courseLink}
        #sectionInfoArr.append(sectionInfo)
    #scrapeData[courseID] = sectionInfoArr
    # make every class a dictionary:
        scrapeData.append(sectionInfo)
print("Done, all courses were scrapped.")

PGFA 5000
Andrea Geyer
Mira Schor
Shane Aslan Selzer
Kamrooz Aram
Jessica Rankin
Faculty TBA
Jennifer Woolfalk
Simone Douglas
Shoshana Dentz
Peter Rostovsky
Ester Partegas
PGFA 5005
Faculty TBA
Faculty TBA
Faculty TBA
PGFA 5020
Andrea Geyer and Lydia Matthews
Andrea Geyer and Lydia Matthews
PGFA 5050
Faculty TBA
Lydia Matthews
Lenore Malen
Neil Goldberg
Faculty TBA
PGFA 5051
Faculty TBA
Lydia Matthews
Lenore Malen
Neil Goldberg
Lydia Matthews
Shane Aslan Selzer
LJ Roberts
Faculty TBA
PGFA 5127
Andrea Geyer
PGFA 5145
Mira Schor
PGFA 5151
Sharmistha Ray
PGFA 5300
Phoenix Lindsey-Hall
PGFA 5301
Sammy Cucher
Carrie Hawks
PGFA 5302
Sara Jimenez
Andrea Geyer
PGFA 5303
Yve Laris Cohen
PGFA 5900
Kevin Bukreev
Kevin Bukreev
Kevin Bukreev
PGPH 5001
William Lamson
William Lamson and MarieVic Vic
Sandra Erbacher and William Lamson
PGPH 5006
Simone Douglas and Arthur Ou
PGPH 5101
Mike Crane and Keisha Scarville
Anthony Aziz and Keren Moscovitch
PGPH 5113
Laura Parnes and James Ramer
PGPH 5302
Stacy

AttributeError: 'NoneType' object has no attribute 'text'

In [None]:
# create a pandas dataframe
import pandas as pd
pd.set_option('display.max_colwidth', 60)
    
#'section title': sectionTitle, 'CRN':crn, 'instructor':instructor, 'instructorInfo':instructorInfo, 'classTime':classTime, 'dateRange':dateRange, 'checkSeats':checkSeats, 'checkStatus':checkStatus, 'description':description, 'courseLink':courseLink
df = pd.DataFrame.from_dict(scrapeData)
#df.style.set_properties(subset=['description'], **{'width': '10px'})
#group by course ID. 
#df.groupby(df['course ID'])
#df.set_index('course ID')

df.columns = ['ID', 'Name', 'CRN', 'Instructor', 'Insturctor Website', 'Time', 'Duration', 'Open Seats', 'Status', 'Description', 'Course Link']
df
# output_file = 'SP21Courses.xlsx'
# # saving the excel 
# df.to_excel(output_file) 
# print('DataFrame is written to Excel File successfully.')

In [None]:
# search for course by instructor:
def searchByInstructor(name):
    # take user input, format it so that first letter is capitalized. 
    print(name)
    name_formatted = name.lower().title()
    print(name_formatted)
    taught_by = df[df['Instructor'].str.contains(name_formatted)]
    display(taught_by)

In [None]:
#searach by class ID
def searchCourseID(id):
    course_by_id = df[df['ID'] == id]
    display(course_by_id)

In [None]:
#search for course by CRN
def searchByCrn(crnNum):
    course_by_crn = df[df['CRN'] == crnNum]
    display(course_by_crn)

In [None]:
# prompt user for further action after every search. 
def searchAgain():
    userInput = input('''Do you wish to search again? y for Yes and n for No''')
    if userInput == 'y':
        search()
    elif userInput == 'n':
        print('Coolio. Have a great semester!')
    else:
        print('Please enter y or n')
        searchAgain()

In [None]:
def search():
    search_by = input('''How do you wish to search for a class:
a. by instructor   b. by CRN   c. by course ID    ''')
    if search_by == 'a' or 'A':
        instructor_name = input('What is the full name of the instructor? i.e. Sven Travis    ')
        searchByInstructor(instructor_name)   
    elif search_by == 'b' or 'B':
        print("I chose b!!!!!")
        crn = input('What is CRN number for the course? i.e. 3011    ')
        searchByCrn(crn)
    elif search_by == 'c' or 'C':
        course_id = input('What is couse ID? i.e. PGTE 5300    ')
        searchCourseID(course_id)
    else:
        print("Sorry Please enter a valid string")   
    searchAgain()

In [None]:
search()