# CSC 522 Group 8 Project

## Professor Insights: Clustering Quality and Difficulty Across NCSU Colleges

This project is a data science project that uses the [RateMyProfessors](https://www.ratemyprofessors.com/) website to scrape data on professors and their ratings. The data is then used to cluster the professors based on their quality and difficulty ratings, and identify similarities and differences of the rating distributions across colleges.

### 1. Data Aquisition
This cell scrapes the most up-to-date data on all NCSU professors on RateMyProfessors and saves the data to a JSON file. This takes less than 30 seconds to execute, and does not need to be re-run if the file already exists.

In [None]:
import json
from ratemyprofessors import RateMyProfessorsAPI


api = RateMyProfessorsAPI()
school = api.search_school("NCSU")

professors = []

# Get all NCSU professors on RateMyProfessors
cursor = ""
while True:
    result = api.search_teachers(school['id'], "", limit=1000, cursor=cursor)
    if not result['teachers']:
        break
    cursor = result['end_cursor']
    for professor in result['teachers']:
        del professor['school']
    professors.extend(result['teachers'])

print(len(professors), 'professors fetched')

# Filter out professors with no ratings
professors = list(filter(lambda professor: professor['num_ratings'] > 0, professors))

print(len(professors), 'professors with ratings')

# Sort by most ratings in descending order    
professors.sort(key=lambda professor: professor['num_ratings'], reverse=True)

with open("data/professors.json", "w") as file:
    json.dump(professors, file, indent=2)

This cell scrapes distributions for every course from the gradient database, calculating the average GPA for each section. It is designed to be able to be computed in multiple executions in case the authorization tokens expire before the scraping is complete. This process takes multiple hours to complete.

To execute this cell, you must have valid authentication headers in a file named `gradient-headers.txt`.

In [7]:
import json
from gradient import GradientAPI


with open("data/colleges.json") as file:
    colleges = json.load(file)

try:
    with open("data/distributions.json", "r") as file:
        distributions = json.load(file)
        last_subject, last_course = distributions[-1]["courseName"].split()[:2]
except:
    distributions = []
    last_subject = last_course = None


api = GradientAPI(request_delay=2)

found = False
try:
    for college in colleges:
        for subject in colleges[college]["subjects"]:
            # Check if we have already fetched this subject
            if last_subject and not found:
                if subject != last_subject:
                    continue
                found = True
            
            for course_distributions in api.get_subject_distrubutions(subject, last_course if subject == last_subject else None):
                if "individual" not in course_distributions:
                    continue
                
                for section in course_distributions["individual"]:
                    a, b, c, d, f, s, u, w = (section["grades"][grade]["raw"] for grade in ["A", "B", "C", "D", "F", "S", "U", "W"])
                    total = a + b + c + d + f + s + u + w
                    gpa = round((a * 4 + b * 3 + c * 2 + d + s * 3) / total, 2) if total else 0
                    
                    section["gpa"] = gpa
                    section["total"] = total
                    section["college"] = college
                    del section["grades"]
                    del section["googleChart"]
                
                distributions.extend(course_distributions["individual"])
except:
    # gracefully handle exceptions or interruptions
    pass
finally:
    with open("data/distributions.json", "w") as file:
        json.dump(distributions, file, indent=2)

### 2. Data Preprocessing

This cell processes the data and tags each professor with their college using a file generated by an LLM and manually validated that maps college names to department names. It then combines duplicate professors (same name and same department) into a single entry with a weighted average of their ratings.

In [None]:
import json
import difflib


with open("data/professors.json", "r") as file:
    professors = json.load(file)
    
with open("data/colleges.json", "r") as file:
    colleges = json.load(file)


# Construct a map from department to college
department_map = {department: college for college, departments in colleges.items() for department in departments}

# Fuzzy search for the best department match and assign the college
for professor in professors:
    department = professor['department']
    result = difflib.get_close_matches(department, department_map.keys(), cutoff=0.75)
    professor['college'] = department_map[result[0]] if result else None


# Combine duplicate entries of the same professor that belong to the same college
names = set()
duplicates_names = []

for professor in professors:
    name = professor['name']
    if name in names and name not in duplicates_names:
        duplicates_names.append(name)
    names.add(name)

duplicates = {}
for professor in filter(lambda professor: professor['name'] in duplicates_names, professors):
    key = (professor['name'], professor['college'])
    if key not in duplicates:
        duplicates[key] = professor
    else:
        # Take weighted average of two entries
        n1, n2 = duplicates[key]['num_ratings'], professor['num_ratings']
        total = n1 + n2
        avg1, avg2 = duplicates[key]['avg_rating'], professor['avg_rating']
        take1, take2 = duplicates[key]['would_take_again'], professor['would_take_again']
        diff1, diff2 = duplicates[key]['avg_difficulty'], professor['avg_difficulty']
        
        duplicates[key]['num_ratings'] = total
        duplicates[key]['avg_rating'] = round((avg1 * n1 + avg2 * n2) / total, 1)
        duplicates[key]['would_take_again'] = round((take1 * n1 + take2 * n2) / total, 1)
        duplicates[key]['avg_difficulty'] = round((diff1 * n1 + diff2 * n2) / total, 1)
    

professors = list(filter(lambda professor: professor['name'] not in duplicates_names, professors)) + list(duplicates.values())


with open("data/professors.json", "w") as file:
    json.dump(professors, file, indent=2)


This cell processes and aggregates the gradient data for all professors that exist on ratemyprofessor.

In [36]:
import json
import difflib

with open("data/professors.json", "r") as file:
    professors = {professor["id"]: professor for professor in json.load(file)}  # {id: {professor info}}
    professor_names = [professor["name"].lower() for professor in professors.values()]
    
with open("data/filtered_distributions.json", "r") as file:
    distributions = json.load(file)

# To be filled in manually and then converted to dataframe (and then stored as .csv)
professors_info = {"Name": [], "College": [], "Quality Score": [], "Difficulty Score": [], "GPA": [], "Would Take Again": [], "Number of Ratings": []}

name_map = {}
not_found = set()
collisions = 0
for section in distributions:
    full_name = " ".join(section["instructorName"].split(",", 1)[::-1])

    # drop middle name and roman numeral and phd
    split_name = full_name.split()
    if all(c in 'IV' for c in split_name[-1].upper()) or split_name[-1].lower() == 'phd':
        name = ' '.join([split_name[0], split_name[-2]])
    else:
        name = ' '.join([split_name[0], split_name[-1]])
    result = difflib.get_close_matches(name.lower(), professor_names, n=1, cutoff=0.89)
    if not result:
        not_found.add(full_name)
        continue
    
    result = result[0]
    if result in name_map.keys() and full_name not in name_map[result]:
        name_map[result].append(full_name)
        collisions += 1
    else:
        name_map[result] = [full_name]

print(f'Collisions: {collisions}\n')
print(f'Not found: {len(not_found)} items')
print(not_found)
print()
print([(key, value) for (key, value) in name_map.items() if len(value) > 1])


    # TODO: join the ratemyprofessors and gradient data into a single json/csv, drop professors that don't exist in both


Collisions: 35

Not found: 557 items
{'Maria Davis', 'David Oh', 'Lindsey Hubbard', 'Jagannadham Kasichainula', 'Christopher Salerno', 'Wayne Place', 'Elizabeth Riley', 'Min   Yang', 'Guido van der Hoeven', 'John S King', 'Christopher Cummings', 'Billy M Williams Jr', 'Michael P Lewis', 'Michael A Stanko', 'Sabrina Spencer', 'Aileen Rodriguez', 'Hatice Orun', 'Daniel Quintanilla Lopez', 'Melissa Brandon', 'Mally   Dietrich', 'Robert P. Patterson', 'Lirong Xiang', 'Arthur R Rice', 'Katherine Greder', 'Benjamin Shane Underwood', 'Paul Tesar', 'Allison Medlin', 'Zixuan Cang', 'Robert H Martin Jr', 'Matthew   Burnette', 'Matt Reynolds', 'Andrew Weaver', 'Marty Martin', 'Alexander Betz', 'Eiko Tai', 'Medwick V Byrd Jr', 'Rachel N Harris', 'Chris Halweg', 'Richard Winters', 'Joe Johnson', 'Elizabeth Anne Wilson', 'Ragan Glover-Rijkse', 'Rong Liu', 'Michelle R Jones', 'William Fortney', 'Katherine Medlin', 'Ian   Sullivan', 'John Elliott Forbes', 'William F Tolbert', 'Charles Ernest Knowles',

In [44]:
# special_cases = ["Longerbeam II,Nick Alexander", "Johnson,Joseph A", "Ro,Paul I", "Webster,Zo"]
# special_cases = ['Webster,Zo']
# for i in special_cases:
#     not_found.remove(i)
relevant_sections = [sect for sect in distributions if " ".join(sect["instructorName"].split(",", 1)[::-1]) not in not_found]
print(f'Number of found items: {len(relevant_sections)}')

with open("further_filtered_distributions.json", "w") as file:
    json.dump(relevant_sections, file, indent=2)

Number of found items: 61211


### 3. Model Processing
This stage involves the application of a tuned clusternig algorithm to the data. The clustering algorithm will group similar data points together, and evaulate the homogeniety of the classes of the clusters (the similiarity of professors in the same departments). This step allows us to identify patterns and relationships within the data.

In [None]:
# TODO: Decide on clustering aglorithm and tuning approach
# TODO: Perform tuned clustering on professor data points
# TODO: Calculate homogeniety of clusters

### 4. Visualization
This stage involves generating visuals to analyze and find patterns in the data.

In [None]:
# TODO: Visualize data points and rating distributions for each college individually
# TODO: Visualize aggregation + average of data points and rating distributions for all colleges
# TODO: Visualize 

### 5. Interpretation
This stage involves computing metrics to drive insights about the data.

In [None]:
# TODO: Calculate and compare weighted averages of ratings for each college, compare to global averages
# TODO: Calculate metrics like standard deviation, median, outliers, for each college and globally
# TODO: Compare expected salary with average ratings for each college