# Presentation: Analysis of PolyRatings Reviews

 <center> <h4> Robert Hensley - DATA 301 Final Project </h4> </center>

<img src="img/website.jpg" />

## I. Purpose / Analysis

** What is Polyratings? (for the uninitiated) **

Polyratings is a professor review platform for students attending Cal Poly San Luis Obispo. It’s similar to the website [RateMyProfessors](http://www.ratemyprofessors.com) in that professors are given an overall score based on the class’s difficulty and how the material is presented, but differs by being more minimalistic (with less categories to rate teachers and no advertisements). The site has reviews dating back to June 1999 (a month after the creation of RateMyProfessors) and has been generating a consistent amount reviews since the mid 2000’s. 


** Is Polyratings Dead? **

Within the last year, Polyratings has appeared to be dormant in terms of support by the moderators. The last significant change to the website occurred in 2016, introducing a new polished design. There however hasn’t been any significant updates to the website since this redesign. In fact, although users are still able to write reviews about existing professors, the site is no longer accepting new professor submissions. There is no way to contact the current moderators of the site, as the email for reporting bugs (errors@polyratings.com) appears to be down. These facors could have contributed to a lower amount of reviews around 2017.
- [Polyratings revamped site (September 2016)](https://www.reddit.com/r/CalPoly/comments/53j9uu/polyratings_got_a_massive_design_update_check_it/)
- [Discussion about the lack of moderator support (One Month Ago)](https://www.reddit.com/r/CalPoly/comments/8gu309/scouting_profsno_info_on_polyratings/)
- [Discussion on the state of Polyratings (Last Week)](https://www.reddit.com/r/CalPoly/comments/8pn0t0/is_there_any_way_to_fix_polyratings/)

Polyratings is far from dead despite the lack of moderator support. People still add new reviews to existing teacher profiles. There are even new tools, such as the Chrome extension [PassThePlebs](https://chrome.google.com/webstore/detail/pass-the-plebs/mhglgbabaleaegjhdcmfffkaaklpmjog?hl=en) that take advantage of Polyratings' reviews to help students register for the classes they want.


To ensure that the site doesn't become completely dormant, there needs to be greater support from the moderators. Polyratings is a student run website however, so moderation can be especially time consuming for a site dedicated to an ~20,000 undergraduate population. Moderation includes checking new professors requests as well as removing reviews that are spam or innapriopriate. So, for this data science project, I sought to create a machine learning model that would detect and flag bad reviews in order to streamline the moderating process. This model would use existing reviews as a basis for detecting reviews in a similar category.

** Site Use Over Time **

\* Notice the spike of reviews written around late 2016. This was around the time that the website was redesigned (September 2016), which probably revitalized use of the site.


<img src="img/pr_usage.png" > <img src="img/pr_usage_bar.png" >

## II. Collection

Polyratings is a relatively small website in terms of popularity given it's a review site specific to one school (a google search for "Polyratings" yields only 11,300 results). Therefore, it is understandable that the website does not come with a developer API. 

In order to collect reviews, I created a web scraping function with Beautiful Soup, iterating through the Polyratings directory of teacher profiles and scraping each profile individually. The HTML was a bit messy, but the HTML structure of each page was fortunately consistent across each profile. So I scraped each line of HTML in the file and created an array of these lines, and used offsets (indexes) to collect the following information:

- Class Information
    - class (full name)
    - subject abbreviation
    - class number (or level, stored as an int)


- Review Information:
    - the review contents (as a large string)
    - the month of the review (string)
    - the year of the review (as an int)


- Student Information:
    - grade in the class
        - stored as a string (to account for credit/no credit/withdrawn classes)
        - also stored as a float (gpa) with the corresponding grade points:
            - A: 4.0
            - B: 3.0
            - C: 2.0
            - D: 1.0
            - F: 0.0
            - Withdrawn/Credit/No Credit: Nan
    - the reviewer’s academic standing (as an integer and string)
        - Freshman: 0
        - Sophomore: 1
        - Junior: 2
        - Senior: 3
        - 5th/6th Year Senior: 4
        - Graduate Student: 5
    - major of student (is the class major required? is it for a G.E.)


- Teacher Information:
    - Name
    - Field of Study
    - Material Presentation Score (float)
    - Overall Rating (float)


Some of the offsets were fixed, such as the teacher’s name and score (it remained the same across every html file). Others were variable depending on the amount of classes the teacher offered and the lengths or reviews. To collect this data from the page, I used delimiters to separate the content. For example, when reading a review in, I could tell if I reached the end of a review if the next line was the name of a class or the school standing of the next reviewer. 

Here’s an example of one what one of my DataFrames looked like. This is the profile contents of the revered Chemistry teacher [Professor Snape](http://polyratings.com/eval.php?profid=3485):

<center> <img src="img/snape.jpg" /> </center>

In [3]:
import pandas as pd

pd.read_csv("polyratings_profiles/snape_severus.csv")

Unnamed: 0.1,Unnamed: 0,class,class_abrv,class_number,review_content,review_month,review_year,student_gpa,student_grade,student_major,student_rank,student_standing,teacher_difficulties,teacher_field,teacher_name,teacher_presentaion,teacher_rating
0,0,CHEM 101,CHEM,101,"Snape, Snape, Severus Snape.",May,2014,3.0,B,General Ed,0,Freshman,2.33,Chemistry and Biochemistry,"Snape, Severus",3.33,3.17
1,1,CHEM 110,CHEM,110,This professor is the most amazing teacher in ...,Mar,2014,4.0,A,Elective,0,Freshman,2.33,Chemistry and Biochemistry,"Snape, Severus",3.33,3.17
2,2,CHEM 124,CHEM,124,I tried really hard in his class but he seemed...,May,2015,1.0,D,Required (Major),0,Freshman,2.33,Chemistry and Biochemistry,"Snape, Severus",3.33,3.17
3,3,CHEM 124,CHEM,124,"Very cramped handwriting, difficult to read on...",Jan,2018,2.0,C,Required (Support),0,Freshman,2.33,Chemistry and Biochemistry,"Snape, Severus",3.33,3.17
4,4,CHEM 125,CHEM,125,Coming in to the class I had heard some things...,Jan,2015,2.0,C,Elective,1,Sophomore,2.33,Chemistry and Biochemistry,"Snape, Severus",3.33,3.17
5,5,CHEM 125,CHEM,125,I really wanted to get him for defense against...,May,2016,3.0,B,Required (Major),1,Sophomore,2.33,Chemistry and Biochemistry,"Snape, Severus",3.33,3.17
6,6,CHEM 129,CHEM,129,"Prof. Snape knows his chemistry, but seems per...",Jun,2015,2.0,C,Required (Support),0,Freshman,2.33,Chemistry and Biochemistry,"Snape, Severus",3.33,3.17
7,7,CHEM 202,CHEM,202,Professor Snape is a fantastic professor. Bare...,Jan,2018,4.0,A,General Ed,1,Sophomore,2.33,Chemistry and Biochemistry,"Snape, Severus",3.33,3.17
8,8,CHEM 211,CHEM,211,"He can teach you how to bewitch the mind, and ...",Mar,2014,3.0,B,General Ed,0,Freshman,2.33,Chemistry and Biochemistry,"Snape, Severus",3.33,3.17
9,9,CHEM 216,CHEM,216,I miss him. Always.,Jan,2016,3.0,B,Required (Major),1,Sophomore,2.33,Chemistry and Biochemistry,"Snape, Severus",3.33,3.17


** Error Handling **

I wrote a csv for each professor when scraping the content before concatenating the csv’s into one large csv. Although this made the collection process slightly long, it allowed me to isolate web-scraping errors. Problems I encountered included:
- invalid profile pages (pages that initially existed but were taken down)
- teachers that had N/A ratings (there were few of these, I filtered them out because they would not be useful for my machine learning model)
- ‘\r’, the unknown character
    - this caused some of my DataFrame rows to become fragmented and unreadable
    - I replaced these characters with a blank character before adding the string to the DataFrame
- pages with only one review had slightly different offsets

** Statistics **

- the process of collecting merging all of the reviews took approximately 30 minutes
    - I had to sleep for 0.1 seconds between each profile scrape because I initially flooded the website with requests without this implemented
    
    
- there are 2,478 visible profiles
- I was able to collect 65,020 reviews from PolyRatings (creating a csv file 44.1 mb)

## III. Machine Learning  

** KMeans Models **


To create a model to flag material that was potentially spam or unproductively toxic, I needed a training set that defined different types of reviews. Given that there are over 65,000 reviews, it would be nearly impossible (and impractical) to read and filter each review. Flagging tools could be implemented on the site to allow users to define reviews they feel are spam/un-useful, but in lieu of this information, I used KMeans clusters to determine review categories.


To create KMeans clusters, I collected term-frequencies of the reviews, using an **IDF** (inverse document frequency) weight metric. Because there’s a signficant vocabulary of words across all of the reviews, I could not create a bag of words containing each review due to system constraints. So with trial and error, I was able to create a TF-IDF data frame from a random sample of 1/20th of the reviews from Polyratings (around ~3250 reviews with 6,349 unique words). Of the sample, these were the top 20 words used for the term frequencies count (TF) and TF IDF scores:


<img src="img/TF_top.png" /> <img src="img/TF_IDF_top.png" />


I then trained a KMeans cluster model on the TF IDF data of the sample. I did not have an explicit number of groups that I thought the reviews could be categorized by, so I created five different scikit-learn KMeans models with cluster sizes of n = 2 to n = 6. These are the distributions of clusters for each KMeans model:  


<img src="img/model_2.png" /> <img src="img/model_3.png" /> <img src="img/model_4.png" /> <img src="img/model_5.png" /> <img src="img/model_6.png" />


### Auto-Moderator Learning Model (K-nearest neighbors)

Unfortunately after examining some of the groups generated from the KMeans model, I could not find groups that stood out as definitively spam, innapropriate or uselessly toxic. Therefore I did not have a correct variable to predict types of reviews. However with proper spam flagging from users, I could have a basis to create a model to predict categories of reviews. It would have been a K-nearest neighbors machine learning model and it is summarized below. I planned to use TF-IDF words, classes (by their abreviation and level), student standing and the student's grade (by their GPA) be factors to predict the category of a review.

Factors Used in the Model (X):
- review_content (in the form of TF-IDF words)
- class_abv (varies by major)
- class_number (difficulty increases over time)
- student_standing 
- student_gpa

Predicting (Y):
- category of message (spam / un-useful / regular)


<center> <img src="img/logo_p_small.png" /> </center>