# Cheddar News Senior Software Developer
### Data Analysis and Insight
### Greg Salmon

![image](cheddarnews.jpeg)

# $500,000

## What does Cheddar do?
- News services
- Acquired RateMyProfessors in 2018

## What would I do for Cheddar?
- Senior Software Developer
- Build and grow RateMyProfessor’s professor recommendation service
- Improve architecture, process and workflows

- Keywords: MySQL, AWS, Data query and manipulation

## Data Collection

## What data should I look at?
- **People Data**

## What people data do I have access to?
- **Rate My Professor**
- **LMU PROWL**

### Installing Selenium

In [1]:
!pip install selenium



### Importing Elements to use

In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains
from bs4 import BeautifulSoup
import re
import time
import pandas as pd
from sqlalchemy import create_engine
import random
import requests

### Setting the Driver

In [4]:
# Specify the path to the chromedriver executable
chromedriver_path = '/Users/gsalmon/Documents/chromedriver_mac64/chromedriver'

# Create a Service object for the chromedriver
service = Service(executable_path=chromedriver_path)

# Create a new instance of the Chrome browser
driver = webdriver.Chrome(service=service)

## Rate My Professor Scrape

In [None]:
# Navigate to the web page
url = 'https://www.ratemyprofessors.com/search/teachers?query=*&sid=538'
driver.get(url)

In [None]:
# Creating Soup Element
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

In [None]:
# Creating Empty Dictionary containing teacher details
teacher_details = {
    'name':[],
    'department':[],
    'rating':[],
    'count_ratings':[],
    'difficulty':[],
    'would_retake':[]
}

In [None]:
# Generating all of the HTML for the webpage
button = driver.find_element(By.XPATH, '//*[@id="root"]/div/div/div[4]/div[1]/div[1]/div[4]/button')
for i in range(273):
    ActionChains(driver)\
        .click(button)\
        .perform()
    time.sleep(3)

In [None]:
# Setting the HTML and soup variable to scrape
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

In [None]:
# Scraping each attribute for each teacher
teachers = soup.findAll('a', {'class':'TeacherCard__StyledTeacherCard-syjs0d-0 dLJIlx'})
x = 1;
for teacher in teachers:
    print(x)
    name = teacher.find('div', {'class':'CardName__StyledCardName-sc-1gyrgim-0 cJdVEK'})
    teacher_details['name'].append(name.text)
    print('Name:', name.text)

    department = teacher.find('div', {'class':'CardSchool__Department-sc-19lmz2k-0 haUIRO'})
    teacher_details['department'].append(department.text)
    print('Department:', department.text)

    rating = teacher.find('div', {'class':re.compile('^CardNumRating__CardNumRatingNumber-sc-17t4b9u-2')})
    teacher_details['rating'].append(rating.text)
    print('Rating:', rating.text)
    
    count_ratings = teacher.find('div', {'class':re.compile('^CardNumRating__CardNumRatingCount-sc-17t4b9u-3')})
    teacher_details['count_ratings'].append(count_ratings.text[:-8])
    print('Rating Count:', count_ratings.text[:-8])

    difficulty_elements = teacher.find_all('div', {'class':re.compile('^CardFeedback__CardFeedbackNumber-lq6nix-2')})
    difficulty = difficulty_elements[1]
    teacher_details['difficulty'].append(difficulty.text)
    print('Difficulty:', difficulty.text)

    would_retake = difficulty_elements[0]
    teacher_details['would_retake'].append(would_retake.text)
    print('Would Retake:', would_retake.text)
    
    x = x + 1

    print('-' * 50)

In [None]:
# Assigning teacher_details to a dataframe
teacher_df = pd.DataFrame(teacher_details)

In [None]:
# Checking the data for completeness
teacher_df.head()

In [None]:
# Checking the data for completeness
teacher_df.tail()

In [None]:
Turning the dataframe into a csv file
teacher_df.to_csv('teacher_details.csv', index=False)

## Prowl Scrape

In [None]:
# Resetting url and driver
url = 'https://bannerxe.lmu.edu/StudentRegistrationSsb/ssb/classSearch/classSearch'
driver.get(url)

In [None]:
# Setting soup variable
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

In [None]:
# Finding the table in the html
table = soup.find('table', id='table1')

In [None]:
# Creating field titles by scraping each of the headers
headers = []
for i in table.find_all('th'):
    title = i.text
    print(title)
    headers.append(title)

In [None]:
# Create a dataframe
mydata = pd.DataFrame(columns = headers)

In [None]:
# Create a for loop to fill mydata and paginate
# Filling with all data in table
for page in range(1,60):
    time.sleep(10)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table', id='table1')
    for j in table.find_all('tr')[1:]:
        row_data = j.find_all('td')
        row = [i.text for i in row_data]
        length = len(mydata)
        mydata.loc[length] = row
    button = driver.find_element(By.CSS_SELECTOR, '#searchResultsTable > div.bottom.ui-widget-header > div > button.paging-control.next.ltr.enabled')
    button.click()

In [None]:
# Checking length of data scraped
len(mydata)

In [None]:
# Viewing data for completeness
mydata

In [None]:
# Converting data to CSV
mydata.to_csv('prowl_data.csv', index=False)

In [6]:
# Quitting driver
driver.quit()

# SQL Analysis
Greg Salmon

### Importing pandas and create_engine for SQL Analysis

In [1]:
import pandas as pd
from sqlalchemy import create_engine

In [14]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


### mySQL Login

In [16]:
%sql mysql://USERNAME:PASSWORD@HOST/DATABASE

## Queries

### Descriptive Analytics

The first thing I want to look into is seeing how many instructers there are on Rate My Professors, as well as their average rating, difficulty, count of ratings, and would retake probability

## Count of Instructors on RMP with Averages

In [6]:
%%sql
SELECT 
    COUNT(Instructor) AS CountInstructors,
    ROUND(AVG(Rating), 2) AS AvgRating,
    ROUND(AVG(Difficulty), 2) AS AvgDifficulty,
    ROUND(AVG(CountRatings), 2) AS AvgCountRatings,
    ROUND(AVG(WouldRetake), 2) AS AvgWouldRetake
FROM rmp_ratings rr;

 * mysql://admin:***@isba-dev-01.cj5u8akyr9fw.us-east-1.rds.amazonaws.com/RMPproject
1 rows affected.


CountInstructors,AvgRating,AvgDifficulty,AvgCountRatings,AvgWouldRetake
1396,3.33,2.52,12.79,0.74


Something I am interested is the amount of instructors shown. I wonder if Prowl will have a similar amount. I am worried that there may be issues with the correct amount of data passing over.

## Unique Instructors and Subjects on Prowl

In [7]:
%%sql
SELECT 
    COUNT(DISTINCT(InstructorID)) AS CountInstructors,
    COUNT(DISTINCT(Subject)) AS CountSubjects
FROM classes c;

 * mysql://admin:***@isba-dev-01.cj5u8akyr9fw.us-east-1.rds.amazonaws.com/RMPproject
1 rows affected.


CountInstructors,CountSubjects
765,97


Interesting. Comparing the amount of instructors who are actually teaching a class in prowl to the professors in RMP shows there are significantly more professors in RMP. Two things come to mind: 
1) RMP does not remove teachers who no longer teach at the school

2) Teachers may have different names or name changes.
It may be useful to look into this aspect of the data.

I think it could be interesting to see if there are any specific meeting days that have a significant higher amount of classes. SQL features used: Aggregate functions, Case Statement

In [8]:
%%sql
SELECT
    SUM(CASE WHEN MeetingDays = "MWF" THEN 1 ELSE 0 END) AS CountMWF,
    SUM(CASE WHEN MeetingDays = "MW" THEN 1 ELSE 0 END) AS CountMW,
    SUM(CASE WHEN MeetingDays = "TT" THEN 1 ELSE 0 END) AS CountTT,
    SUM(CASE WHEN MeetingDays = "M" THEN 1 ELSE 0 END) AS CountM,
    SUM(CASE WHEN MeetingDays = "T" THEN 1 ELSE 0 END) AS CountT,
    SUM(CASE WHEN MeetingDays = "W" THEN 1 ELSE 0 END) AS CountW,
    SUM(CASE WHEN MeetingDays = "Th" THEN 1 ELSE 0 END) AS CountTh,
    SUM(CASE WHEN MeetingDays = "F" THEN 1 ELSE 0 END) AS CountF
FROM classes c;

 * mysql://admin:***@isba-dev-01.cj5u8akyr9fw.us-east-1.rds.amazonaws.com/RMPproject
1 rows affected.


CountMWF,CountMW,CountTT,CountM,CountT,CountW,CountTh,CountF
243,418,614,238,280,234,196,48


There is a significant amount of classes meeting on Tuesday and Thursday. This may have an effect on a multitude of things on campus. While this project focuses on specifically rate my professor, it could be interesting to see if moving some of the Tuesday/Thursday classes to other days could alleviate parking. I digress. It is interesting to me to see that there are very few friday classes. I wonder if classes that are required to meet on just one day would have worse reviews due to the long class periods.

I want to also look at what professors have the most amount of ratings. There may be trends in what college or department gets the most reviews.

In [9]:
%%sql
SELECT
    Instructor,
    Department,
    CountRatings
FROM rmp_ratings rr
ORDER BY CountRatings DESC;

 * mysql://admin:***@isba-dev-01.cj5u8akyr9fw.us-east-1.rds.amazonaws.com/RMPproject
1396 rows affected.


Instructor,Department,CountRatings
Megan Granich,Mathematics,124
Jodi Finkel,Political Science,119
Michel van Biezen,Physics,104
Amir Hussain,Theology,99
Karen Ellis,Mathematics,93
Michael Foy,Psychology,90
Evan Gerstmann,Political Science,87
Ralph Quinones,Business,86
Cara Anzilotti,History,82
Laurie Pintar,History,80


It looks like the Megan Granich has the most reviews in Mathematics. I am guessing she most likely teaches a large lower division class. I am also noticing that there are a lot of BCLA and Seaver teachers in the top 10 highest ratings. This may be due to how many students are attending those colleges. Another reason may be due to how they teach. I would think a controversial teacher would garner more reviews than a teacher who did not stand out.

## Teachers with No Reviews on Rate My Professors

In [10]:
%%sql
SELECT 
    COUNT(Instructor) AS CountInstructors
FROM rmp_ratings rr
WHERE CountRatings = 0;

 * mysql://admin:***@isba-dev-01.cj5u8akyr9fw.us-east-1.rds.amazonaws.com/RMPproject
1 rows affected.


CountInstructors
169


### 169 Teachers with no reviews

**Why?**

- Teachers uploaded by students
    - Student deleting review

## Primary Question

### Are there professors on Prowl, but not on Rate My Professor?

In [11]:
%%sql
CREATE OR REPLACE VIEW prowl_rmp AS 
(
    SELECT 
        t.Instructor AS ProwlInstructor,
        rr.Instructor AS RMPInstructor,
        rr.Rating,
        rr.Difficulty
    FROM teachers t
    LEFT JOIN rmp_ratings rr 
        ON t.InstructorID = rr.InstructorID
    UNION
    SELECT 
        t.Instructor AS ProwlInstructor,
        rr.Instructor AS RMPInstructor,
        rr.Rating,
        rr.Difficulty
    FROM teachers t
    RIGHT JOIN rmp_ratings rr 
        ON t.InstructorID = rr.InstructorID
);

 * mysql://admin:***@isba-dev-01.cj5u8akyr9fw.us-east-1.rds.amazonaws.com/RMPproject
0 rows affected.


[]

Using this view, my first query I am interested in looking into is to see how many professors are in RMP, how many are in Prowl, and how many are in Prowl that have a rating. I would also like to include the average rating and difficulty. 

SQL Features used: Subquery

In [12]:
%%sql
SELECT
    COUNT(RMPInstructor) AS CountRMPInstructors,
    COUNT(ProwlInstructor) AS CountProwlInstructors,
    (
        SELECT
            SUM(CASE WHEN ProwlInstructor = RMPInstructor THEN 1 ELSE 0 END) AS ProwlInRMP
        FROM prowl_rmp
    ) AS CountProwlInRMP,
    ROUND(AVG(Rating), 2) AS AvgRating,
    ROUND(AVG(Difficulty), 2) AS AvgDifficulty
FROM prowl_rmp;

 * mysql://admin:***@isba-dev-01.cj5u8akyr9fw.us-east-1.rds.amazonaws.com/RMPproject
1 rows affected.


CountRMPInstructors,CountProwlInstructors,CountProwlInRMP,AvgRating,AvgDifficulty
1396,771,368,3.33,2.52


### Insight
- Out of 771 professors, only 368 are in RMP
- RMP still has 1396 professors...
**Rate My Professor needs effective teacher data*
- What departments are recieving the most ratings?

## Secondary Questions

### Do Departments with more classes recieve more ratings?
- Is there a correlation between certain departments and better ratings?

In [53]:
%%sql
WITH InstructorClasses AS (
    SELECT
        t.InstructorID,
        t.Instructor,
        rr.Department,
        rr.Difficulty,
        rr.Rating,
        COUNT(c.Title) AS CountClasses,
        rr.CountRatings
    FROM rmp_ratings rr
    RIGHT JOIN teachers t 
        ON rr.InstructorID = t.InstructorID
    RIGHT JOIN classes c
        ON t.InstructorID = c.InstructorID
    GROUP BY Instructor
    ORDER BY rr.CountRatings DESC, rr.Rating DESC, Difficulty ASC
)
SELECT
    Department,
    Difficulty,
    Rating,
    SUM(CountClasses) AS SumOfClasses,
    SUM(CountRatings) AS SumOfRatings
FROM InstructorClasses
WHERE Department != 'None'
GROUP BY Department
ORDER BY SumOfRatings DESC;

 * mysql://admin:***@isba-dev-01.cj5u8akyr9fw.us-east-1.rds.amazonaws.com/RMPproject
63 rows affected.


Department,Difficulty,Rating,SumOfClasses,SumOfRatings
Communication Studies,3.4,3.8,55,476
Philosophy,3.5,3.8,33,424
Mathematics,2.5,4.1,36,397
Psychology,3.5,3.6,30,387
History,3.3,3.6,32,345
Economics,4.4,3.3,36,311
Chemistry,3.5,4.6,67,284
Political Science,4.3,3.6,32,283
Business,4.2,3.1,48,273
Theology,2.3,4.9,19,257


- Less classes relate to less ratings
- Market to majors with many students to increase views and clicks

### What percentage of WouldRetake is available for each department?

In [109]:
%%sql
SELECT
    Department,
    SUM(CASE WHEN WouldRetake IS NULL THEN 0 ELSE 1 END) AS TotalNotNull,
    SUM(CASE WHEN WouldRetake IS NOT NULL THEN 0 ELSE 1 END) AS TotalNull,
    ROUND((SUM(CASE WHEN WouldRetake IS NULL THEN 0 ELSE 1 END) / 
         (SUM(CASE WHEN WouldRetake IS NULL THEN 0 ELSE 1 END) + 
          SUM(CASE WHEN WouldRetake IS NOT NULL THEN 0 ELSE 1 END)) * 100
    ), 2) AS PercentNotNull,
    RANK() OVER (
        ORDER BY ROUND((SUM(CASE WHEN WouldRetake IS NULL THEN 0 ELSE 1 END) / 
         (SUM(CASE WHEN WouldRetake IS NULL THEN 0 ELSE 1 END) + 
          SUM(CASE WHEN WouldRetake IS NOT NULL THEN 0 ELSE 1 END)) * 100
    ), 2) DESC) AS RankNotNull
FROM
    rmp_ratings rr
GROUP BY Department
ORDER BY SUM(CASE WHEN WouldRetake IS NULL THEN 0 ELSE 1 END) DESC;

 * mysql://admin:***@isba-dev-01.cj5u8akyr9fw.us-east-1.rds.amazonaws.com/RMPproject
80 rows affected.


Department,TotalNotNull,TotalNull,PercentNotNull,RankNotNull
Communication Studies,43,27,61.43,44
English,41,62,39.81,66
Business,40,15,72.73,23
Film & Television,40,21,65.57,38
Philosophy,40,28,58.82,48
History,32,27,54.24,57
Psychology,30,18,62.5,41
Theater,30,14,68.18,31
Economics,30,12,71.43,26
Theology,30,12,71.43,26


- There is a large percentage of students who do not fill in the WouldRetake attribute
- Rating scale may be difficult to decide on for students
- RMP should create teacher pages using data from PROWL

# Okay, so whats the point?

# 403 Missing Teachers

Cheddar News' Rate My Professors software is in an interesting position:
- Large user base
- Significant lack of teachers
- Outdated teacher ratings
- Competition with Coursicle


# Recommendation

## Student Ambassadors
- Legally acquire data from schools
- Add all current teachers to Rate My Professors
    - Allows students to easily add ratings