# Student Opinions on Courses in Relation with Piazza.com Engagement
## INFO 2950 Final Project
#### Amina Shikhalieva, Gauri Pidatala

## Introduction

The purpose of this report is to utilize data publicly available on piazza.com and ratemyprofessors.com to answer the question: "What makes a class good?". We use ratings for professors as a general indicator of how positive or negative the average student's experience was with the professor for a particular course for a particular semester, and look for patterns between good ratings and other observations in our data. Variables like class size (reflected by the Piazza enrollment), average percentage of the class engaging on Piazza, course level, and average instructor response time on Piazza are all analyzed. 
Maximizing learning and improving the student experience in a class should always be a prioritiy of instructors. If this investigation shows that behaviors like encouraging student engagement on piazza.com, responding to piazza questions in a timely manner, and limiting class sizes significantly correlates with higher student ratings of a class, this information can be used to improve student satisfaction and learning for a large range of courses both at Cornell and elsewhere. Especially now in the shift to online learning, resources such as piazza.com are becoming one of the only ways some students have of getting help in and engaging in a course, so it is only more important now to understand the its impact in learning.

[] insert findings summary []

## Data Description


#### Dataset description
- all_data:
    - observances: each row represents one review on ratemyprofessors.com
    - attributes:
        - Course#: 
        - Total_Q: 
        - %Response: 
        - #Contributions: 
        - Avg_Response_Time: 
        - Enrolled: 
        - instructor_response: 
        - %Active: 
        - semester: 
        - department: 
        - Quality: 
        - Difficulty: 
        - Would Take Again: 
        - Date: 
        - avg_quality: 
        - avg_difficulty: 
      

#### Data acquisition
- We asked our some of our peers from a broad variety of majors, from chemical engineering to biology to computer science for the links to the stats pages of their classes that used piazza.com. We compiled all the links, as well as the semester they took the class, into [ piazza_links.txt ]. 

#### Potential drawbacks
- 

#### Preprocessing and cleaning

#### Raw data







#### Data acquisition and cleaning: 

**piazza.com** 
* See "piazza_data_collect.ipynb"
* Every course on Piazza has a stats report accessible to anyone enrolled in the class (under the statistics tab). A few courses do not have this report because there was not enough piazza activity. 
* We compiled a list of courses (with no duplicates) from various departments taught in various semesters that used piazza.com as an interactive platform. For each of these courses, we recorded the URL for its stats page in the file "piazza_stats.txt". Here we also manually added the semester corresponding to each stats report. We then iterated through each of the URLs in the file and recorded in [ piazza .csv ] various attributes for each course
* As of now, some of our piazza stats course names are irregular. Because there is only a small number of these irregularities and they don't follow any particular pattern, we chose to manually fix each one
* We also need to get rid of the '%' following the values in columns "% response" and "instructor_response"
* We also want to be able to group our courses by departments

**ratemyprofessors.com** 
* See "data_collect.ipymb"
* We manually associate each course element of our piazza.csv database with the name of the professor who taught it that semester. We then manually aqcuire the URLs of the associated ratemyprofessors.com page of each of these professors and record them in the file "Rate_my_prof.txt". We then iterate through each URL in this file and scrape the necessary attributes for each review for the course we are looking for.
* Because a review page for a professor includes many 
* Some ratemyprofessors.com reviews are void because when they were scraped the ads on the page corrupted some of the observations
    - So we need to remove these observations. Every review is guaranteed to have a quality rating
    - Because the course names are listed as a single string, we need to split the department name and course number and delete reviews with typos in the course name
    - We also need to convert columns that hold number values from string to int (eg, Difficulty)
    
#### Tables: 
* piazza_data
    - Stores data collected from the statistics reports of each piazza course
    - Observations: a course (courses can be repeated but not from the same semester, in other words, CS2110 is repeated three times, but each time the data is from a different semester
    - Attributes: 
        + Course#: A string value of the department abbreviation and course number (eg, "PHYS 2213")
        + Total_Q: 
        + %Response: percentage of questions that received a response
        + #Contributions: total number of questions, answers, notes, and posts by all members
        + Avg_Response_Time: average number of minutes it takes for a question to be answered
        + Enrolled: number of students signed up to the course piazza
        + instructor_response: percentage of questions answered by an instructor (the professor/lecturer or TA's) 
        + %Active: percentage of students enrolled who contributed on piazza
        + Semester: A string containing the semester abbreviation and year abbreviation (eg, "SP19")
        + department: a string value of the department of the course (eg, "PHYS")
        + course_num: an integer value of the course number (eg, 2213)
        

* ratings_data
    - Stores data collected from the review pages of professors who have taught the courses that we collected piazza data for
    - Observations: a professor 
    - Attributes: 
        + Name: the name of the professor as a string
        + Quality: an integer rating from 1-5 of the quality of the professor's teaching
        + Difficulty: an integer rating from 1-5 of the difficulty of the professor's course
        + Would take again: a string value "Yes" or "No" based on whether the reviewer would take the course from that professor again
        + Date: String timestamp of the review
        + Class taught: a string value of the course the reviewer is basing their experience on with said professor (eg, "CS4700")
        + course_number: integer value of the course number (eg, 4700)
        + department: a string value of the department of the course (eg, "CS")
        + Course #: string course name consistent with 'Course #' column in piazza_data

## Preregistration Statement

#### Analyses:

- Use multi-variable linear regression to try to predict the rating of the professor teaching the course (with training values gathered from ratemyprofessors.com)
    - Predicting variables:
        - class size (approximated by Piazza enrollment)
        - department (use dummy values for each group)
        - Piazza activity (as a percent)
    - The results will help us understand if there is a strong linear relationship between the combination of these three variables with the ratings of students' experiences with the professor teaching the class, which we will interpret as strongly correlated with the quality of the students' experience in the class 
    
- Use k-means clustering to plot class size by professor rating, color coded by clusters
    - Clusters: departments
    - This may show us if certain departments correspond with larger class sizes or higher ratings
    - The results may be useful if there is in fact such a relationship, so if we see some deparments consistently have lower ratings than others, then leaders at the heads of those departments could see if a change in class sizes or Piazza engagement may help students have a better learning experience in their department.
    
- Use linear regression and correlation coefficient to determine if there is a correlation between the variables class size and piazza activity for a course.
    - A strong correlation here may suggest that students feel more confident asking questions in smaller classes than bigger classes and vice versa. 
    
- Use linear regression and correlation coefficient to determine if there is a correlation between the variables instructor response time and piazza activity for a course.
    - This may help us determine whether classes with higher instructer involvement result in higher student involvement for a course.


## Data Analysis

In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from scipy import stats

In [3]:
piazza_data = pd.read_csv('piazza.csv')
ratings_data = pd.read_csv('rmp.csv')

- Use multi-variable linear regression to try to predict the rating of the professor teaching the course (with training values gathered from ratemyprofessors.com)
    - Predicting variables:
        - class size (approximated by Piazza enrollment)
        - department (use dummy values for each group)       
        - Piazza activity (as a percent)


- Use k-means clustering to plot class size by professor rating, color coded by clusters
    - Clusters: departments

- Use linear regression and correlation coefficient to determine if there is a correlation between the variables class size and piazza activity for a course.

- Use linear regression and correlation coefficient to determine if there is a correlation between the variables instructor response time and piazza activity for a course.

In [6]:
X = piazza_data["Enrolled"]
y = piazza_data["%Active"].values
model = LinearRegression()
model.fit(X, y)

ValueError: Expected 2D array, got 1D array instead:
array=[259 417 543 162 135  37 419 618 506 430  28 482 238 721 145 716 216 379
 716 590 482 404 405  42  36  44  84 197  90 667 396 473 404 145 403 404
 242  58  61 424  56  80 410 205 185 201 569 662 417].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [5]:
X = piazza_data["instructor_response"].values
y = piazza_data["%Active"].values
model  = LinearRegression()
model.fit(X, y)


ValueError: could not convert string to float: '63%'

## Evaluation of Significance

- Repeat linear regression of piazza activity vs class size, but randomized, to see if we get similar results

- Repeat linear regression of piazza activity vs instructor response time, but randomized, to see if we get similar results

## Interpretation and Conclusions

## Limitations

## Source Code


All source code, raw data sets, and previous project phases are available publicly on the project's Github page:
- https://github.com/Amina-S/2950_project

## Acknowledgements

We would like to recognize ratemyprofs.com and piazza.com from which we gathered our raw data through web scraping.


We would also like to thank our peers who helped us gather our data by sending us the links to their past piazza.com courses:
- Amulya Khurana
- Anissa Dallmann
- Dakota Thomas
- Destiny Nwafor
- Elena Peot
- Enya Zimecka
- Luis Enriquez
- Rayshard Thompson
- Rizo Rakhmanov
- Sanjana Namreen


We would also like to thank Prof. Mimno and the INFO 2950 course staff for helping this report come to fruition.