**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Jiaqi Wu
- Ella Wen
- Zihao Yang
- Yunqi Zhang
- Zhining Zhang

# Research Question

#### How do factors such as course difficulty, instructor quality, and course level (lower vs. upper division) correlate with student satisfaction scores in UCSD STEM courses, based on CAPE data from the 2020-2022 academic year?

The factors that may influence the likelihood of students recommending a class include:

- Instructor Quality: Indicated by the percentage of students recommending the instructor.

- Course Difficulty: Implied by the average number of hours spent studying per week.

- Grade Expectations vs. Actual Grades: Measured by the difference between expected and received GPA.

- Number of Evaluations: A higher number provides a more reliable recommendation rate.

- Enrollment Size: Larger class sizes may impact instruction quality and student engagement.

- Term and Course Context: Course content, teaching methods, and term-specific factors may influence recommendations.

## Background and Prior Work


When students at UCSD register for courses on WebReg each quarter, we often rely on Course and Professor Evaluations (CAPEs) to make our decisions. This online evaluation provides us insight on various aspects of a course, such as instructor, workload, and expected vs. received grades. One of the key categories within CAPE is the recommendation rate of the class, which is the percentage of students who would recommend a course to others. However, it remains unclear which factors have the most significant impact on a student’s likelihood of recommending a course.

For this project, we aim to analyze the correlation between CAPE recommendation rates and other CAPEs factors, such as instructor ratings, expected vs. received grades, study hours per week, and overall course enrollment. This can help us better select the courses.

In this past project shown in Reference #1, a group of UCSD students in COGS108 conducted research on past CAPE data, focusing on identifying the key factors that influence the average GPA of a class at UCSD. Their study explored correlations between expected and actual GPAs, study time, class/professor evaluations, and GPA trends across different fields and course levels. One of their notable findings was that upper-division classes tend to have higher average GPAs than lower-division classes, challenging common assumptions about course difficulty.

While their project primarily aimed to understand GPA predictors and their implications for university policies, our research takes a different approach by focusing on the factors that influence students' likelihood of recommending a class in CAPEs. Instead of analyzing the average GPA received by students, we examine the recommendation rate of a class to understand what aspects contribute to a positive course evaluation. However, like their study, our research also relies on CAPE survey data, making their findings a relevant reference for understanding broader academic trends at UCSD.

Prior studies in Reference #2 have explored various aspects of student course selection. Dahl et al. (2022) examined how student attitudes towards the class, class recommendation rate, and perceived behavioral control impact their decision to enroll in a course. This provides a similar approach to our question, what factors influenced the class recommendation rate. It aligns with our hypothesis that instructor quality, expected workload, and grading play a major role in student's satisfication. 



- 1) https://github.com/COGS108/FinalProjects-Sp23/blob/main/FinalProjectGroup_Sp23_DigitSapiens.ipynb
- 2) https://nactajournal.org/index.php/nactaj/article/view/138
- 3) https://github.com/UCSD-Historical-Enrollment-Data/UCSDHistEnrollData.git/ 

# Hypothesis



Instructor recommendation ratings, course difficulty, grade expectations, and enrollment size are significant predictors of class recommendation rates in UCSD STEM courses.
Specifically:

Courses with higher instructor recommendation ratings will have higher class recommendation rates.

Courses with lower perceived difficulty (fewer study hours per week) will have higher class recommendation rates.

Courses where students’ actual grades meet or exceed their expectations will have higher class recommendation rates.

Smaller enrollment sizes will correlate with higher class recommendation rates compared to larger enrollment sizes.


# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Dataset #1 (use name instead of number here)

In [35]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
import pandas as pd

file_path = "data/CAPEs.tsv"

df = pd.read_csv(file_path, sep="\t")

In [40]:
# Filter dataset for terms from 2020, 2021, and 2022
df = df[df['term'].str[-2:].isin(['20', '21', '22'])]

# Drop rows where avg_grade_rec is -1 (N/A values)
df_cleaned = df[df['avg_grade_rec'] != -1]

# Display the cleaned dataframe
df_cleaned

Unnamed: 0,instructor,sub_course,course,term,enroll,evals_made,rcmd_class,rcmd_instr,study_hr_wk,avg_grade_exp,avg_grade_rec
33,"Fortier, Jana",ANAR 100,Spec Topics/Anth Archaeology,FA22,20,3,100.0,100.0,3.17,3.33,3.30
36,"Goldstein, Paul S",ANAR 143,Biblical Arch,FA22,25,8,100.0,100.0,1.36,3.86,3.85
37,"Marchetto, Maria Carolina",ANBI 100,Special Topic/Biological Anth,FA22,30,14,100.0,100.0,1.93,4.00,3.99
41,"Gagneux, Pascal",ANBI 141,The Evolution of Human Diet,FA22,186,69,98.5,100.0,3.76,3.70,3.66
42,"Hrvoj Mihic, Branka",ANBI 145,Bioarchaeology,FA22,20,4,75.0,100.0,4.50,3.33,2.84
...,...,...,...,...,...,...,...,...,...,...,...
61786,"Gladstein, Jill M",SYN 1,Perspectives/Changing Planet,WI22,404,243,90.9,93.4,3.95,3.94,3.90
61787,"Gladstein, Jill M",SYN 2,Explorations/Changing Planet,WI22,191,93,77.0,88.5,4.18,3.79,3.89
61788,"Gladstein, Jill M",SYN 2,Explorations/Changing Planet,FA21,319,135,68.0,74.4,3.89,3.93,3.88
61789,"Gladstein, Jill M",SYN 1,Perspectives/Changing Planet,SP21,235,107,90.6,95.3,5.15,3.85,3.88


In [None]:
# We defined a list of STEM courses prefix 
STEM = ['MATH', 'CHEM', 'BIO', 'COGS', 'CSE', 'ECON', 'ECE', 'MAE', 'PHYS', 'DSC']
df_cleaned['department'] = df_cleaned['sub_course'].astype(str).str.split().str[0]

cleaned_df_stem = df_cleaned[df_cleaned['department'].isin(STEM)]

cleaned_df_stem

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['department'] = df_cleaned['sub_course'].astype(str).str.split().str[0]


Unnamed: 0,instructor,sub_course,course,term,enroll,evals_made,rcmd_class,rcmd_instr,study_hr_wk,avg_grade_exp,avg_grade_rec,department
6851,"Pomeroy, Robert S.",CHEM 100A,Analytical Chemistry Lab,FA22,140,59,84.7,93.2,8.81,3.17,3.20,CHEM
6852,"Young, Mark",CHEM 105A,Physical Chemistry Laboratory,FA22,42,12,81.8,81.8,7.41,3.18,2.82,CHEM
6853,"Young, Mark",CHEM 105A,Physical Chemistry Laboratory,FA22,39,11,72.7,81.8,9.77,3.50,3.15,CHEM
6854,"Ghosh, Gourisankar",CHEM 108,Protein Biochemistry Lab,FA22,46,17,88.2,47.1,5.91,3.24,3.25,CHEM
6855,"Ghosh, Gourisankar",CHEM 108,Protein Biochemistry Lab,FA22,47,10,100.0,88.9,6.75,3.50,3.04,CHEM
...,...,...,...,...,...,...,...,...,...,...,...,...
61250,"Langlois, Marina",DSC 30,DataStrc/Algrthms for Data Sci,WI20,98,30,85.2,88.5,11.69,3.00,2.67,DSC
61251,"Eldridge, Justin Matthew",DSC 40A,Theor Fndtns of Data Sci I,WI20,99,40,89.5,100.0,5.58,3.29,3.53,DSC
61252,"Tiefenbruck, Janine LoBue",DSC 40B,Theor Fndtns of Data Sci II,WI20,57,18,100.0,100.0,6.06,3.61,2.94,DSC
61253,"Tiefenbruck, Janine LoBue",DSC 40B,Theor Fndtns of Data Sci II,WI20,37,12,80.0,90.0,5.10,3.25,3.01,DSC


Here is the dataset summary:

### **Dataset #1**
- **Dataset Name:** CAPE Student Course Evaluations (Filtered for 2020-2022, **`df_cleaned`**)
- **Link to the dataset:** *(Local TSV file, not publicly hosted)*
- **Number of observations:** **10,831** rows (after filtering and removing missing values)
- **Number of variables:** **11** columns (including instructor name, course details, term, enrollment, evaluation metrics, study hours, and GPA expectations vs. received)

The **CAPE dataset** contains student evaluations of university courses from **2020, 2021, and 2022**, capturing key insights into instructor effectiveness, course difficulty, and student workload. Each row represents a specific course-section taught by an instructor in a given term (e.g., FA22 for Fall 2022) and includes metrics such as enrollment numbers, evaluation counts, and recommendation percentages for both the course and instructor. Additionally, it records students' reported study hours per week and their expected vs. received GPA. Missing values in the **`avg_grade_rec`** column, originally represented as `-1`, have been removed to ensure data accuracy. With proper analysis, this dataset can be used to study student's staisfactory scores towards different courses over time.

## Dataset #2 (if you have more than one, use name instead of number here)

In [18]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

# Ethics & Privacy

In our data science project, ethical considerations and privacy protection are integrated into every phase of our process—from the initial formulation of unbiased and inclusive research questions to the transparent communication of our findings. We ensure that our data collection practices adhere to ethical standards by verifying privacy policies and confirming that the datasets do not favor or exclude specific populations. During analysis, we actively apply statistical tests and fairness metrics to detect any biases, and we are prepared to mitigate these through methods such as re-sampling or incorporating additional data sources. We also take stringent measures to protect sensitive information by anonymizing or aggregating data, ensuring full compliance with legal and usage terms. Utilizing established guidelines like UCSD’s Ethics Checklist provision, our team maintains ongoing oversight and open communication to promptly address emerging ethical issues, ultimately striving to produce research that is both equitable and transparent.

# Team Expectations 

* Show up to meetings on time and participate in discussion actively.
* Communicate and collaborate with manners to ensure a productive and efficient working environment.
* Work as hard as possible to ensure every assignment is due before the deadline.
* Split work equally and fairly.
* Openly give and receive feedback from group mates for the improvements.
* Be nice &#x1F604;

# Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting                                                                                         | Discuss at Meeting                                                                                                                                                   |
|-------------:|:-----------:|:------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **1/22**     | 5 PM        | 1. Read & review COGS 108 expectations <br> 2. Brainstorm topics/questions on course evaluations                                                     | 1. Finalize communication method <br> 2. Decide on final project focus (CAPE & Enrollment) <br> 3. Draft initial hypothesis <br> 4. Begin background research on CAPE data |
| **2/1**      | 5 PM        | 1. Conduct background research on course evaluation studies <br> 2. Look at how other universities analyze eval data                                   | 1. Narrow down research question <br> 2. Discuss ethical/data privacy considerations <br> 3. Outline ideal dataset requirements |
| **2/8**      | 5 PM        | 1. Draft project proposal & integrate feedback <br> 2. Search for CAPE & Enrollment datasets (format, coverage)                                        | 1. Review/finalize project proposal <br> 2. Plan data-wrangling strategy (merging CAPE & enrollment) <br> 3. Assign roles for data collection, cleaning, and analysis         |
| **2/22**     | 5 PM        | 1. Import & wrangle combined CAPE + Enrollment data <br> 2. Perform basic EDA (missing data, summary stats)                                            | 1. Review initial wrangling/EDA findings <br> 2. Draft advanced analysis plan (correlation, regression) <br> 3. Confirm approach for missing evals or outliers                |
| **3/4**      | 5 PM        | 1. Finalize data cleaning & EDA <br> 2. Begin preliminary analysis (e.g., correlation between eval scores & grades)                                    | 1. Discuss results & refine analysis methods (control for course level/instructor) <br> 2. Complete project check-in or milestone report                                   |
| **3/15**     | 5 PM        | 1. Wrap up statistical analysis (regressions, hypothesis tests) <br> 2. Draft discussion/conclusions                                                   | 1. Review/edit full analysis & results <br> 2. Plan final data visualization (e.g., CAPE vs. grades, trends over time) <br> 3. Assign final write-up tasks                       |
| **3/19**     | Before 11:59 PM | *No new tasks before meeting; final submission deadline*                                                                                         | 1. Finalize project report <br> 2. Submit final project and group surveys <br> 3. Ensure code/data/documentation are well-organized                                  |