# DSC 80: Project 01

### Due Date: Thursday, April 18, 11:59:59 PM

---
# Instructions

This Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems.  
* Like the lab, your coding work will be developed in the accompanying `project01.py` file, that will be imported into the current notebook. This code will be autograded.
* The project also has free response questions. To answer the free response questions, edit the markdown cell where specified (as in DSC 10). Submission of the project include uploading a pdf of this notebook to gradescope for manual grading.

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Do not change the free response cells outside the horizontal lines**
- The format of the cells will be used in grading the free response questions.


**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the HW! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `project01.py` (much like we do in the notebook).
- Always document your code!

**Tips for writing the free response questions**:
- You should treat the notebook as a final report for the assignment, containing conclusions and answers to open ended questions that are graded.
- Upon submitting the notebook, there should not be extraneous code in the notebook (e.g. any debugging code). You should only have your answers the the questions, and the necessary code and corresponding output data that serves as evidence for your responses.
- Generally, the free response questions will involve you *using* the functions defined in your `.py` file to justify portions of your argument.
- They should not be long, verbose answers! Typically a short paragraph will do.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import project01 as proj

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np

import os

# The other side of Gradescope

The file contains the grade-book from a fictional data science course with 535 students. 

**Note: this dataset is synthetically generated; it does not contain real student grades.**

In this project, you will:
1. clean and process the data to compute total course grades according to the fictional syllabus (below),
2. qualitatively understand how students did in the course,
3. create a curve and assess its effect.

---

The course syllabus is as follows:

* The course consists of HW assignments, projects, 1 midterm, and a final exam.
* The weight of the course components are HW (20%), projects (30%), midterm (20%), final (30%).
* For the HW assignments, students can revise an assignment for one week after submission for a 10% penalty, for two weeks after submission for a 20% penalty, and beyond that for a 50% penalty. Such revisions are reflected in the `Lateness` columns in the gradebook.
* The lowest HW assignment is dropped.
* Students can earn extra-credit through the `extra-credit` assignment, as well as turning in project checkpoints. All of the extra-credit should amount to the equivalent of *one HW assignment*.

### A note on generalization

You may assume that your code will only need to work on a gradebook for a class with the syllabus given above. That is, you may assume that the dataframe `grades` looks like the given on (in `data/grades.csv`), but 
1. may have more/fewer HW and projects,
2. may have more/fewer students.

You may assume the course components and the naming conventions are as given in the data file. 

In [None]:
grades_fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(grades_fp)

### Computing homework grades

First, you will clean and process the HW grades. To do this, you will develop functions that normalize the grades, adjust for lateness, drops the lowest grade, and totals the HW grades for each student.

*Note:* You should adapt the questions in this section to process the project assignments as well, as you will need to compute the project grades for a later question. The two are similar (but not identical).

**Question 1**

Create a function `normalize_hw` that takes in a dataframe like `grades` and outputs a dataframe of normalized HW grades (see doctest for the format of the output). The output should **not** take the late penalty into account.


**Question 2**

Unfortunately, Gradescope sometimes experiences a delay in registering when an assignment is submitted during "periods of heavy usage" (i.e. near a submission deadline). You need to assess when a student's assignment was actually turned in on time, even if Gradescope did not process it in time. To do this, it is helpful to know:
* Every late submission has to be submitted by a TA (late submissions are turned off).
* TAs never submitted a late assignment "just after" the deadline. 
* The deadlines were at midnight and students had to come to staff hours to late-submit their assignment.

Create a function `last_minute_submissions` that takes in the dataframe `grades` and outputs the number of submissions that were turned in on time by the student and marked 'late' by Gradescope (for each homework assignment). See the doctest for more details.

*Note:* You have to figure out what's truly a late submission by looking at the data and understanding the facts about the data generating process above. There is some ambiguity in finding which submissions are truly late; your answer will be specific to this dataset.

**Question 3**

Now you need to adjust the HW grades for late submissions. Create a function `adjust_lateness` that takes in the dataframe `grades` and returns a dataframe of HW grades adjusted for lateness according to the syllabus. Only *truly* late submissions should be counted as late (as in question 2). The adjusted HW grades should be proportions between 0 and 1.

*Note:* You should use your work from question 1 here!

**Question 4**

Create a function `hw_total` that takes in a dataframe of lateness-adjusted HW grades, and computes the total HW grade for each student according to the syllabus. All homework assignments should be equally weighte. Your answer should be a proportion between 0 and 1. (Don't forget to drop the lowest score!)

*Note*: Don't forget to properly handle students who didn't turn in assignments! (Use your experience and common sense)

**Question 5** 

Now, you want to understand the effect that "missing assignments" have on the HW grade distribution.

* Create a function `average_student` that takes in a dataframe like `grades` and outputs the overall HW grade of a student who hypothetically received the average grade on each HW assignment. When computing the 'average of each assignment' you *shouldn't* include people who didn't turn in the assignment.

* Is this value lower or higher than the average total HW grades given by the function `hw_total`? Write your answer in the function `higher_or_lower`.

### Computing extra-credit grades

**Question 6**

Compute the extra credit grades. To do this, you need to identify which assignments are extra-credit, total them up, *then* normalize them (the extra-credit assignments should *not* all have equal weight). To find the extra-credit assignments **read the syllabus**.

Create a function `extra_credit_total` that takes in a dataframe like `grades` and returns the total extra-credit grade as a proportion between 0 and 1.

### Putting it together

**Question 7**

Finally, you need to create the final course grades. To do this, you will add up the total of each course component according to the weights given in the syllabus. 

* Create a function `total_points` that takes in `grades` and returns the final course grades according to the syllabus. Course grades should be proportions between zero and one.
* Create a function `final_grades` that takes in the final course grades as above and returns a Series of letter grades given by the standard cutoffs (`A >= .90`, `.90 > B >= .80`, `.80 > C >= .70`, `.70 > D >= .60`, `.60 > F`). You should not use rounding to determining the letter grades.
* Create a function `letter_proportions` which takes in the dataframe `grades` and outputs a Series that contains the proportion of the class that received each grade. (This question requires you to put everything together).

*Note*: You can and should use your functions from previous questions in this problem!

*Note*: You need to create a helper function that is an analogue to question 1 for the projects. Be aware that projects may consist of both autograded (final) and free-response portions. The checkpoints are part of the extra-credit.

Verify for yourself the course grade distribution and relevant statistics!

### Do Sophomores get better grades?

**Question 8**

You notice that students who are sophomores on average did better in the class (if you can't verify this, you should go back and check your work!). Is this difference significant, or just due to noise?

Perform a hypothesis test, assessing likelihood of the null hypothesis: 
> "sophomores earn grades that are roughly equal on average to the rest of the class."


Create a function `simulate_pval` which takes in the number of simulations `N` and `grades` and returns the the likelihood that the grade of juniors was no better on average than the class as a whole (i.e. calculate the p-value).

### Creating a curve

You realize that certain assignments in the course were harder than other assignments and you would like take this into account. You feel if someone did very well on a difficult assignment, that it should have more effect that doing well on an easy one. You decide to try out a curve as follows:

1. Convert *every* assignment to [Standard Units](https://www.inferentialthinking.com/chapters/14/2/Variability.html#standard-units).
2. Calculate the proportion of the course grade that every assignment represents.
3. Calculate the weighted sum of the standardized assignment scores and their weights.
4. Now that you have a sorted list of total scores, assign the same number of each letter grade as in the un-curved distribution (this allows for an entire class to get `A`s for example, if the class is easy).

**Question 9**

Create a function `get_assignment_proportions` that takes in `grades` and returns a dictionary 
* keyed by assignment name 
* with values given by the proportion of the final grade that assignment makes up. 

*Note*: Every column in `grades` that represents a student score should be a key.

**Question 10**

Create a function `curved_total_points` which takes in `grades` and outputs the curved total scores for each student. For the HW questions, grade adjustments should *still* be made for late-submissions, however, for simplicity, **do not** drop the lowest HW assignment. 

*Note*: When standardizing scores, the mean/std that you are standardizing to should *not* incorporate missing values. However, a missing assignment *should* be set to zero *before* standardizing (otherwise, you could do average by skipping all work!).

Create a function `curved_letter_grades` which takes in:
1. a Series of curved course grades (as above),
2. a Series of letter grade distributions (e.g. the output of `letter_proportions`)

and returns a Series containing the letter grade of each student according to the curve.    

*Note:* You may find the `np.percentile` function useful here!

### Assessing the curve

**Question 11**

Do data analysis to understand the effect the curve has on students' grades in the given course. Write a summary of your analysis in the free response section below. You should address:
1.  Was there a change in the median letter grade in the course between the not-curved/curved grades?
2. How many students saw a grade increase due to the curve? Why did their grades increase?
3. How many students saw a grade decrease due to the curve? Why did their grades decrease?
4. Describe a hypothetical class where a student's grade might decrease due to implementing such a curve.
5. Discuss the advantages and disadvantages of using the curve over grading on a straight-scale.

**Free Response Cell**

---

**Response to Question 10 here**

---

# Congratulations, you finished the project!

### Before you submit:
* Be sure you run the doctests on all your code in project01.py
* Be sure your free repsonse questions are all answered, readable, and that you haven't changed the cells outside the horizontal lines!

### To submit:
* **Convert the notebook to PDF and upload to gradescope for grading the free response.**
* **Upload the .py file to gradescope**