# DSC 80: Project 01

### Checkpoint Due Date: Thursday April 8, 11:59 PM (Questions 1-4)
### Due Date: Thursday April 15, 11:59 PM

---
# Instructions

This Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems.  
* Like the lab, your coding work will be developed in the accompanying `project01.py` file, that will be imported into the current notebook. This code will be autograded.
* **For the checkpoint, turn in questions 1-4**

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are **encouraged to write your own additional functions** to solve the questions! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `project01.py` -- however, be sure to upload these to gradescope as well!
- Always document your code!

**Tips for testing the correctness of your answers!**
Once you have your work saved in the .py file, you should import the `project01` to test your function out in the notebook. In the notebook you should inspect/analyze the output to assess its correctness!
* Run your functions on the main dataset (`grades`) and ask yourself if the output *looks correct.*
* Run your functions on very small datasets (e.g. 1-5 row table), calculate the expected response by hand, and see if the function output matches (this *is* unit-testing your code with data).
* Run your functions on (large and small) samples of the dataset `grades` (with and without replacement). Does your code break? Or does it still run as expected.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import project01 as proj

In [3]:
import pandas as pd
import numpy as np
import os

# The Other Side of Gradescope

The file contains the grade-book from a fictional data science course with 535 students. 

**Note: this dataset is synthetically generated; it does not contain real student grades. The course syllabus below is also not the same as the course syllabus for this class!**

In this project, you will:
1. clean and process the data to compute total course grades according to a fictional syllabus (below),
2. qualitatively understand how students did in the course,
3. understand how student grades vary with small changes in performance on each assignment.

---

The course syllabus is as follows:

* Lab assignments 
    - Each are worth the same amount, regardless of each lab's raw point total.
    - The lowest lab is dropped.
    - Each lab may be revised for one week after submission for a 10% penalty, for two weeks after submission for a 30% penalty, and beyond that for a 60% penalty. Such revisions are reflected in the `Lateness` columns in the gradebook.
    - Labs are 20% of the total grade.
* Projects 
    - Each project consists of an autograded portion, and *possibly* a free response portion.
    - The total points for a single project consist of the sum of the raw score of the two portions.
    - Each are worth the same amount, regardless of each project's raw point total.
    - Projects are 30% of the total grade.
* Checkpoints
    - Project checkpoints are worth 2.5% of the total grade.
* Discussion
    - Discussion notebooks are worth 2.5% of the total grade.
* Exams
    - The midterm is worth 15% of the total grade.
    - The final is worth 30% of the total grade.


### A note on generalization

You may assume that your code will only need to work on a gradebook for a class with the syllabus given above. That is, you may assume that the dataframe `grades` looks like the given one in `data/grades.csv`.

However, such a class:
1. may have a different numbers of labs, projects, discussions, and project checkpoints.
2. may have a different number of students.

You may assume the course components and the naming conventions are as given in the data file.

The dataset was generated by Gradescope; you must attempt to reason about the data as given using what you know as a student who uses Gradescope.

### A note on 'putting everything together'

The goal of this project is to create and assess final grades for a fictional course; if anything, the process is broken down into functions for your convenience and guidance. Here are a few remarks and tips for approaching the projects:
1. If you are having trouble figuring out what a question is asking you to do, look at the big picture and try to understand what the current step is doing to contribute to this big picture. This may clarify what's being asked!
1. These questions intentionally build off of each other and the final result matters! In fact, you can 'get a question correct', but only receive partial credit on it because a previous answer was wrong.
    - Credit for a question will typically receive partial credit based on *how close* your answer is to correct (as well as some credit for a solution in the correct form). 
    - You should try to assess your answer to each question based on what you understand of the data. This might involve writing extensive code (that isn't turned in) just to check your work! Suggestions on checking your work are given in the assignment, but you should also think of your own ways of checking your work.
    - As you do this project, think about the data from the perspective of the student (which should be easy to do!)

In [4]:
grades_fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(grades_fp)

### Getting started: enumerating the assignments

First, you will list all the 'assignment names' and what part of the syllabus to which they belong.

**Question 1:**

Create a function `get_assignment_names` that takes in a dataframe like `grades` and returns a dictionary with the following structure:
- The keys are the general areas of the syllabus: `lab, project, midterm, final, disc, checkpoint`
- The values are lists that contain the assignment names of that type. For example the lab assignments all have names of the form `labXX` where `XX` is a zero-padded two digit number. See the doctests for more details.

In [5]:
def get_assignment_names(grades):
    '''
    get_assignment_names takes in a dataframe like grades and returns 
    a dictionary with the following structure:

    The keys are the general areas of the syllabus: lab, project, 
    midterm, final, disc, checkpoint

    The values are lists that contain the assignment names of that type. 
    For example the lab assignments all have names of the form labXX where XX 
    is a zero-padded two digit number. See the doctests for more details.    

    :Example:
    >>> grades_fp = os.path.join('data', 'grades.csv')
    >>> grades = pd.read_csv(grades_fp)
    >>> names = get_assignment_names(grades)
    >>> set(names.keys()) == {'lab', 'project', 'midterm', 'final', 'disc', 'checkpoint'}
    True
    >>> names['final'] == ['Final']
    True
    >>> 'project02' in names['project']
    True
    '''
    class_dict = {'lab':[], 'project':[], 'midterm':[], 'final':[], 'disc':[], 'checkpoint':[]}
   
    for col in grades.columns:
        if 'lab' in col and '-' not in col:
            class_dict['lab'].append(col)
        elif 'project' in col and '-' not in col and '_' not in col:
            class_dict['project'].append(col)
        elif 'midterm' in col and '-' not in col:
            class_dict['midterm'].append(col)
        elif 'Final' in col and '-' not in col:
            class_dict['final'].append(col)
        elif 'discussion' in col and '-' not in col:
            class_dict['disc'].append(col)
        elif 'checkpoint' in col and '-' not in col:
            class_dict['checkpoint'].append(col)
    
    return class_dict

In [6]:
grades_fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(grades_fp)
names = get_assignment_names(grades)

In [7]:
set(names.keys()) == {'lab', 'project', 'midterm', 'final', 'disc', 'checkpoint'}

True

In [8]:
names['final'] == ['Final']

True

In [9]:
'project02' in names['project']

True

### Computing project grades

**Question 2**

Compute the total score for the project portion of the course according to the syllabus. Create a function `projects_total` that takes in `grades` and computes the total project grade for the quarter according to the syllabus. The output Series should contain values between 0 and 1.

*Note*: Don't forget to properly handle students who didn't turn in assignments! (Use your experience and common sense).

*Note:* To check your work, try (1) calculating the score for a few types of students by hand, and (2) calculate the statistics for the class performance on each individual course project, making sure they look reasonable.

In [10]:
names['project']

['project01', 'project02', 'project03', 'project04', 'project05']

In [11]:
def projects_total(grades):
    '''
    projects_total that takes in grades and computes the total project grade
    for the quarter according to the syllabus. 
    The output Series should contain values between 0 and 1.
    
    :Example:
    >>> grades_fp = os.path.join('data', 'grades.csv')
    >>> grades = pd.read_csv(grades_fp)
    >>> out = projects_total(grades)
    >>> np.all((0 <= out) & (out <= 1))
    True
    >>> 0.7 < out.mean() < 0.9
    True
    '''
    grades = grades.fillna(0)
    
    overall_grades = []
    
    project_count = len(get_assignment_names(grades)['project'])
    
    max_grades = grades[list(filter(lambda x:'Max' in x and 'project' in x and 'checkpoint' not in x, grades.columns))]
    project_grades = grades[list(filter(lambda x:'project' in x and 'checkpoint' not in x and '-' not in x, grades.columns))]
    
    for n in range(1, project_count+1):
        total_score = project_grades[list(filter(lambda x: str(n) in x, project_grades.columns))].sum(axis=1)
        max_score = max_grades[list(filter(lambda x: str(n) in x, max_grades.columns))].sum(axis=1)
        overall_grades.append(total_score / max_score)
    
    return pd.Series(sum(overall_grades) / project_count)

In [12]:
grades_fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(grades_fp)
out = projects_total(grades)
out

0      0.916234
1      0.765932
2      0.681279
3      0.962581
4      0.737446
         ...   
530    0.949434
531    0.866795
532    0.862050
533    0.813468
534    0.939433
Length: 535, dtype: float64

In [13]:
np.all((0 <= out) & (out <= 1))

True

In [14]:
0.7 < out.mean() < 0.9

True

### Computing lab grades

Now, you will clean and process the lab grades, which is a little more complicated. To do this, you will develop functions that:
- 'normalize' the grades, 
- adjust for late submissions, 
- drop the lowest lab grade, and 
- creates a total lab score for each student.

**Question 3**

Unfortunately, Gradescope sometimes experiences a delay in registering when an assignment is submitted during "periods of heavy usage" (i.e. near a submission deadline). You need to assess when a student's assignment was actually turned in on time, even if Gradescope did not process it in time. To do this, it is helpful to know:
* Every late submission has to be submitted by a TA (late submissions are turned off).
* TAs never submitted a late assignment "just after" the deadline. 
* The deadlines were at midnight and students had to come to staff hours to late-submit their assignment.

Create a function `last_minute_submissions` that takes in the dataframe `grades` and outputs the number of submissions on each assignment that were turned in on time by the student, yet marked 'late' by Gradescope. See the doctest for more details.

*Note:* You have to figure out what truly is a late submission by looking at the data and understanding the facts about the data generating process above. There is some ambiguity in finding which submissions are truly late; you will *make a best guess for a threshold* by looking at this dataset. This question is about 'cleaning' a messy 'data recording process'.

In [15]:
def time_converter(time):
    h = int(time[0])
    m = int(time[1])
    s = int(time[2])
    return h*3600 + m*60 + s

In [16]:
def last_minute_submissions(grades):
    """
    last_minute_submissions takes in the dataframe 
    grades and a Series indexed by lab assignment that 
    contains the number of submissions that were turned 
    in on time by the student, yet marked 'late' by Gradescope.

    :Example:
    >>> fp = os.path.join('data', 'grades.csv')
    >>> grades = pd.read_csv(fp)
    >>> out = last_minute_submissions(grades)
    >>> isinstance(out, pd.Series)
    True
    >>> np.all(out.index == ['lab0%d' % d for d in range(1,10)])
    True
    >>> (out > 0).sum()
    8
    """
    late_count = []
    lab_count = len(get_assignment_names(grades)['lab'])
    all_labs = grades[list(filter(lambda x: 'lab' in x and 'Lateness' in x, grades.columns))]
    
    for n in range(1, lab_count+1):
        late_times = all_labs[list(filter(lambda x: str(n) in x, all_labs.columns))[0]].str.split(":")
        seconds_late = late_times.apply(time_converter)
        late_count.append(len(seconds_late[seconds_late > 0][seconds_late <= 10800]))
    return pd.Series(late_count,get_assignment_names(grades)['lab'])

In [17]:
out = last_minute_submissions(grades)
out

lab01     2
lab02     0
lab03     2
lab04     6
lab05     7
lab06     8
lab07    16
lab08    11
lab09    26
dtype: int64

In [18]:
isinstance(out, pd.Series)

True

In [19]:
np.all(out.index == ['lab0%d' % d for d in range(1,10)])

True

In [20]:
(out > 0).sum()

8

**Question 4**

Now you need to adjust the lab grades for late submissions -- however, you need to take into account your investigation in the previous question, since students shouldn't be penalized by a bug in Gradescope!

Create a function `lateness_penalty` that takes in a 'Lateness' column and returns a column of penalties (represented by the values `1.0,0.9,0.7,0.4` according to the syllabus). Only *truly* late submissions should be counted as late.

*Note*: For the purpose of this project, we will only be calculating lateness for labs. There is no penalty for lateness for projects, discussions, nor checkpoints.

In [21]:
def lateness_penalty(col):
    """
    lateness_penalty takes in a 'lateness' column and returns 
    a column of penalties according to the syllabus.

    :Example:
    >>> fp = os.path.join('data', 'grades.csv')
    >>> col = pd.read_csv(fp)['lab01 - Lateness (H:M:S)']
    >>> out = lateness_penalty(col)
    >>> isinstance(out, pd.Series)
    True
    >>> set(out.unique()) <= {1.0, 0.9, 0.7, 0.4}
    True
    """
    reformatted = col.str.split(":")
    seconds_late = reformatted.apply(time_converter)
    scores = [1 if time <= 10800 else 0.9 if time <=604800 else 0.7 if time <= 1209600 else 0.4 for time in seconds_late]
    return pd.Series(scores)

In [22]:
fp = os.path.join('data', 'grades.csv')
col = pd.read_csv(fp)['lab01 - Lateness (H:M:S)']
out = lateness_penalty(col)
out

0      1.0
1      1.0
2      1.0
3      1.0
4      1.0
      ... 
530    0.9
531    1.0
532    1.0
533    1.0
534    1.0
Length: 535, dtype: float64

In [23]:
set(out.unique()) <= {1.0, 0.9, 0.7, 0.4}

True

**Question 5**

Create a function `process_labs` that takes in a dataframe like `grades` and returns a dataframe of processed lab scores. The output should:
* share the same index as `grades`,
* have columns given by the lab assignment names (e.g. `lab01,...lab10`)
* have values representing the lab grades for each assignment, adjusted for Lateness and scaled to a score between 0 and 1.

In [24]:
grades

Unnamed: 0,PID,College,Level,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),project01,...,discussion07 - Lateness (H:M:S),discussion08,discussion08 - Max Points,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S)
0,A14721419,SI,JR,99.735279,100.0,00:00:00,84.990171,100.0,00:00:00,75.282632,...,00:00:00,8.895294,10,00:00:00,10.000000,10,780:01:28,10.000000,10,00:00:00
1,A14883274,TH,JR,98.829476,100.0,00:00:00,50.784231,100.0,00:00:00,52.929482,...,669:12:21,9.022407,10,00:00:00,9.020283,10,00:00:00,9.437368,10,00:00:00
2,A14164800,SI,SR,86.513369,100.0,00:00:00,47.802820,100.0,00:00:00,46.122801,...,00:00:00,3.030538,10,00:04:51,7.613698,10,00:00:00,9.624617,10,00:00:00
3,A14847419,TH,JR,100.000000,100.0,00:00:00,100.000000,100.0,00:00:00,79.121806,...,00:00:00,10.000000,10,00:00:00,9.249126,10,00:00:00,10.000000,10,00:00:00
4,A14162943,SI,JR,66.506974,100.0,00:00:00,33.422412,100.0,00:00:00,41.823703,...,00:00:00,4.439606,10,00:00:00,4.485291,10,00:00:00,6.282712,10,00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530,A14490387,SI,JR,100.000000,100.0,47:26:10,82.022753,100.0,00:00:00,78.936816,...,00:00:00,10.000000,10,12:08:58,9.169447,10,00:00:00,10.000000,10,00:00:00
531,A14088257,SI,SO,100.000000,100.0,00:00:00,87.498073,100.0,00:00:00,72.076801,...,00:00:00,10.000000,10,00:00:00,10.000000,10,00:00:00,10.000000,10,00:00:00
532,A14847419,WA,JR,88.656641,100.0,00:00:00,90.326041,100.0,00:00:00,66.273252,...,00:00:00,9.878661,10,00:00:00,8.878946,10,00:00:00,10.000000,10,00:00:00
533,A14513929,TH,SR,83.799719,100.0,00:00:00,85.636947,100.0,00:00:00,63.965217,...,00:00:00,7.759434,10,00:00:00,8.655478,10,419:06:41,8.102277,10,00:00:00


In [25]:
def process_labs(grades):
    """
    process_labs that takes in a dataframe like grades and returns
    a dataframe of processed lab scores. The output should:
      * share the same index as grades,
      * have columns given by the lab assignment names (e.g. lab01,...lab10)
      * have values representing the lab grades for each assignment, 
        adjusted for Lateness and scaled to a score between 0 and 1.

    :Example:
    >>> fp = os.path.join('data', 'grades.csv')
    >>> grades = pd.read_csv(fp)
    >>> out = process_labs(grades)
    >>> out.columns.tolist() == ['lab%02d' % x for x in range(1,10)]
    True
    >>> np.all((0.65 <= out.mean()) & (out.mean() <= 0.90))
    True
    """
    
    all_labs = grades[list(filter(lambda x: 'lab' in x and '-'not in x, grades.columns))]
    lab_cols = all_labs.columns
    max_points = grades[list(filter(lambda x: 'lab' in x and 'Max'in x, grades.columns))]
    lateness = grades[list(filter(lambda x: 'lab' in x and 'Lateness' in x, grades.columns))]
    
    late_dict = {}
    for col in lateness.columns:
        late_dict[col] = lateness_penalty(lateness[col])
        #late_dict[col] = lateness_penalty(lateness[col]) - 1
    lateness_df = pd.DataFrame(late_dict)
    
    penalty = pd.DataFrame(all_labs.values / max_points.values * lateness_df.values)
    #penalty = all_labs.values / max_points.values + lateness_df.values
    
    final_grades = pd.DataFrame(penalty)
    final_grades.columns = lab_cols
    return final_grades

In [26]:
out = process_labs(grades)
out

Unnamed: 0,lab01,lab02,lab03,lab04,lab05,lab06,lab07,lab08,lab09
0,0.997353,0.849902,0.637744,1.000000,1.000000,0.994518,0.389141,0.887917,0.874913
1,0.988295,0.507842,0.714477,0.783672,1.000000,0.393887,0.914061,0.944378,0.902977
2,0.865134,0.478028,0.433667,0.738875,0.927838,0.345076,0.734070,0.718204,0.757840
3,1.000000,1.000000,0.925903,0.950614,0.891614,0.688403,0.985371,0.963307,0.777880
4,0.665070,0.334224,0.706932,0.747915,0.659720,0.731345,0.607859,0.370186,1.000000
...,...,...,...,...,...,...,...,...,...
530,0.900000,0.820228,1.000000,0.792935,1.000000,0.284106,0.770281,0.931245,1.000000
531,1.000000,0.874981,0.809945,0.592866,0.987597,0.759688,0.856178,0.849694,0.582645
532,0.886566,0.903260,1.000000,1.000000,0.941425,0.768909,0.967282,0.877898,1.000000
533,0.837997,0.856369,0.909363,0.955287,0.737854,0.382781,0.769093,0.947450,0.867373


In [27]:
out.columns.tolist() == ['lab%02d' % x for x in range(1,10)]

True

In [28]:
np.all((0.65 <= out.mean()) & (out.mean() <= 0.90))

True

In [29]:
out.mean()

lab01    0.893585
lab02    0.784569
lab03    0.850241
lab04    0.806796
lab05    0.848179
lab06    0.660588
lab07    0.840851
lab08    0.746800
lab09    0.857812
dtype: float64

**Question 6**

Create a function `lab_total` that takes in dataframe of processed assignments (like the output of Question 5) and computes the total lab grade for each student according to the syllabus (returning a Series). Your answers should be proportions between 0 and 1. For example, if there are only 3 labs, and a student received scores of {80%,90%,100%}, then the total score would be 0.95.

*Note*: Don't forget to properly handle students who didn't turn in assignments! (Use your experience and common sense).

In [30]:
def find_highest(row):
    smallest = row.min()
    #return row.loc[row != smallest].mean()
    return (row.sum() - smallest) / (row.shape[0] - 1)

In [31]:
find_highest(pd.Series([0.2, 0.90, 1.0]))

0.9500000000000001

In [32]:
def lab_total(processed):
    """
    lab_total takes in dataframe of processed assignments (like the output of 
    Question 5) and computes the total lab grade for each student according to
    the syllabus (returning a Series). 
     
    Your answers should be proportions between 0 and 1.

    :Example:
    >>> cols = 'lab01 lab02 lab03'.split()
    >>> processed = pd.DataFrame([[0.2, 0.90, 1.0]], index=[0], columns=cols)
    >>> np.isclose(lab_total(processed), 0.95).all()
    True
    """
    processed = processed.fillna(0)
    highest_scores = processed.apply(find_highest, axis=1)
    #print(highest_scores.isnull().values.any())
    return highest_scores

In [33]:
cols = 'lab01 lab02 lab03'.split()
processed = pd.DataFrame([[0.2, 0.90, 1.0]], index=[0], columns=cols)
processed

Unnamed: 0,lab01,lab02,lab03
0,0.2,0.9,1.0


In [34]:
lab_total(processed)

0    0.95
dtype: float64

In [35]:
np.isclose(lab_total(processed), 0.95).all()

True

### Putting it together

**Question 7**

Finally, you need to create the final course grades. To do this, you will add up the total of each course component according to the weights given in the syllabus. 

* Create a function `total_points` that takes in `grades` and returns the final course grades according to the syllabus. Course grades should be proportions between zero and one.
* Create a function `final_grades` that takes in the final course grades as above and returns a Series of letter grades given by the standard cutoffs (`A >= .90`, `.90 > B >= .80`, `.80 > C >= .70`, `.70 > D >= .60`, `.60 > F`). You should not use rounding to determining the letter grades.
* Create a function `letter_proportions` which takes in the dataframe `grades` and outputs a Series that contains the proportion of the class that received each grade. (This question requires you to put everything together).
* The indices should be ordered by the proportion of the class that receives that grade, from largest to smallest.

*Note 1*: Don't repeat yourself when computing the checkpoint and discussion portions of the course.

*Note 2*: Only the lab portion of the course accounts for late assignments; you may assume all assignments in other portions are turned in without penalty.

*Note 3*: These values should add up to exactly 1.0. If you are getting something close such as 0.99999, that means there is a slight issue with your code from above. 

To check your work, verify the course grade distribution and relevant statistics! Do the work by hand for a few students.

In [36]:
grades

Unnamed: 0,PID,College,Level,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),project01,...,discussion07 - Lateness (H:M:S),discussion08,discussion08 - Max Points,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S)
0,A14721419,SI,JR,99.735279,100.0,00:00:00,84.990171,100.0,00:00:00,75.282632,...,00:00:00,8.895294,10,00:00:00,10.000000,10,780:01:28,10.000000,10,00:00:00
1,A14883274,TH,JR,98.829476,100.0,00:00:00,50.784231,100.0,00:00:00,52.929482,...,669:12:21,9.022407,10,00:00:00,9.020283,10,00:00:00,9.437368,10,00:00:00
2,A14164800,SI,SR,86.513369,100.0,00:00:00,47.802820,100.0,00:00:00,46.122801,...,00:00:00,3.030538,10,00:04:51,7.613698,10,00:00:00,9.624617,10,00:00:00
3,A14847419,TH,JR,100.000000,100.0,00:00:00,100.000000,100.0,00:00:00,79.121806,...,00:00:00,10.000000,10,00:00:00,9.249126,10,00:00:00,10.000000,10,00:00:00
4,A14162943,SI,JR,66.506974,100.0,00:00:00,33.422412,100.0,00:00:00,41.823703,...,00:00:00,4.439606,10,00:00:00,4.485291,10,00:00:00,6.282712,10,00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530,A14490387,SI,JR,100.000000,100.0,47:26:10,82.022753,100.0,00:00:00,78.936816,...,00:00:00,10.000000,10,12:08:58,9.169447,10,00:00:00,10.000000,10,00:00:00
531,A14088257,SI,SO,100.000000,100.0,00:00:00,87.498073,100.0,00:00:00,72.076801,...,00:00:00,10.000000,10,00:00:00,10.000000,10,00:00:00,10.000000,10,00:00:00
532,A14847419,WA,JR,88.656641,100.0,00:00:00,90.326041,100.0,00:00:00,66.273252,...,00:00:00,9.878661,10,00:00:00,8.878946,10,00:00:00,10.000000,10,00:00:00
533,A14513929,TH,SR,83.799719,100.0,00:00:00,85.636947,100.0,00:00:00,63.965217,...,00:00:00,7.759434,10,00:00:00,8.655478,10,419:06:41,8.102277,10,00:00:00


In [37]:
def points_helper(grades, col):
    
    col_all = grades[list(filter(lambda x: col in x and '-' not in x, grades.columns))]
    col_max = grades[list(filter(lambda x: col in x and 'Max' in x, grades.columns))]
    col_grades = pd.DataFrame(col_all.values / col_max.values)
    col_avg_grades = col_grades.mean(axis=1)#.reset_index()
    
    return col_avg_grades

In [38]:
def total_points(grades):
    """
    total_points takes in grades and returns the final
    course grades according to the syllabus. Course grades
    should be proportions between zero and one.

    :Example:
    >>> fp = os.path.join('data', 'grades.csv')
    >>> grades = pd.read_csv(fp)
    >>> out = total_points(grades)
    >>> np.all((0 <= out) & (out <= 1))
    True
    >>> 0.7 < out.mean() < 0.9
    True
    """
    grades = grades.fillna(0)
    
    lab_grades = lab_total(process_labs(grades))
    lab_perc = .2 * lab_grades
    ##print(lab_perc.shape)
    
    project_grades = projects_total(grades)
    project_perc = .3 * project_grades
    ##print(project_perc.shape)
    
    cp_avg_grades = points_helper(grades, 'checkpoint')
    checkpoint_perc = cp_avg_grades * .025
    ##print(checkpoint_perc.shape)
    
    disc_avg_grades = points_helper(grades, 'discussion')
    discussion_perc = disc_avg_grades * .025
    ##print(discussion_perc.shape)
    
    mexam_avg_grades = points_helper(grades, 'Midterm')
    mexam_perc = mexam_avg_grades * .15
    ##print(mexam_perc.shape)
    
    fexam_avg_grades = points_helper(grades, 'Final')
    fexam_perc = fexam_avg_grades * .3
    ##print(fexam_perc.shape)
    
    finalgrades = lab_perc + project_perc + checkpoint_perc + discussion_perc + mexam_perc + fexam_perc
    #print(finalgrades.shape)
    
    return finalgrades

In [39]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
out = total_points(grades)
out.mean()

0.8243406154328311

In [40]:
np.all((0 <= out) & (out <= 1))

True

In [41]:
0.7 < out.mean() < 0.9

True

In [42]:
def letter_helper(value):
    grade_dict = {90: "A",80: "B",70: "C",60: "D",0: "F"}
    for key, letter in grade_dict.items():
        if value >= key:
            return letter

In [43]:
def final_grades(total):
    total = (total * 100).astype(int)
    letter_grades = total.map(letter_helper)
    
    return letter_grades

In [44]:
out = final_grades(pd.Series([0.92, 0.81, 0.41]))
out

0    A
1    B
2    F
dtype: object

In [45]:
np.all(out == ['A', 'B', 'F'])

True

In [90]:
def letter_proportions(grades):
    """
    letter_proportions takes in the dataframe grades 
    and outputs a Series that contains the proportion
    of the class that received each grade.

    :Example:
    >>> fp = os.path.join('data', 'grades.csv')
    >>> grades = pd.read_csv(fp)
    >>> out = letter_proportions(grades)
    >>> np.all(out.index == ['B', 'C', 'A', 'D', 'F'])
    True
    >>> out.sum() == 1.0
    True
    """
    num_grades = total_points(grades)
    let_grades = final_grades(num_grades)
    #print(num_grades.shape[0])
    #print(let_grades.value_counts().values.sum())
    return pd.Series(let_grades.value_counts().values / num_grades.shape[0], index=let_grades.value_counts().index)

In [91]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
out = letter_proportions(grades)
out

B    0.506542
C    0.224299
A    0.207477
D    0.031776
F    0.029907
dtype: float64

In [92]:
np.all(out.index == ['B', 'C', 'A', 'D', 'F'])

True

In [93]:
out.sum() == 1.0

True

### Do Seniors get worse grades?

**Question 8**

You notice that students who are seniors on average did worse in the class (if you can't verify this, you should go back and check your work!). Is this difference significant, or just due to noise?

Perform a hypothesis test, assessing the likelihood of the above statement under the null hypothesis: 
> "seniors earn grades that are roughly equal on average to the rest of the class."


Create a function `simulate_pval` which takes in the number of simulations `N` and `grades` and returns the the likelihood that the grade of seniors was worse than the average of the class as a whole under the null hypothesis(i.e. calculate the p-value).

*Note:* To check your work, plot the sampling distribution and the observation. Do these values look reasonable?

In [50]:
grades = pd.read_csv(grades_fp)
total = total_points(grades)
grades['Letter'] = final_grades(total)
grades['Final Grade'] = total
clean_gradebook = grades[['PID', 'Level', 'Letter', 'Final Grade']]
grouped_means = clean_gradebook.groupby('Level').mean().reset_index()
grouped_means
#sr_mean = grouped_means.loc[grouped_means.Level == 'SR'].values[0][1]
#sr_mean

Unnamed: 0,Level,Final Grade
0,JR,0.836108
1,SO,0.857176
2,SR,0.801045


In [51]:
# total = total_points(grades)
# grades['Letter'] = final_grades(total)
# grades['Final Grade'] = total
# clean_gradebook = grades[['PID', 'Level', 'Letter', 'Final Grade']]
# grouped_means = clean_gradebook.groupby('Level').mean().reset_index()
# observed = grouped_means.loc[grouped_means.Level == 'SR'].values[0][1]

In [52]:
total_points(grades).mean()

0.7038938873089399

In [53]:
grades.loc[grades.Level == "SR"]#.shape[0]

Unnamed: 0,PID,College,Level,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),project01,...,discussion08 - Max Points,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S),Letter,Final Grade
2,A14164800,SI,SR,86.513369,100.0,00:00:00,47.802820,100.0,00:00:00,46.122801,...,10,00:04:51,7.613698,10,00:00:00,9.624617,10,00:00:00,C,0.759665
5,A14282114,RE,SR,92.821876,100.0,00:00:00,100.000000,100.0,00:00:00,67.807382,...,10,00:00:00,10.000000,10,00:00:00,8.416273,10,00:00:00,A,0.914075
6,A14297403,MU,SR,98.120326,100.0,00:00:00,46.650293,100.0,00:00:00,56.472871,...,10,00:00:00,4.900131,10,46:35:38,7.246751,10,00:00:00,B,0.820217
8,A14137484,FI,SR,91.534841,100.0,00:00:00,100.000000,100.0,00:00:00,69.068823,...,10,00:00:00,9.059151,10,42:06:36,9.984830,10,00:00:00,B,0.887615
10,A14600543,RE,SR,97.559463,100.0,00:00:00,100.000000,100.0,00:00:00,77.493905,...,10,987:54:39,9.264759,10,00:00:00,10.000000,10,00:00:00,A,0.907341
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
527,A14282114,FI,SR,84.768706,100.0,00:00:00,98.689484,100.0,00:00:00,76.550346,...,10,00:00:00,9.786083,10,00:00:00,10.000000,10,00:00:00,A,0.924285
528,A14540066,SI,SR,84.912851,100.0,00:00:00,82.803516,100.0,00:00:00,77.385300,...,10,00:00:00,10.000000,10,00:00:00,10.000000,10,00:00:00,B,0.831665
529,A14297403,MU,SR,80.805130,100.0,00:00:00,38.413829,100.0,00:00:00,48.937602,...,10,00:00:00,7.952751,10,00:00:00,8.524640,10,00:00:00,C,0.782373
533,A14513929,TH,SR,83.799719,100.0,00:00:00,85.636947,100.0,00:00:00,63.965217,...,10,00:00:00,8.655478,10,419:06:41,8.102277,10,00:00:00,B,0.866322


In [54]:
class_grades = total_points(grades)
obs = class_grades.mean()
#obs
sr_grades = total_points(grades.loc[grades.Level == "SR"])
sr_grades
#prob = (sr_grades > obs).shape[0] / sr_grades.shape[0]
#prob
grades[grades['Level'] == 'SR'].shape[0]

215

In [55]:
total_points(grades[grades['Level'] == 'SR'])#.mean()

0           NaN
1           NaN
2      0.636448
3           NaN
4           NaN
         ...   
527         NaN
528         NaN
529         NaN
533         NaN
534         NaN
Length: 336, dtype: float64

In [99]:
def simulate_pval(grades, N):
    """
    simulate_pval takes in the number of
    simulations N and grades and returns
    the likelihood that the grade of seniors
    was worse than the class under null hypothesis conditions
    (i.e. calculate the p-value).

    :Example:
    >>> fp = os.path.join('data', 'grades.csv')
    >>> grades = pd.read_csv(fp)
    >>> out = simulate_pval(grades, 100)
    >>> 0 <= out <= 0.1
    True
    """
    
    totals = total_points(grades)
    grades['Grade'] = totals
    cleaned = grades[['PID', 'Level', 'Grade']]
    grouped_means = cleaned.groupby('Level').mean().reset_index()
    observed = grouped_means.loc[grouped_means.Level == 'SR'].values[0][1]
    
    seniors = grades[grades['Level'] == 'SR']
    
    cat_distr = cleaned['Grade'].value_counts(normalize=True)
    samples = np.random.choice(cat_distr.index,p=cat_distr,size=(N,int(seniors.shape[0])))
    
    averages = samples.mean(axis=1)    
    
    pval = np.count_nonzero(averages <= observed) / N
    return pval

In [57]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
out = simulate_pval(grades, 10000)
out

0.0068

In [58]:
0 <= out <= 0.1

True

### What is the true distribution of grades?

The gradebook for this class only reflects one particular instance of each student's performance, subject to the effects of all the little events and hiccups that occurred throughout the quarter. Might you have done better on the midterm had your roommate kept you up all night with their coughing? Wasn't it lucky that the example you were studying just before the final happened to appear on the exam?

**Question 9**

This question will simulate these '(un)lucky, random events' by adding or subtracting random amounts to each assignment before calculating the final grades. These 'random amounts' will be drawn from a Gaussian distribution of mean 0 and a std deviation 0.02:
```
np.random.normal(0, 0.02, size=(num_rows, num_cols))
```
Intuitively, such a model says that random events may bump up or down a given grade (given as a proportion):
- which on average has no effect on the class as a whole (mean 0),
- which not uncommonly might perturb a grade by 2% (std dev 0.02).

Create a function `total_points_with_noise` that takes in a dataframe like `grades`, adds noise to the assignments as described above, and returns the final scores using *the same procedure* as questions 1-7.

*Note:* You should be able to reuse (or minorly change) the code from previous problems. Try to be DRY (don't repeat yourself)!

*Note 1:* Once adding the noise to the assignment scores, use the `np.clip` function to be sure each assignment retains a score between 0% and 100%.

*Note 2:* To check your work -- what would you expect the difference between the actual scores and noisy scores to be, on average?

In [60]:
def noisy_helper_dcp(grades, col):
    names = get_assignment_names(grades)[str(col)]
    col_all = grades[names]
    col_list = list(filter(lambda x: col in x and 'Max' in x, grades.columns))
    col_max = grades[col_list]
    grade_dict = {}
    
    for col in range(len(names)):
        processed = pd.DataFrame(col_all[names[col]] / col_max[col_list[col]])
        processed += np.random.normal(0, 0.02, size=(processed.shape[0],processed.shape[1]))
        processed = np.clip(processed.iloc[:,0],0,1)
        grade_dict[names[col]] = processed
    points_df = pd.DataFrame(grade_dict)
    points_df = points_df.fillna(0)
    averages = points_df.mean(axis=1)
    
    return averages

In [61]:
def noisy_helper_exam(grades, exam):
    capitalized_exam = exam.capitalize()
    scores = grades[capitalized_exam]
    maxes = list(filter(lambda x: capitalized_exam in x and 'Max' in x, grades.columns))
    max_points = grades[maxes[0]]
    noisy_grades = pd.DataFrame(scores / max_points)
    noisy_grades += np.random.normal(0, 0.02, size=(noisy_grades.shape[0], noisy_grades.shape[1]))
    noisy_grades = np.clip(noisy_grades.iloc[:,0],0,1)
  
    return noisy_grades

In [71]:
def total_points_with_noise(grades):
    """
    total_points_with_noise takes in a dataframe like grades, 
    adds noise to the assignments as described in notebook, and returns
    the total scores of each student calculated with noisy grades.

    :Example:
    >>> fp = os.path.join('data', 'grades.csv')
    >>> grades = pd.read_csv(fp)
    >>> out = total_points_with_noise(grades)
    >>> np.all((0 <= out) & (out <= 1))
    True
    >>> 0.7 < out.mean() < 0.9
    True
    """
    grades.fillna(0, inplace=True)
    
    proc_labs = process_labs(grades)
    proc_labs += np.random.normal(0, 0.02, size=(proc_labs.shape[0], proc_labs.shape[1]))
    noisy_lab_grades = lab_total(np.clip(proc_labs,0,1)) * 20
    
    total_percs = []
    assignments = get_assignment_names(grades)
    project_count = len(assignments['project'])
    proj = grades[list(filter(lambda x: 'project' in x and '-' not in x and 'checkpoint' not in x, grades.columns))]
    proj = proj.fillna(0)
    proj_max = grades[list(filter(lambda x: 'Max' in x and 'project' in x and 'checkpoint' not in x, grades.columns))]    
    for num in range(1, project_count+1):
        final_proj_scores = proj[list(filter(lambda x: str(num) in x, proj.columns))].sum(axis=1)
        max_proj_points = proj_max[list(filter(lambda x: str(num) in x, proj_max.columns))].sum(axis=1)
        final_proj_grades = pd.DataFrame(final_proj_scores / max_proj_points)
        
        final_proj_grades += np.random.normal(0, 0.02, size=(final_proj_grades.shape[0], final_proj_grades.shape[1]))
        noisy_proj_grades = np.clip(final_proj_grades.iloc[:,0],0,1) * 30
        total_percs.append(noisy_proj_grades)
    noisy_proj_grades = sum(total_percs) / project_count 
    
    noisy_cp_grades = noisy_helper_dcp(grades, 'checkpoint')
    
    noisy_disc_grades = noisy_helper_dcp(grades, 'disc')
    
    noisy_mexam_grades = noisy_helper_exam(grades, 'midterm') * 15
    
    noisy_fexam_grades = noisy_helper_exam(grades, 'final') * 30
    
    total = noisy_lab_grades + noisy_proj_grades + noisy_disc_grades + noisy_cp_grades + noisy_mexam_grades + noisy_fexam_grades
    
    return total / 100

In [72]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
out = total_points_with_noise(grades)
out
#print(out.to_string())

0      0.867007
1      0.809533
2      0.741164
3      0.879290
4      0.652831
         ...   
530    0.830078
531    0.742437
532    0.839581
533    0.840257
534    0.872296
Length: 535, dtype: float64

In [73]:
np.all((0 <= out) & (out <= 1))

True

In [74]:
0.7 < out.mean() < 0.9

True

In [75]:
out.mean()

0.801955311186975

### Short-answer questions (hard-coded)

Use your functions from above to understanding the data and answer the following questions. The function below should return **hard-coded values**. It should not compute anything!

**Question 10**

Create a function `short_answer` of zero variables that returns (hard-coded) answers to the following question in a list:
0. For the class on average, what is the difference between students' scores (`total_points`) and their scores with noise (`total_points_with_noise`)? (Remark: plot the distribution of differences; does this align with what you know about binomial distributions?)
1. What percentage of the class only sees their grade change at most (but not including) $\pm 0.01$?
2. What is the 95% confidence interval for the statistic above? (see [DSC10](https://www.inferentialthinking.com/chapters/13/3/Confidence_Intervals.html) and use `np.percentile`)
3. What proportion of the class sees a change in their letter grade?
4. The assumption behind the model in Question 9 is that:
    - The (observed) gradebook well represents the true population of students,
    - The noisy scores does not represent other possible observations drawn from the true population of students.
    - Answer `True` or `False` in a list like `[True, True]`

In [100]:
(total_points_with_noise(grades) - total_points(grades)).mean()

-0.021568117298272856

In [101]:
changes = abs(total_points_with_noise(grades) - total_points(grades)) < 0.01
changes.value_counts().values[1] / changes.value_counts().sum()

0.08598130841121496

In [103]:
def short_answer():
    """
    short_answer returns (hard-coded) answers to the 
    questions listed in the notebook. The answers should be
    given in a list with the same order as questions.

    :Example:
    >>> out = short_answer()
    >>> len(out) == 5
    True
    >>> len(out[2]) == 2
    True
    >>> 50 < out[2][0] < 100
    True
    >>> 0 < out[3] < 1
    True
    >>> isinstance(out[4][0], bool)
    True
    >>> isinstance(out[4][1], bool)
    True
    """

    return [0.007, 84.673, [78.00,86.17], .065, [True,False]]

# Congratulations, you finished the project!

### Before you submit:
* Be sure you run the doctests on all your code in project01.py

### To submit:
* **Upload the .py file to gradescope**