# DSC 80: Project 01

### Checkpoint Due Date: Thursday Oct 10, 11:59:59 PM (Questions 1-4)
### Due Date: Thursday, Oct 17, 11:59:59 PM

---
# Instructions

This Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems.  
* Like the lab, your coding work will be developed in the accompanying `project01.py` file, that will be imported into the current notebook. This code will be autograded.
* **For the checkpoint, turn in questions 1-4**

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are **encouraged to write your own additional functions** to solve the questions! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `project01.py` -- however, be sure to upload these to gradescope as well!
- Always document your code!

**Tips for testing the correctness of your answers!**
Once you have your work saved in the .py file, you should import the `project01` to test your function out in the notebook. In the notebook you should inspect/analyze the output to assess its correctness!
* Run your functions on the main dataset (`grades`) and ask yourself if the output *looks correct.*
* Run your functions on very small datasets (e.g. 1-5 row table), calculate the expected response by hand, and see if the function output matches (this *is* unit-testing your code with data).
* Run your functions on (large and small) samples of the dataset `grades` (with and without replacement). Does your code break? Or does it still run as expected.

In [235]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [236]:
import project01 as proj

In [237]:
%matplotlib inline
import pandas as pd
import numpy as np
import datetime
import os


# The Other Side of Gradescope

The file contains the grade-book from a fictional data science course with 535 students. 

**Note: this dataset is synthetically generated; it does not contain real student grades.**

In this project, you will:
1. clean and process the data to compute total course grades according to a fictional syllabus (below),
2. qualitatively understand how students did in the course,
3. understand how student grades vary with small changes in performance on each assignment.

---

The course syllabus is as follows:

* Lab assignments 
    - Each are worth the same amount, regardless of each lab's raw point total.
    - The lowest lab is dropped.
    - Each lab may be revised for one week after submission for a 10% penalty, for two weeks after submission for a 20% penalty, and beyond that for a 50% penalty. Such revisions are reflected in the `Lateness` columns in the gradebook.
    - Labs are 20% of the total grade.
* Projects 
    - Each project consists of an autograded portion, and *possibly* a free response portion.
    - The total points for a single project consist of the sum of the raw score of the two portions.
    - Each are worth the same amount, regardless of each project's raw point total.
    - Projects are 30% of the total grade.
* Checkpoints
    - Project checkpoints are worth 2.5% of the total grade.
* Discussion
    - Discussion notebooks are worth 2.5% of the total grade.
* Exams
    - The midterm is worth 15% of the total grade.
    - The final is worth 30% of the total grade.


### A note on generalization

You may assume that your code will only need to work on a gradebook for a class with the syllabus given above. That is, you may assume that the dataframe `grades` looks like the given one in `data/grades.csv`.

However, such a class:
1. may have a different numbers of labs, projects, discussions, and project checkpoints.
2. may have a different number of students.

You may assume the course components and the naming conventions are as given in the data file.

The dataset was generated by Gradescope; you must attempt to reason about the data as given using what you know as a student who uses Gradescope.

### A note on 'putting everything together'

The goal of this project is to create and assess final grades for a fictional course; if anything, the process is broken down into functions for your convenience and guidance. Here are a few remarks and tips for approaching the projects:
1. If you are having trouble figuring out what a question is asking you to do, look at the big picture and try to understand what the current step is doing to contribute to this big picture. This may clarify what's being asked!
1. These questions intentionally build off of each other and the final result matters! In fact, you can 'get a question correct', but only receive partial credit on it because a previous answer was wrong.
    - Credit for a question will typically receive partial credit based on *how close* your answer is to correct (as well as some credit for a solution in the correct form). 
    - You should try to assess your answer to each question based on what you understand of the data. This might involve writing extensive code (that isn't turned in) just to check your work! Suggestions on checking your work are given in the assignment, but you should also think of your own ways of checking your work.
    - As you do this project, think about the data from the perspective of the student (which should be easy to do!)

In [238]:
grades_fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(grades_fp)


### Getting started: enumerating the assignments

First, you will list all the 'assignment names' and what part of the syllabus to which they belong.

**Question 1:**

Create a function `get_assignment_names` that takes in a dataframe like `grades` and returns a dictionary with the following structure:
- The keys are the general areas of the syllabus: `lab, project, midterm, final, disc, checkpoint`
- The values are lists that contain the assignment names of that type. For example the lab assignments all have names of the form `labXX` where `XX` is a zero-padded two digit number. See the doctests for more details.

In [392]:
def get_assignment_names(grades):
    lab = []
    project = []
    midterm = []
    final = []
    disc = []
    checkpoint = []
    for col in grades.columns:
        if len(col) <= 12:
            if 'lab' in col:
                lab.append(col)
            if 'project' in col:
                project.append(col)
            if 'Midterm' in col:
                midterm.append(col)
            if 'Final' in col:
                final.append(col)
            if 'disc' in col:
                disc.append(col)
        if 'checkpoint' in col:
            if not '-' in col:
                checkpoint.append(col)
    dic = {'lab': lab, 'project': project, 'midterm': midterm, 'final': final,
          'disc': disc, 'checkpoint': checkpoint}
    return dic

In [240]:
grades_fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(grades_fp)
names = get_assignment_names(grades)
set(names.keys()) == {'lab', 'project', 'midterm', 'final', 'disc', 'checkpoint'}

names['final'] == ['Final']
'project02' in names['project']
new = []
for col in grades.columns:
    if "Midterm" in col:
        new.append(col)
grades[new]

Unnamed: 0,Midterm,Midterm - Max Points,Midterm - Lateness (H:M:S)
0,47.0,47.0,00:00:00
1,44.0,47.0,00:00:00
2,37.0,47.0,00:00:00
3,44.0,47.0,00:00:00
4,18.0,47.0,00:00:00
5,44.0,47.0,00:00:00
6,35.0,47.0,00:00:00
7,43.0,47.0,00:00:00
8,32.0,47.0,00:00:00
9,45.0,47.0,00:00:00


### Computing project grades

**Question 2**

Compute the total score for the project portion of the course according to the syllabus. Create a function `projects_total` that takes in `grades` and computes the total project grade for the quarter according to the syllabus. The output Series should contain values between 0 and 1.

*Note*: Don't forget to properly handle students who didn't turn in assignments! (Use your experience and common sense).

*Note:* To check your work, try (1) calculating the score for a few types of students by hand, and (2) calculate the statistics for the class performance on each individual course project, making sure they look reasonable.

In [241]:
def projects_total(grades):
    total = 0
    grades = grades.fillna(0.0)
    for col in grades.columns:
        if not "-" in col and "project" in col:
            total += grades[col]
    return total/500

In [242]:
grades_fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(grades_fp)
out = projects_total(grades)
np.all((0 <= out) & (out <= 1))
0.7 < out.mean() < 0.9
grades

Unnamed: 0,PID,College,Level,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),project01,...,discussion07 - Lateness (H:M:S),discussion08,discussion08 - Max Points,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S)
0,A14721419,SI,JR,99.0,100.0,00:00:00,86.0,100.0,00:00:00,75.0,...,00:00:00,10.0,10,00:00:00,10.0,10,780:01:28,10.0,10,00:00:00
1,A14883274,TH,JR,98.0,100.0,00:00:00,52.0,100.0,00:00:00,53.0,...,669:12:21,7.0,10,00:00:00,7.0,10,00:00:00,8.0,10,00:00:00
2,A14164800,SI,SR,86.0,100.0,00:00:00,45.0,100.0,00:00:00,44.0,...,00:00:00,6.0,10,00:04:51,6.0,10,00:00:00,7.0,10,00:00:00
3,A14847419,TH,JR,100.0,100.0,00:00:00,100.0,100.0,00:00:00,78.0,...,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00
4,A14162943,SI,JR,66.0,100.0,00:00:00,33.0,100.0,00:00:00,42.0,...,00:00:00,5.0,10,00:00:00,5.0,10,00:00:00,6.0,10,00:00:00
5,A14282114,RE,SR,91.0,100.0,00:00:00,100.0,100.0,00:00:00,70.0,...,00:00:00,9.0,10,00:00:00,9.0,10,00:00:00,9.0,10,00:00:00
6,A14297403,MU,SR,96.0,100.0,00:00:00,47.0,100.0,00:00:00,54.0,...,00:00:00,8.0,10,00:00:00,7.0,10,46:35:38,8.0,10,00:00:00
7,A14369624,WA,JR,100.0,100.0,00:00:00,98.0,100.0,00:00:00,81.0,...,04:09:52,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00
8,A14137484,FI,SR,90.0,100.0,00:00:00,100.0,100.0,00:00:00,66.0,...,00:00:00,9.0,10,00:00:00,9.0,10,42:06:36,9.0,10,00:00:00
9,A14353945,WA,JR,100.0,100.0,00:00:00,100.0,100.0,00:00:00,79.0,...,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00


### Computing lab grades

Now, you will clean and process the lab grades, which is a little more complicated. To do this, you will develop functions that:
- 'normalize' the grades, 
- adjust for late submissions, 
- drop the lowest lab grade, and 
- creates a total lab score for each student.

**Question 3**

Unfortunately, Gradescope sometimes experiences a delay in registering when an assignment is submitted during "periods of heavy usage" (i.e. near a submission deadline). You need to assess when a student's assignment was actually turned in on time, even if Gradescope did not process it in time. To do this, it is helpful to know:
* Every late submission has to be submitted by a TA (late submissions are turned off).
* TAs never submitted a late assignment "just after" the deadline. 
* The deadlines were at midnight and students had to come to staff hours to late-submit their assignment.

Create a function `last_minute_submissions` that takes in the dataframe `grades` and outputs the number of submissions on each assignment that were turned in on time by the student, yet marked 'late' by Gradescope. See the doctest for more details.

*Note:* You have to figure out what truly is a late submission by looking at the data and understanding the facts about the data generating process above. There is some ambiguity in finding which submissions are truly late; you will *make a best guess for a threshold* by looking at this dataset. This question is about 'cleaning' a messy 'data recording process'.

In [243]:
def to_sec(td):
    h, m, s = td.split(':')
    sec = datetime.timedelta(hours = int(h), minutes = int(m), seconds = int(s)).total_seconds()
    return int(sec)

In [244]:
"lab" in grades.columns

False

In [245]:
threshold = 5 * 3600
lab_df = grades[['lab0%d - Lateness (H:M:S)' % d for d in range(1, 10)]]
lab = grades[['lab0%d' % d for d in range(1,10)]]
lab

Unnamed: 0,lab01,lab02,lab03,lab04,lab05,lab06,lab07,lab08,lab09
0,99.0,86.0,90.0,98.0,70.0,83.0,97.0,88.0,43.0
1,98.0,52.0,73.0,77.0,70.0,85.0,89.0,94.0,43.0
2,86.0,45.0,40.0,73.0,63.0,73.0,72.0,71.0,38.0
3,100.0,100.0,92.0,91.0,62.0,57.0,100.0,95.0,39.0
4,66.0,33.0,69.0,81.0,45.0,63.0,60.0,36.0,50.0
5,91.0,100.0,100.0,97.0,70.0,78.0,91.0,100.0,43.0
6,96.0,47.0,74.0,84.0,61.0,64.0,49.0,64.0,27.0
7,100.0,98.0,97.0,100.0,68.0,81.0,91.0,92.0,49.0
8,90.0,100.0,99.0,97.0,70.0,81.0,86.0,62.0,50.0
9,100.0,100.0,82.0,89.0,63.0,78.0,92.0,50.0,37.0


In [246]:
def time_conversion1(time):
    hour, minute, second = time.split(':')
    total_sec = datetime.timedelta(hours = int(hour),
    minutes = int(minute), seconds = int(second)).total_seconds()
    return int(total_sec)






def last_minute_submissions1(grades):

    threshold = 5 * 3600
    latelabs = []
    late = grades.filter(regex = r'^lab.{2} - Lateness')
    for lab in late.columns:
        late[lab] = grades[lab].apply(time_conversion1)
        latelabs.append((sum(late[lab] == 0) - sum(late[lab] < threshold)) * -1)
    lab_as_index = grades.filter(regex = r'^lab.{2}$').columns.tolist()
    final_series = pd.Series(latelabs)
    final_series.index = lab_as_index
    return final_series
out = last_minute_submissions1(grades)
isinstance(out, pd.Series)
np.all(out.index == ['lab0%d' % d for d in range(1,10)])
(out > 0).sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


8

In [247]:
threshold = 5 * 3600
latelabs = []
late = grades.filter(regex = r'^lab.{2} - Lateness')
for lab in late.columns:
    late[lab] = grades[lab].apply(time_conversion1)
    latelabs.append((sum(late[lab] == 0) - sum(late[lab] < threshold)) * -1)
lab_as_index = grades.filter(regex = r'^lab.{2}$').columns.tolist()
final_series = pd.Series(latelabs)
final_series

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


0     2
1     0
2     2
3    12
4     7
5     8
6    16
7    11
8    26
dtype: int64

In [430]:
def last_minute_submissions(grades):
    threshold = 5 * 3600
    lab_df = grades[['lab0%d - Lateness (H:M:S)' % d for d in range(1, 10)]]
    lab_df = lab_df.transpose()
    for col in lab_df.columns:
        lab_df[col] = lab_df[col].apply(to_sec)
        lab_df[col] = (lab_df[col] < threshold) & (lab_df[col] != 0)
    lab_df = lab_df.transpose().sum().tolist()
    return pd.Series(lab_df, index = ['lab0%d' % d for d in range(1, 10)])


In [432]:
out = last_minute_submissions(grades)
isinstance(out, pd.Series)
np.all(out.index == ['lab0%d' % d for d in range(1,10)])


True

In [250]:
def to_weeks(td):
    return td / pd.Timedelta('7 days') 
def penalty(i):
    if 0 < i < 1:
        return 0.9
    elif 1 <= i < 2:
        return 0.8
    elif i > 2:
        return 0.5
    else:
        return 1.0
fix = lambda x : 0 if x<=18000 else x
fix(18000)

0

In [251]:

def converter(row):
    loss = 0
    total = 0
    row = row.fillna(0.0)
    for col in ['lab0%d' % d for d in range(1,10)]:
        total += row[col]
    #for col in ['lab0%d - Lateness (H:M:S)' % d for d in range(1,10)]:
        row[col + "- Lateness (H:M:S)"] = pd.to_timedelta(row[col])
        row[col + "- Lateness (H:M:S)"] = row[col].days/7
        loss += penalty(row[col])   
    return total - loss

**Question 4**

Now you need to adjust the lab grades for late submissions -- however, you need to take into account your investigation in the previous question, since students shouldn't be penalized by a bug in Gradescope!

Create a function `lateness_penalty` that takes in a 'Lateness' column and returns a column of penalties (represented by the values `1.0,0.9,0.8,0.5` according to the syllabus). Only *truly* late submissions should be counted as late.

In [252]:
def lateness_penalty(col):
    col = col.apply(to_sec)
    fix = lambda x : 0 if x<18000 else x
    col = col.apply(fix)
    col = -(-col // pd.Timedelta(7 * 24 * 3600))
    col = col.apply(penalty)
    return col

In [253]:
fp = os.path.join('data', 'grades.csv')
col = pd.read_csv(fp)['lab01 - Lateness (H:M:S)']
out = lateness_penalty(col)
print(out)
isinstance(out, pd.Series)
set(out.unique()) <= {1.0, 0.9, 0.8, 0.5}

0      1.0
1      1.0
2      1.0
3      1.0
4      1.0
5      1.0
6      1.0
7      1.0
8      1.0
9      1.0
10     1.0
11     1.0
12     1.0
13     1.0
14     1.0
15     1.0
16     1.0
17     1.0
18     1.0
19     1.0
20     0.8
21     1.0
22     1.0
23     1.0
24     1.0
25     1.0
26     1.0
27     1.0
28     1.0
29     1.0
      ... 
505    1.0
506    0.8
507    1.0
508    1.0
509    1.0
510    1.0
511    1.0
512    1.0
513    1.0
514    0.8
515    1.0
516    1.0
517    1.0
518    1.0
519    1.0
520    1.0
521    1.0
522    1.0
523    1.0
524    1.0
525    1.0
526    1.0
527    1.0
528    1.0
529    1.0
530    0.8
531    1.0
532    1.0
533    1.0
534    1.0
Name: lab01 - Lateness (H:M:S), Length: 535, dtype: float64


True

**Question 5**

Create a function `process_labs` that takes in a dataframe like `grades` and returns a dataframe of processed lab scores. The output should:
* share the same index as `grades`,
* have columns given by the lab assignment names (e.g. `lab01,...lab10`)
* have values representing the lab grades for each assignment, adjusted for Lateness and scaled to a score between 0 and 1.

In [254]:
def process_labs(grades):
    grades = grades.fillna(0.0)
    lab = ['lab0%d' % d for d in range(1,10)]
    lab_late = ['lab0%d - Lateness (H:M:S)' % d for d in range(1,10)]
    for col in lab_late:
        grades[col] = lateness_penalty(grades[col])
    for col in lab:
        grades[col] = grades[col] / (grades[col + ' - Max Points'])
    lab = pd.np.multiply(grades[['lab0%d' % d for d in range(1,10)]], 
                         grades[['lab0%d - Lateness (H:M:S)' % d for d in range(1,10)]])
    return lab

In [255]:
loss = 0
total = 0
grades = grades.fillna(0.0)
lab = grades[['lab0%d' % d for d in range(1,10)]]
lab_late = grades[['lab0%d - Lateness (H:M:S)' % d for d in range(1,10)]]
for col in lab_late.columns:
    lab_late[col] = lateness_penalty(lab_late[col])
for col in lab.columns:
    lab[col] = lab[col] / 100
lab = pd.np.multiply(lab[['lab0%d' % d for d in range(1,10)]], .5)
lab


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,lab01,lab02,lab03,lab04,lab05,lab06,lab07,lab08,lab09
0,0.495,0.430,0.450,0.490,0.350,0.415,0.485,0.440,0.215
1,0.490,0.260,0.365,0.385,0.350,0.425,0.445,0.470,0.215
2,0.430,0.225,0.200,0.365,0.315,0.365,0.360,0.355,0.190
3,0.500,0.500,0.460,0.455,0.310,0.285,0.500,0.475,0.195
4,0.330,0.165,0.345,0.405,0.225,0.315,0.300,0.180,0.250
5,0.455,0.500,0.500,0.485,0.350,0.390,0.455,0.500,0.215
6,0.480,0.235,0.370,0.420,0.305,0.320,0.245,0.320,0.135
7,0.500,0.490,0.485,0.500,0.340,0.405,0.455,0.460,0.245
8,0.450,0.500,0.495,0.485,0.350,0.405,0.430,0.310,0.250
9,0.500,0.500,0.410,0.445,0.315,0.390,0.460,0.250,0.185


In [256]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
out = process_labs(grades)
#out.columns.tolist() == ['lab%02d' % x for x in range(1,10)]
#np.all((0.65 <= out.mean()) & (out.mean() <= 0.90))
out

Unnamed: 0,lab01,lab02,lab03,lab04,lab05,lab06,lab07,lab08,lab09
0,0.990,0.860,0.900,0.980,1.000000,0.976471,0.485,0.88,0.86
1,0.980,0.520,0.730,0.770,1.000000,0.500000,0.890,0.94,0.86
2,0.860,0.450,0.400,0.730,0.900000,0.429412,0.720,0.71,0.76
3,1.000,1.000,0.920,0.910,0.885714,0.670588,1.000,0.95,0.78
4,0.660,0.330,0.690,0.648,0.642857,0.741176,0.600,0.36,1.00
5,0.910,1.000,1.000,0.970,1.000000,0.917647,0.910,1.00,0.86
6,0.960,0.470,0.740,0.672,0.871429,0.752941,0.490,0.64,0.54
7,1.000,0.980,0.970,1.000,0.971429,0.952941,0.910,0.92,0.98
8,0.900,1.000,0.990,0.776,1.000000,0.476471,0.860,0.62,1.00
9,1.000,1.000,0.820,0.712,0.900000,0.458824,0.920,0.50,0.74


**Question 6**

Create a function `lab_total` that takes in dataframe of processed assignments (like the output of Question 5) and computes the total lab grade for each student according to the syllabus (returning a Series). Your answers should be proportions between 0 and 1. For example, if there are only 3 labs, and a student received scores of {80%,90%,100%}, then the total score would be 0.95.

*Note*: Don't forget to properly handle students who didn't turn in assignments! (Use your experience and common sense).

In [257]:
def lab_total(processed):
    idxlab = processed.transpose()
    idxlab = idxlab.fillna(0)
    for col in idxlab.columns:
        idxlab[col][idxlab[col].idxmin()] = np.NaN
    return idxlab.mean()
lab_total(out)

0      0.930809
1      0.836250
2      0.694926
3      0.930714
4      0.667754
5      0.963456
6      0.708296
7      0.971796
8      0.893250
9      0.824000
10     0.944853
11     0.675000
12     0.878286
13     0.931429
14     0.905000
15     0.841460
16     0.962647
17     0.596607
18     0.866250
19     0.855746
20     0.827832
21     0.882069
22     0.683036
23     0.804464
24     0.772462
25     0.826029
26     0.786588
27     0.802929
28     0.812536
29     0.849107
         ...   
505    0.880500
506    0.873143
507    0.553629
508    0.915000
509    0.972500
510    0.896250
511    0.792450
512    0.878929
513    0.752300
514    0.844643
515    0.865000
516    0.851429
517    0.827265
518    0.904375
519    0.905221
520    0.915500
521    0.956250
522    0.892000
523    0.796618
524    0.831397
525    0.915641
526    0.873687
527    0.988088
528    0.883319
529    0.728450
530    0.871750
531    0.823414
532    0.938571
533    0.845357
534    0.856460
Length: 535, dtype: floa

In [258]:
cols = 'lab01 lab02 lab03'.split()
processed = pd.DataFrame([[0.2, 0.90, 1.0]], index=[0], columns=cols)
np.isclose(lab_total(processed), 0.95).all()

True

In [259]:
discussions = ['discussion0%d' % d for d in range(1,10)]
grades[discussions].transpose().sum() / 100

0      0.90
1      0.65
2      0.58
3      0.90
4      0.52
5      0.82
6      0.69
7      0.90
8      0.80
9      0.90
10     0.89
11     0.58
12     0.90
13     0.88
14     0.75
15     0.90
16     0.90
17     0.71
18     0.90
19     0.81
20     0.89
21     0.74
22     0.80
23     0.90
24     0.88
25     0.88
26     0.82
27     0.81
28     0.76
29     0.62
       ... 
505    0.90
506    0.89
507    0.74
508    0.77
509    0.90
510    0.90
511    0.76
512    0.90
513    0.56
514    0.90
515    0.79
516    0.69
517    0.63
518    0.81
519    0.90
520    0.83
521    0.90
522    0.90
523    0.82
524    0.88
525    0.81
526    0.90
527    0.88
528    0.88
529    0.64
530    0.90
531    0.87
532    0.80
533    0.77
534    0.90
Length: 535, dtype: float64

### Putting it together

**Question 7**

Finally, you need to create the final course grades. To do this, you will add up the total of each course component according to the weights given in the syllabus. 

* Create a function `total_points` that takes in `grades` and returns the final course grades according to the syllabus. Course grades should be proportions between zero and one.
* Create a function `final_grades` that takes in the final course grades as above and returns a Series of letter grades given by the standard cutoffs (`A >= .90`, `.90 > B >= .80`, `.80 > C >= .70`, `.70 > D >= .60`, `.60 > F`). You should not use rounding to determining the letter grades.
* Create a function `letter_proportions` which takes in the dataframe `grades` and outputs a Series that contains the proportion of the class that received each grade. (This question requires you to put everything together).

*Note 1*: Don't repeat yourself when computing the checkpoint and discussion portions of the course.

*Note 2*: Only the lab portion of the course accounts for late assignments; you may assume all assignments in other portions are turned in without penalty.

To check your work, verify the course grade distribution and relevant statistics! Do the work by hand for a few students.

In [317]:
def project_helper(grades):
    projects = ['project0%d' % d for d in range(1,5)]
    free_r = '_free_response'
    max_p = ' - Max Points'
    for col in projects:
        grades[col + '_total'] = grades[col]
        grades[col + '_max_total'] = grades[col + max_p]
        if col + '_free_response' in grades.columns:
            grades[col + '_total'] += grades[col + free_r]
            grades[col + '_max_total'] += grades[col + free_r + max_p]
        grades[col + '_total'] = np.clip(grades[col + '_total'] / grades[col + '_max_total'], 0, 1)
    proj_grades = grades[['project0%d_total' % d for d in range(1,5)]]
    return proj_grades.transpose().mean()

In [318]:
def total_points(grades):
    grades = grades.fillna(0)
    processed = process_labs(grades)
    lab_of_total = np.clip(lab_total(processed), 0, 1) * .2
    final_of_total = np.clip(grades["Final"] / grades["Final - Max Points"], 0, 1) * .3
    mid_of_total = np.clip(grades["Midterm"] / grades["Midterm - Max Points"], 0, 1) * .15
    discussions = ['discussion0%d' % d for d in range(1,10)]
    disc_max = ['discussion0%d - Max Points' % d for d in range(1,10)]
    disc_of_total = np.clip(grades[discussions].transpose().sum() / grades[disc_max].transpose().sum(), 0, 1) * .025
    grades['check'] = 0
    grades['check_max'] = 0 
    for col in grades.columns:
        if ("checkpoint" in col) and (len(col) <= 23):
            grades['check'] += grades[col]
            grades['check_max'] += grades[col + ' - Max Points']
        
    proj_of_total = np.clip(project_helper(grades), 0, 1) * .3
    check_of_total = np.clip(grades['check']/grades['check_max'], 0, 1) * .025
    total = lab_of_total + final_of_total + mid_of_total + disc_of_total+ proj_of_total+ check_of_total
    return total
def helper(x):
    if x >= .9:
        return 'A'
    elif .8 <= x < .9:
        return 'B'
    elif .7 <= x < .8:
        return 'C'
    elif .6 <= x < .7:
        return 'D'
    else:
        return 'F'
def final_grades(total):
    return total.apply(helper)
def letter_proportions(grades):
    tot = total_points(grades)
    fin = final_grades(tot)
    return fin.value_counts(normalize = True)

In [319]:
grades = pd.read_csv(fp)
out = total_points(grades)
np.all((0 <= out) & (out <= 1))
0.7 < out.mean() < 0.9
out2 = final_grades(out)
out3 = letter_proportions(grades)
out3

B    0.530841
C    0.248598
A    0.132710
D    0.054206
F    0.033645
dtype: float64

In [402]:
proj.total_points(grades)
proj.letter_proportions(grades)
proj.total_points_with_noise(grades)
proj.process_labs(grades)
proj.lab_total(proj.process_labs(grades))


0      0.956312
1      0.856699
2      0.695359
3      0.926214
4      0.668180
5      0.950932
6      0.703947
7      0.967704
8      0.919280
9      0.848801
10     0.958576
11     0.658640
12     0.867870
13     0.905008
14     0.910906
15     0.828041
16     0.950365
17     0.607826
18     0.866195
19     0.868795
20     0.836937
21     0.887531
22     0.708281
23     0.813077
24     0.778202
25     0.819979
26     0.790853
27     0.798408
28     0.817677
29     0.857375
         ...   
505    0.888178
506    0.873459
507    0.545475
508    0.913635
509    0.989324
510    0.864322
511    0.819546
512    0.894313
513    0.768683
514    0.844654
515    0.854106
516    0.841709
517    0.841668
518    0.886379
519    0.932447
520    0.926740
521    0.920187
522    0.890726
523    0.810493
524    0.845562
525    0.905366
526    0.878464
527    1.005599
528    0.875781
529    0.732783
530    0.888214
531    0.799942
532    0.938999
533    0.820437
534    0.844650
Length: 535, dtype: floa

In [263]:
grades = pd.read_csv(fp)
grades.head(5).transpose().tail(53)

Unnamed: 0,0,1,2,3,4
lab08 - Lateness (H:M:S),00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
lab09,43,43,38,39,50
lab09 - Max Points,50,50,50,50,50
lab09 - Lateness (H:M:S),00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
project03_checkpoint01,0,0,6,0,0
project03_checkpoint01 - Max Points,10,10,10,10,10
project03_checkpoint01 - Lateness (H:M:S),00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
project03,86,88,75,94,90
project03 - Max Points,100,100,100,100,100
project03 - Lateness (H:M:S),00:00:00,00:00:00,00:00:00,00:00:00,00:00:00


### Do Sophomores get better grades?

**Question 8**

You notice that students who are sophomores on average did better in the class (if you can't verify this, you should go back and check your work!). Is this difference significant, or just due to noise?

Perform a hypothesis test, assessing likelihood of the null hypothesis: 
> "sophomores earn grades that are roughly equal on average to the rest of the class."


Create a function `simulate_pval` which takes in the number of simulations `N` and `grades` and returns the the likelihood that the grade of sophomores was no better on average than the class as a whole (i.e. calculate the p-value).

*Note:* To check your work, plot the sampling distribution and the observation. Do these values look reasonable?

In [440]:
def simulate_pval(grades, N):
    level = grades.groupby('Level')
    other = total_points(level.get_group('JR').append(level.get_group('SR'))).mean()
    j = []
    for i in range(N):
        so = total_points(level.get_group('SO')).sample(n = 30, replace = True)
        j.append(so.mean())
    j = np.array(j)
    j = j < other
    return sum(j) / N
print(simulate_pval(grades, 1000))

0.001


In [436]:
level = grades.groupby('Level')
so = total_points(level.get_group('SO')).mean()
jr = letter_proportions(level.get_group('JR').append(level.get_group('SR')))
so

0.8406457958752065

### What is the true distribution of grades?

The gradebook for this class only reflects one particular instance of each student's performance, subject to the effects of all the little events and hiccups that occurred throughout the quarter. Might you have done better on the midterm had your roommate kept you up all night with their coughing? Wasn't it lucky that the example you were studying just before the final happened to appear on the exam?

**Question 9**

This question will simulate these '(un)lucky, random events' by adding or subtracting random amounts to each assignment before calculating the final grades. These 'random amounts' will be drawn from a Gaussian distribution of mean 0 and a std deviation 0.02:
```
np.random.normal(0, 0.02, size=(num_rows, num_cols))
```
Intuitively, such a model says that random events may bump up or down a given grade (given as a proportion):
- which on average has no effect on the class as a whole (mean 0),
- which not uncommonly might perturb a grade by 2% (std dev 0.02).

Create a function `total_points_with_noise` that takes in a dataframe like `grades`, adds noise to the assignments as described above, and returns the final scores using *the same procedure* as questions 1-7.

*Note:* You should be able to reuse (or minorly change) the code from previous problems. Try to be DRY (don't repeat yourself)!

*Note 1:* Once adding the noise to the assignment scores, use the `np.clip` function to be sure each assignment retains a score between 0% and 100%.

*Note 2:* To check your work -- what would you expect the difference between the actual scores and noisy scores to be, on average?

In [324]:
def total_points_with_noise(grades):
    for col in grades.columns:
        if ("-" not in col) and '0' in col:
            noise = np.random.normal(0, 0.02, size = grades.shape[0]) * grades[col + ' - Max Points']
            grades[col] += noise
    return total_points(grades)

In [433]:
total_points_with_noise(grades).describe()
proj.simulate_pval(grades, 100)

0.18691588785046728

### Short-answer questions (hard-coded)

Use your functions from above to understanding the data and answer the following questions. The function below should return **hard-coded values**. It should not compute anything!

**Question 10**

Create a function `short_answer` of zero variables that returns (hard-coded) answers to the following question in a list:
0. For the class on average, what is the difference between students' scores (`total_points`) and their scores with noise (`total_points_with_noise`)? (Remark: plot the distribution of differences; does this align with what you know about binomial distributions?)
1. What percentage of the class only sees their grade change at most (but not including) $\pm 0.01$?
2. What is the 95% confidence interval for the statistic above? (see [DSC10](https://www.inferentialthinking.com/chapters/13/3/Confidence_Intervals.html) and use `np.percentile`)
3. What proportion of the class sees a change in their letter grade?
4. The assumption behind the model in Question 9 is that:
    - The (observed) gradebook well represents the true population of students,
    - The noisy scores represent other possible observations drawn from the true population of students.
    - Answer `True` or `False`

In [380]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
difference = total_points(grades) - total_points_with_noise(grades)

(np.percentile(difference, [5, 95], axis= 0))
533/535

0.9962616822429906

In [381]:
def short_answer():
    return [4.462e-05, 1, [99.6, 100], 0.03, True]

# Congratulations, you finished the project!

### Before you submit:
* Be sure you run the doctests on all your code in project01.py

### To submit:
* **Upload the .py file to gradescope**