In [9]:
import numpy as np
from collections.abc import MutableSequence
import pandas as pd
from abc import ABC, abstractmethod
import statistics
from statistics import stdev, mean
import math

# Assignment #2 - With Bonus Stats!!

## Overview

The end goal for this is to create a special data structure that will be a list of numbers plus some extra math stuff, as well as the code to support using and testing everything. Each of these lists, here called a calculationList, will have two main parts - a list of numbers and a threshold value. Each type of object will work differently depending on its type, but the basic logic is the same. The threshold value is a limit for whatever type of calculation the list belongs to, so for a stdList, the threshold applies to the standard deviation, for a meanList, the threshold applies to the mean, etc. The calculation list should have a prune() method that will start removing values from the list until the relevant value is below the threshold. Each type of calculation list will have a different way of figuring out what to remove, as we want to remove the most "important" values first - i.e. if the standard deviation is greater than the threshold, and we have a value that is 3 standard deviations away from the mean and another that is 10 standard deviations away from the mean, we want to remove the second value first as it will be the most impactful. 

<b>Note: please let me know if the premise isn't clear. You should have to sort out some ambiguities as you develop, but the goal should be clear.</b>

### Classes to Create

A caclulationList class that is made up of a list of float numbers as well as a few additions. This class will inherit from two things - the mutable sequence class and the ABC class. The mutable sequence class will allow us to use the list methods, and the ABC class will allow us to use the abstract methods.

The calculation list will be a base class that will not be implemented directly. You will need to create some subclasses that then inherit from the calculationList class. These subclasses will be the following:
<ul>
<li> stdList - this will be a calculationList that will prune values based on the standard deviation of the list. </li>
<li> meanList - this will be a calculationList that will prune values based on the mean of the list. </li>
<li> sumList - this will be a calculationList that will prune values based on the sum of the list. </li>
</ul>

Each of these classes should only add what they need to make their unique functionality work, the things that are common to all of them should be in the calculationList class. The top level calcList class is similar to the example listBasedSet class here: https://python.readthedocs.io/en/latest/library/collections.abc.html The other classes should be children of that class, each adding their own unique parts. One note, there may be erroneous values in the input data, so there should be some error checking to deal with broken inputs - <b>if a row has erroneous data, that row should be skipped entirely. </b>

#### Example Results

Here are a few screenshots of the processing logic of the calculation lists:

![Calculation List Example](example_results.png "Calculation List Example")

We can also look at the inputs and outputs of the calculation lists to see some of the details:

![Input and Output Example](input_output.png "Input and Output Example")

Please check with me if the idea and the goal is not clear. 

## Deliverables

For this assignment, please submit the following:
<ul>
<li> The notebook file containing your code. </li>
<li> The CSV output file, <b>generated from a test file that I'll post before the due date.</b> This file will be in the same format as the test data, but the values will be different. </li>
</ul>

## Grading

The grading for this will be broken out as follows, and will learn heavily on things working correctly. 
<ul>
<li> 75% - Functionality. If yours works, this is the baseline. If it fails, I may decrease this, depending on what I can visually spot in code. </li>
<li> 25% - Code clarity and formatting. </li>
</ul>

### Notes and Hints

I will put any update notes, responses to common questions, and relevant hints in a list in the README file. Please don't edit that file, as that will let you pull it to get new stuff without conflict. 

In [11]:
class calcList(MutableSequence, ABC):
    def __init__(self, name, threshold, iterable, trim=3):
        self._name = name
        self._threshold = threshold
        self._trim = trim
        self.elements = list(iterable)

    def __getitem__(self, index):
        return self.elements[index]

    def __setitem__(self, index, value):
        self.elements[index] = value

    def __delitem__(self, index):
        del self.elements[index]

    def __len__(self):
        return len(self.elements)

    def insert(self, index, value):
        self.elements.insert(index, value)

    def csv_output(self):
        return f"{self._name},{len(self)},{self._threshold},{self.value()}"

    @abstractmethod
    def value(self):
        pass

    @abstractmethod
    def prune(self):
        pass

    @abstractmethod
    def isPruned(self):
        pass

    @abstractmethod
    def returnType(self):
        pass

    def setThreshold(self, threshold):
        self._threshold = threshold

    def getThreshold(self):
        return self._threshold

In [36]:
class stdList(calcList):    
    
    def value(self):
        if len(self.elements) > 1:
            return np.std(self.elements)
        else:
            return 0

    def prune(self):
        while self.value() > self._threshold:
            mean_val = mean(self.elements)
            max_deviation = max(self.elements, key=lambda x: abs(x - mean_val))
            self.elements.remove(max_deviation)

    def isPruned(self):
        return self.value() <= self._threshold

    def returnType(self):
        return 'stdList'

    def __str__(self):
        return f"{self._name} - Std. Dev: {self.value()} (Thresh: {self._threshold}) {self.elements}"

In [37]:
class meanList(calcList):
        
    def value(self):
        if not self.elements:
            return 0  
        else:
            return mean(self.elements)

    def prune(self):
        
        while self.value() > self._threshold:

            elements_to_remove = int(len(self.elements) * 0.1) 

            if elements_to_remove == 0:
                elements_to_remove = 1  
            
            for x in range(elements_to_remove):
                self.elements.remove(max(self.elements))

    def isPruned(self):
        return self.value() <= self._threshold

    def returnType(self):
        return 'meanList'

    def __str__(self):
        return f"{self._name} - Mean: {self.value()} (Thresh: {self._threshold}) {self.elements}"

In [38]:
class sumList(calcList):
    def value(self):
        return sum(self.elements)

    def prune(self):
        while len(self.elements) > 1 and self.value() > self._threshold + 1e-10:
            self.remove_element()
        if len(self.elements) == 1 and self.value() > self._threshold + 1e-10:
            pass

    def isPruned(self):
        return self.value() <= self._threshold

    def returnType(self):
        return 'sumList'
    
    def __str__(self):
        return f"{self._name} - Sum: {self.value()} (Thresh: {self._threshold}) {self.elements}"
    
    def remove_element(self):
        idx_to_remove = self.elements.index(max(self.elements))
        del self.elements[idx_to_remove]

### Simple Unit Tests

These are some simple tests that you can use to check, if you want. Please feel free to change, remove, or add to these as you see fit.

In [39]:
calc = stdList("test", 2, [1,2,3,4,5,6,7,8,9,10])
print(calc)
calc.prune()
print(calc)

test - Std. Dev: 2.8722813232690143 (Thresh: 2) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test - Std. Dev: 2.0 (Thresh: 2) [4, 5, 6, 7, 8, 9, 10]


In [41]:
calc2 = meanList("test2", 4, [1,2,3,4,5,6,7,8,9,10])
print(calc2)
calc2.prune()
print(calc2)

test2 - Mean: 5.5 (Thresh: 4) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test2 - Mean: 4 (Thresh: 4) [1, 2, 3, 4, 5, 6, 7]


In [40]:
calc3 = sumList("test3", 45, [1,2,3,4,5,6,7,8,9,10])
print(calc3)
calc3.prune()
print(calc3)

test3 - Sum: 55 (Thresh: 45) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test3 - Sum: 45 (Thresh: 45) [1, 2, 3, 4, 5, 6, 7, 8, 9]


### Load Data and Test

The functions below are a simple test function for your code, it'll take in an input and an output and score the two. In your code, you'll have half of the inputs here, the expected results, and will need to write the rest of the code to generate your results and input them to run the test. 

This function can likely be wrapped in another, one that calls your code to generate that input to check against. This isn't required, but will likely make things easier to call and test repeatedly. You'd have to do everything required to get the "response" argument, which is the CSV file of your answers. 

In [42]:
def testHarness(response, expected, response_col="Value", expected_col="Value", match_thresh=.03, exp_name="Name", resp_name="Name"):
    '''Runs a test of the response file against the expected file. Returns a tuple of the number of correct and incorrect responses.'''
    resp = pd.read_csv(response)
    exp = pd.read_csv(expected)
    
    correct = 0
    incorrect = 0
    
    i = 0
    while i < len(resp):
        exp_val = exp.iloc[i][expected_col]
        resp_val = resp.iloc[i][response_col]
        
        if toleranceMatch(exp_val, resp_val, match_thresh) and (exp.iloc[i][exp_name] == resp.iloc[i][resp_name]):
            correct += 1
        else:
            incorrect += 1
        i += 1
    
    return (correct, incorrect)
    

def toleranceMatch(val1, val2, percent_tolerance):
    '''Returns True if val1 and val2 are within percent_tolerance of each other, False otherwise.'''
    if val1 == val2:
        return True
    else:
        if val1 == 0:
            if val2 == 0:
                return True
            else:
                return False
        if (abs(val1 - val2) / val1) <= percent_tolerance:
            return True
        else:
            return False

In [43]:
test = testHarness("output.csv", "output.csv")
print(test)

(1000, 0)


In [None]:
#Couldnt figure this test part out for the life of me. 