## CSCI 470 Activities and Case Studies

1. For all activities, you are allowed to collaborate with a partner. 
1. For case studies, you should work individually and are **not** allowed to collaborate.

By filling out this notebook and submitting it, you acknowledge that you are aware of the above policies and are agreeing to comply with them.

Some considerations with regard to how these notebooks will be graded:

1. You can add more notebook cells or edit existing notebook cells other than "# YOUR CODE HERE" to test out or debug your code. We actually highly recommend you do so to gain a better understanding of what is happening. However, during grading, **these changes are ignored**. 
2. You must ensure that all your code for the particular task is available in the cells that say "# YOUR CODE HERE"
3. Every cell that says "# YOUR CODE HERE" is followed by a "raise NotImplementedError". You need to remove that line. During grading, if an error occurs then you will not receive points for your work in that section.
4. If your code passes the "assert" statements, then no output will result. If your code fails the "assert" statements, you will get an "AssertionError". Getting an assertion error means you will not receive points for that particular task.
5. If you edit the "assert" statements to make your code pass, they will still fail when they are graded since the "assert" statements will revert to the original. Make sure you don't edit the assert statements.
6. We may sometimes have "hidden" tests for grading. This means that passing the visible "assert" statements is not sufficient. The "assert" statements are there as a guide but you need to make sure you understand what you're required to do and ensure that you are doing it correctly. Passing the visible tests is necessary but not sufficient to get the grade for that cell.
7. When you are asked to define a function, make sure you **don't** use any variables outside of the parameters passed to the function. You can think of the parameters being passed to the function as a hint. Make sure you're using all of those variables.
8. Finally, **make sure you run "Kernel > Restart and Run All"** and pass all the asserts before submitting. If you don't restart the kernel, there may be some code that you ran and deleted that is still being used and that was why your asserts were passing.

# Naive Classification Approaches


In this case study, we will build a custom classifier that takes some naive classification approaches. The approaches we'll take are:

- Guessing a user-determined class at all times
- Guessing the most common class at all times
- Guessing randomly based on the distribution of the classes
- Guessing randomly based on an equal chance of the classes


We're going to build a class, `NaiveClassifier`, that can fit and predict based on the above approaches. We will then try it out on a few datasets and see what results we get. This should help you understand the minimal performance you should expect out of your machine learning models.

The way `NaiveClassifier` should work is that we instantiate it with an `approach` and an optional `value` depending on the method.

Examples:

- always predict class 1 would be: `clf = NaiveClassifier(approach="always", value=1)`
- always predict most common class would be: `clf = NaiveClassifier(approach="most")`
- predict based on class distribution: `clf = NaiveClassifier(approach="distribution")`
- predict based on equal class distribution: `clf = NaiveClassifier(approach="equal")`

### Note that we are *not* building a Naive Bayes Classifier

A Naive Bayes Classifier is a legitimate ML model that predicts output targets based on input features. In contrast, the point of this exercise and the naive (non-Bayesian) classifier you are building is to instill awareness in you that a ML model might give *seemingly* reasonable prediction scores despite having actually learned no meaningful relationship between features and targets.

Your naive classifier models only the distribution of target values, or some statistic of that distribution. It is *not* a machine learning model, because there is no learning involved. Thus, any ML that underperforms relative to such a naive classifier presumably has not actually learned any useful connection between the features and the targets.

In addition, this exercise is intended to help you understand how the sci-kit learn model classes are typically constructed, by building one yourself.

Examine the imported packages below. These are the packages you may use for your solutions.

Use the ```stat``` package to help you extract desired statistics from your target data. Use the ```random``` package to help you randomly select values for your naive classifier's prediction, in a manner that reflects the specified ```approach```.

In [1]:
import numpy as np
import scipy
from scipy import stats
import sklearn
from sklearn.metrics import accuracy_score, classification_report
import random

In [4]:
def select_most_common(labels):
    """Select the most common value (the mode) in an iterable of labels
    
    Args:
        labels (iterable): An iterable of integers representing the labels of a dataset
    
    Returns:
        int: The most common element in the iterable
    """
    return np.bincount(labels).argmax()


In [5]:
assert select_most_common([1,2,2,3,4,5]) == 2
assert select_most_common([1,1,1,1,1,1,2,2,2]) == 1

A solution to the function definition below will likely require some deeper thinking on your part.


<br>
<details>
<summary>Click here for a hint</summary>
    
There are multiple coding approaches. As a hint to one type of approach, consider how you might work with the cumulative distribution function (cdf) rather than the probability distribution function (pdf). Yet, don't let this hint constrain your thinking and creativity.
</details>


In [166]:
def predict_from_distribution(distribution):
    """Draw a sample from a specified categorical distribution
    
    Args:
        distribution (iterable): An iterable of the probabilities of each class in the distrubtion.
                                 The distrubution probabilities must sum to 1.
    
    Returns:
        int: The 0-indexed class of the drawn sample.
    """    
    assert sum(distribution) == 1
    for idx, value in enumerate(distribution):
        if random.random() < value:
            if idx == 0: return idx
            if idx > 1: return idx-1
        if idx == len(distribution)-1:
            return idx


In [167]:
# Example predictions
# You should see 10 results with about 5 0s, 1 1, and 4 2's.
# You can print val in order to see if it's being calculated correctly
[predict_from_distribution([0.5, 0.1, 0.4]) for i in range(10)]

[0, 1, 2, 2, 0, 1, 1, 0, 0, 0]

In the ```fit()``` method in the class definition below you need to compute and store any variables that you will need for the ```approach``` that is specified in the ```__init__()``` method and the corresponding ```predict()``` method. The number of variables you need to store differs depending on the ```approach```.

<br>
<details>
<summary>Click here for a hint</summary>
    
1. Try coding the predict function first and then from there determine what you would need to know ahead of time that you can learn from the fit stage.

2. Update the fit method with that information storing it into the class using `self.some_stored_value`. You can then access `self.some_stored_value` in the predict stage.
</details>


In [168]:
class NaiveClassifier:
    """A Naive Classifier that predicts classes using simple approaches.
    """
    
    def __init__(self, approach, value=None):
        """Initialize the NaiveClassifier
        
        Args:
            approach (str): One of "always", "most", "distribution", "equal"
            value (int, optional): Defaults to None. The value of the class to select if approach is "always"
        """
        assert approach in ["always", "most", "distribution", "equal"]
        self.approach = approach
        self.value = value

    def fit(self,X,y):
        """Fit to data and labels
        Examples:
            always predict class 1 would be: clf = NaiveClassifier(approach="always", value=1)
            always predict most common class would be: clf = NaiveClassifier(approach="most")
            predict based on class distribution: clf = NaiveClassifier(approach="distribution")
            predict based on equal class distribution: clf = NaiveClassifier(approach="equal")
        
        Args:
            X (iterable): The features of the data
            y (iterable): The labels of the data
        """
        if self.approach == ("always"):
            self.value
            
        elif self.approach == "most":
            self.value = select_most_common(y)

        elif self.approach == "distribution":
            unique, counts = np.unique(y, return_counts=True)
            frequencies = np.asarray((unique, counts)).T
            self.distribution = [frequency[1]/len(y) for frequency in frequencies]
            
        elif self.approach == "equal":
            unique, counts = np.unique(y, return_counts=True)
            self.distribution = [1/len(unique) for i in range(len(unique))]

    def predict(self,X):
        """Predict the labels of a new set of datapoints
        
        Args:
            X (iterable): The data to predict
        """
        if self.approach == "always":
            return [self.value]*len(X)
        elif self.approach == "most":
            return [self.value]*len(X)
        elif self.approach == "distribution":
            return [predict_from_distribution(self.distribution) for i in range(len(X))]
        elif self.approach == "equal":
            return [predict_from_distribution(self.distribution) for i in range(len(X))]

Let's create a few datasets that we'll use to analyze how a predictor would work with each of those approaches. Here are all the datasets we'll create:

- 2 classes (0 and 1), equally distributed
- 2 classes with 0 at 90% and 1 at 10%
- 3 classes (0, 1, and 2), equally distributed
- 3 classes with 0 at 90%, 1 at 9% and 2 at 1%

With these classes, you are setting them to be **exactly at the desired percentages** and not generating them randomly from a distribution. 

In [169]:
# You will create the labels for each of the listed datasets (above) with length n
# Name your listed datasets as: binary_equal, binary_unequal, trinary_equal and trinary_unequal
# The datasets may be lists or numpy arrays.

labels = [0,1,2]
n = 15000
features = np.zeros((n,3))

binary_equal = [labels[0]]*int(n/2) + [labels[1]]*int(n/2)
binary_unequal = [labels[0]]*int(n*0.9) + [labels[1]]*int(n*0.1)
trinary_equal =  [labels[0]]*int(n/3) + [labels[1]]*int(n/3) + [labels[2]]*int(n/3)
trinary_unequal =[labels[0]]*int(n*0.9) + [labels[1]]*int(n*0.09) + [labels[2]]*int(n*0.01)


In [170]:
assert np.all(np.bincount(binary_equal) == np.array([7500,7500]))
assert np.all(np.bincount(binary_unequal) == np.array([13500,1500]))
assert np.all(np.bincount(trinary_equal) == np.array([5000,5000,5000]))
assert np.all(np.bincount(trinary_unequal) == np.array([13500,1350,150]))

In [171]:
datasets = [{
    "name": "Binary Classification Equally Distributed",
    "labels": binary_equal
},{
    "name": "Binary Classification 90:10",
    "labels": binary_unequal
},{
    "name": "3-Class Classification Equally Distributed",
    "labels": trinary_equal
},{
    "name": "3-Class Classification 90:9:1",
    "labels": trinary_unequal
}]

# Testing

Let's now test out our Naive Classifiers on the above datasets. We will be training and testing on the full dataset. Since the model is actually not a machine learning algorithm and this is just for educational purposes, this lack of proper cross-validation will not be an issue. We are just using this approach to learn what the naive model would have predicted even on the data it "trained" on.

When "training" or "fitting" the model, note that since it's not actually learning anything, it doesn't matter what the features it receives are. It is just going to predict things regardless of the features of the test data. For that reason, we'll be just using a single feature for data. The models should still work fine with that.

In [172]:
# Create three classifers. Classifiers that always predict 0, 1, and 2, respectively.
# Name them always_zero, always_one and always_two respectively

always_zero = NaiveClassifier("always", 0)
always_one = NaiveClassifier("always", 1)
always_two = NaiveClassifier("always", 2)

In [173]:
assert always_zero.approach=="always"
assert always_zero.value == 0
assert always_one.approach=="always"
assert always_one.value == 1
assert always_two.approach=="always"
assert always_two.value == 2

In [174]:
# Create a classifer that predicts the most frequent class
# Name it most_est

most_est = NaiveClassifier("most")

In [175]:
assert most_est.approach=="most"
most_est.fit([0,0,0,0,0], [0,1,1,1,0])
assert most_est.predict([0,0,0]) == [1, 1, 1]

In [176]:
# Create a classifer that predicts based on the distribution of the classes
# Name it dist_est

dist_est = NaiveClassifier("distribution")

In [177]:
## To test the predict() function, use its output to create a distribution based
## on a large number of samples/predictions, and compare that to the original
## distribution. The two distributions should be very close.

assert dist_est.approach == "distribution"
nn = 1000
X = np.zeros((nn, 1), dtype=np.int)
y = [0 if random.random()>0.9 else 1 for i in range(nn)]
true_distrib = np.bincount(y)
dist_est.fit(X, y)
pred_distrib = np.bincount(dist_est.predict(X))

# Distributions should be close, but not necessarily identical
earth_mover_distance = np.sum(np.abs(np.subtract(true_distrib, pred_distrib))) / nn / 2
print(earth_mover_distance)
assert earth_mover_distance < 0.1

0.016


In [178]:
# Create a classifer that predicts equally any of the classes
# Name it equal_est

equal_est = NaiveClassifier("equal")

In [179]:
## To test the predict() function, use its output to create a distribution based
## on a large number of samples/predictions, and compare that to an equal (uniform)
## distribution. The two distributions should be very close.

assert equal_est.approach == "equal"
nn = 1000
X = np.zeros((nn, 1), dtype=np.int)
y = [0 if random.random()>0.9 else 1 for i in range(nn)]
equal_est.fit(X, y)
pred_distrib = np.bincount(equal_est.predict(X))

# Prediced distribution should be close to uniform distrubution, but not necessarily identical
target_distrib = [nn/2, nn/2]
earth_mover_distance = np.sum(np.abs(np.subtract(target_distrib, pred_distrib))) / nn / 2
print(earth_mover_distance)
assert earth_mover_distance < 0.1

0.008


In [180]:
estimators = [
    {
        "name": "Always Zero",
        "estimator": always_zero
    },
    {
        "name": "Always One",
        "estimator": always_one
    },
    {
        "name": "Always Two",
        "estimator": always_two
    },
    {
        "name": "Most Common",
        "estimator": most_est
    },
    {
        "name": "Distribution Based",
        "estimator": dist_est
    },
    {
        "name": "Equally",
        "estimator": equal_est
    }
]

In [181]:
# For each dataset, apply each estimator to generate predictions and save them as pred
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

for dataset in datasets:
    name = dataset["name"]
    labels = dataset["labels"]
    print("="*100)
    print(f"{name}")
    print("="*100)
    for est in estimators:
        estimator_name = est["name"]
        print("-"*20)
        print(f"Estimating with {estimator_name}")
        print("-"*20)
        pred = est["estimator"].predict(labels)
    
        print(f"Produced an accuracy score of {accuracy_score(labels, pred)} and the following report")
        print(classification_report(labels, pred))

Binary Classification Equally Distributed
--------------------
Estimating with Always Zero
--------------------
Produced an accuracy score of 0.5 and the following report
              precision    recall  f1-score   support

           0       0.50      1.00      0.67      7500
           1       0.00      0.00      0.00      7500

    accuracy                           0.50     15000
   macro avg       0.25      0.50      0.33     15000
weighted avg       0.25      0.50      0.33     15000

--------------------
Estimating with Always One
--------------------
Produced an accuracy score of 0.5 and the following report
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      7500
           1       0.50      1.00      0.67      7500

    accuracy                           0.50     15000
   macro avg       0.25      0.50      0.33     15000
weighted avg       0.25      0.50      0.33     15000

--------------------
Estimating with Always Two

Produced an accuracy score of 0.3338 and the following report
              precision    recall  f1-score   support

           0       0.34      0.50      0.40      5000
           1       0.33      0.50      0.40      5000
           2       0.00      0.00      0.00      5000

    accuracy                           0.33     15000
   macro avg       0.22      0.33      0.27     15000
weighted avg       0.22      0.33      0.27     15000

3-Class Classification 90:9:1
--------------------
Estimating with Always Zero
--------------------
Produced an accuracy score of 0.9 and the following report
              precision    recall  f1-score   support

           0       0.90      1.00      0.95     13500
           1       0.00      0.00      0.00      1350
           2       0.00      0.00      0.00       150

    accuracy                           0.90     15000
   macro avg       0.30      0.33      0.32     15000
weighted avg       0.81      0.90      0.85     15000

-----------------

In [182]:
# Please describe your conclusions based on the above results
# You must write at least 300 characters
# This portion is worth 100 points (20% of the CS)
# Save your answer to conclusions
conclusions = f'''
The NaiveClassifier is a classifier that uses the most basic of guessing approaches
since it's not actually learning anything. It is just going to predict things regardless 
of the features of the test data. The aproach, 'always', always guess the same number. 
This doesn't include any logic that takes into consideration the distribution make 
this method as good as make 'A' as the answer to every multiple choice question. When 
looking at the Equally Distributed dataset, this method will correctly guess N of the 
labels where N=labels/num_class. This method will when the dataset has a different 
distribution will depend largely on the chosen class since the accuracy will be whichever 
distribution that label represents. Again since nothing is learned about the fit data, choosing
the value to initialize the model with is largely random and directly correlated to the 
accuracy. When we look at the "most" model, this tries to 'learn' basic information about the 
fit dataset to evaluate the test dataset. By choosing the value that appeared most in the
train set, the model is trying to mimic the previously opitmal result that one could achieve 
by only guessing one number for every label. The results on the test set were not great though as 
the test set was mostly composed of the other labels, making this relatively okay on the train
set, but horribly on the test set (Poor generalization). The distributed model's performance
was in the middle of all the models meaning that it might generalize to unknown data the best
out of all the methods. This model tried to 'learn' the distribution of the train set to mimic
the results on the test set. The equally distributed approach is naive just like the "always" 
approach, but would be like guessing 'A', 'B', or 'C' equally. The method's accuracy scored 
about half of the labels correctly on all datasets on this run. When the dataset distribution was
optimal for the 'always' approach, this method scored best, but otherwise, this equally distributed
method provided midway results above the "distributed" approach that was fit to the train sets.
'''
print(conclusions)


The NaiveClassifier is a classifier that uses the most basic of guessing approaches
since it's not actually learning anything. It is just going to predict things regardless 
of the features of the test data. The aproach, 'always', always guess the same number. 
This doesn't include any logic that takes into consideration the distribution make 
this method as good as make 'A' as the answer to every multiple choice question. When 
looking at the Equally Distributed dataset, this method will correctly guess N of the 
labels where N=labels/num_class. This method will when the dataset has a different 
distribution will depend largely on the chosen class since the accuracy will be whichever 
distribution that label represents. Again since nothing is learned about the fit data, choosing
the value to initialize the model with is largely random and directly correlated to the 
accuracy. When we look at the "most" model, this tries to 'learn' basic information about the 
fit dataset to evaluate 

In [183]:
assert len(conclusions) > 300

## Feedback

In [184]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    return "This assignment was not difficult except for the lack of direction. If better directions were provided, much confusion over what was supposed to be done would be solved. Piazza was helpful, so maybe include some of that in the directions next time."

feedback()


'This assignment was not difficult except for the lack of direction. If better directions were provided, much confusion over what was supposed to be done would be solved. Piazza was helpful, so maybe include some of that in the directions next time.'