# Naive Classification Approaches


In this case study, we will build a custom classifier that takes some naive classification approaches. The approaches we'll take are:

- Guessing one class at all times
- Guessing the most common class at all times
- Guessing randomly based on the distribution of the classes
- Guessing randomly based on an equal chance of the classes


We're going to build a class, `NaiveClassifier`, that can fit and predict based on the above approaches. We will then try it out on a few datasets and see what results we get. This should help you understand the minimal performance you should expect out of your machine learning models.

The way `NaiveClassifier` should work is that we instantiate it with an `approach` and an optional `value` depending on the method.

Examples:

- always predict class 1 would be: `clf = NaiveClassifier(approach="always", value=1)`
- always predict most common class would be: `clf = NaiveClassifier(approach="most")`
- predict based on class distribution: `clf = NaiveClassifier(approach="distribution")`
- predict based on equal class distribution: `clf = NaiveClassifier(approach="equal")`

In [0]:
import numpy as np
import scipy
from scipy import stats
import sklearn
from sklearn.metrics import accuracy_score, classification_report
import random

In [0]:
def select_most_common(labels):
    """Select the most common value in an iterable of labels
    
    Args:
        labels (iterable): An iterable of integers representing the labels of a dataset
    
    Returns:
        int: The most common element in the iterable
    """
    # YOUR CODE HERE
    counts = np.bincount(labels)
    return np.argmax(counts)

In [0]:
assert select_most_common([1,2,2,3,4,5]) == 2
assert select_most_common([1,1,1,1,1,1,2,2,2]) == 1

In [0]:
def predict_from_distribution(distribution):
    assert sum(distribution) == 1
    # YOUR CODE HERE
    c_probability = 0
    sum_probability = []
    for p in distribution:
        c_probability +=p
        sum_probability.append(c_probability)
    r= random.uniform(0,1)
    for index, sp in enumerate(sum_probability):
        if r<= sp:
          return index
    return len(distribution)-1

In [6]:
# Example predictions
# You should see 10 results with about 5 0s, 1 1, and 4 2's.
# You can print val in order to see if it's being calculated correctly
[predict_from_distribution([0.5, 0.1, 0.4]) for i in range(10)]

[0, 0, 2, 2, 2, 0, 0, 1, 2, 0]

In [0]:
class NaiveClassifier:
    """A Naive Classifier that predicts classes using simple approaches.
    """
    
    def __init__(self, approach, value=None):
        """Initialize the NaiveClassifier
        
        Args:
            approach (str): One of "always", "most", "distribution", "equal"
            value (int, optional): Defaults to None. The value of the class to select if approach is "always"
        """
        assert approach in ["always", "most", "distribution", "equal"]
        self.approach = approach
        self.value = value

    def fit(self,X,y):
        """Fit to data and labels
        
        Args:
            X (iterable): The features of the data
            y (iterable): The labels of the data
        """
        if self.approach == "always":
            # YOUR CODE HERE
            pass
            
        elif self.approach == "most":
            # YOUR CODE HERE
            self.most = select_most_common(y)
            
        elif self.approach == "distribution":
            # YOUR CODE HERE
            self.distribution = np.bincount(y)/len(y)
            
        elif self.approach == "equal":
            # YOUR CODE HERE
            self.equal = np.unique(y)
            

    def predict(self,X):
        """Predict the labels of a new set of datapoints
        
        Args:
            X (iterable): The data to predict
        """
        if self.approach == "always":
            return [self.value]*len(X)
            
        elif self.approach == "most":
            # YOUR CODE HERE
            return [self.most]*len(X)
          
        elif self.approach == "distribution":
            # YOUR CODE HERE
            pred = np.zeros(len(X))
            for i in range(len(X)):
              pred[i] = predict_from_distribution(self.distribution)
            return pred
            
        elif self.approach == "equal":
            # YOUR CODE HERE
            pred = np.zeros(len(X))
            for i in range(len(X)):
              pred[i] = random.choice(self.equal)
            return pred

Let's create a few datasets that we'll use to analyze how a predictor would work with each of those approaches. Here are all the datasets we'll create:

- 2 classes equally distributed
- 2 classes with 0 at 90% and 1 at 10%
- 3 classes equally distributed
- 3 classes with 0 at 90%, 1 at 9% and 2 at 1%

In [0]:
# We will create the labels for each of the listed datasets with length n
# Create the listed datasets as binary_equal, binary_unequal, trinary_equal and trinary_unequal
n = 15000
features = np.zeros((n,3))

# YOUR CODE HERE
binary_equal=np.zeros((n), dtype=np.int)
binary_unequal=np.zeros((n,), dtype=np.int)
trinary_equal=np.zeros((n,), dtype=np.int)
trinary_unequal=np.zeros((n,), dtype=np.int)

binary_equal = [0]*7500 + [1]*7500
binary_unequal = [0]*13500 + [1]*1500
trinary_equal = [0]*5000 + [1]*5000 +[2]*5000
trinary_unequal = [0]*13500 +[1]*1350 + [2]*150


In [0]:
assert np.all(np.bincount(binary_equal) == np.array([7500,7500]))
assert np.all(np.bincount(binary_unequal) == np.array([13500,1500]))
assert np.all(np.bincount(trinary_equal) == np.array([5000,5000,5000]))
assert np.all(np.bincount(trinary_unequal) == np.array([13500,1350,150]))

In [0]:
datasets = [{
    "name": "Binary Classification Equally Distributed",
    "labels": binary_equal
},{
    "name": "Binary Classification 90:10",
    "labels": binary_unequal
},{
    "name": "3-Class Classification Equally Distributed",
    "labels": trinary_equal
},{
    "name": "3-Class Classification 90:9:1",
    "labels": trinary_unequal
}]

# Testing

Let's now test out our Naive Classifiers on the above datasets. We will be training and testing on the full dataset. Since the model is actually not a machine learning algorithm and this is just for educational purposes, it will not be an issue. We are just using this approach to learn what the naive model would have predicted even on the data it trained on.

In [0]:
# Create three classifers that predict always 0, 1, and 2
# Name them always_zero, always_one and always_two respectively

# YOUR CODE HERE
    
always_zero = NaiveClassifier("always", 0)
always_one = NaiveClassifier("always", 1)
always_two = NaiveClassifier("always", 2)


In [0]:

assert always_zero.approach=="always"
assert always_zero.value == 0
assert always_one.approach=="always"
assert always_one.value == 1
assert always_two.approach=="always"
assert always_two.value == 2

In [0]:
# Create a classifer that predicts the most frequent class
# Name it most_est

# YOUR CODE HERE
most_est = NaiveClassifier("most",)

In [0]:
assert most_est.approach=="most"
most_est.fit([0,0,0,0,0], [0,1,1,1,0])
assert most_est.predict([0,0,0]) == [1, 1, 1]

In [0]:
# Create a classifer that predicts based on the distribution of the classes
# Name it dist_est

# YOUR CODE HERE
dist_est = NaiveClassifier("distribution",)

In [0]:
assert dist_est.approach == "distribution"
dist_est.fit([0,0,0,0,0], [0,0,1,1,1])
random.seed(0)
assert sum(dist_est.predict([0,0,0,0,0])) == 4

In [0]:
# Create a classifer that predicts equally any of the classes
# Name it equal_est

# YOUR CODE HERE
equal_est = NaiveClassifier("equal",4)

In [0]:
assert equal_est.approach == "equal"
equal_est.fit([0,0,0,0,0], [0,1,1,1,1])
random.seed(0)
assert sum(equal_est.predict([0,0,0,0])) == 3

In [0]:
estimators = [
    {
        "name": "Always Zero",
        "estimator": always_zero
    },
    {
        "name": "Always One",
        "estimator": always_one
    },
    {
        "name": "Always Two",
        "estimator": always_two
    },
    {
        "name": "Most Common",
        "estimator": most_est
    },
    {
        "name": "Distribution Based",
        "estimator": dist_est
    },
    {
        "name": "Equally",
        "estimator": equal_est
    }
]

In [35]:
# For each dataset, apply each estimator and save the predictions as pred
for dataset in datasets:
    name = dataset["name"]
    labels = dataset["labels"]
    print("="*20)
    print(f"{name}")
    print("="*20)
    for est in estimators:
        estimator_name = est["name"]
        print("-"*20)
        print(f"Estimating with {estimator_name}")
        print("-"*20)
        # YOUR CODE HERE
        #print(estimators)
        #print(dataset)
        pred = []
        estimator = est["estimator"]
        pred = estimator.predict(features)
        
        print(f"Produced an accuracy score of {accuracy_score(labels, pred)} and the following report")
        print(classification_report(labels, pred))

Binary Classification Equally Distributed
--------------------
Estimating with Always Zero
--------------------
Produced an accuracy score of 0.5 and the following report
              precision    recall  f1-score   support

           0       0.50      1.00      0.67      7500
           1       0.00      0.00      0.00      7500

   micro avg       0.50      0.50      0.50     15000
   macro avg       0.25      0.50      0.33     15000
weighted avg       0.25      0.50      0.33     15000

--------------------
Estimating with Always One
--------------------
Produced an accuracy score of 0.5 and the following report
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      7500
           1       0.50      1.00      0.67      7500

   micro avg       0.50      0.50      0.50     15000
   macro avg       0.25      0.50      0.33     15000
weighted avg       0.25      0.50      0.33     15000

--------------------
Estimating with Always Two

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


Produced an accuracy score of 0.5061333333333333 and the following report
              precision    recall  f1-score   support

           0       0.51      0.51      0.51      7500
           1       0.51      0.51      0.51      7500

   micro avg       0.51      0.51      0.51     15000
   macro avg       0.51      0.51      0.51     15000
weighted avg       0.51      0.51      0.51     15000

Binary Classification 90:10
--------------------
Estimating with Always Zero
--------------------
Produced an accuracy score of 0.9 and the following report
              precision    recall  f1-score   support

           0       0.90      1.00      0.95     13500
           1       0.00      0.00      0.00      1500

   micro avg       0.90      0.90      0.90     15000
   macro avg       0.45      0.50      0.47     15000
weighted avg       0.81      0.90      0.85     15000

--------------------
Estimating with Always One
--------------------
Produced an accuracy score of 0.1 and the foll

In [43]:
# Please describe your conclusions based on the above results
# You must write at least 300 characters
# This portion is worth 100 points (20% of CS)
# Save your answer to conclusions

# YOUR CODE HERE
conclusions = "From this case study 2 part 1, we realized that accuracy number is not represented correctly. For example, even though the accuracy score is higher in some methods, this does not mean that the accuracy will remain high if other datasets are used. In a sense, a particular dataset may be biased towards the specific estimator method. Common problem in statistic, an estimator method is biased if it favors some specific outcome. A sample of dataset is also biased if certain groups are underrepresented or overrepresented relative to the norm. Recognizing the impact of biased and unbiased dataset into an algorithm is an important portion in the overall statistics of machine learning."
print(conclusions)

From this case study 2 part 1, we realized that accuracy number is not represented correctly. For example, even though the accuracy score is higher in some methods, this does not mean that the accuracy will remain high if other datasets are used. In a sense, a particular dataset may be biased towards the specific estimator method. Common problem in statistic, an estimator method is biased if it favors some specific outcome. A sample of dataset is also biased if certain groups are underrepresented or overrepresented relative to the norm. Recognizing the impact of biased and unbiased dataset into an algorithm is an important portion in the overall statistics of machine learning.


In [0]:
assert len(conclusions) > 300

## Feedback

In [0]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    # YOUR CODE HERE
    raise NotImplementedError()