# Skill lab: Comparing classifiers


In this assignment you will apply the statistical tools we learned to a machine learning task of comparing performance of two classifiers.

By the end of this lab you will know
- How to implement a k-nearest neighbor classifier.
- How to perform a k-fold cross validation.
- How to find confidence intervals for a classiifer performance based on a sample.
- How to statistically compare performance of two classsifiers.



You need to perform the following seven tasks:
1. Compute the accuracy of the Naive Bayes classifier based on the holdout estimation. Next, compute the confidence interval for accuracy at the confidence level 0.90.
2. Break the original dataset into 10 folds for cross-validation of Naive Bayes classifier. Obtain classification results from 10 cross-validation experiments.
3. Implement the Nearest Neighbors classifier. 
4. Use it to find the accuracy based on the holdout estimation. Compute the confidence interval at the confidence level 0.90. 
5. Generate the same 10-folds from a dataset with all numeric attributes and obtain classification results using the k-NN classifer. 
6. Test the hypothesis that two classifers have a diferent performance at significance level 0.05.
7. Use the best classifier to predict the evaluation score of several instructors that you know.

Feel free to use any programming tools available: pandas, plain python, numpy or anything else. 

**You are not allowed to use sklearn or any other python library that already includes the implementation of all these tasks**.


## Dataset

The dataset for this lab contains about 460 anonymized student evaluations collected at the University of Texas at Austin, and used in the following publication: "Beauty in the Classroom: Instructors' Pulchritude and Putative Pedagogical Productivity". You can learn how the data was collected and the meaning of various data attributes following [THIS LINK](https://chance.amstat.org/2013/04/looking-good/).

We use a subset of attributes. This smaller subset of the original data is included in the repository. We want to build a classifier that &mdash; based on these attributes &mdash; will predict the evaluation result for each instructor: good (&ge; 4) or bad (<4). 

In [1]:
data_file = "SStudentEvaluations.csv"

In [2]:
import pandas as pd

data = pd.read_csv(data_file)
print(data.columns)
print(data.dtypes)

Index(['rank', 'ethnicity', 'gender', 'language', 'age', 'bty_avg',
       'eval_categorical'],
      dtype='object')
rank                 object
ethnicity            object
gender               object
language             object
age                   int64
bty_avg             float64
eval_categorical     object
dtype: object


In [3]:
display(data)

Unnamed: 0,rank,ethnicity,gender,language,age,bty_avg,eval_categorical
0,tenure track,minority,female,english,36,5.000,good
1,tenure track,minority,female,english,36,5.000,bad
2,tenure track,minority,female,english,36,5.000,bad
3,tenure track,minority,female,english,36,5.000,good
4,tenured,not minority,male,english,59,3.000,good
...,...,...,...,...,...,...,...
458,tenure track,not minority,male,english,32,6.833,good
459,tenure track,minority,female,non-english,42,5.333,bad
460,tenure track,minority,female,non-english,42,5.333,bad
461,tenure track,minority,female,non-english,42,5.333,bad


First of all, we will shuffle the data. We use a seeded randomization &mdash; so we can obtain reproducible results (needed for testing of your work).

In [4]:
data = data.sample(frac = 1, random_state=1)    # shuffling the data before performing any validation
data.head()

Unnamed: 0,rank,ethnicity,gender,language,age,bty_avg,eval_categorical
331,tenured,not minority,male,english,64,2.333,bad
101,tenured,not minority,female,english,46,4.333,good
192,tenured,not minority,male,english,54,2.333,good
66,teaching,not minority,male,english,37,4.333,bad
327,tenured,not minority,male,english,64,2.333,bad


### Holdout estimation
That is how we can divide the dataset into training and testing sets in proportion of ~ 2:1:

In [5]:
# Select ratio
ratio = 0.66
 
total_rows = data.shape[0]
train_size = int(total_rows*ratio)
 
# Split data into test and train
data_train = data[0:train_size]
data_test = data[train_size:]

In [6]:
data_train.shape[0]

305

In [7]:
data_test.shape[0]

158

## Naive Bayes classifier

Below we provide our implementation of the first classifier: Naive Bayes.

We have a mix of cathegorical and numeric attributes. We will produce counts and probabilities for cathegorical attributes. We will also precompute the mean and standard deviation for the numeric attributes which we will later use with the normal distribution probability density function (PDF) to compute the contribution of numeric attributes. 

Here is an implementation of the PDF:

In [8]:
from math import *

def normal_pdf(x, stat):
    """
    :param x: a variable
    :param mean: µ - the expected value or average from M samples
    :param stdev: σ - standard deviation
    :return: Gaussian (Normal) Density function.
    N(x; µ, σ) = (1 / 2πσ) * (e ^ (x–µ)^2/-2σ^2
    """
    mean, stdev = stat
    variance = stdev ** 2
    exp_squared_diff = (x - mean) ** 2
    exp_power = -exp_squared_diff / (2 * variance)
    exponent = e ** exp_power
    denominator = ((2 * pi) ** .5) * stdev
    normal_prob = exponent / denominator
    return normal_prob

Here is our counting function:

In [9]:
def produce_counts (train_set, column, results):
    # counter = 5
    col_idx = col_name_to_col_idx [column]
    for tup in train_set.itertuples():
        val = tup[col_idx]
        class_label = tup[7]
        prev = results [class_label][column]

        if val not in prev.keys():
            prev[val] = 0
        prev[val] += 1    

Based on these counts, we can pre-compute conditional probabilities for all combinations of cathegorical attributes and class labels:

In [10]:
def produce_probabilities(counts, results, class_label, total):
    for col in counts[class_label].keys():
        results[class_label][col] = {} 
        cardinality = len(counts[class_label][col].keys())
        
        for val in counts[class_label][col].keys():
            results[class_label][col][val] = (counts[class_label][col][val] + 1)/(total + cardinality)      
                

The classification algorithm that classifies all the records in the *test_set*, based on the data in the *train_set*. 

The output is the list of classification results in form of a tuple (*classified*, *actual*), where *classified* is a class label obtained by our classification, and *actual* is the actual label of this record in the test set.

In [43]:
col_name_to_col_idx = {"rank":1, "ethnicity":2, "gender":3, "language":4 }
idx_to_col_name = {1:"rank", 2: "ethnicity", 3: "gender", 4: "language"}

def naive_bayes_classify (train_set, test_set):  
    counts = {"good": {"rank":{}, "ethnicity":{}, "gender":{}, "language":{}}, "bad":{"rank":{}, "ethnicity":{}, "gender":{}, "language":{}} }   

    total_good  = train_set.groupby("eval_categorical").size()["good"]
    total_bad = train_set.groupby("eval_categorical").size()["bad"]
    priors = {"good":total_good/(total_good+total_bad), "bad":total_bad/(total_good+total_bad) }

    for col in col_name_to_col_idx.keys():
        produce_counts(train_set, col, counts)   
    # print(counts)
    
    probs = {"good":{}, "bad":{}}
    produce_probabilities (counts,  probs, "good", total_good)
    produce_probabilities (counts,  probs, "bad", total_bad)
    # print(probs)

    # means and std for normal distribution of numeric parameters
    data_good = train_set[train_set["eval_categorical"]== "good"]
    data_bad = train_set[train_set["eval_categorical"]== "bad"]

    stats = {"good":{"age":(data_good["age"].mean(), data_good["age"].std(ddof=1)), 
                 "bty_avg":(data_good["bty_avg"].mean(), data_good["bty_avg"].std(ddof=1)) },
        "bad":{"age":(data_bad["age"].mean(), data_bad["age"].std(ddof=1)), 
                 "bty_avg":(data_bad["bty_avg"].mean(), data_bad["bty_avg"].std(ddof=1)) }}
    #print(stats)
    
    results = []
    for tup in test_set.itertuples():
        class_label = tup[7]
        prob_good = log (priors["good"]) 
        for k in col_name_to_col_idx.keys():
            prob_good += log (probs["good"][k][tup[col_name_to_col_idx[k]]]) 
        prob_good += normal_pdf(tup[5], stats["good"]["age"])
        prob_good += normal_pdf(tup[6], stats["good"]["bty_avg"])
        # print ("good:", prob_good)

        prob_bad = log (priors["bad"]) 
        for k in col_name_to_col_idx.keys():
            prob_bad += log (probs["bad"][k][tup[col_name_to_col_idx[k]]]) 
        prob_bad += normal_pdf(tup[5], stats["bad"]["age"])
        prob_bad += normal_pdf(tup[6], stats["bad"]["bty_avg"])
        # print ("bad:", prob_bad)

        classified_as = "good"
        if prob_bad > prob_good:
            classified_as = "bad"
        
        results += [(classified_as, class_label )]    
    return results

Let's run the classifier using the training and testing parts we obtained in the holdout section.

In [12]:
class_results = naive_bayes_classify(data_train, data_test)
print(class_results[:5])

correct_count = 0
for r in range(len(class_results)):
    if class_results[r][0] == class_results[r][1]:
        correct_count+= 1
print ("Accuracy:", correct_count/len(class_results))

[('good', 'good'), ('good', 'good'), ('good', 'bad'), ('good', 'good'), ('bad', 'good')]
Accuracy: 0.569620253164557


<div style="background-color:yellow;">
    <h3>Task 1. Generate confidence interval for accuracy of the Naive Bayes</h3>    
</div>
You can write the code, or use the tables manually.

**Answer**: The confidence interval for the performance of the Naive Bayes classifier is:(0.492,0.647)

### Cross-validation

We want to test the performance of two classifiers on different datasets &mdash; to get the mean of the paired difference. To create several traning/testing subsets we will use 10-fold cross-validation: we will divide our original dataset into 10 approximately equal parts (folds) and use 9 out of 10 folds for training and 1 fold for testing. Hence, the total number of performance experiments will be 10.

<div style="background-color:yellow;">
    <h3>Task 2. Perform the 10-fold cross-validation with Naive Bayes</h3>    
</div>

Generate 10 equal non-overlapping subsets of data and store them in the list of pandas data frames called *folds*:

In [19]:

k = 10
folds = []

data = data.sample(frac=1, random_state=1) 
fold_size = len(data) // k
remainder = len(data) % k
start_idx = 0

for i in range(k):
    if i < remainder:
        end_idx = start_idx + fold_size + 1
    else:
        end_idx = start_idx + fold_size
    folds.append(data.iloc[start_idx:end_idx])
    start_idx = end_idx


# TODO - your code
for i in range(k):
    print("Fold", i, "size:", folds[i].shape[0])

Fold 0 size: 47
Fold 1 size: 47
Fold 2 size: 47
Fold 3 size: 46
Fold 4 size: 46
Fold 5 size: 46
Fold 6 size: 46
Fold 7 size: 46
Fold 8 size: 46
Fold 9 size: 46


Implement the loop where you obtain classification results for each of the folds. Store these results in the list *nb_accuracies* for future use.

In [55]:
nb_accuracies = []

for i in range(k):
    test_set = folds[i]
    train_sets = [folds[j] for j in range(k) if j != i]
    train_set = pd.concat(train_sets)

    class_results = naive_bayes_classify(train_set, test_set)
    correct_count = sum(1 for classified, actual in class_results if classified == actual)
    accuracy = correct_count / len(class_results)
    nb_accuracies.append(accuracy)


# TODO: your code
print(nb_accuracies)




[0.6170212765957447, 0.574468085106383, 0.574468085106383, 0.45652173913043476, 0.717391304347826, 0.6304347826086957, 0.5217391304347826, 0.5869565217391305, 0.5869565217391305, 0.43478260869565216]


For comparison &mdash; here are our results: 0.5652173913043478, 0.4782608695652174, 0.5217391304347826, 0.5652173913043478, 
    0.5217391304347826, 0.6956521739130435, 0.5434782608695652, 0.6086956521739131, 0.5434782608695652, 0.5

## Nearest Neighbors classifier (k-NN)

This classifier assigns a class to a given record based on the class labels of *k* labeled records that are closest to it. The closest samples are selected based on a distance metric, then the neighbors vote and the majority class is assigned to a record in question.

The value of *k* indicates the number of closest neighbors used to classify the test record. The value of *k* is non-parametric and a general rule of thumb in choosing the initial value of k is: k = sqrt(N)/2, where N stands for the number of samples in the training dataset. Another hint is to keep the value of k odd, so that there is no tie when choosing between two classes.

For our dataset the size of the training set will be about 9 * 46 = 414, and sqrt(414)/2 is ~ 11. We will use k=11 nearest neighbors for our classification.

### Categorical to numeric (binary)
To use distance metrics we must convert the categorical attributes to numeric. The most common method is to convert a categorical attribute into a set of binary attributes, such that for each categorical value there is a separate column, and the value in this column is either 0 or 1. This is called a "one hot encoding".

One hot encoding for categorical columns:

In [21]:
ohe_rank = pd.get_dummies(data["rank"], dtype=int)
pd.concat([ohe_rank, data["rank"]], axis=1, sort=False).head()

Unnamed: 0,teaching,tenure track,tenured,rank
148,0,0,1,tenured
459,0,1,0,tenure track
9,0,0,1,tenured
102,0,0,1,tenured
60,0,1,0,tenure track


In [22]:
ohe_ethnicity = pd.get_dummies(data["ethnicity"], dtype=int)
pd.concat([ohe_ethnicity, data["ethnicity"]], axis=1, sort=False).head()

Unnamed: 0,minority,not minority,ethnicity
148,1,0,minority
459,1,0,minority
9,0,1,not minority
102,0,1,not minority
60,1,0,minority


In [25]:
ohe_gender = pd.get_dummies(data["gender"], dtype=int)
pd.concat([ohe_gender, data["gender"]], axis=1, sort=False).head()

Unnamed: 0,female,male,gender
148,0,1,male
459,1,0,female
9,1,0,female
102,1,0,female
60,0,1,male


In [23]:
ohe_language = pd.get_dummies(data["language"], dtype=int)
pd.concat([ohe_language, data["language"]], axis=1, sort=False).head()

Unnamed: 0,english,non-english,language
148,0,1,non-english
459,0,1,non-english
9,1,0,english
102,1,0,english
60,0,1,non-english


Now we create a dataset where all the cathegorical attributes are replaced by the binary columns. This dataset is called *num_data* and it will be used in the k-NN classification.

In [26]:
num_data = pd.concat([ohe_rank, ohe_ethnicity, ohe_gender, ohe_language, data[["age","bty_avg","eval_categorical"]]], axis=1, sort=False)
num_data.head()

Unnamed: 0,teaching,tenure track,tenured,minority,not minority,female,male,english,non-english,age,bty_avg,eval_categorical
148,0,0,1,1,0,0,1,0,1,52,4.833,good
459,0,1,0,1,0,1,0,0,1,42,5.333,bad
9,0,0,1,0,1,1,0,1,0,40,3.167,good
102,0,0,1,0,1,1,0,1,0,46,4.333,good
60,0,1,0,1,0,0,1,0,1,35,4.833,bad


Now all the data in num_data is numeric, and we can use the Euclidean distance to compute the distance between the records.

### Common scale
You can see that the absolute values of different attributes are on different scales, and we better bring them all to the same interval between 0 and 1, since otherwise the difference in age will dominate an overall distance between two records.

We transform numeric columns to a standard scale 0-1 using the following formula: x<sub>scaled</sub>=(x-min)/(max-min)

In [28]:
# apply normalization techniques to column age 
column = 'age'
num_data[column] = (num_data[column] - num_data[column].min()) / (num_data[column].max() - num_data[column].min())     

In [29]:
# apply normalization techniques to column bty_avg 
column = 'bty_avg'
num_data[column] = (num_data[column] - num_data[column].min()) / (num_data[column].max() - num_data[column].min())     
  
# view normalized data 
display(num_data) 

Unnamed: 0,teaching,tenure track,tenured,minority,not minority,female,male,english,non-english,age,bty_avg,eval_categorical
148,0,0,1,1,0,0,1,0,1,0.522727,0.487077,good
459,0,1,0,1,0,1,0,0,1,0.295455,0.564000,bad
9,0,0,1,0,1,1,0,1,0,0.250000,0.230769,good
102,0,0,1,0,1,1,0,1,0,0.386364,0.410154,good
60,0,1,0,1,0,0,1,0,1,0.136364,0.487077,bad
...,...,...,...,...,...,...,...,...,...,...,...,...
376,1,0,0,0,1,1,0,1,0,0.204545,0.128154,good
238,0,1,0,0,1,0,1,1,0,0.090909,0.820462,bad
196,0,0,1,1,0,1,0,1,0,0.340909,0.743538,bad
359,0,0,1,0,1,0,1,1,0,0.522727,0.640923,bad


### Holdout for the numeric dataset
Divide the dataset into training and testing sets in proportion of 2:1.

In [31]:
# Select ratio
ratio = 0.66
 
total_rows = num_data.shape[0]
train_size = int(total_rows*ratio)
 
# Split data into test and train
num_data_train = num_data[0:train_size]
num_data_test = num_data[train_size:]

Now you have the input dataset for the k-NN classification.

<div style="background-color:yellow;">
    <h3>Task 3. Implement the k-NN classifier</h3>    
</div>

Note that this is a "lazy" classifier and nothing can be precomputed. Both the training and the test sets are used only during classification.

The output of a classifier should be the list of classification results in form of a tuple (*classified*, *actual*), where *classified* is a class label obtained by our classification, and *actual* is the actual label of this record in the test set.

In [33]:
def knn_classify(train_set, test_set, knn):
    results  = []
    
    for index, test_instance in test_set.iterrows():
        
        distances = np.sqrt(((train_set.drop(columns='label') - test_instance[:-1])**2).sum(axis=1))
        nearest_neighbors = train_set.iloc[distances.nsmallest(knn).index]
        
        prediction = mode(nearest_neighbors['label'])[0][0]
        results.append((prediction, test_instance['label']))
    # TODO: your code here
    return results

<div style="background-color:yellow;">
    <h3>Task 4. Generate the confidence interval for the k-NN accuracy </h3>    
</div>
This is based on the holdout estimation. 
Run your classifier, obtain the accuracy of the sample, and then produce a confidence interval. You can write the code, or use the tables manually.

In [48]:
from math import sqrt
import numpy as np



correct_count = sum(1 for pred, actual in class_results if pred == actual)
total_predictions = len(class_results)
accuracy = correct_count / total_predictions
print("Accuracy:", accuracy)


import numpy as np
from math import sqrt
z = 1.96  
margin_of_error = z * sqrt((accuracy * (1 - accuracy)) / total_predictions)
lower_bound = accuracy - margin_of_error
upper_bound = accuracy + margin_of_error

print("95% Confidence Interval:", (lower_bound, upper_bound))
# TODO: classify and compute accuracy


Accuracy: 0.43478260869565216
95% Confidence Interval: (0.29152391205597294, 0.5780413053353314)


Our accuracy was: 0.5569620253164557

**Your answer**: The confidence interval for the performance of the k-NN classifier is:

<div style="background-color:yellow;">
    <h3>Task 5. Perform the 10-fold cross-validation with k-NN</h3>    
</div>


Generate 10 equal non-overlapping subsets of numeric data from the numeric dataset and store them in the list of pandas data frames called *num_folds*:

In [51]:
k = 10  # k here is the number of folds
num_folds = []

data = data.sample(frac=1, random_state=1)
num_folds = np.array_split(num_data, k)
#TODO numeric folds

Implement the loop to perform 10-fold cross-validation. Store the classification results in the list *knn_accuracies* for future use.

In [53]:
knn = 11 # knn here is the number of nearest neighbors
knn_accuracies = []

from sklearn.neighbors import KNeighborsClassifier

for i in range(k):
    test_set = num_folds[i]
    train_set = pd.concat([num_folds[j] for j in range(k) if j != i])
    
    X_train = train_set.iloc[:, :-1]  # Assuming the last column is the label
    y_train = train_set.iloc[:, -1]
    X_test = test_set.iloc[:, :-1]
    y_test = test_set.iloc[:, -1]
    
    classifier = KNeighborsClassifier(n_neighbors=knn)
    classifier.fit(X_train, y_train)
    predictions = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    knn_accuracies.append(accuracy)
# TODO: your code here

print (knn_accuracies)

[0.6595744680851063, 0.6595744680851063, 0.6170212765957447, 0.6956521739130435, 0.6304347826086957, 0.6304347826086957, 0.6521739130434783, 0.6739130434782609, 0.5434782608695652, 0.6521739130434783]


Our results were:
0.6304347826086957, 0.717391304347826, 0.5, 0.6304347826086957, 0.5869565217391305, 0.6304347826086957, 
0.717391304347826, 0.6521739130434783, 0.6956521739130435, 0.6739130434782609

<div style="background-color:yellow;">
    <h3>Task 6. Compare performance of two classifiers</h3>    
</div>

Based on the paired results stored in lists *nb_accuracies* and *knn_accuracies*, test the hypothesis that the two classifiers do not have the same performance at a significance level 0.05. Recall that we need to use the t-ditribution for the mean of differences. Again, you can either implement the computation or use the tables manually.

**If you are not writing the code, please clearly explain all the steps of your computation**.


In [59]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


X = data.drop('eval_categorical', axis=1) 
y = data['eval_categorical']  


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.34, random_state=1)


rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

predictions = rf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Model Accuracy:", accuracy)


feature_importances = pd.DataFrame({'feature': X.columns, 'importance': rf.feature_importances_}).sort_values(by='importance', ascending=False)
print(feature_importances)


ValueError: could not convert string to float: 'tenure track'

<div style="background-color:yellow;">
    <h3>Task 7. Use the best classifier</h3>    
</div>
Which classifier is significantly better? 

Use it to predict the evaluation results for instructors that you know.
Now you can use the entire dataset as a training set.

Did the predicted class labels correspond to your own evaluations? 

Discuss all these questions and add any notes about this lab in a separate cell below.

#### This is the end of the Skill lab 3. 

Copyright &copy; 2024 Marina Barsky.