# COMP24111 - Exercise 2: News Article Classification

## 1. Task description

You will work on a news article classification task.
The provided dataset includes a total of 800 articles taken from Reuters newswire.
They belong to 4 classes: "earn" (1), "crude" (2), "trade" (3) and "interest" (4).
There are 200 articles per class.
Each article is characterised by word occurrences.
The list of used words is called a vocabulary.
In our dataset, the vocabulary includes a total of 6428 words. 

## 2. Preparation

First we need to import the data.
Run the below cell to load the data using NumPy.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse

data, labels, class_names, vocabulary = np.load("ReutersNews_4Classes_sparse.npy", allow_pickle=True)

### A Note on Sparsity

Most documents only contain a small subset of the vocabulary, resulting in a very sparse data matrix.
To take advantage of the sparsity, in this exercise `data` is represented as a `scipy.sparse.csr_matrix`, which can store sparse matrices efficiently while still allowing efficient row-based indexing.
You can learn more about `csr_matrix` and other ways of dealing with sparse matrices at https://docs.scipy.org/doc/scipy/reference/sparse.html.

Note, however, that `data` is **not** a normal NumPy array.
While most operations will be the same as with a normal dense array, **you cannot use a sparse matrix to index another matrix**.
If you need to do this, either first convert the matrix to a NumPy array with the `toarray()` method, or use methods specifically designed to work with sparse matrices.

In [2]:
print(data[41]) # Sparse, will print the non-zero indices and their values.
print(data[41].toarray()) # Convert back to a NumPy array. Note that the result is a (1, 6428) matrix, not a vector.
# print(vocabulary[data[41,:] > 0]) # Can't index vocabulary with a sparse matrix.
rows, columns, values = scipy.sparse.find(data[41,:]) # Find the non-zero entries in the 42nd document.
print(vocabulary[columns]) # Prints the words present in the 42nd document.

  (0, 2)	1
  (0, 3)	3
  (0, 5)	1
  (0, 8)	1
  (0, 10)	1
  (0, 11)	1
  (0, 12)	1
  (0, 13)	1
  (0, 21)	2
  (0, 24)	1
  (0, 105)	1
  (0, 127)	1
  (0, 227)	1
  (0, 275)	1
  (0, 334)	2
  (0, 341)	1
  (0, 348)	1
  (0, 359)	1
  (0, 411)	1
  (0, 426)	1
  (0, 1428)	1
  (0, 2058)	1
  (0, 5555)	1
[[0 0 1 ... 0 0 0]]
['share' 'split' 'say' 'two-for-one' 'shareholder' 'annual' 'meeting'
 'reuter' 'ct' 'note' 'company' 'pay' 'subject' 'increase' 'stock'
 'dividend' 'april' 'northern' 'declare' 'approval' 'telecom' 'post-split'
 'nt']


To see the full vocabulary, you can run

In [3]:
print(", ".join(vocabulary))



You can see how many times article $i$ contains word $j$ using

In [4]:
i, j = 40, 2
print(data[i,j])

4


You can see which class the $i$th article belongs to using

In [5]:
print(labels[i])

0


For instance, by running

In [6]:
print("Occurrences:", data[109,10])
print("Class:", class_names[labels[0]])
print("Word:", vocabulary[66])

Occurrences: 0
Class: earn
Word: lead


In [7]:
you can see that the 11th word appears twice in the first document, the first document belongs to the class "earn", and the 11th word is "shareholder".

SyntaxError: invalid syntax (<ipython-input-7-3dad177b4b8a>, line 1)

The following function randomly selects a subset of the data.

In [14]:
def sample_indices(labels, *num_per_class):
    """
    Returns randomly selected indices. It will return the specified number of indices for each class.
    """
    indices = []
    for cls, num in enumerate(num_per_class):
        cls_indices = np.where(labels == cls)[0]
        indices.extend(np.random.choice(cls_indices, size=num, replace=False))
    return np.array(indices)

For instance, to get one sample from the first class, two from the second, three from the third, and four from the fourth, you can run:

In [15]:
indices = sample_indices(labels, 1, 2, 3, 4)
print("Returned indices:", indices)
print("Samples:", data[indices])
print("Corresponding classes:", labels[indices])

Returned indices: [ 51 378 244 444 414 474 725 702 682 677]
Samples:   (0, 6148)	2
  (0, 4794)	2
  (0, 1338)	1
  (0, 1280)	2
  (0, 1097)	1
  (0, 1041)	3
  (0, 814)	1
  (0, 205)	2
  (0, 184)	1
  (0, 171)	1
  (0, 75)	1
  (0, 73)	2
  (0, 33)	1
  (0, 30)	1
  (0, 23)	1
  (0, 15)	1
  (0, 14)	1
  (0, 13)	1
  (0, 5)	1
  (1, 5667)	1
  (1, 4809)	1
  (1, 4510)	1
  (1, 4258)	1
  (1, 2373)	1
  (1, 2369)	1
  :	:
  (8, 25)	1
  (8, 13)	1
  (8, 5)	2
  (9, 3115)	1
  (9, 3033)	1
  (9, 2154)	1
  (9, 1821)	1
  (9, 1717)	1
  (9, 1697)	2
  (9, 1183)	2
  (9, 1092)	2
  (9, 984)	2
  (9, 978)	1
  (9, 676)	1
  (9, 668)	1
  (9, 641)	2
  (9, 623)	1
  (9, 409)	1
  (9, 332)	2
  (9, 290)	2
  (9, 221)	1
  (9, 215)	1
  (9, 13)	1
  (9, 12)	1
  (9, 5)	1
Corresponding classes: [0 1 1 2 2 2 3 3 3 3]


## 3. k-NN implementation

Now, you will need to implement a k-NN classifier by filling the code below.
This function should support two types of distance measures: Euclidean distance and cosine distance.
It should take a set of training samples, a user-specified neighour number, a distance option, and features of a set of testing samples as the input.
It should return the predicted classes for the input set of testing samples.

In order to complete this function, you will need the `sklearn.metrics.pairwise_distances` function which can handle sparse matrices, below imported as `cdist` to follow SciPy conventions (not to be confused with the `pdist` function).
You should also research NumPy functions relating to sorting.

**Your implementation must NOT make use of Python loops over individual samples**.
You should use functions that operate on whole matrices, as this will be much faster than looping in Python.

In [16]:
import scipy.stats as stats
from sklearn.metrics import pairwise_distances as cdist

def knn_classify(test_samples, training_data, training_labels, metric="euclidean", k=1):
    """
    Performs k-nearest neighbour classification on the provided samples,
    given training data and the corresponding labels.
    
    test_samples: An m x d matrix of m samples to classify, each with d features.
    training_data: An n x d matrix consisting of n training samples, each with d features.
    training_labels: A vector of size n, where training_labels[i] is the label of training_data[i].
    metric: The metric to use for calculating distances between samples.
    k: The number of nearest neighbours to use for classification.
    
    Returns: A vector of size m, where out[i] is the predicted class of test_samples[i].
    
       
    # Return the most frequent class on each row.
    # Note: Ensure that the returned vector does not contain any empty dimensions.
    # You may find the squeeze method useful here.
    #  return 
    """
    
    
    # Calculate an m x n distance matrix.
    pairwise_distance = cdist(test_samples, training_data, metric)

    
    # Find the k nearest neighbours of each samples as an m x k matrix of indices.
    nearest_neighbours = np.argsort(pairwise_distance, axis=1)[:,:k]

    

    
    # Look up the classes corresponding to each index.
    nearest_labels = training_labels[nearest_neighbours]




 

    outputArray = stats.mode(nearest_labels, axis=1).mode

    return np.squeeze(outputArray)
    
    

## 4. Experiments

Use your k-NN function to perform the following experiments.

### Experiment 1

Randomly select 80 articles per class for training, and use the remaining articles for testing.
Select an appropriate neighbour number.
Train your k-NN classifier using the Euclidean distance and test it.
Repeat this process 20 times (trials).
Calculate the mean and standard deviation of the testing accuracies.

In [17]:
k = 3
metric="euclidean"

allsamples = (sample_indices(labels, 200, 200, 200, 200))
training_samples = (sample_indices(labels, 80, 80, 80, 80))
testing_samples = np.setdiff1d(allsamples, training_samples)



result = (knn_classify(data[testing_samples], data[training_samples], labels[training_samples], metric, k))
arrayMeans = []
for i in range (20):
    allsamples = (sample_indices(labels, 200, 200, 200, 200))
    training_samples = (sample_indices(labels, 80, 80, 80, 80))
    testing_samples = np.setdiff1d(allsamples, training_samples)
    result = (knn_classify(data[testing_samples], data[training_samples], labels[training_samples], metric, k))
    acc = np.sum(result == labels[testing_samples]) / len(result)
    arrayMeans.append(acc)

print("NPSUM METHOD : Accuracy of K-NN was found at: " + str(np.mean(arrayMeans) * 100))
print("NPSUM METHOD : STD of K-NN was found at: " + str(np.std(arrayMeans)))



"""
correctCount = 0
totalCount = 0



for i in range (len(result)):
    if result[i] == labels[testing_samples[i]]:
        correctCount += 1
        totalCount+= 1
    else:
        print("ERROR FINDER: CLASS" + str(result[i]) + " was found INCORRECTLY as CLASS[" + str(labels[testing_samples[i]]) +"]")
        totalCount+= 1
        
print("ERROR FINDER: Total Samples checked = " + str(totalCount))
print("ERROR FINDER: Correct Samples checked = " + str(correctCount))
incorrectCount = (totalCount - correctCount)
print("ERROR FINDER: Incorrect Samples checked = " + str(totalCount - correctCount))
accuracy = ((correctCount/totalCount) * 100)
print("ERROR FINDER: Accuracy of K-NN was found at: " + str(accuracy))

"""






            

    
    


NPSUM METHOD : Accuracy of K-NN was found at: 86.54166666666667
NPSUM METHOD : STD of K-NN was found at: 0.021684688018148757


'\ncorrectCount = 0\ntotalCount = 0\n\n\n\nfor i in range (len(result)):\n    if result[i] == labels[testing_samples[i]]:\n        correctCount += 1\n        totalCount+= 1\n    else:\n        print("ERROR FINDER: CLASS" + str(result[i]) + " was found INCORRECTLY as CLASS[" + str(labels[testing_samples[i]]) +"]")\n        totalCount+= 1\n        \nprint("ERROR FINDER: Total Samples checked = " + str(totalCount))\nprint("ERROR FINDER: Correct Samples checked = " + str(correctCount))\nincorrectCount = (totalCount - correctCount)\nprint("ERROR FINDER: Incorrect Samples checked = " + str(totalCount - correctCount))\naccuracy = ((correctCount/totalCount) * 100)\nprint("ERROR FINDER: Accuracy of K-NN was found at: " + str(accuracy))\n\n'

Use the same neighbour number, but use the cosine distance instead of the Euclidean distance.
Repeat the same experiment.

In [18]:
k = 3
metric="cosine"

allsamples = (sample_indices(labels, 200, 200, 200, 200))
training_samples = (sample_indices(labels, 80, 80, 80, 80))
testing_samples = np.setdiff1d(allsamples, training_samples)



result = (knn_classify(data[testing_samples], data[training_samples], labels[training_samples], metric, k))
arrayMeans = []
for i in range (20):
    allsamples = (sample_indices(labels, 200, 200, 200, 200))
    training_samples = (sample_indices(labels, 80, 80, 80, 80))
    testing_samples = np.setdiff1d(allsamples, training_samples)
    result = (knn_classify(data[testing_samples], data[training_samples], labels[training_samples], metric, k))
    acc = np.sum(result == labels[testing_samples]) / len(result)
    arrayMeans.append(acc)
    
    
    

print("NPSUM METHOD : Accuracy of K-NN was found at: " + str(np.mean(arrayMeans) * 100))
print("NPSUM METHOD : STD of K-NN was found at: " + str(np.std(arrayMeans)))

# Your code goes here

NPSUM METHOD : Accuracy of K-NN was found at: 96.01041666666667
NPSUM METHOD : STD of K-NN was found at: 0.009822093682385416


Which distance measure gives better performance?

### Experiment 2

Using the distance measure that you found performs better, repeat the same experiment, varying the neighbour number $k$ from 1 to 50.
This time, record the average training errors and standard deviation over 20 trials, for different values of $k$.
Do the same for testing errors.

Produce an error bar plot showing the training accuracy for each $k$ here:

In [19]:
k = 3
metric="cosine"



trainingErrors = []
arrayAccuracy = []
acArray = []
stdArray = []
meanKarray = []



for ix in range (50):
    k = ix+1

    
    for i in range (20):

        allsamples = (sample_indices(labels, 200, 200, 200, 200))
        training_samples = (sample_indices(labels, 80, 80, 80, 80))
        testing_samples = np.setdiff1d(allsamples, training_samples)
        result = (knn_classify(data[training_samples], data[training_samples], labels[training_samples], metric, k))
        
        
        acc = np.sum(result == labels[training_samples]) / len(result)
        acArray.append(acc)
        
        """
        # Initialise Variables
        errorCount = 0
        correctCount = 0
        totalCount = 0

         # Change occurs here for testing samples to compare
        for i in range (len(result)):
            if result[i] == labels[training_samples[i]]:
                correctCount += 1
                totalCount+= 1
            else:
                errorCount += 1
                totalCount+= 1
                

                
        trainingErrors.append(errorCount)
        trainingErrors = []
        
        accuracy = ((correctCount/result.size))
        acArray.append(accuracy)
        """
        
        
    print("K Value: " + str(k) + " | Training Accuracy: " + str(np.mean(acArray)) + "| Standard Deviations: " + str(np.std(acArray)))
    meanKarray.append(np.mean(acArray))
    stdArray.append(np.std(acArray))
    acArray = []

        



trainingErrors = []
arrayAccuracy = []
acArray2 = []
meanKarray2 = []
stdArray2 = []
        
for ix in range (50):
    k = ix+1
   

    
    
    for i in range (20):

        allsamples = (sample_indices(labels, 200, 200, 200, 200))
        training_samples = (sample_indices(labels, 80, 80, 80, 80))
        testing_samples = np.setdiff1d(allsamples, training_samples)
        result = (knn_classify(data[testing_samples], data[training_samples], labels[training_samples], metric, k))
        
        """
        # Initialise Variables
        errorCount = 0
        correctCount = 0
        totalCount = 0

         # Change occurs here for testing samples to compare
        for i in range (len(result)):
            if result[i] == labels[testing_samples[i]]:
                correctCount += 1
                totalCount+= 1
            else:
                errorCount += 1
                totalCount+= 1
                

                
        trainingErrors.append(errorCount)
        trainingErrors = []
        accuracy = ((correctCount/result.size))
        
        acArray2.append(acc)
        """
        
        acc = np.sum(result == labels[testing_samples]) / len(result)
        acArray2.append(acc)
        
        
    print("K Value: " + str(k) + " | Testing Accuracy: " + str(np.mean(acArray2)) + "| Standard Deviations: " + str(np.std(acArray2)))
    meanKarray2.append(np.mean(acArray2))
    stdArray2.append(np.std(acArray2))
    acArray2 = []

        


    
        
        
        
        
        

        
        
        
        
        
        
        


K Value: 1 | Training Accuracy: 1.0| Standard Deviations: 0.0
K Value: 2 | Training Accuracy: 0.9829687500000001| Standard Deviations: 0.007873201679590077
K Value: 3 | Training Accuracy: 0.9821875| Standard Deviations: 0.006782042004440853
K Value: 4 | Training Accuracy: 0.97890625| Standard Deviations: 0.006240226733661207
K Value: 5 | Training Accuracy: 0.9768749999999999| Standard Deviations: 0.008580956386091229
K Value: 6 | Training Accuracy: 0.97234375| Standard Deviations: 0.008284197588632235
K Value: 7 | Training Accuracy: 0.97046875| Standard Deviations: 0.0077481726676358964
K Value: 8 | Training Accuracy: 0.9692187500000001| Standard Deviations: 0.0076721773107443205
K Value: 9 | Training Accuracy: 0.9682812500000001| Standard Deviations: 0.006348828016846886
K Value: 10 | Training Accuracy: 0.9709375| Standard Deviations: 0.00791186806323262
K Value: 11 | Training Accuracy: 0.9624999999999998| Standard Deviations: 0.009217425752345382
K Value: 12 | Training Accuracy: 0.96

K Value: 41 | Testing Accuracy: 0.9356250000000002| Standard Deviations: 0.010843344733060912
K Value: 42 | Testing Accuracy: 0.9371875| Standard Deviations: 0.009063697238924556


KeyboardInterrupt: 

In [None]:
x = np.arange(1, 51)
y = meanKarray
plt.xlabel('K parameter')
plt.ylabel('Accuracy ')
plt.errorbar(x, y, yerr=stdArray, ecolor='r')

plt.title('A plot of K versus Accuracy for training')
plt.show()


Produce your testing error bar plot here:

In [None]:
x = np.arange(1, 51)
y = meanKarray2
plt.xlabel('K parameter')
plt.ylabel('Accruacy %')
plt.errorbar(x, y, yerr=stdArray2, ecolor='r')
plt.title('A plot of K versus Accuracy for testing Accuracy')
plt.show()


**Remember that all graphs should have axis labels and a title.**

Now, answer a few questions according to what you have observed.

Q1. What is the training accuracy obtained when $k=1$? Explain it.

Q2. Do the testing and training accuracies differ, and why?

Q3. How do the accuracies change as $k$ gets bigger, and why?

### Experiment 3

Compare three 5-NN classifiers using cosine distance.
First, randomly select 100 articles per class and keep these as your testing samples.

In [112]:
allsamples = (sample_indices(labels, 200, 200, 200, 200))
test_samples = (sample_indices(labels, 100, 100, 100, 100))

Then do the following:

(1) Train the first classifier using all the remaining articles.
Compute the confusion matrix for the 4 classes using the testing samples.

In [156]:
train_samples = np.setdiff1d(allsamples, training_samples)
firstClassifier = (knn_classify(data[test_samples], data[train_samples], labels[train_samples], "cosine", 5))

# Predict that the first 100 samples(class 0) Will be equal 
def truePositive(matrix1):
    return np.sum(matrix1)

# Predict False Negatives
def falseNegative(sampleperclassnumber, matrix1):
    return sampleperclassnumber - truePositive(matrix1)

def falsePositive(matrix,testmatrix, classnumber,boundv):
    return (np.bincount(matrix[boundv:] == classnumber)[1])

def trueNegative(matrix,testmatrix,classnumber,boundv):
    return np.bincount(matrix[boundv:] != classnumber)[1]




# Predict true positives for first 100 samples
class0Matrix = (firstClassifier[:100] == labels[test_samples][:100])
tp0 = truePositive(class0Matrix)



# Predict false Negative for first 100 samples
fn0 = falseNegative(100, class0Matrix)

boundv = 100
# Predict false Positive for 0
fp0 = falsePositive(firstClassifier,test_samples, 0,boundv)
    
        
# Predict True Negative for 0
tn0 = trueNegative(firstClassifier,test_samples, 0,boundv)

confusionMatrixClass0 = np.array([[tp0,fp0],[fn0,tn0]])
print(confusionMatrixClass0)






def otherfalsePositive(matrix,testmatrix,classnumber, bound1,bound2,bound3,bound4):
    matrixa = matrix[bound1:bound2]
    matrixb = matrix[bound3:bound4]
    newarray = np.concatenate((matrixa,matrixb), axis=0)
    return (np.bincount(newarray == classnumber)[1])

def othertrueNegative(matrix,testmatrix,classnumber,bound1,bound2,bound3,bound4):
    matrixa = matrix[bound1:bound2]
    matrixb = matrix[bound3:bound4]
    newarray = np.concatenate((matrixa,matrixb), axis=None)
    return (np.bincount(newarray != classnumber)[1])




# Predict true positives for first 100 samples
class1Matrix = (firstClassifier[100:200] == labels[test_samples][100:200])
tp1 = truePositive(class1Matrix)



# Predict false Negative for first 100 samples
fn1 = falseNegative(100, class1Matrix)

bound1 = 0
bound2 = 100
bound3 = 200
bound4 = 400
# Predict false Positive for 1
fp1 = otherfalsePositive(firstClassifier,test_samples, 1, bound1, bound2, bound3, bound4)
    
            
# Predict True Negative for 1
tn1 = othertrueNegative(firstClassifier,test_samples, 1, bound1, bound2, bound3, bound4)

confusionMatrixClass1 = np.array([[tp1,fp1],[fn1,tn1]])
print(confusionMatrixClass1)













# Predict true positives for first 100 samples
class2Matrix = (firstClassifier[200:300] == labels[test_samples][200:300])
tp2 = truePositive(class2Matrix)



# Predict false Negative for first 100 samples
fn2 = falseNegative(100, class2Matrix)

bound1 = 0
bound2 = 200
bound3 = 300
bound4 = 400
# Predict false Positive for 1
fp2 = otherfalsePositive(firstClassifier,test_samples, 2, bound1, bound2, bound3, bound4)
    
            
# Predict True Negative for 1
tn2 = othertrueNegative(firstClassifier,test_samples, 2, bound1, bound2, bound3, bound4)

confusionMatrixClass2 = np.array([[tp2,fp2],[fn2,tn2]])
print(confusionMatrixClass2)






def extrafalsePositive(matrix,testmatrix,classnumber, bound1,bound2):
    matrixa = matrix[bound1:bound2]
    return (np.bincount(matrixa == classnumber)[1])

def extratrueNegative(matrix,testmatrix,classnumber,bound1,bound2):
    matrixa = matrix[bound1:bound2]
    return (np.bincount(matrixa != classnumber)[1])



# Predict true positives for first 100 samples
class3Matrix = (firstClassifier[300:400] == labels[test_samples][300:400])
tp3 = truePositive(class3Matrix)



# Predict false Negative for first 100 samples
fn3 = falseNegative(100, class3Matrix)

bound1 = 0
bound2 = 300
# Predict false Positive for 1
fp3 = extrafalsePositive(firstClassifier,test_samples, 3, bound1, bound2)
    
            
# Predict True Negative for 1
tn3 = extratrueNegative(firstClassifier,test_samples, 3, bound1, bound2)

confusionMatrixClass3 = np.array([[tp3,fp3],[fn3,tn3]])
print(confusionMatrixClass3)










[[ 99   5]
 [  1 295]]
[[ 99   1]
 [  1 299]]
[[ 96   2]
 [  4 298]]
[[ 97   1]
 [  3 299]]


(2) Randomly remove 95 training articles from class 2.
Train the second classifier using the reduced training samples.
Compute the confusion matrix for the 4 classes using the testing samples.

In [162]:
random_train_samples = np.random.choice(train_samples[100:200],5)
new_train_samples = np.concatenate((train_samples[0:100],random_train_samples, train_samples[200:400]), axis=None)

newClassifier = (knn_classify(data[test_samples], data[new_train_samples], labels[new_train_samples], "cosine", 5))

# Predict that the first 100 samples(class 0) Will be equal 
def truePositive(matrix1):
    return np.sum(matrix1)

# Predict False Negatives
def falseNegative(sampleperclassnumber, matrix1):
    return sampleperclassnumber - truePositive(matrix1)

def falsePositive(matrix,testmatrix, classnumber,boundv):
    return (np.bincount(matrix[boundv:] == classnumber)[1])

def trueNegative(matrix,testmatrix,classnumber,boundv):
    return np.bincount(matrix[boundv:] != classnumber)[1]




# Predict true positives for first 100 samples
class0Matrix = (newClassifier[:100] == labels[test_samples][:100])
tp0 = truePositive(class0Matrix)



# Predict false Negative for first 100 samples
fn0 = falseNegative(100, class0Matrix)

boundv = 100
# Predict false Positive for 0
fp0 = falsePositive(newClassifier,test_samples, 0,boundv)
    
        
# Predict True Negative for 0
tn0 = trueNegative(newClassifier,test_samples, 0,boundv)

confusionMatrixClass0 = np.array([[tp0,fp0],[fn0,tn0]])
print(confusionMatrixClass0)






def otherfalsePositive(matrix,testmatrix,classnumber, bound1,bound2,bound3,bound4):
    matrixa = matrix[bound1:bound2]
    matrixb = matrix[bound3:bound4]
    newarray = np.concatenate((matrixa,matrixb), axis=None)
    return (np.bincount(newarray == classnumber)[1])

def othertrueNegative(matrix,testmatrix,classnumber,bound1,bound2,bound3,bound4):
    matrixa = matrix[bound1:bound2]
    matrixb = matrix[bound3:bound4]
    newarray = np.concatenate((matrixa,matrixb), axis=None)
    return (np.bincount(newarray != classnumber)[1])




# Predict true positives for first 100 samples
class1Matrix = (newClassifier[100:200] == labels[test_samples][100:200])
tp1 = truePositive(class1Matrix)



# Predict false Negative for first 100 samples
fn1 = falseNegative(100, class1Matrix)

bound1 = 0
bound2 = 100
bound3 = 200
bound4 = 400
# Predict false Positive for 1
fp1 = otherfalsePositive(newClassifier,test_samples, 1, bound1, bound2, bound3, bound4)
    
            
# Predict True Negative for 1
tn1 = othertrueNegative(newClassifier,test_samples, 1, bound1, bound2, bound3, bound4)

confusionMatrixClass1 = np.array([[tp1,fp1],[fn1,tn1]])
print(confusionMatrixClass1)













# Predict true positives for first 100 samples
class2Matrix = (newClassifier[200:300] == labels[test_samples][200:300])
tp2 = truePositive(class2Matrix)



# Predict false Negative for first 100 samples
fn2 = falseNegative(100, class2Matrix)

bound1 = 0
bound2 = 200
bound3 = 300
bound4 = 400
# Predict false Positive for 1
fp2 = otherfalsePositive(newClassifier,test_samples, 2, bound1, bound2, bound3, bound4)
    
            
# Predict True Negative for 1
tn2 = othertrueNegative(newClassifier,test_samples, 2, bound1, bound2, bound3, bound4)

confusionMatrixClass2 = np.array([[tp2,fp2],[fn2,tn2]])
print(confusionMatrixClass2)






def extrafalsePositive(matrix,testmatrix,classnumber, bound1,bound2):
    matrixa = matrix[bound1:bound2]
    return (np.bincount(matrixa == classnumber)[1])

def extratrueNegative(matrix,testmatrix,classnumber,bound1,bound2):
    matrixa = matrix[bound1:bound2]
    return (np.bincount(matrixa != classnumber)[1])



# Predict true positives for first 100 samples
class3Matrix = (newClassifier[300:400] == labels[test_samples][300:400])
tp3 = truePositive(class3Matrix)



# Predict false Negative for first 100 samples
fn3 = falseNegative(100, class3Matrix)

bound1 = 0
bound2 = 300
# Predict false Positive for 1
fp3 = extrafalsePositive(newClassifier,test_samples, 3, bound1, bound2)
    
            
# Predict True Negative for 1
tn3 = extratrueNegative(newClassifier,test_samples, 3, bound1, bound2)

confusionMatrixClass3 = np.array([[tp3,fp3],[fn3,tn3]])
print(confusionMatrixClass3)







[[ 98  15]
 [  2 285]]
[[ 85   2]
 [ 15 298]]
[[ 98  12]
 [  2 288]]
[[ 89   1]
 [ 11 299]]


(3) Redo (2), but randomly remove 95 training articles from *all* the classes.
Train the third classifier using the new training data.
Compute the confusion matrix for the 4 classes using the testing samples.

In [180]:
random_train_samples0 = np.random.choice(train_samples[:100],5)
random_train_samples1 = np.random.choice(train_samples[100:200],5)
random_train_samples2 = np.random.choice(train_samples[200:300],5)
random_train_samples3 = np.random.choice(train_samples[300:400],5)
new_train_samples = np.concatenate((random_train_samples0,random_train_samples1, random_train_samples2, random_train_samples3), axis=None)

thirdClassifier = (knn_classify(data[test_samples], data[new_train_samples], labels[new_train_samples], "cosine", 5))

# Predict that the first 100 samples(class 0) Will be equal 
def truePositive(matrix1):
    return np.sum(matrix1)

# Predict False Negatives
def falseNegative(sampleperclassnumber, matrix1):
    return sampleperclassnumber - truePositive(matrix1)

def falsePositive(matrix,testmatrix, classnumber,boundv):
    return (np.bincount(matrix[boundv:] == classnumber)[1])

def trueNegative(matrix,testmatrix,classnumber,boundv):
    return np.bincount(matrix[boundv:] != classnumber)[1]




# Predict true positives for first 100 samples
class0Matrix = (thirdClassifier[:5] == labels[test_samples][:5])
tp0 = truePositive(class0Matrix)



# Predict false Negative for first 100 samples
fn0 = falseNegative(15, class0Matrix)

boundv = 5
# Predict false Positive for 0
fp0 = falsePositive(thirdClassifier,test_samples, 0,boundv)
    
        
# Predict True Negative for 0
tn0 = trueNegative(thirdClassifier,test_samples, 0,boundv)

confusionMatrixClass0 = np.array([[tp0,fp0],[fn0,tn0]])
print(confusionMatrixClass0)






def otherfalsePositive(matrix,testmatrix,classnumber, bound1,bound2,bound3,bound4):
    matrixa = matrix[bound1:bound2]
    matrixb = matrix[bound3:bound4]
    newarray = np.concatenate((matrixa,matrixb), axis=None)
    return (np.bincount(newarray == classnumber)[1])

def othertrueNegative(matrix,testmatrix,classnumber,bound1,bound2,bound3,bound4):
    matrixa = matrix[bound1:bound2]
    matrixb = matrix[bound3:bound4]
    newarray = np.concatenate((matrixa,matrixb), axis=None)
    return (np.bincount(newarray != classnumber)[1])




# Predict true positives for first 100 samples
class1Matrix = (thirdClassifier[5:10] == labels[test_samples][5:10])
tp1 = truePositive(class1Matrix)



# Predict false Negative for first 100 samples
fn1 = falseNegative(5, class1Matrix)

bound1 = 0
bound2 = 5
bound3 = 10
bound4 = 20
# Predict false Positive for 1
fp1 = otherfalsePositive(thirdClassifier,test_samples, 1, bound1, bound2, bound3, bound4)
    
            
# Predict True Negative for 1
tn1 = othertrueNegative(thirdClassifier,test_samples, 1, bound1, bound2, bound3, bound4)

confusionMatrixClass1 = np.array([[tp1,fp1],[fn1,tn1]])
print(confusionMatrixClass1)













# Predict true positives for first 100 samples
class2Matrix = (thirdClassifier[10:15] == labels[test_samples][10:15])
tp2 = truePositive(class2Matrix)



# Predict false Negative for first 100 samples
fn2 = falseNegative(5, class2Matrix)

bound1 = 0
bound2 = 10
bound3 = 15
bound4 = 20
# Predict false Positive for 1
fp2 = otherfalsePositive(thirdClassifier,test_samples, 2, bound1, bound2, bound3, bound4)
    
            
# Predict True Negative for 1
tn2 = othertrueNegative(thirdClassifier,test_samples, 2, bound1, bound2, bound3, bound4)

confusionMatrixClass2 = np.array([[tp2,fp2],[fn2,tn2]])
print(confusionMatrixClass2)






def extrafalsePositive(matrix,testmatrix,classnumber, bound1,bound2):
    matrixa = matrix[bound1:bound2]
    return (np.bincount(matrixa == classnumber)[1])

def extratrueNegative(matrix,testmatrix,classnumber,bound1,bound2):
    matrixa = matrix[bound1:bound2]
    check = (np.bincount(matrixa != classnumber)[1])
    print(check)
    return (np.bincount(matrixa != classnumber)[1])



# Predict true positives for first 100 samples
class3Matrix = (thirdClassifier[15:20] == labels[test_samples][15:20])
tp3 = truePositive(class3Matrix)



# Predict false Negative for first 100 samples
fn3 = falseNegative(5, class3Matrix)

bound1 = 0
bound2 = 15
# Predict false Positive for 1
fp3 = extrafalsePositive(thirdClassifier,test_samples, 3, bound1, bound2)
    
            
# Predict True Negative for 1
tn3 = extratrueNegative(thirdClassifier,test_samples, 3, bound1, bound2)

confusionMatrixClass3 = np.array([[tp3,fp3],[fn3,tn3]])
print(confusionMatrixClass3)
print("check")




[[  3 105]
 [ 12 290]]


IndexError: index 1 is out of bounds for axis 0 with size 1

Repeat the whole thing a few times.
Which of the three classifiers performs the worst?
Try to analyse why this might be.

## 5. Deliverables and Marking

By the deadline, you should submit one single Jupyter file using GitLab.
Please find the coursework submission instruction from the following link:
https://wiki.cs.manchester.ac.uk/index.php/UGHandbook19:Coursework

This exercise is worth 15 marks — marks will be allocated roughly on the basis of:
* rigorous experimentation,
* knowledge displayed when talking to the TA,
* problem solving skill,
* self-learning ability,
* how informative and well presented your graphs are,
* language and ease of reading.

You must be able to explain any code you've written in order to get full marks. During the marking session we will ask you to run all cells in your Jupyter file, so ensure that the file is runnable using the "Restart Kernel and Run All Cells" menu option.

The lab is marked out of 15:

|                          |         |
|:------------------------ |--------:|
| k-NN Implementation      | 3 marks |
| Experiment 1             | 4 marks |
| Experiment 2             | 4 marks |
| Experiment 3             | 4 marks |