# CS3244, Machine Learning, Semester 1, 2024/25

### PARITY
We will experiment with using nearest neighbour to do prediction for **PARITY** under two input distributions: 
1. Uniform distribution, and 
2. Uniform distribution over instances with at most two nonzero inputs.

Using a 20 dimensional input, we will randomly select 10,000 training examples.

Before running the experiment, enter your prediction of the outcome in Archipelago:<br>
A. Nearest neighbour does WELL on both (1) and (2).<br>
B. Nearest neighbour does WELL on (1) but POORLY on (2). <br>
C. Nearest neighbour does POORLY on (1) but WELL on (2).<br>
D. Nearest neighbour does POORLY  on both (1) and (2).

In [29]:
import numpy as np
from sklearn import neighbors
from sklearn.metrics import accuracy_score

train_size = 10000
test_size = 1000
input_size = 20

np.random.seed(0)

# Construct training and test sets
# Uniform input distribution
train_data1 = np.random.randint(2,size=(train_size,input_size))
train_label1 = train_data1.sum(axis=1)%2
test_data1 = np.random.randint(2,size=(test_size,input_size))
test_label1 = test_data1.sum(axis=1)%2 # compute PARITY
# Uniform distribution over instances with at most two nonzero inputs
train_data2 = np.zeros((train_size,input_size))
for i in range(0,train_size):
    num_ones = np.random.randint(3)
    curr = np.array([1] * num_ones + [0] * (input_size - num_ones))
    np.random.shuffle(curr)
    train_data2[i] = curr
train_label2 = train_data2.sum(axis=1)%2
test_data2 = np.zeros((test_size,input_size))
for i in range(0,test_size):
    num_ones = np.random.randint(3)
    curr = np.array([1] * num_ones + [0] * (input_size - num_ones))
    np.random.shuffle(curr)
    test_data2[i] = curr
test_label2 = test_data2.sum(axis=1)%2

# Run nearest neighbour classifier
clf1 = neighbors.KNeighborsClassifier(1)
clf1.fit(train_data1, train_label1)
predict1 = clf1.predict(test_data1)
clf2 = neighbors.KNeighborsClassifier(1)
clf2.fit(train_data2, train_label2)
predict1 = clf1.predict(test_data1)
accuracy1 = accuracy_score(test_label1, predict1)
predict2 = clf2.predict(test_data2)
accuracy2 = accuracy_score(test_label2, predict2)

# Print accuracies
print("Test set accuracy for uniform distribution: " + "{0:.2f}".format(accuracy1))
print("Test set accuracy for uniform distribution over instances with at most 2 nonzero inputs: " + "{0:.2f}".format(accuracy2))

Test set accuracy for uniform distribution: 0.69
Test set accuracy for uniform distribution over instances with at most 2 nonzero inputs: 1.00


## Feature Selection for NN

Do you think doing feature selection would be helpful when using nearest neighbour for text classification? Why? Answer with a phrase before running the experiment.

We will use the 20 Newsgroup dataset in the experiment. TF-IDF representation is used for feature representation. Chi square method is used for feature selection. Feature engineering and feature selection will be covered later in the course.

In [35]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn import neighbors
from sklearn.metrics import accuracy_score

# Select only 4 categories to speed things up
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']

# Fetch training and test sets
twenty_train = fetch_20newsgroups(subset='train', remove=('headers','footers','quotes'),
                                  categories=categories, shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test', remove=('headers','footers','quotes'),
                                 categories=categories, shuffle=True, random_state=42)

# Use tfidf
vectorizer = TfidfVectorizer(norm=None)
vectors = vectorizer.fit_transform(twenty_train.data)
vectors_test = vectorizer.transform(twenty_test.data)

# No feature selection
clf = neighbors.KNeighborsClassifier(1)
clf.fit(vectors, twenty_train.target)
predict = clf.predict(vectors_test)
accuracy = accuracy_score(twenty_test.target, predict)
print("Accuracy with no feature selection: " + "{0:.2f}".format(accuracy))

# Feature selection with different number of features selected
fs_num = [10, 50, 100, 500, 1000, 5000]
for i in fs_num:
    fs = SelectKBest(chi2, k=i)
    vectors_fs = fs.fit_transform(vectors, twenty_train.target)
    vectors_test_fs = fs.transform(vectors_test)
    clf = neighbors.KNeighborsClassifier(1)
    clf.fit(vectors_fs, twenty_train.target)
    predict = clf.predict(vectors_test_fs)
    accuracy = accuracy_score(twenty_test.target, predict)
    print("Accuracy with " + str(i) + " features: " + "{0:.2f}".format(accuracy))  

Accuracy with no feature selection: 0.39
Accuracy with 10 features: 0.42
Accuracy with 50 features: 0.55
Accuracy with 100 features: 0.56
Accuracy with 500 features: 0.56
Accuracy with 1000 features: 0.54
Accuracy with 5000 features: 0.48


## Normalization for NN

The previous experiment did not do normalization. Do you think doing normalization would be helpful when using nearest neighbour for text classification? Why? Answer with a phrase before running the experiment.

Now, rerun with normalization.

In [None]:
# Use tfidf
vectorizer = TfidfVectorizer(norm='l2') # Use l2 normalization
vectors = vectorizer.fit_transform(twenty_train.data)
vectors_test = vectorizer.transform(twenty_test.data)

# No feature selection
clf = neighbors.KNeighborsClassifier(1)
clf.fit(vectors, twenty_train.target)
predict = clf.predict(vectors_test)
accuracy = accuracy_score(twenty_test.target, predict)
print("Accuracy with no feature selection: " + "{0:.2f}".format(accuracy))

# Feature selection with different number of features selected
fs_num = [50, 100, 500, 1000, 5000]
for i in fs_num:
    fs = SelectKBest(chi2, k=i)
    vectors_fs = fs.fit_transform(vectors, twenty_train.target)
    vectors_test_fs = fs.transform(vectors_test)
    clf = neighbors.KNeighborsClassifier(1)
    clf.fit(vectors_fs, twenty_train.target)
    predict = clf.predict(vectors_test_fs)
    accuracy = accuracy_score(twenty_test.target, predict)
    print("Accuracy with " + str(i) + " features: " + "{0:.2f}".format(accuracy))  