# CS3244, Machine Learning, Semester 1, 2024/25

# Credits
Authored by [Min-Yen Kan](http://www.comp.nus.edu.sg/~kanmy), Chris Boesch and Martin Strobel (2021–2018), affiliated with [WING](http://wing.comp.nus.edu.sg), [NUS School of Computing](http://www.comp.nus.edu.sg) and [ALSET](http://www.nus.edu.sg/alset). Inspired in part by Andrew Ng's Coursera course and Yaser S. Abu-Mostafa's Caltech course.
Licensed as: [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/ ) (CC BY 4.0).
Please retain and add to this credits cell if using this material as a whole or in part.   Credits for photos given in their captions.

Modified by [Xavier Bresson](https://twitter.com/xbresson), 14 Aug 2023  
Modified by Wee Sun LEE, Jan 2025

**Learning Outcomes for this Notebook**

After finishing this exercise, you should be able to:
* Understand the basic means for executing machine learning.
* Have a high-level introduction to Python, Notebook, sklearn

# 1 Digit Recognition

_Handwritten digit samples from the [MNIST (Modified National Institute of Standards and Technology) database](https://www.nist.gov/system/files/documents/srd/nistsd19.pdf).  Can you guess what was this first used for and who commissioned the work?_

![Sample images from MNIST test dataset](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)

_Image by Josef Steppan via Wikipedia.  CC BY-SA 4.0_


Now let's actually try to get some time on task in class.  We'll follow a modified tutorial for **[sklearn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)**, a very useful Python machine learning library which will be featured extensively in our course.  Let's first get the dataset _digits_ loaded in and take a peek at the data.  This is part of the famous MNIST digit dataset used to improve handwritten digit recognition.

In [None]:
# import the libraries to access data
from sklearn import datasets

# load in the data into a variable 'digits'
digits = datasets.load_digits()

# separate the loaded data into the digits (feature vectors) x and their corresponding answers (labels) y
x_digits = digits.data
y_digits = digits.target

# find the number of instances in our dataset
n_samples = len(x_digits) # size of the variable x_digits.
print ('Number of examples: %d' % n_samples) # a format print statement: prints the string substituting the variable for the digit placeholder '%d'

# Look at the input (x) and label (y) for a particular jth instance
j = 0
print ('Input #%d:' % j, x_digits[j])
print ('Label #%d:' % j, y_digits[j])

This particular dataset are handwritten digits and represented by $8\times8 = 64$ entries, where each row concatenated in series to form a 1-dimension vector.  See whether you can make sense of the data.  Pick another instance (1–9) and run the same code and guess the number below.

In [None]:
k = 9 # Change this line to some number 1–9
print ('Input #%d:' % k, x_digits[k])

In [None]:
# Plot the image with the following steps
# 1. Convert 64-dim vector to 8x8-matrix
# 2. Use matplotlib library

k = 9
image = x_digits[k]
print(image.shape)
image = image.reshape(8,8)
print(image.shape)
print(image)
print('Label:', y_digits[k])

import matplotlib.pyplot as plt
plt.imshow(image, cmap='gray');


**Your Turn (Question 1)** Guess the digit that the number k=179 represents. Put your guess in Archipelago after looking at the image.

In [None]:
k = 179
image = x_digits[k]
print(image.shape)
image = image.reshape(8,8)
print(image.shape)
print(image)

plt.imshow(image, cmap='gray');


In [None]:
# NOW PRINT THE LABEL TO VERIFY YOUR GUESS
print('Label:', y_digits[k])

Next, we'll split the data instances into those used to learn the model (training instances), and those used to assess the learned model's performance (testing instances).

In [None]:
# set aside the first 90% of the data for the training and the remaining 10% for testing.
x_train = x_digits[0:int(.9 * n_samples)]
y_train = y_digits[0:int(.9 * n_samples)]
x_test = x_digits[int(.9 * n_samples):]
y_test = y_digits[int(.9 * n_samples):]
print ('Number of training examples: %d' % len(x_train))
print ('Number of testing examples: %d' % len(x_test))

Finally we'll do model learning and assessing. 

In [None]:
# import the library to use our desired classifier (here, nearest neighbors, we'll see it next)
from sklearn import neighbors

# create an instance of the learner
knn = neighbors.KNeighborsClassifier()

# learn a model from the training data
knn.fit(x_train, y_train)

# evaluate our model over the testing data and print its accuracy. 
acc = knn.score(x_test,y_test)
print('KNN score: %f' % acc)

That was pretty easy, wasn't it? 

Now, instead of spliting the first part of the data as training set and the last part as the test set, let's randomly put them in the training and test set.

In [None]:
# split *randomly* 90% of the data for the training and the remaining 10% for testing.
import numpy as np
index_random_perm = np.random.permutation(n_samples)
print(index_random_perm[:10])
x_train = x_digits[index_random_perm[0:int(.9 * n_samples)]]
y_train = y_digits[index_random_perm[0:int(.9 * n_samples)]]
x_test = x_digits[index_random_perm[int(.9 * n_samples):]]
y_test = y_digits[index_random_perm[int(.9 * n_samples):]]
print ('Number of training examples: %d' % len(x_train))
print ('Number of testing examples: %d' % len(x_test))

In [None]:
# create an instance of the learner
knn2 = neighbors.KNeighborsClassifier()

# learn a model from the training data
knn2.fit(x_train, y_train)

# evaluate our model over the testing data and print its accuracy. 
acc2 = knn.score(x_test,y_test)
print('KNN score: %f' % acc2)

**Your Turn (Question 2)** The score for randomly using the data as training and test is different. Speculate on the reason. Put your answer in Archipelago.