# CISC/CMPE 452/COGS 400 Assignment 3 - Unsupervised Learning (10 points)  

Please put your name and student id

    FirstName LastName, #12345678

- The notebook file has clearly marked blocks where you are expected to write code. Do not write or modify any code outside of these blocks.
- Make sure to run all the cells from the beginning before submission. Do not clear out the outputs. You will only get credit for code that has been run.
- Mark will be deducted based on late policy (-1% of the course total marks per day after due date until the end date after which no assignments will be accepted)

## Dataset
The dataset is [Palmer Archipelago (Antarctica) penguin data](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data)  
The dataset has 6 features and 1 label called species (Chinstrap, Adélie, or Gentoo)  
The dataset is preprocessed into x_train, x_test, y_train, y_test  

## Part 1 Kohonen Learning and MaxNet (5 points)
- Build a Kohonen network (Kohonen-1) with Maxnet to cluster the preprocessed data (3 points)  
- Train the model with both train and test sets (1 point)  
- Print the confusion matrix of the predicted results (1 point)  

## Part 2 Principle Component Analysis Network (5 points)
- Build a PCA network to reduce the number of input features from 6 to 4 (2 points)  
- Build a Kohonen network (Kohonen-2) with Maxnet to cluster the data  
- Train the model with the new train and test sets generated by PCA network (1 point)  
- Print the confusion matrix of the predicted results (1 point)  
- Compare the predicted results with Kohonen-1 and analyze the results (1 point)  

In [8]:
!pip3 install matplotlib

Defaulting to user installation because normal site-packages is not writeable
Collecting matplotlib
  Downloading matplotlib-3.6.1-cp39-cp39-macosx_10_12_x86_64.whl (7.3 MB)
[K     |████████████████████████████████| 7.3 MB 4.0 MB/s eta 0:00:01
Collecting pillow>=6.2.0
  Downloading Pillow-9.2.0-cp39-cp39-macosx_10_10_x86_64.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 12.9 MB/s eta 0:00:01
[?25hCollecting fonttools>=4.22.0
  Downloading fonttools-4.37.4-py3-none-any.whl (960 kB)
[K     |████████████████████████████████| 960 kB 12.1 MB/s eta 0:00:01
[?25hCollecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.4.4-cp39-cp39-macosx_10_9_x86_64.whl (65 kB)
[K     |████████████████████████████████| 65 kB 6.7 MB/s  eta 0:00:01
Collecting contourpy>=1.0.1
  Downloading contourpy-1.0.5-cp39-cp39-macosx_10_9_x86_64.whl (241 kB)
[K     |████████████████████████████████| 241 kB 12.7 MB/s eta 0:00:01
[?25hCollecting cycler>=0.10
  Downloading cycler-0.11.0-py3-none-any.whl

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [10]:
# load the dataset
data = pd.read_csv('data/penguins_size.csv')
data.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


### Data Preprocessing

In [11]:
data = data.dropna()
data = data[data['sex'] != '.']

cleanup_nums = {"species": {"Adelie": 0, "Chinstrap": 1, "Gentoo": 2},
                "island": {"Biscoe": 0, "Dream": 1, "Torgersen": 2},
                "sex": {"MALE": 0.0, "FEMALE": 1.0}}
data = data.replace(cleanup_nums)

data.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,2,39.1,18.7,181.0,3750.0,0.0
1,0,2,39.5,17.4,186.0,3800.0,1.0
2,0,2,40.3,18.0,195.0,3250.0,1.0
4,0,2,36.7,19.3,193.0,3450.0,1.0
5,0,2,39.3,20.6,190.0,3650.0,0.0


In [12]:
x = np.array(data.drop(['species'], axis=1).copy())
x = (x - x.mean(axis=0)) / x.std(axis=0) # data normalization
y = np.array(data['species'].copy()).astype(int)

In [13]:
# split the dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((266, 6), (67, 6), (266,), (67,))

In [14]:
# calculate the confusion matrix
def evaluator(y, y_pred):    
    from sklearn.metrics import confusion_matrix
    print('Confusion matrix:\n', confusion_matrix(y, y_pred))

In [15]:
# setup a baseline model
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3) # n_clusters - the number of clusters
km.fit(x_train)
y_pred = km.predict(x_train)
evaluator(y_train, y_pred)
y_pred = km.predict(x_test)
evaluator(y_test, y_pred)

Confusion matrix:
 [[ 47   0  60]
 [ 26   0  32]
 [  0 101   0]]
Confusion matrix:
 [[26  0 13]
 [ 8  0  2]
 [ 0 18  0]]


### Part 1 K-Means Clustering

> Indented block



In [17]:
class Model1(object):
    def __init__(self):
        self.history = {}
        self.history['train_acc'] = []
        self.history['test_acc'] = []
    
    def relu (self, x):
        x[x < 0] = 0
        return x
        
    def train(self, x, y, x_test, y_test, learning_rate=0.1, n_iters=10, verbose=True):
        n_train, input_size = x.shape
        output_size = len(np.unique(y))
        n_test = x_test.shape[0]
        self.W = np.ones((input_size, output_size))

        for i in range(n_iters):
#             if (i+1)%2 == 0:
#                 learning_rate *= 0.5
            for xi, yi in zip(x, y):
                dist = self.relu(np.matmul(xi, self.W))
                index = np.argmax(dist)
                self.W[:, index] += learning_rate * (xi - self.W[:, index])
            #print('epoch %d, learning rate %.4f' % (i + 1, learning_rate))
        y_pred = self.predict(x)
        evaluator(y, y_pred)
        y_pred = self.predict(x_test)
        evaluator(y_test, y_pred)

    def predict(self, x):
        y_pred = np.argmax(self.relu(np.matmul(x, self.W)), axis=1)
        return y_pred

In [18]:
# initialize and train Kohonen-1
model1 = Model1()
model1.train(x_train, y_train, x_test, y_test, learning_rate=0.001, n_iters=50, verbose=True)

Confusion matrix:
 [[ 51  56   0]
 [ 26  32   0]
 [  0   0 101]]
Confusion matrix:
 [[27 12  0]
 [ 8  2  0]
 [ 0  0 18]]


In [None]:
# print the confusion matrix of both train and test sets


### Part 2 Principle Component Analysis Network

In [19]:
import numpy as np
from scipy.io import wavfile

In [56]:
# read csv into Numpy array
txtData = np.genfromtxt('data/sound.csv', delimiter=',')
txtData.shape

(50000, 2)

In [58]:
# save array to WAV audio file
scaledData = np.int16(txtData * 8000)
wavfile.write('data/sound.wav', 8000, scaledData)

In [59]:
# read waveform from audio file
samrate, data = wavfile.read('data/sound.wav')

In [60]:
samrate, data.shape

(8000, (50000, 2))

In [61]:
data

array([[ 7375, -7937],
       [  312,  -562],
       [ -187,   187],
       ...,
       [-2000,  2125],
       [-1937,  2062],
       [-1562,  1625]], dtype=int16)

In [62]:
data = data/8000
# data = data * np.max(data) #(data - data.mean(axis=0)) / data.std(axis=0)

In [63]:
data

array([[ 0.921875, -0.992125],
       [ 0.039   , -0.07025 ],
       [-0.023375,  0.023375],
       ...,
       [-0.25    ,  0.265625],
       [-0.242125,  0.25775 ],
       [-0.19525 ,  0.203125]])

In [64]:
class PCA(object):
    def __init__(self, lr=0.01, epoch=10):
        self.lr = lr
        self.epoch = epoch
        
    def train(self, x, n_components=1):
        self.W = np.random.rand(x.shape[1], n_components)
        for k in range(self.epoch):
            for xi in x:
                y = self.W[0] * xi[0] + self.W[1] * xi[1]
                dW1 = self.lr * y * xi[0] - y ** 2 * self.W[0]
                dW2 = self.lr * y * xi[1] - y ** 2 * self.W[1]
                self.W[0] += dW1
                self.W[1] += dW2
        return
    
    def predict(self, x):
        return np.matmul(x, self.W)

In [65]:
pca = PCA(lr=1, epoch=2)
pca.train(np.array(data), 1)
x = pca.predict(data)

In [66]:
x.shape

(50000, 1)

In [67]:
x

array([[-1.35429904],
       [-0.07813589],
       [ 0.03302665],
       ...,
       [ 0.36474254],
       [ 0.35361591],
       [ 0.28167391]])

In [68]:
scaledData = np.int16(x/np.max(np.abs(x))*8000)
wavfile.write('data/output.wav', 8000, scaledData)