# CISC/CMPE 452/COGS 400 Assignment 3 - Unsupervised Learning (10 points)  

Please put your name and student id

    Liam Patterson, 20121376

- Make sure to run all the cells from the beginning before submission. Do not clear out the outputs. You will only get credit for code that has been run.
- Mark will be deducted based on late policy (-1% of the course total marks per day after due date until the end date after which no assignments will be accepted).
- You can only use Numpy to build the models. Other packages such as Pandas, Sklearn and Scipy can be used for evaluation metrics calculating, data processing, and file reading and writing.

### Files need to be uploaded for this assignment: A3.ipynb, output.wav, and output.csv

## Part 1 Principle Component Analysis Network (5 points)
The dataset "data/sound.csv" contains two sounds recorded by the two microphones. The goal of this assignment is using PCA network to find the approximation of the first principal component.
- Build a PCA network (refer to Principal Component Analysis slide #22 and #23) to reduce the number of features from 2 to 1 (3 points)  
- Train the model and generate the processed data (1 point)  
- Save the data into output.wav and output.csv files (1 point)  
- Compare the sound_o.wav (audio with noise) and output.wav (audio is denoised)  

In [232]:
import numpy as np
import pandas as pd
from scipy.io import wavfile

In [233]:
samrate = 8000

In [234]:
# read csv into array
txtData = np.genfromtxt('data/sound.csv', delimiter=',')
txtData.shape
txtData[0]

array([ 0.92188, -0.99219])

In [235]:
# save array to WAV file
scaledData = np.int16(txtData * samrate)
wavfile.write('data/sound_o.wav', samrate, scaledData)

In [236]:
# read WAV file into array
# The data in sound.csv is processed
# If you use the data generated here, you need to process the data by adding wavData = wavData / samrate
samrate, wavData = wavfile.read('data/sound_o.wav')
samrate, wavData.shape

(8000, (50000, 2))

In [237]:
# save array to csv file
np.savetxt('data/sound_o.csv', txtData, delimiter=',')

In [238]:
# build PCA model and only Numpy can be used
import math
class PCA(object):
    def __init__(self, lr, epoch):
        self.learning_rate = lr
        self.epochs = epoch
        self.history = {}
        self.history['mag'], self.history['val_acc'], self.history['loss'] = 0, [], []

    def train(self, x, n_components):
       
        W = np.random.rand(2)
        
        for i in range(self.epochs):
            
            for xi in x:
                
                yi = np.dot(W, xi)
                xt = np.reshape(xi, (2,1))
                
                delta_w = self.learning_rate*yi*xt - yi*yi*W
                
                W = W + delta_w
                
                mag = np.linalg.norm(W)
                if mag >= self.history['mag']:
                    self.history['mag'] = mag
                    self.history['W'] = W
        
        # Select the weight with the maximum variance
        eigen_values, eigen_vectors = np.linalg.eig(W)
        maxIdx = np.argmax(eigen_values)
        transform_matrix = eigen_vectors[:,maxIdx] 
        return transform_matrix

In [239]:
# initialize and train the model
model = PCA(0.1, 2)
transform_matrix = model.train(txtData, txtData.shape[1])

In [240]:
# save the data
transform = np.dot(txtData, transform_matrix)

scaledData = np.int16(transform * samrate)
wavfile.write('data/output.wav', samrate, scaledData)
np.savetxt('data/output.csv', transform, delimiter=',')

## Part 2 K-Means Clustering Algorithm (5 points)
The dataset is [Palmer Archipelago (Antarctica) penguin data](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data) which has 6 features and 1 label called species (Chinstrap, Adélie, or Gentoo)  
The dataset is saved in the "data/penguins_size.csv" file and preprocessed into x_train, x_test, y_train, y_test  
- Build a K-Means clustering algorithm (refer to Unsupervised Learning slide #29) to cluster the preprocessed data (2 points)  
- Standardize the data and train the model with the training set (1 point)  
- Evaluate the model and print the confusion matrixes with both training and test sets (2 points)  


In [241]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [242]:
# load the dataset
data = pd.read_csv('data/penguins_size.csv')
data.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [243]:
# data preprocessing
data = data.dropna()
data = data[data['sex'] != '.']

cleanup_nums = {"species": {"Adelie": 0, "Chinstrap": 1, "Gentoo": 2},
                "island": {"Biscoe": 0, "Dream": 1, "Torgersen": 2},
                "sex": {"MALE": 0.0, "FEMALE": 1.0}}
data = data.replace(cleanup_nums)

data.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,2,39.1,18.7,181.0,3750.0,0.0
1,0,2,39.5,17.4,186.0,3800.0,1.0
2,0,2,40.3,18.0,195.0,3250.0,1.0
4,0,2,36.7,19.3,193.0,3450.0,1.0
5,0,2,39.3,20.6,190.0,3650.0,0.0


In [244]:
x = np.array(data.drop(['species'], axis=1).copy())
y = np.array(data['species'].copy()).astype(int)

In [245]:
from sklearn.preprocessing import StandardScaler
# data standardization
# exclude species target
cols = data.columns[1:7] 

scaler = StandardScaler()

standard_data = data.copy()

standard_data[cols] = scaler.fit_transform(standard_data[cols])

x = np.array(standard_data.drop(['species'], axis=1))

In [246]:
# split the dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((266, 6), (67, 6), (266,), (67,))

In [247]:
# calculate the confusion matrix
from sklearn.metrics import confusion_matrix

def evaluator(y, y_pred):    
    cf = confusion_matrix(y, y_pred)
    print('Confusion matrix:\n', cf)

In [248]:
# setup a baseline model
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3) # n_clusters - the number of clusters
km.fit(x_train)
y_pred = km.predict(x_train)
evaluator(y_train, y_pred)
y_pred = km.predict(x_test)
evaluator(y_test, y_pred)

Confusion matrix:
 [[  0  60  47]
 [  0  32  26]
 [101   0   0]]
Confusion matrix:
 [[ 0 13 26]
 [ 0  2  8]
 [18  0  0]]


In [249]:
# build K-means model and only Numpy can be used
class KMeans(object):
    def __init__(self, clusters):
        self.num_clusters = clusters

        self.history = {}
        self.history['error'], self.history['train_acc'], self.history['test_acc'] = [], [], []

    def train(self, x, y, x_test, y_test, learning_rate, n_iters, verbose=True):

        n_train, input_size = x.shape
        n_test = x_test.shape[0]

        # Need to reduce the learning rate over the number of iterations

        #initialize random weights
        self.weights = [np.random.rand(x.shape[1]) for j in range(self.num_clusters)]
        
        valid = True
        i = 0
        while i < n_iters and valid:
            
            error = 0
            # points in a cluster, initially no points in any cluster
            points = [[] for j in range(self.num_clusters)]
            self.actual_cluster_labels = [[] for j in range(self.num_clusters)]

            for xi, yi in zip(x, y):
                
                #find min distance
                distances = np.dot(self.weights, xi)
                cluster = np.argmax(distances)

                # add point to cluster with the min distance
                points[cluster].append(xi)
                # actual cluster labels is used to determine what is the most common
                # label from that cluster
                self.actual_cluster_labels[cluster].append(yi)
                
            self.weights = [np.mean(points[j], axis=0) for j in range(self.num_clusters)]
            
            for k in range(self.num_clusters):

                w = self.weights[k]
                
                for point in points[k]:
                    
                    diff = point - w
                    error += diff*diff
            
            i += 1

            error = np.sum(error)

            # if there is existing history, and the current error is greater than the previous error
            if len(self.history['error']) and self.history['error'][-1] <= error:
                valid = False
            
            self.history['error'].append(error)
            
            train_acc = (self.predict(x) == y).sum() / n_train
            self.history['train_acc'].append(train_acc)

            test_acc = (self.predict(x_test) == y_test).sum() / n_test
            self.history['test_acc'].append(test_acc)
            if verbose:
                print('epoch %d, train error %.4f, train acc %.4f, test acc %.4f' % (i, error, train_acc, test_acc))

    def predict(self, x):   
        
        # calculate the mode label for each cluster
        self.predicted_cluster_labels = [max(set(lst), key=lst.count) for lst in self.actual_cluster_labels]
        
        preds = []
        for xi in x:
            #find min distance
            distances = np.dot(self.weights, xi)
            cluster = np.argmax(distances)
            
            preds.append(self.predicted_cluster_labels[cluster])
        return np.array(preds)

In [250]:
# initialize and train the model
kMeans = KMeans(3)
kMeans.train(x_train, y_train, x_test, y_test, 0.01, 15)

epoch 1, train error 1037.6406, train acc 0.7744, test acc 0.8507
epoch 2, train error 649.3592, train acc 0.7820, test acc 0.8507
epoch 3, train error 591.0986, train acc 0.7820, test acc 0.8507
epoch 4, train error 591.0986, train acc 0.7820, test acc 0.8507


In [251]:
# evaluate the model and print the confusion matrixes for both training and test sets
train_predictions = kMeans.predict(x_train)
evaluator(y_train, train_predictions)

test_predictions = kMeans.predict(x_test)
evaluator(y_test, test_predictions)

Confusion matrix:
 [[107   0   0]
 [ 58   0   0]
 [  0   0 101]]
Confusion matrix:
 [[39  0  0]
 [10  0  0]
 [ 0  0 18]]
