# CISC/CMPE 452/COGS 400 Assignment 3 - Unsupervised Learning (10 points)  

Please put your name and student id

    Jackson Kehoe, #20105920

- Make sure to run all the cells from the beginning before submission. Do not clear out the outputs. You will only get credit for code that has been run.
- Mark will be deducted based on late policy (-1% of the course total marks per day after due date until the end date after which no assignments will be accepted).
- You can only use Numpy to build the models. Other packages such as Pandas, Sklearn and Scipy can be used for evaluation metrics calculating, data processing, and file reading and writing.

### Files need to be uploaded for this assignment: A3.ipynb, output.wav, and output.csv

## Part 1 Principle Component Analysis Network (5 points)
The dataset "data/sound.csv" contains two sounds recorded by the two microphones. The goal of this assignment is using PCA network to find the approximation of the first principal component.
- Build a PCA network (refer to Principal Component Analysis slide #22 and #23) to reduce the number of features from 2 to 1 (3 points)  
- Train the model and generate the processed data (1 point)  
- Save the data into output.wav and output.csv files (1 point)  
- Compare the sound_o.wav (audio with noise) and output.wav (audio is denoised)  

In [100]:
import numpy as np
import pandas as pd
from scipy.io import wavfile

In [101]:
samrate = 8000

In [102]:
# read csv into arrayq
txtData = np.genfromtxt('data/sound.csv', delimiter=',')
txtData.shape

(50000, 2)

In [103]:
# save array to WAV file
scaledData = np.int16(txtData * samrate)
wavfile.write('data/sound_o.wav', samrate, scaledData)

In [104]:
# read WAV file into array
# The data in sound.csv is processed
# If you use the data generated here, you need to process the data by adding wavData = wavData / samrate
samrate, wavData = wavfile.read('data/sound_o.wav')
samrate, wavData.shape

(8000, (50000, 2))

In [105]:
# save array to csv file
np.savetxt('data/sound_o.csv', txtData, delimiter=',')

In [106]:
# build PCA model and only Numpy can be used
class PCA(object):
    def __init__(self, lr, epoch):
        self.lr = lr
        self.epoch = epoch

    def train(self, x, n_components):
        self.W = np.random.rand(x.shape[1], n_components)
        for k in range(self.epoch):
            for xi in x:
                y = self.W[0] * xi[0] + self.W[1] * xi[1]
                dW1 = self.lr * y * xi[0] - y ** 2 * self.W[0]
                dW2 = self.lr * y * xi[1] - y ** 2 * self.W[1]
                self.W[0] += dW1
                self.W[1] += dW2
        return
    def predict(self, x):
        return np.matmul(x, self.W)

In [108]:
# initialize and train the model
wavData = wavData/samrate
pca = PCA (lr=0.01, epoch=50)
pca.train(np.array(wavData), 2)
x = pca.predict(wavData)

In [109]:
# save the data
scaleback = np.int16(x/np.max(np.abs(x))*samrate)
wavfile.write('data/output.wav', samrate, scaleback)
wavfile.write('data/output.csv', samrate, scaleback)

## Part 2 K-Means Clustering Algorithm (5 points)
The dataset is [Palmer Archipelago (Antarctica) penguin data](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data) which has 6 features and 1 label called species (Chinstrap, Adélie, or Gentoo)  
The dataset is saved in the "data/penguins_size.csv" file and preprocessed into x_train, x_test, y_train, y_test  
- Build a K-Means clustering algorithm (refer to Unsupervised Learning slide #29) to cluster the preprocessed data (2 points)  
- Standardize the data and train the model with the training set (1 point)  
- Evaluate the model and print the confusion matrixes with both training and test sets (2 points)  


In [110]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [111]:
# load the dataset
data = pd.read_csv('data/penguins_size.csv')
data.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [112]:
# data preprocessing
data = data.dropna()
data = data[data['sex'] != '.']

cleanup_nums = {"species": {"Adelie": 0, "Chinstrap": 1, "Gentoo": 2},
                "island": {"Biscoe": 0, "Dream": 1, "Torgersen": 2},
                "sex": {"MALE": 0.0, "FEMALE": 1.0}}
data = data.replace(cleanup_nums)

data.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,2,39.1,18.7,181.0,3750.0,0.0
1,0,2,39.5,17.4,186.0,3800.0,1.0
2,0,2,40.3,18.0,195.0,3250.0,1.0
4,0,2,36.7,19.3,193.0,3450.0,1.0
5,0,2,39.3,20.6,190.0,3650.0,0.0


In [113]:
x = np.array(data.drop(['species'], axis=1).copy())
y = np.array(data['species'].copy()).astype(int)

In [114]:
# data standardization
x = (x-x.mean(axis=0))/ x.std(axis=0)

In [115]:
# split the dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((266, 6), (67, 6), (266,), (67,))

In [116]:
# calculate the confusion matrix
def evaluator(y, y_pred):    
    from sklearn.metrics import confusion_matrix
    print('Confusion matrix:\n', confusion_matrix(y, y_pred))

In [117]:
# setup a baseline model
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3) # n_clusters - the number of clusters
km.fit(x_train)
y_pred = km.predict(x_train)
evaluator(y_train, y_pred)
y_pred = km.predict(x_test)
evaluator(y_test, y_pred)

Confusion matrix:
 [[ 60   0  47]
 [ 32   0  26]
 [  0 101   0]]
Confusion matrix:
 [[13  0 26]
 [ 2  0  8]
 [ 0 18  0]]


In [118]:
# build K-means model and only Numpy can be used
class KMeans(object):
    def __init__(self):
        self.history = {}
        self.history['train_acc'] = []
        self.history['test_acc'] = []
    
    def relu (self, x):
        x[x < 0] = 0
        return x
        
    def train(self, x, y, x_test, y_test, learning_rate, n_iters):
        n_train, input_size = x.shape
        output_size = len(np.unique(y))
        n_test = x_test.shape[0]
        self.W = np.ones((input_size, output_size))

        for i in range(n_iters):
            for xi, yi in zip(x, y):
                dist = self.relu(np.matmul(xi, self.W))
                index = np.argmax(dist)
                self.W[:, index] += learning_rate * (xi - self.W[:, index])
        y_pred = self.predict(x)
        evaluator(y, y_pred)
        y_pred = self.predict(x_test)
        evaluator(y_test, y_pred)

    def predict(self, x):
        y_pred = np.argmax(self.relu(np.matmul(x, self.W)), axis=1)
        return y_pred

In [119]:
# initialize and train the model
model = KMeans()
model.train(x_train, y_train, x_test, y_test, learning_rate=0.01, n_iters=50)

Confusion matrix:
 [[ 47  60   0]
 [ 26  32   0]
 [  0   0 101]]
Confusion matrix:
 [[26 13  0]
 [ 8  2  0]
 [ 0  0 18]]
