## EE559 Assignment 6 : Classification using SVM, Anuran Calls Dataset

### @author : Suchismita Sahu, USCID : 7688176370

## Multi-class and Multi-Label Classification Using Support Vector Machines

### (a) Download the Anuran Calls (MFCCs) Data Set from: https://archive.ics.uci.edu/ml/datasets/Anuran+Calls+%28MFCCs). Choose 70% of the data randomly as the training set.

Dataset Information:

This dataset was used in several classifications tasks related to the challenge of anuran species recognition through their calls. It is a multilabel dataset with three columns of labels. These coefficients were normalized between -1 mfcc 1. The amount of instances per class are: 

Families: 
Bufonidae 68 
Dendrobatidae 542 
Hylidae 2165 
Leptodactylidae 4420 

Genus: 
Adenomera 4150 
Ameerega 542 
Dendropsophus 310 
Hypsiboas 1593 
Leptodactylus 270 
Osteocephalus 114 
Rhinella 68 
Scinax 148 

Species: 
AdenomeraAndre 672 
AdenomeraHylaedactâ€¦ 3478 
Ameeregatrivittata 542 
HylaMinuta 310 
HypsiboasCinerascens 472 
HypsiboasCordobae 1121 
LeptodactylusFuscus 270 
OsteocephalusOophaâ€¦ 114 
Rhinellagranulosa 68 
ScinaxRuber 148

In [1]:
# Importing Libraries
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from sklearn import svm
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix  
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, hamming_loss
import math
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import ClassifierChain

%matplotlib inline

In [2]:
# Reading the data
data = pd.read_csv(r"C:\Users\AbsurdFantasy\Documents\EE559 Assignments\Assignment 3\Anuran Calls (MFCCs)\Frogs_MFCCs.csv")  

#Exploratory Data Analysis
#data.shape
#data.head() 

# Spliting Data and Labels

X = data.drop(['Family', 'Genus', 'Species', 'RecordID'], axis=1)
Yfamily= pd.DataFrame(data['Family'])
Ygenus = pd.DataFrame(data['Genus'])
Yspecies = pd.DataFrame(data['Species'])
#X.shape
#Yfamily.shape

#Encoding the labels as int

Yfamily["Family"] = Yfamily["Family"].astype('category')
Yfamily.dtypes
Yfamily["Label"] = Yfamily["Family"].cat.codes

Ygenus["Genus"] = Ygenus["Genus"].astype('category')
Ygenus.dtypes
Ygenus["Label"] = Ygenus["Genus"].cat.codes

Yspecies["Species"] = Yspecies["Species"].astype('category')
Yspecies.dtypes
Yspecies["Label"] = Yspecies["Species"].cat.codes


### (b) Each instance has three labels: Families, Genus, and Species. Each of the labels has multiple classes. We wish to solve a multi-class and multi-label problem. One of the most important approaches to multi-class classification is to train a classifier for each label. We first try this approach:

### i. Research exact match and hamming score/ loss methods for evaluating multi-label classification and use them in evaluating the classifiers in this problem.

### ii. Train a SVM for each of the labels, using Gaussian kernels and one versus all classifiers. Determine the weight of the SVM penalty and the width of the Gaussian Kernel using 10 fold cross validation.1 You are welcome to try to solve the problem with both normalized and raw attributes and report the results.

In [3]:
# Function Defined for Parameter Grid Search using CV to get best SVM Penalty and Width of Gaussian Kernel

def svc_param_selection(X, y):
    Cs = [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]
    gammas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2]
    param_grid = {'C': Cs, 'gamma' : gammas}
    grid_search = GridSearchCV(svm.SVC(kernel='rbf'), param_grid, cv=10)
    grid_search.fit(X, y)
    grid_search.best_params_
    return grid_search.best_params_

In [4]:
# SVM with Gaussian Kernel and One-vs-All Classification For Label Family

X_train, X_test, Yfamily_train, Yfamily_test = train_test_split(X, Yfamily['Family'], test_size = 0.30)

best_params = svc_param_selection(X_train, Yfamily_train)
print('The best SVM Penalty for Gaussian Kernel and One-vs All for Family Label is:', best_params['C'])
print('The best width of Gaussian Kernel for Family Label is:', best_params['gamma'])
svclassifier = SVC(kernel='rbf', C = best_params['C'], gamma = best_params['gamma'])  

svclassifier.fit(X_train, Yfamily_train) 
Yfamily_predsvm = svclassifier.predict(X_test)

Accuracy = svclassifier.score(X_test, Yfamily_test)
print('The Accuracy for Gaussian Kernel One-vs-All SVM for Family Label is:', Accuracy)

hammingGaussianFam = []
hammingGaussianFam = hamming_loss(Yfamily_test, Yfamily_predsvm)

print('The Hamming Loss is: %.8f' %hamming_loss(Yfamily_test, Yfamily_predsvm))

The best SVM Penalty for Gaussian Kernel and One-vs All for Family Label is: 10
The best width of Gaussian Kernel for Family Label is: 1.9
The Accuracy for Gaussian Kernel One-vs-All SVM for Family Label is: 0.9925891616489115
The Hamming Loss is: 0.00741084


In [5]:
# SVM with Gaussian Kernel and One-vs-All Classification For Label Genus

X_train, X_test, Ygenus_train, Ygenus_test = train_test_split(X, Ygenus['Genus'], test_size = 0.30)

best_params = svc_param_selection(X_train, Ygenus_train)
print('The best SVM Penalty for Gaussian Kernel and One-vs All for Genus Label is:', best_params['C'])
print('The best width of Gaussian Kernel for Genus Label is:', best_params['gamma'])
svclassifier = SVC(kernel='rbf', C = best_params['C'], gamma = best_params['gamma'])  

svclassifier.fit(X_train, Ygenus_train) 
Ygenus_predsvm = svclassifier.predict(X_test) 

Accuracy = svclassifier.score(X_test, Ygenus_test)
print('The Accuracy Gaussian Kernel One-vs-All SVM for Family Label is:', Accuracy)

hammingGaussianGen = []
hammingGaussianGen = hamming_loss(Ygenus_test, Ygenus_predsvm)

print('The Hamming Loss is: %.8f' %hamming_loss(Ygenus_test, Ygenus_predsvm))

The best SVM Penalty for Gaussian Kernel and One-vs All for Genus Label is: 10
The best width of Gaussian Kernel for Genus Label is: 2
The Accuracy Gaussian Kernel One-vs-All SVM for Family Label is: 0.9916628068550255
The Hamming Loss is: 0.00833719


In [6]:
# SVM with Gaussian Kernel and One-vs-All Classification For Label Species

X_train, X_test, Yspecies_train, Yspecies_test = train_test_split(X, Yspecies['Species'], test_size = 0.30)

best_params = svc_param_selection(X_train, Yspecies_train)
print('The best SVM Penalty for Gaussian Kernel and One-vs All for Species Label is:', best_params['C'])
print('The best width of Gaussian Kernel for Species Label is:', best_params['gamma'])
svclassifier = SVC(kernel='rbf', C = best_params['C'], gamma = best_params['gamma'])  

svclassifier.fit(X_train, Yspecies_train) 
Yspecies_predsvm = svclassifier.predict(X_test)

Accuracy = svclassifier.score(X_test, Yspecies_test) 
print('The Accuracy Gaussian Kernel One-vs-All SVM for Species Label is:', Accuracy)

hammingGaussianSp = []
hammingGaussianSp = hamming_loss(Yspecies_test, Yspecies_predsvm)

print('The Hamming Loss is: %.8f' %hamming_loss(Yspecies_test, Yspecies_predsvm))

The best SVM Penalty for Gaussian Kernel and One-vs All for Species Label is: 10
The best width of Gaussian Kernel for Species Label is: 1.9
The Accuracy Gaussian Kernel One-vs-All SVM for Species Label is: 0.9874942102825383
The Hamming Loss is: 0.01250579


In [7]:
# Concatenaing the Test and Predicted outputs
Ypredsvc = np.column_stack((Yfamily_predsvm, Ygenus_predsvm, Yspecies_predsvm))
Ytestsvc = np.column_stack((Yfamily_test, Ygenus_test, Yspecies_test))

In [8]:
# Function for Finding Exact Match Score:

def ExactMatchscore(Ypred, Ytrue):
    score = 0
    for i in range(Ypred.shape[0]):
        if False not in (Ypred[i,:]==np.array((Ytrue))[i,:]):
            score+=1
            
    return float(score/Ypred.shape[0])

In [9]:
ExactMatchScore = ExactMatchscore(Ypredsvc, Ytestsvc)
print('The Exact Match Score for SVM Classifier with Gaussian Kernel and One-vs-all is:', ExactMatchScore)

The Exact Match Score for SVM Classifier with Gaussian Kernel and One-vs-all is: 0.9726725335803613


In [10]:
# Function for calculating Hamming Loss for the Multiclass-Multilabel Problem

def HammingLoss(Ypred, Ytrue):
    score = 0
    for i in range(Ypred.shape[0]):
        for j in range(0,3):
            if False == (Ypred[i,j]==Ytrue[i,j]):
                score+=1
            
    return float(score/Ypred.shape[0]*3)


In [11]:
# Hamming Loss for the Multiclass-Multilabel Problem in Gaussian SVM

#Hamming = ((hammingGaussianFam + hammingGaussianGen + hammingGaussianSp)/3)
#print('The Hamming Loss for the Gaussian One vs. All SVM is:', Hamming)

Hamming_Loss_SVC = HammingLoss(Ypredsvc, Ytestsvc)
print('The Hamming Loss for SVM Classifier with Gaussian Kernel and One-vs-all is:', Hamming_Loss_SVC)


The Hamming Loss for SVM Classifier with Gaussian Kernel and One-vs-all is: 0.08476146364057434


### iii. Repeat 1(b)ii with L1-penalized SVMs. Remember to normalize the attributes.

In [12]:
# Defining Function to find Best weight of SVM Penalty for L1-Penalized SVM 

def linearsvc_param_selection(X, y):
    Cs = [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]
    param_grid = {'C': Cs}
    grid_search = GridSearchCV(svm.LinearSVC(penalty = 'l1', dual = False), param_grid, cv=10)
    grid_search.fit(X, y)
    grid_search.best_params_
    return grid_search.best_params_

In [13]:
# L1-Penalized SVM for Family Label

X_train, X_test, Yfamily_train, Yfamily_test = train_test_split(X, Yfamily['Family'], test_size = 0.30)

best_params2 = linearsvc_param_selection(X_train, Yfamily_train)
print('The best Penalty for L1-Penalized SVM for Family Label is:', best_params2['C'])

svmpenalized = LinearSVC(penalty='l1', C = best_params2['C'], dual = False)  
svmpenalized.fit(X_train, Yfamily_train) 
Yfamily_predl1 = svmpenalized.predict(X_test)

Accuracy = svmpenalized.score(X_test, Yfamily_test) 
print('The Accuracy for L1-Penalized SVM for Family Label is:', Accuracy)

hammingGaussianFam1 = []
hammingGaussianFam1 = hamming_loss(Yfamily_test, Yfamily_predl1)

print('The Hamming Loss is: %.8f' %hamming_loss(Yfamily_test, Yfamily_predl1))

The best Penalty for L1-Penalized SVM for Family Label is: 10
The Accuracy for L1-Penalized SVM for Family Label is: 0.9305233904585456
The Hamming Loss is: 0.06947661


In [14]:
# L1-Penalized SVM for Genus Label

X_train, X_test, Ygenus_train, Ygenus_test = train_test_split(X, Ygenus['Genus'], test_size = 0.30)

best_params2 = linearsvc_param_selection(X_train, Ygenus_train)
print('The best Penalty for L1-Penalized SVM for Genus Label is:', best_params2['C'])

svmpenalized = LinearSVC(penalty='l1', C = best_params2['C'], dual = False)  
svmpenalized.fit(X_train, Ygenus_train) 
Ygenus_predl1 = svmpenalized.predict(X_test)

Accuracy = svmpenalized.score(X_test, Ygenus_test)  
print('The Accuracy for L1-Penalized SVM for Genus Label is:', Accuracy)

hammingGaussianGen1 = []
hammingGaussianGen1 = hamming_loss(Ygenus_test, Ygenus_predl1)

print('The Hamming Loss is: %.8f' %hamming_loss(Ygenus_test, Ygenus_predl1))

The best Penalty for L1-Penalized SVM for Genus Label is: 100
The Accuracy for L1-Penalized SVM for Genus Label is: 0.9481241315423807
The Hamming Loss is: 0.05187587


In [15]:
# L1-Penalized SVM for Species Label

X_train, X_test, Yspecies_train, Yspecies_test = train_test_split(X, Yspecies['Species'], test_size = 0.30)

best_params2 = linearsvc_param_selection(X_train, Yspecies_train)
print('The The best Penalty for L1-Penalized SVM for Species Label is:', best_params2['C'])

svmpenalized = LinearSVC(penalty='l1', C = best_params2['C'], dual = False)  
svmpenalized.fit(X_train, Yspecies_train) 
Yspecies_predl1 = svmpenalized.predict(X_test)

Accuracy = svmpenalized.score(X_test, Yspecies_test)  
print('The Accuracy for L1-Penalized SVM for Species Label is:', Accuracy)

hammingGaussianSp1 = []
hammingGaussianSp1 = hamming_loss(Yspecies_test, Yspecies_predl1)

print('The Hamming Loss is: %.8f' %hamming_loss(Yspecies_test, Yspecies_predl1))

The The best Penalty for L1-Penalized SVM for Species Label is: 10
The Accuracy for L1-Penalized SVM for Species Label is: 0.952292728114868
The Hamming Loss is: 0.04770727


In [16]:
# Concatenaing the Test and Predicted outputs

Ypredl1 = np.column_stack((Yfamily_predl1, Ygenus_predl1, Yspecies_predl1))
Ytestl1 = np.column_stack((Yfamily_test, Ygenus_test, Yspecies_test))

In [17]:
#Exact Match Score for L1-Penalized SVM

ExactMatchScore1 = ExactMatchscore(Ypredl1, Ytestl1)
print('The Exact Match Score for L1-Penalized SVM Classifier is:', ExactMatchScore1)

The Exact Match Score for L1-Penalized SVM Classifier is: 0.8388142658638258


In [18]:
# Hamming Loss for the Multiclass-Multilabel Problem using L1-Penalized SVM 

Hamming2 = ((hammingGaussianFam1 + hammingGaussianGen1 + hammingGaussianSp1)/3)
#print('The Hamming Loss for L1-Penalized SVM is:', Hamming2)

Hamming_Loss_L1 = HammingLoss(Ypredl1, Ytestl1)
print('The Mean Hamming Loss for L1-Penalized SVM Classifier is:', Hamming2)
print('The Hamming Loss for L1-Penalized SVM Classifier Combined is:', Hamming_Loss_L1)

The Mean Hamming Loss for L1-Penalized SVM Classifier is: 0.056353249961401876
The Hamming Loss for L1-Penalized SVM Classifier Combined is: 0.507179249652617


### iv. Repeat 1(b)iii by using SMOTE or any other method you know to remedy class imbalance. Report your conclusions about the classifiers you trained.

In [19]:
# L1-Penalized SVM using SMOTE for Label Family

X_train, X_test, Yfamily_train, Yfamily_test = train_test_split(X, Yfamily['Family'], test_size = 0.3)

sm = SMOTE(kind = 'svm')
X_train_res, Yfamily_train_res = sm.fit_sample(X_train, Yfamily_train)

best_params3 = linearsvc_param_selection(X_train_res, Yfamily_train_res)
print('The best SVM Penalty for L1-Penalized SVM using SMOTE for Family Label is:', best_params3['C']) 

svmpenalized = LinearSVC(penalty='l1', C = best_params3['C'], dual = False)
svmpenalized.fit(X_train_res, Yfamily_train_res)
Yfamily_predsmote = svmpenalized.predict(X_test)

Accuracy = svmpenalized.score(X_test, Yfamily_test) 
print('The Accuracy for L1-Penalized SVM using SMOTE for Family Label is:', Accuracy)

hammingGaussianFam2 = []
hammingGaussianFam2 = hamming_loss(Yfamily_test, Yfamily_predsmote)

print('The Hamming Loss is: %.8f' %hamming_loss(Yfamily_test, Yfamily_predsmote))

The best SVM Penalty for L1-Penalized SVM using SMOTE for Family Label is: 10000
The Accuracy for L1-Penalized SVM using SMOTE for Family Label is: 0.9036591014358499
The Hamming Loss is: 0.09634090


In [23]:
# L1-Penalized SVM using SMOTE for Label Genus

X_train, X_test, Ygenus_train, Ygenus_test = train_test_split(X, Ygenus['Genus'], test_size = 0.3)

sm = SMOTE(kind = 'svm')
X_train_res, Ygenus_train_res = sm.fit_sample(X_train, Ygenus_train)

best_params3 = linearsvc_param_selection(X_train_res, Ygenus_train_res)
print('The best SVM Penalty for L1-Penalized SVM using SMOTE for Genus Label is:', best_params3['C']) 

svmpenalized = LinearSVC(penalty='l1', C = best_params3['C'], dual = False)
svmpenalized.fit(X_train_res, Ygenus_train_res)
Ygenus_predsmote = svmpenalized.predict(X_test)

Accuracy = svmpenalized.score(X_test, Ygenus_test) 
print('The Accuracy for L1-Penalized SVM using SMOTE for Genus Label is:', Accuracy)

hammingGaussianGen2 = []
hammingGaussianGen2 = hamming_loss(Ygenus_test, Ygenus_predsmote)

print('The Hamming Loss is: %.8f' %hamming_loss(Ygenus_test, Ygenus_predsmote))

The best SVM Penalty for L1-Penalized SVM using SMOTE for Genus Label is: 100
The Accuracy for L1-Penalized SVM using SMOTE for Genus Label is: 0.8818897637795275
The Hamming Loss is: 0.11811024


In [24]:
# L1-Penalized SVM using SMOTE for Label Species

X_train, X_test, Yspecies_train, Yspecies_test = train_test_split(X, Yspecies['Species'], test_size = 0.3)

sm = SMOTE(kind = 'svm')
X_train_res, Yspecies_train_res = sm.fit_sample(X_train, Yspecies_train)

best_params3 = linearsvc_param_selection(X_train_res, Yspecies_train_res)
print('The best SVM Penalty for L1-Penalized SVM using SMOTE for Species Label is:', best_params3['C']) 

svmpenalized = LinearSVC(penalty='l1', C = best_params3['C'], dual = False)
svmpenalized.fit(X_train_res, Yspecies_train_res)
Yspecies_predsmote = svmpenalized.predict(X_test)

Accuracy = svmpenalized.score(X_test, Yspecies_test) 
print('The Accuracy for L1-Penalized SVM using SMOTE for Species Label is:', Accuracy)

hammingGaussianSp2 = []
hammingGaussianSp2 = hamming_loss(Yspecies_test, Yspecies_predsmote)

print('The Hamming Loss is: %.8f' %hamming_loss(Yspecies_test, Yspecies_predsmote))

The best SVM Penalty for L1-Penalized SVM using SMOTE for Species Label is: 100
The Accuracy for L1-Penalized SVM using SMOTE for Species Label is: 0.9124594719777674
The Hamming Loss is: 0.08754053


In [25]:
# Hamming Loss for the Multiclass-Multilabel Problem using L1-Penalized SVM using SMOTE

Hamming3 = ((hammingGaussianFam2 + hammingGaussianGen2 + hammingGaussianSp2)/3)
#print('The Hamming Loss for L1-Penalized SVM using SMOTE is:', Hamming3) 

Ypredsmote = np.column_stack((Yfamily_predsmote, Ygenus_predsmote, Yspecies_predsmote))
Ytestsmote = np.column_stack((Yfamily_test, Ygenus_test, Yspecies_test))

ExactMatchScore2 = ExactMatchscore(Ypredsmote, Ytestsmote )
print('The Exact Match Score for L1-Penalized SVM Classifier using SMOTE is:', ExactMatchScore2)

Hamming_Loss_smote = HammingLoss(Ypredsmote, Ytestsmote)
print('The Mean Hamming Loss for L1-Penalized SVM using SMOTE is:', Hamming3)
print('The Hamming Loss for L1-Penalized SVM using SMOTE is:', Hamming_Loss_smote)

The Exact Match Score for L1-Penalized SVM Classifier using SMOTE is: 0.7285780453913849
The Mean Hamming Loss for L1-Penalized SVM using SMOTE is: 0.10066388760228501
The Hamming Loss for L1-Penalized SVM using SMOTE is: 0.905974988420565


## Conclusion

## For Family Label

### SVM classifier with Gaussian Kernel and One-vs All :
##### The best SVM Penalty is: 
10 
##### The best width of Gaussian Kernel is : 
1.9
##### The Accuracy is: 
0.9925891616489115
##### The Hamming Loss is: 
0.00741084

### L1-Penalized SVM classifier :
##### The best Penalty is: 
10
##### The Accuracy is: 
0.9302501157943493
##### The Hamming Loss is: 
0.069497661

### L1-Penalized SVM classifier using SMOTE :
##### The best SVM Penalty is: 
10000
##### The Accuracy is: 
0.90365910143584
##### The Hamming Loss is: 
0.09634090

## For Genus Label

### SVM classifier with Gaussian Kernel and One-vs All :
##### The best SVM Penalty is: 
10
##### The best width of Gaussian Kernel is: 
2
##### The Accuracy is: 
0.9916628068550255
##### The Hamming Loss is: 
0.00833719

### L1-Penalized SVM classifier :
##### The best Penalty is: 
100
##### The Accuracy is: 
0.9481241315423
##### The Hamming Loss is: 
0.05187587

### L1-Penalized SVM classifier using SMOTE :
##### The best SVM Penalty is : 
100
##### The Accuracy is : 
0.8809634089856415
##### The Hamming Loss is: 
0.11811024


## For Species Label

### SVM classifier with Gaussian Kernel and One-vs All :
##### The best SVM Penalty is: 
10
##### The best width of Gaussian Kernel is: 
1.9
##### The Accuracy is: 
0.9874942102825383
##### The Hamming Loss is: 
0.01250579

### L1-Penalized SVM classifier :
##### The The best Penalty is: 
10
##### The Accuracy is: 
0.960629921259842
##### The Hamming Loss is: 
0.04770727

### L1-Penalized SVM classifier using SMOTE :
##### The best SVM Penalty is: 
100
##### The Accuracy is: 
0.9128392774432608
##### The Hamming Loss is: 
0.08754053

### Exact Match Scores for the Combined Multiclass and Multilabel Classifier :
#### SVM classifier with Gaussian Kernel and One-vs All : 
0.972672533580
#### L1-Penalized SVM classifier : 
0.948124131542
#### L1-Penalized SVM classifier using SMOTE :
0.7285780453913



### Hamming Loss for the Combined Multiclass and Multilabel Classifier :
#### SVM classifier with Gaussian Kernel and One-vs All : 
0.084761463640
#### L1-Penalized SVM classifier : 
0.50717924965
#### L1-Penalized SVM classifier using SMOTE : 
0.9059749884205

## v. Extra Practice: Study the Classifier Chain method and apply it to the aboveproblem.

In [None]:
# Classifier Chain Method for Multiclass-Multilabel Problem.

X_train, X_test, Yfamily_train, Yfamily_test = train_test_split(X, Yfamily['Label'], test_size = 0.3)
X_train, X_test, Ygenus_train, Ygenus_test = train_test_split(X, Ygenus['Label'], test_size = 0.3)
X_train, X_test, Yspecies_train, Yspecies_test = train_test_split(X, Yspecies['Label'], test_size = 0.3)

y_test = np.column_stack((Yfamily_test, Ygenus_test, Yspecies_test))
y_train = np.column_stack((Yfamily_train, Ygenus_train, Yspecies_train))

classifier = ClassifierChain(svm.SVC(C=100))
classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)

accuracy_score(y_test, predictions)
