### Multi-class and Multi-Label Classification Using Support Vector Machines on Anuran Calls (MFCCs) Data Set

### (a) Download the Anuran Calls (MFCCs) Data Set from: https://archive.ics.uci.edu/ml/datasets/Anuran+Calls+%28MFCCs). 

Choose 70% of the data randomly as the training set.

In [0]:
import pandas as pd
import sklearn
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import hamming_loss
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import scipy as s
from imblearn.over_sampling import SMOTE

In [0]:
df= pd.read_csv("Frogs_MFCCs.csv")

In [0]:
X= df.drop(columns=['Family','Genus','Species','RecordID'])
y= df[['Family','Genus','Species']]
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size= 0.3,shuffle= True)

### (b) Each instance has three labels: Families, Genus, and Species. Each of the labels has multiple classes. We wish to solve a multi-class and multi-label problem. One of the most important approaches to multi-class classification is to train a classifier for each label.

In [0]:
classifier = SVC()

In [0]:
cv_score=[0,0,0]
cv_score_best=[0,0,0]
c_best=[0,0,0]
g_best=[0,0,0]

In [0]:
for i in range(0,3):
    for c in range(-3,4):
        for sigma in range(1,20,1):
            cv_score[i]= cross_val_score(SVC(C= 10**c,gamma= 1/(2*(sigma/10)**2),decision_function_shape= 'ovr'),X_train,y_train.iloc[:,i], cv=10, scoring='accuracy').mean()
            if cv_score[i] > cv_score_best[i]:
                cv_score_best[i]= cv_score[i]
                c_best[i]= 10**c
                g_best[i]= 1/(2*(sigma/10)**2)

In [0]:
c_best

[100, 10, 10]

In [0]:
g_best

[3.1249999999999996, 1.0204081632653064, 1.3888888888888888]

In [0]:
classifier1 = SVC(C= c_best[0], gamma= g_best[0],decision_function_shape= 'ovr')
clf1 = classifier1.fit(X_train,y_train.iloc[:,0])
y_pred1= clf1.predict(X_test)

classifier2 = SVC(C= c_best[1], gamma= g_best[1],decision_function_shape= 'ovr')
clf2 = classifier2.fit(X_train,y_train.iloc[:,1])
y_pred2= clf2.predict(X_test)

classifier3 = SVC(C= c_best[2], gamma= g_best[2],decision_function_shape= 'ovr')
clf3 = classifier3.fit(X_train,y_train.iloc[:,2])
y_pred3= clf3.predict(X_test)

In [0]:
y_pred= [y_pred1, y_pred2, y_pred3]

### i. Exact match and hamming score/ loss methods for evaluation of multilabel classification 

In [0]:
exact_match=0
for i in range(0, len(y_pred1)):
    if (y_pred1[i]== y_test.iloc[i,0] and y_pred2[i]== y_test.iloc[i,1] and y_pred3[i]== y_test.iloc[i,2]):
        exact_match = exact_match +1
exact_match= exact_match/len(y_pred1)

In [0]:
exact_match

0.9856415006947661

In [0]:
hamming_loss= pd.Series( s.zeros( len(y_pred1)) )
for i in range(0,len(y_pred1)):
    h_loss=0
    for j in range(0,3):
        if (y_pred[j][i] != y_test.iloc[i,j]):
            h_loss= h_loss +1;
        h_loss= h_loss/3
    hamming_loss[i]= h_loss
hamming_loss_value= hamming_loss.mean()       

In [0]:
hamming_loss_value

0.005626747636937539

### iii. Repeat 1(b)ii with L1-penalized SVMs. 

Let's normalize the attributes.

In [0]:
classifier = LinearSVC()

In [0]:
cv_score=[0,0,0]
cv_score_best=[0,0,0]
c_best=[0,0,0]

In [0]:
for i in range(0,3):
    for c in range(-3,4):
        cv_score[i]= cross_val_score(LinearSVC(penalty ='l1',C= 10**c, dual=False),X_train,y_train.iloc[:,i], cv=10, scoring='accuracy').mean()
        if cv_score[i]> cv_score_best[i]:
            cv_score_best[i]= cv_score[i]
            c_best[i]= 10**c

In [13]:
c_best

[10, 100, 10]

In [0]:
classifier1 = LinearSVC(penalty='l1', C= c_best[0],dual=False)
clf1 = classifier1.fit(X_train,y_train.iloc[:,0])
y_pred1= clf1.predict(X_test)

classifier2 = LinearSVC(penalty='l1', C= c_best[1], dual= False)
clf2 = classifier2.fit(X_train,y_train.iloc[:,1])
y_pred2= clf2.predict(X_test)

classifier3 = LinearSVC(penalty='l1', C= c_best[2], dual= False)
clf3 = classifier3.fit(X_train,y_train.iloc[:,2])
y_pred3= clf3.predict(X_test)

In [0]:
y_pred= [y_pred1, y_pred2, y_pred3]

In [0]:
exact_match=0
for i in range(0, len(y_pred1)):
    if (y_pred1[i]== y_test.iloc[i,0] and y_pred2[i]== y_test.iloc[i,1] and y_pred3[i]== y_test.iloc[i,2]):
        exact_match = exact_match +1
exact_match= exact_match/len(y_pred1)

In [17]:
exact_match

0.9157017137563687

In [0]:
hamming_loss= pd.Series( s.zeros( len(y_pred1)) )
for i in range(0,len(y_pred1)):
    h_loss=0
    for j in range(0,3):
        if (y_pred[j][i] != y_test.iloc[i,j]):
            h_loss= h_loss +1;
        h_loss= h_loss/3
    hamming_loss[i]= h_loss
hamming_loss_value= hamming_loss.mean() 

In [19]:
hamming_loss_value

0.024668485066817625

### iv. Repeat 1(b)iii by using SMOTE to remedy class imbalance. 

In [0]:
smote = SMOTE( kind='svm' )
X_smote1, y_smote1 = smote.fit_sample( X_train, y_train.iloc[:,0])
X_smote2, y_smote2 = smote.fit_sample( X_train, y_train.iloc[:,1])
X_smote3, y_smote3 = smote.fit_sample( X_train, y_train.iloc[:,2])

In [0]:
X_smote= [X_smote1, X_smote2, X_smote3 ]
y_smote=[y_smote1, y_smote2, y_smote3]

In [0]:
cv_score=[0,0,0]
cv_score_best=[0,0,0]
c_best=[0,0,0]

In [0]:
for i in range(0,3):
    for c in range(-3,4):
        cv_score[i]= cross_val_score(LinearSVC(penalty ='l1',C= 10**c, dual=False),X_smote[i],y_smote[i], cv=10, scoring='accuracy').mean()
        if cv_score[i]> cv_score_best[i]:
            cv_score_best[i]= cv_score[i]
            c_best[i]= 10**c

In [0]:
c_best

[100, 100, 100]

In [0]:
classifier1 = LinearSVC(penalty='l1', C= c_best[0],dual=False)
clf1 = classifier1.fit(X_smote[0],y_smote[0])
y_pred1= clf1.predict(X_test)

classifier2 = LinearSVC(penalty='l1', C= c_best[1], dual= False)
clf2 = classifier2.fit(X_smote[1],y_smote[1])
y_pred2= clf2.predict(X_test)

classifier3 = LinearSVC(penalty='l1', C= c_best[2], dual= False)
clf3 = classifier3.fit(X_smote[2],y_smote[2])
y_pred3= clf3.predict(X_test)

In [0]:
y_pred= [y_pred1, y_pred2, y_pred3]

In [0]:
exact_match=0
for i in range(0, len(y_pred1)):
    if (y_pred1[i]== y_test.iloc[i,0] and y_pred2[i]== y_test.iloc[i,1] and y_pred3[i]== y_test.iloc[i,2]):
        exact_match = exact_match +1
exact_match= exact_match/len(y_pred1)

In [0]:
exact_match

0.8119499768411301

In [0]:
hamming_loss= pd.Series( s.zeros( len(y_pred1)) )
for i in range(0,len(y_pred1)):
    h_loss=0
    for j in range(0,3):
        if (y_pred[j][i] != y_test.iloc[i,j]):
            h_loss= h_loss +1;
        h_loss= h_loss/3
    hamming_loss[i]= h_loss
hamming_loss_value= hamming_loss.mean() 

In [0]:
hamming_loss_value

0.04662652462559831

### Conclusion

SVM with Gaussian Kernel gives the maximum exact match and minimum hamming score.
For Linear Kernel, Using SMOTE exact match reduces and hamming score increases as compared to Linear Kernel on data directly.