# Penguins and "Support Vector Machines"

We will be using the "Palmer Archipelago (Antarctica) Penguin" dataset with support vector machines. We will be looking to classify penguins such that we can identify the type of penguin. We begin by importing necessary libraries.

In [2]:
import numpy as np
from numpy.random import default_rng
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import CategoricalColorMapper
from sklearn.svm import SVC
rng = default_rng(5)
output_notebook()
import pandas as pd

We also use the following such that we can model our SVM.

In [3]:
def hyperplane(P,x,z=0):
    """Given an SVC object P and an array of vectors x, computes the hyperplane wx+b=z"""
    alphas = P.dual_coef_
    svs = P.support_vectors_
    c = P.intercept_[0]-z
    a = np.sum(alphas.T*svs,axis=0)[0]
    b = np.sum(alphas.T*svs,axis=0)[1]
    return (-c-a*x)/b

def pts(P):
    """Given an SVC object P, returns the two closest points in the associated reduced convex hulls."""
    alphas = P.dual_coef_[0]
    svs = P.support_vectors_
    plus_indices = np.where(alphas>0)
    minus_indices = np.where(alphas<=0)
    alphas = alphas.reshape(-1,1)
    pluspt = np.sum(alphas[plus_indices]*svs[plus_indices],axis=0)/np.sum(alphas[plus_indices])
    minuspt = np.sum(alphas[minus_indices]*svs[minus_indices],axis=0)/np.sum(alphas[minus_indices])
    return pluspt, minuspt

We are going to initally use the penguin's flipper length in order to classify the penguins based on their species. The three species are Adelie, Chinstrap and Gentoo. We will consider species by species in order to keep each classification consistent.

We start by grabbing the necessary data from the file.

In [4]:
data = np.genfromtxt("penguins_lter.csv", delimiter = ',', skip_header = 1)
labels = pd.read_csv('penguins_lter.csv')
labels = labels["Species"]
labels = np.array(labels)
pens = data[:,[12,13]]

In [5]:
for i in range(len(pens)):
    if np.isnan(pens[i][0]):
        print(i)

3
339


We now know which indexes have nan values. Lets remove them:

In [6]:
pens = np.delete(pens, 3, 0)
pens = np.delete(pens, 338, 0)
labels = np.delete(labels, 3, 0)
labels = np.delete(labels, 338, 0)

We must also associate colors to respective islands. We do so below:

In [7]:
colors = ['red','blue','green']

for i in range(len(labels)):
    if labels[i] == "Adelie Penguin (Pygoscelis adeliae)":
        labels[i] = 0
    elif labels[i] == 'Chinstrap penguin (Pygoscelis antarctica)':
        labels[i] = 1
    elif labels[i] == "Gentoo penguin (Pygoscelis papua)":
        labels[i] = 2
        
pen_colors = np.array([colors[i] for i in labels])

For our following figure, we have that red as the Adelie penguin, blue as the Chinstrap penguin and green as the Gentoo penguin.

In [8]:
f=figure(title='Penguin Data: Flipper length (mm) vs body mass (g)',x_range=[150,250],y_range=[2500,7000])
f.scatter(x=pens[:,0],y=pens[:,1], color = pen_colors)
show(f)

We can see that we have a solid way to distinguish who the Gentoo penguin is based on their flipper length versus their body mass. However, the Adelie and Chinstrap penguin would be much harder to classify based on our current features.

Lets continue to compare the penguins to each other. We will compare Gentoo versus Adelie and Gentoo versus Chinstrap seperately. We do this due to computing and time constraints for SVC. The constraints come from the similarities between the Adelie and Chinstrap penguins.

In [9]:
red_green = data[:, [12,13]]
blue_green = data[152:345, [12,13]]

for i in range(0,68):
    red_green = np.delete(red_green, 152, 0)


In [10]:
for i in range(len(red_green)):
    if np.isnan(red_green[i][0]):
        print(i)

for i in range(len(blue_green)):
    if np.isnan(blue_green[i][0]):
        print(i)

3
271
187


The following cleans our data of "nan" values. We create new labels arrays to use for our classification.

In [11]:
red_green = np.delete(red_green, 3, 0)
red_green = np.delete(red_green, 270, 0)

blue_green = np.delete(blue_green, 187, 0)

In [12]:
print(len(labels))
print()

342



In [13]:
labels1 = labels.copy()
labels2 = labels.copy()
labels2 = labels2[151:345]

for i in range(0,68):
    labels1 = np.delete(labels1, 152, 0)

In [14]:
print(len(labels1))
print(len(red_green))
print(len(labels2))
print(len(blue_green))

274
274
191
191


Our labels as well as our penguin comparison arrays are of the same size. We now move onto the next step in our calculations.

As we now have our two arrays that compare Adelie and Gentoo penguins as well as Chinstrap and Gentoo penguins. We continue with our calculations as we progress towards our classifications.

In [15]:
red_points = np.where(labels1==0)
green_points1 = np.where(labels1 ==2)

blue_points = np.where(labels2==1)
green_points2 = np.where(labels2 ==2)

In [16]:
red_vs_others = np.array([0 if x==0 else 1 for x in labels1])
green_vs_others1 = np.array([0 if x==2 else 1 for x in labels1])

blue_vs_others = np.array([0 if x ==1 else 1 for x in labels2])
green_vs_others2 = np.array([0 if x==2 else 1 for x in labels2])

In [17]:
Pgreen1 = SVC(kernel='linear',C=1000).fit(red_green,green_vs_others1)
Pgreen2 = SVC(kernel='linear',C=1000).fit(blue_green,green_vs_others2)

We are going to first compare the Adelie (red) penguin to the Gentoo (green) penguin.

In [18]:
f=figure(title='Penguin Data: Flipper length (mm) vs body mass (g)',x_range=[150,250],y_range=[2500,7000])
f.scatter(x=pens[:,0],y=pens[:,1], color = pen_colors)
x=np.linspace(150,250,100)
yred=hyperplane(Pgreen1,x)
y0 = hyperplane(Pgreen1,x,1)
y1 = hyperplane(Pgreen1,x,-1)
f.line(x=x,y=yred,line_width=3,color='black',line_dash='dashed',legend_label='red vs green')
f.line(x=x,y=y0,line_width=1,alpha=.5,color='black',line_dash='dashed',legend_label='red vs green')
f.line(x=x,y=y1,line_width=1,alpha=.5,color='black',line_dash='dashed',legend_label='red vs green')

show(f)

Next we compute the Chinstrap (blue) penguin versus the Gentoo (green) penguin.

In [19]:
f=figure(title='Penguin Data: Flipper length (mm) vs body mass (g)',x_range=[150,250],y_range=[2500,7000])
f.scatter(x=pens[:,0],y=pens[:,1], color = pen_colors)
x=np.linspace(150,250,100)
yblue=hyperplane(Pgreen2,x)
y0 = hyperplane(Pgreen2,x,1)
y1 = hyperplane(Pgreen2,x,-1)
f.line(x=x,y=yblue,line_width=3,color='black',line_dash='dashed',legend_label='blue vs green')
f.line(x=x,y=y0,line_width=1,alpha=.5,color='black',line_dash='dashed',legend_label='blue vs green')
f.line(x=x,y=y1,line_width=1,alpha=.5,color='black',line_dash='dashed',legend_label='blue vs green')

show(f)

We notice that the lines seperating each respective penguin from the Gentoo penguin is almost the same. This suggests we could easily classify the Gentoo penguin from others. We show that below:

We below try to predict values for the penguins using the line created by the Adelie and Gentoo penguins. We only use one of these lines due to their similarities. Here the black scatter points represents all non Gentoo penguins.

In [20]:
colors = ['green', 'black']

Pall1 = SVC(kernel='linear',C=1000).fit(red_green,green_vs_others1)
Pall1.predict(red_green)
predicted_colors = [colors[i] for i in Pall1.predict(red_green)]
f=figure(title='predicted classification',x_range=[150,250],y_range=[2500,7000])
f.scatter(x=red_green[:,0],y=red_green[:,1],color=predicted_colors)
ygreen=hyperplane(Pgreen1,x)
y0 = hyperplane(Pgreen1,x,1)
y1 = hyperplane(Pgreen1,x,-1)
f.line(x=x,y=ygreen,line_width=3,color='black',line_dash='dashed')
f.line(x=x,y=y0,line_width=1,alpha=.5,color='black',line_dash='dashed')
f.line(x=x,y=y1,line_width=1,alpha=.5,color='gray',line_dash='dashed')

show(f)

In [21]:
score = Pall1.score(red_green,green_vs_others1)
print('Classifier of Gentoo penguins yields accuracy of {:2f}%'.format(100 * score))

Classifier of Gentoo penguins yields accuracy of 98.175182%


As suggested earlier, we can pretty accurately determine if we have a Gentoo penguin or not based upon the weight and flipper length of a given penguin. What we want to find next is a way to classify the differences between the other two penguins.

We start by creating arrays that only consider the Adelie and Chinstrap penguins. We are going to attempt to classify these two only, as we have a confident method to classify the Gentoo penguin. We will need to consider new features. In interest of uniqueness, we will continue to consider flipper length. Our other feature will be culmen length.

In [22]:
rest = data[:219, [10, 12]]
rest = np.delete(rest, 3, 0)

labels3 = labels[:219]
labels3 = np.delete(labels3, 3, 0)

colors = ['red', 'blue']
rest_colors = np.array([colors[i] for i in labels3])

Red is back to being Adelie penguins, while blue is Chinstrap penguins. We find a comparitor to help predict values below.

In [23]:
red_points = np.where(labels3==0)
blue_points =np.where(labels3==1)

In [24]:
red_vs_others = np.array([0 if x==0 else 1 for x in labels3])

In [25]:
Pred = SVC(kernel='linear',C=1000).fit(rest,red_vs_others)

In [26]:
f=figure(title='Penguin Data: Flipper length (mm) vs culmen length (mm)',x_range=[25, 70],y_range=[150,250])
f.scatter(x=rest[:,0],y=rest[:,1], color = rest_colors)
x=np.linspace(25,70,100)
y=hyperplane(Pred,x)
y0 = hyperplane(Pred,x,1)
y1 = hyperplane(Pred,x,-1)
f.line(x=x,y=y,line_width=3,color='black',line_dash='dashed',legend_label='blue vs red')
f.line(x=x,y=y0,line_width=1,alpha=.5,color='black',line_dash='dashed',legend_label='blue vs red')
f.line(x=x,y=y1,line_width=1,alpha=.5,color='black',line_dash='dashed',legend_label='blue vs red')

show(f)

This line looks much more promising in terms of classifying the Adelie and Chinstrap penguins when compared to our previous classification attempt.

Lets now try to predict the Adelie and Chinstrap penguins such that we can check for accuracy of this classification model.

In [27]:
Pall2 = SVC(kernel='linear',C=1000).fit(rest,red_vs_others)
Pall2.predict(rest)
predicted_colors = [colors[i] for i in Pall2.predict(rest)]
f=figure(title='predicted classification',x_range=[25,70],y_range=[150,250])
f.scatter(x=rest[:,0],y=rest[:,1],color=predicted_colors)
y=hyperplane(Pred,x)
y0 = hyperplane(Pred,x,1)
y1 = hyperplane(Pred,x,-1)
f.line(x=x,y=y,line_width=3,color='black',line_dash='dashed')
f.line(x=x,y=y0,line_width=1,alpha=.5,color='black',line_dash='dashed')
f.line(x=x,y=y1,line_width=1,alpha=.5,color='gray',line_dash='dashed')

show(f)

In [28]:
score1 = Pall2.score(rest,red_vs_others)
print('Classifier of Adelie and Chinstrap penguins yields accuracy of {:2f}%'.format(100 * score1))

Classifier of Adelie and Chinstrap penguins yields accuracy of 95.871560%


We were able to succesfully classify Adelie and Chinstrap penguins! Our classifier is a little less efficient than the classifier used for that of the Gentoo penguins, but still has a high accuracy.

One way to classify our three given Antartica penguins from our data set would be to first check if the penguin is a Gentoo penguin or not. If it is, we are done with our classification. If it is not, we attempt to classify the penguin based upon if it is Adelie or Chinstrap. Below we show our total classifier accuracy:

In [29]:
total = score * score1 * 100
print('Total penguin classifier yields accuracy of {:2f}%'.format(total))

Total penguin classifier yields accuracy of 94.122079%
