 # 1. Identify the problem

This image set contains two phenotypes. The cells so called positives that generate "vesicle-type" spots and the cells so called negatives that are normal.

![](./exercises/svm/images/image.png)

## 1.1 Objective

Since the labels in the data are discrete, the predication falls into two categories, (i.e. Postive Cell or Negative Cell). In machine learning this is a classification problem.    
> Thus, the goal is to classify whether cell  is positive or negative and predict the accuracy of the model with different classifier.

## 1.2 Identify data source

We used image set [BBBC016v1](https://data.broadinstitute.org/bbbc/BBBC016/)[<sup>1</sup>](#fn1).   
We used a part of this dataset taking account the wells O06, O07, O16 and O22.    
Features were generated by CellProfiler and classes were annotated manually. The dataset contains **40 samples of positives and negatives cells**.
* The first two columns in the dataset contain the *labels* (Positives, Negatives), and the *dose* put for each well.
* The columns 4 - 5 contain the well position on the plate, the unique ID of the image and the number of object respectively.
* The columns 6 - 155 contain *features* that have been computed from images of the cell nuclei and cell cytoplasm which can be used to build a model to predict the phenotype of the cells.

## 1.3 Load libraries

In [None]:
# a) Importing libraries.
import numpy as np

# b) Replace the occurences of ... to import the pandas library with the clause import. 
... ... as pd

## 1.4 Load dataset

In [None]:
# c) Importing the dataset.

# Replace the occurences of ... to indicate a string path 
# to your file (ie dataset.csv is into ML/exercises/svm/features). 
file = ...

# Replace the occurences of ... to load the dataset.csv file (path assigned previously) 
# using the Pandas read_csv function. 
dataset = ...

## 1.5 Inspecting the data

The first step is to visually inspect the dataset. There are multiple ways to achieve this:
* The easiest being to request the first few records using the data.head() method. By default, “data.head()” returns the first 5 rows.
* Alternatively, one can also use “data.tail()” to return the five last rows of the data.
* For both head and tail methods, there is an option to specify the number of rows by including the required number in between the parentheses when calling either method.

In [None]:
# d) print the dataset.
# After reading the notes above try to display the ten rows of the dataset. Fill in the occurences of ...
...

You can check the number of lines and columns using the shape attribute.

In [None]:
# e) Replace the occurence ... by the shape method to get the number of rows and number of columns
...

In the result displayed, you should be have 40 records with 155 columns.   
The “info()” method provides a concise summary of the data; It provides the type of data in each column, the number of non-null values in each column, and how much memory the data frame is using.

In [None]:
# f) Replace the occurence ... by the "info(verbose = True)" to get the data type of each column
...

From the results above Label, Dose and Well are *categorical variables* and the others are floating or integer values.

# 2. Pre-processing the dataset

Data preprocessing is a crucial step for any data analysis problem. It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use. This involves a number of activities such as:
* Assigning numerical values to categorical data because the svm classifier works only with numerical values.
* Handling missing values.
* Divide data into attributes and labels sets.
* Divide data into traininig and test sets.

## 2.1 Objective

> The goal here is to encode the class Label in a list y and get attributes in an array X. Then split the data into a *training set* and *test set*.

## 2.2 split features and labels into new sets and encoding the labels into integers 

In [None]:
# a) To select a column from dataset use data['feature']. Replace the occurence ... to select the Label feature
y = ...

#transform the class labels from their original string representation (positive and negative) into integers
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

# b) Use the method drop(columns=['feature_1', "feature_2", ...]) 
# to drop unnecessary features 'Label', 'Dose', 'Well', 'ImageNumber', 'ObjectNumber'
# and affect the output to X variable
... = ...

# c) Replace the occurence ... to display 5 rows of the X dataset with head() method
print(..., y)

> After encoding the Label, the phenotype cell are now represented as class 1(i.e positive cell) and as class 0 (i.e negative cell), respectively.

## 2.3 Split data into training and test sets

The simplest method to evaluate the performance of a machine learning algorithm is to use different training and test datasets. Here 
* The *train_test_split* method from the model.selection of scikit-learn split the available data into a training set and a testing set. (70% training, 30% test).
* This method takes three parameters train_test_split: The first parameter will be the X dataset, the second parameter will be the y dataset and the third parameter will be the size of the testing set ie (70% = 0.70).

In [None]:
# a)Follow the instructions above to fill in the occurences ...

from sklearn.model_selection import ...

X_train, X_test, y_train, y_test = train_test_split(..., ..., test_size = ...)

# b) Fill in the occurences ... to display the number of rows 
# and numbers of columns for the two variables (X_train and X_test)
...

# 3. Predictive model using Support Vector Machine (SVM)

## 3.1 Objective

> The goal is to fit a linear model to the data using *SVC library* from the svm of scikit-learn.

In [None]:
# Follow the instructions and fill in the occurences ...

# a) Create an SVM classifier and train it on 70% of the data set.
# use the support vector classifier class, which is written as SVC in the Scikit-Learn's svm library. 
# This class takes one parameter, which is the kernel type 'linear'.
# We will see non-linear kernels in the next section.

from sklearn.svm import ...
svclassifier = ...(kernel= ... )

# b) The fit method of SVC class is called to train the algorithm on the training data (X_train, y_train).
# The training data is passed as a parameter to the fit method.

...(... , ...)

From the above result you will see the important parameters in kernel SVMs:
* Regularization parameter C.

* The choice of the kernel (linear, radial basis function(RBF) or polynomial).

* Kernel-specific parameters.


## 3.2 Making predictions

To make predictions, the predict method of the SVC class is used.

In [None]:
# a) Fill in the occurences to make prediction on the test set (X_test) 
y_pred = ...(...)

# 4. Model Accuracy

## 4.1 Confusion matrix

*Confusion matrix* is the most commonly used metric for classification tasks. Scikit-Learn's metrics library contains the confusion_matrix method, which can be readily used to find out the values for these important metrics.   
The confusion matrix that essentially is a *two-dimensional table* where the classifier model is on one axis (vertical), and ground truth is on the other (horizontal) axis, as shown below. Either of these axes can take two values (as depicted)   

Model says "+" | Model says "-" | Ground truth
---------------|----------------|-----------
`True positive`|`False negative`|Actual: "+"
`False positive`|`True negative`|  Actual: "-"

> The goal is to create a confusion matrix in order to know the accuracy of the model.

In [None]:
# a) fill in the occurences to import confusion_matrix method from sklearn.metrics

from ... import ...  
cm = confusion_matrix(y_test, y_pred)

import matplotlib.pyplot as plt
from IPython.display import Image, display

fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(cm, cmap=plt.cm.Reds, alpha=0.3)
for i in range(cm.shape[0]):
     for j in range(cm.shape[1]):
         ax.text(x=j, y=i,
                s=cm[i, j], 
                va='center', ha='center')
plt.xlabel('Predicted Values', )
plt.ylabel('Actual Values')
plt.show()

### Observation

There are two possible predicted classes: "1" (i.e positive cell) and "0" (i.e negative cell).

    a) How many cells are predicted trully positives ?   
    b) How many cells are predicted trully negatives ?    
    c) How many cells are predicted falsly positives ?     
    d) How many cells are predicted falsly negatives ?

## 4.2 Rates as computed from the confusion matrix 

a) **Accuracy**: Overall, how often is the classifier correct? Calculate the accuracy of the model in percentage   
     $$Accuracy = (\frac{TP+TN}{TP+TN+FP+FN})*100$$   
     Accuracy = ...%
     
b) **Misclassification Rate**: Overall, how often is it wrong? Calculate the "error rate" of the model in percentage
    $$Error Rate = (\frac{FP+FN}{TP+TN+FP+FN})\cdot100$$   
    Error Rate = ...%

# 5 Comparison between different kernel for non linear classification

> We will use polynomial, Gaussian, and sigmoid kernels to see which one works better for our problem.

## 5.1 Polynomial kernel

In [None]:
# a) Replace the ... kernel parameter from the SVC class by 'poly'
svclassifier = SVC(kernel='...')  
svclassifier.fit(X_train, y_train)

In [None]:
# b) Make predictions on the test set
...

In [None]:
# c) create a confusion matrix to evaluate the accuracy
...

d) calculate the accuracy and the error rate of the polynomial model.    
Accuracy = ...%     
Eror rate = ... %

e) Which one kernel works best ?
....

# 6. Make prediction on new dataset (unlabel)

> the goal is to predict unlabel dataset with the best classifier run above

In [None]:
svclassifier = SVC(kernel='...')  
svclassifier.fit(X_train, y_train)

In [None]:
# b) importing the new dataset (file: unlabel_dataset.csv is into /exercises/svm/features)
new_data = ....

In [None]:
# c) print the 8 rows of new_data
...

In [None]:
# d) drop unrelevant features like 'Dose', 'Well', 'ImageNumber', 'ObjectNumber'
x_data = ...

In [None]:
# e) prediction on the new_data
x_pred = ...

In [None]:
print(x_pred)

In [None]:
labelled_data = new_data.assign(Prediction = x_pred)
print(labelled_data[['Dose', 'Well', 'ImageNumber', 'ObjectNumber', 'Prediction']])

# Footnote

<span id="fn1"> We used images set provided by Ilya Ravkin, available from the Broad Bioimage Benchmark Collection [Ljosa et al., Nature Methods, 2012]</span>