# Practical 6b Regression using SVM

Again we're going to read in a file from a URL.

Once you've done this you will need to process the data so that missing values are removed or replaced with sensible values. 

Then finally we can get down to the machine learning. The way you use SVM is very similar to what you did with Linear Regression - so we don't need to give you all the details for that part. You can just look back at that.

The data: This is a set of results from an experiment where the number of initial bacteria, levels of CO2 light and sucrose excretion were varied. Four values can be predicted from this where the four values define a growth-rate curve for the bacteria.

## Reading in the data

Pandas nicely provides you with a method to read in data from a CSV file. The file can either be on your local hard disk or at a URL.

Once you've read in the data the first thing to do is to have a quick look at it - print it out.

In [None]:
import pandas as pd

data = pd.read_csv("http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/fitting-results.csv")

print(data)

## What are the different variables

We can list the variables using the list command

This data represents nearly 20,000 experiments when growing bacteria. There are four variables you can predict from this data: 'a', 'mu', 'tau' and 'a0'. 

The variables are:

| variable | description |
|-----|------|
| n_cyanos | The number of Cyanobacteria available at the start |
| co2 | The amount of CO2 available |
| light | The amount of light available |
| SucRatio | How good the bacteria is at producing sucrose |
| Nsample | Experiment number |
| a | Maximum number of bacteria seen |
| mu | Growth rate of bacteria |
| tau | Time delay before bacteria starts growing |
| a0 | Initial level of bacteria |

In [None]:
list(data)

In this case there is no missing data. But you should still look at the data to see what we have.

In [None]:
data['a'].value_counts()

## Producing the X and y data

The X data is all of the features without the variables we want to predict and the y data is just the variables we want to predict. 

We can remove Nsample here too as it has no value.

We can remove the variables we want to predict (a, mu, tau, a0) from the data to produce X.

We can keep just the variable we want to predict using filter. As there are four variables here we'll keep them all for now and separate them later.

## Exercise
1. Create X to be all columns apart from 'a', 'mu', 'tau', 'a0' and 'Nsamples'.
2. Create y to be just the columns 'a', 'mu', 'tau' and 'a0'.

In [4]:
X = 
y = 

## Split the data

We can now split the data into training and test data.

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

We still have four features in y_train and y_test. We need to split these into four separate dataFrames so we can train a regressor on each of them.

In [6]:
y_train_a = y_train.filter('a').values.ravel()
y_test_a = y_test.filter('a')

## Exercise
The code above produces the separate data for 'a'. Produce the separate data for 'mu', 'tau' and 'a0'.

# Now the machine learning

sklearn is really good in the sense that all of the machine learning processes work in prety much the same way:

```
# import the appropriate regressor
from sklearn.linear_model import LinearRegression

# create an instance of that regressor
lr = LinearRegression()

# Train it on our data
lr.fit(X_train, y_train)
```

So we just need to change sklearn.linear_model.LinearRegression to sklearn.svm.SVR. Then change the instance you create to the correct regressor.

## Exercises
1. Create a Support Vector Regressor for this data. You'll need to produce one for each of 'a', 'mu', 'tau' and 'a0'.
3. For your Regressor look at the R^2 value. You can also plot predicted value against actual value. This should be close to a diagonal line from the bottom left to the top right.
3. Look at changing the C value and the kernel for SVR to see if you can improve the results.

# Write your code from here