$$
{\Huge \text{ML with Kernels - MVA class of 2020} }\\
{\Large \text{Data challenge}}\\
\textbf{Abdelhakim Benechehab, Ichraq Lemghari}\\ %You should put your name here
\text{Due: 24 March 2020} %You should write the date here.
$$

# **Kernel Ridge Regression using RBF, k-spectrum and mismatch kernels**

When multicollinearity occurs in a dataset, regression (or least squares) estimates are still unbiased.However, their variances are too large that they may be very different from the true value.

Hence, to mitigate this problem of multicollinearity in linear regression, **Ridge regression**, which is a special case of Tikhonovregularization, reduces the standard error by regularizing the parameters equally.

## Imports


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import timeit
from scipy.spatial import distance_matrix
from scipy import optimize
import cvxopt
from itertools import product
import timeit
from scipy.sparse import csr_matrix, csc_matrix, lil_matrix
from sklearn.model_selection import train_test_split

## Data

In this section, we extract the three numerical datasets (Xtr0_mat100, Xtr1_mat100, Xtr2_mat100, tr0, Ytr1, Ytr2) . And we apply some data preprocessing so a s to make the lables values in {-1,1}.  

Then, we divide the datasets into training and validation subsets so as to test the accuracy before applying the algorithm on  the test data.



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
Xtr0 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Xtr0_mat100.csv", delimiter= ' ', header=None)
Xtr1 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Xtr1_mat100.csv", delimiter= ' ', header=None)
Xtr2 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Xtr1_mat100.csv", delimiter= ' ', header=None)

Xtr0 = np.array(Xtr0)
Xtr1 = np.array(Xtr1)
Xtr2 = np.array(Xtr2)

N0 = Xtr0.shape[0]
N1 = Xtr1.shape[0]
N2 = Xtr2.shape[0]

ytr0 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Ytr0.csv", delimiter= ',')
ytr0 = ytr0.drop(columns='Id')
ytr0= ytr0.replace(0,-1)
ytr0 = np.array(ytr0).reshape(N0)

ytr1 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Ytr1.csv", delimiter= ',')
ytr1 = ytr1.drop(columns='Id')
ytr1= ytr1.replace(0,-1)
ytr1 = np.array(ytr1).reshape(N1)

ytr2 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Ytr2.csv", delimiter= ',')
ytr2= ytr2.drop(columns='Id')
ytr2= ytr2.replace(0,-1)
ytr2 = np.array(ytr2).reshape(N2)

In [4]:
Xtr0_train, Xtr0_test, ytr0_train, ytr0_test = train_test_split( Xtr0, ytr0, test_size=0.20, random_state=42)
Xtr1_train, Xtr1_test, ytr1_train, ytr1_test = train_test_split( Xtr1, ytr1, test_size=0.20, random_state=42)
Xtr2_train, Xtr2_test, ytr2_train, ytr2_test = train_test_split( Xtr2, ytr2, test_size=0.20, random_state=42)

## **Ridge regression**

In [5]:
class RidgeReg():
  '''
  This class is an implementation of the ridge regression method

  Inputs:
    - param: Regularization value (lambda)
    - lr: the learning rate used with iterative implementation
    - solver: 'closed_form' for using the close form of the estimated parameters
              or 'gradient_descent' for an iterative implementation
    - maxIter: number of iterations for the gradient stochastic

  Outputs:
    - fit: the learned parameters
    - predict: the prediction using the learned parameters

  '''

  def __init__(self, param=1.0, lr = 0.01,  solver='closed_form', maxIter = 0):
      self.param = param
      self.solver = solver
      self.maxIter = maxIter
      self.lr = lr

  def fit(self, X, y):

      X_modified = np.c_[np.ones((X.shape[0], 1)), X]

      self.X_intercept = X_modified
      self.X = X
      self.y = y

      if self.solver == 'closed_form':

          # number of columns in matrix of X including intercept
          dimension = self.X_intercept.shape[1]

          # Identity matrix of dimension compatible with our X_intercept Matrix
          I = np.identity(dimension)

          # We create a bias term corresponding to param for each column of X 
          I_biased = self.param * I

          betas = np.linalg.inv(self.X_intercept.T.dot(
              self.X_intercept) + I_biased).dot(self.X_intercept.T).dot(y)
          self.betas = betas

      if self.solver == 'gradient_descent':

          self.intercept = 0
          self.beta = np.zeros(X.shape[1])
          betas =  np.array([self.intercept] + [ a for a in self.beta] )  

          self.betas = betas

          # gradient descent learning 
                  
          for i in range( self.maxIter ) :             
              self.beta, self.intercept = self.update() 

          betas =  np.array([self.intercept] + [ a for a in self.beta] )  

          self.betas = betas

      return self.betas

  def update(self):

      Y_pred = self.predict( self.X ) 
          
        # calculate gradients       
      dbeta = ( - ( 2 * ( self.X.T ).dot( self.y - Y_pred ) ) +               
               ( 2 * self.param * self.beta ) ) / self.X.shape[0]
      dbeta = dbeta.reshape((len(dbeta), 1))
      d_intercept= - 2 * np.sum( self.y - Y_pred ) / self.X.shape[0] 
          
        # update weights     
      self.beta = self.beta - self.lr * dbeta     
      self.intercept = self.intercept - self.lr * d_intercept       

      return self.beta, self.intercept


  def predict(self, X):
      betas = self.betas
      X_predictor = np.c_[np.ones((X.shape[0], 1)), X]
      self.predictions = X_predictor.dot(betas)
      return self.predictions

### Xtr0

In [6]:
params = np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = RidgeReg(param = param)
  Ridge.fit(Xtr0_train,ytr0_train)

  ypred_tr0 = np.sign(Ridge.predict(Xtr0_test))

  error= np.sum((ypred_tr0 - ytr0_test) != 0)
  loss = error/len(ytr0_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
      


error:  167
loss:  0.4175
error:  167
loss:  0.4175
error:  167
loss:  0.4175
error:  165
loss:  0.4125
error:  166
loss:  0.415
error:  169
loss:  0.4225
optimal param:  0.01


### Xtr1

In [7]:
params = np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = RidgeReg(param = param)
  Ridge.fit(Xtr1_train,ytr1_train)

  ypred_tr1 = np.sign(Ridge.predict(Xtr1_test))

  error= np.sum((ypred_tr1 - ytr1_test) != 0)
  loss = error/len(ytr1_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])

error:  164
loss:  0.41
error:  164
loss:  0.41
error:  163
loss:  0.4075
error:  164
loss:  0.41
error:  161
loss:  0.4025
error:  161
loss:  0.4025
optimal param:  0.1


### Xtr2

In [8]:
params = np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = RidgeReg(param = param)
  Ridge.fit(Xtr2_train,ytr2_train)

  ypred_tr2 = np.sign(Ridge.predict(Xtr2_test))

  error= np.sum((ypred_tr2 - ytr2_test) != 0)
  loss = error/len(ytr2_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])

error:  213
loss:  0.5325
error:  212
loss:  0.53
error:  212
loss:  0.53
error:  212
loss:  0.53
error:  203
loss:  0.5075
error:  218
loss:  0.545
optimal param:  0.1


As we can see, the performance of the ridge regression isn't good. the loss in each of the datasets is higher than 50%.  
Therefore, **we won't be using this method**.

## **Kernel Ridge regression**

### Numerical data

#### RBF kernel

In [11]:
class KernelRidgeRegScratch():
  '''
  This class is an implementation of the kernel ridge regression method

  Inputs:
    - param: Regularization value (lambda)
    - var: The variance of the RBF kernel

  Outputs:
    - fit: the learned parameters
    - predict: the prediction using the learned parameters

  '''

  def __init__(self, param=1.0, var = 1):
      self.param = param
      self.var = var

  def RBF(self, xi, xj):
    diff = xi - xj
    return  np.exp(-0.5 * np.dot(diff,diff) /(self.var)**2)

  def fit(self, X, y):
      
      self.X = X
      self.y = y

      
      # number of rows  in matrix of X 
      n = self.X.shape[0]

      # Identity matrix of dimension compatible with our X_intercept Matrix
      I = np.identity(n)


      # We create a bias term corresponding to param for each column of X 
      I_biased = self.param * I

      n = self.X.shape[0]

      K = np.zeros((n,n))

      for i in range(n):
        for j in range(n):
          #print(X[i,:])
          K[i,j] = self.RBF(self.X[i,:], self.X[j,:])

      self.K = K

      betas = np.linalg.inv(K + I_biased).dot(y)
      self.betas = betas


  def predict(self, X):
      betas = self.betas
      n = X.shape[0]
      K_predictor = np.zeros(n)

      for i in range(n):
        for j in range(len(betas)):
          K_predictor[i] += betas[j] * self.RBF(self.X[j,:], X[i,:])


      self.predictions = K_predictor
      return self.predictions

##### Xtr0

In [12]:
params = np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1])
vars = np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1,10,100,1000,10000])

errors = np.zeros((len(params),len(vars)))
losses = np.zeros((len(params),len(vars)))

for i in range(len(params)):
  for j in range((len(vars))):

    param =params[i]
    var = vars[j]

    Ridge_kernel  = KernelRidgeRegScratch(param = param, var = var)
    Ridge_kernel.fit(Xtr0_train,ytr0_train)

    ypred_tr0 = np.sign(Ridge_kernel.predict(Xtr0_test))

    error= np.sum((ypred_tr0 - ytr0_test) != 0)
    loss = error/len(ytr0_test)

    errors[i,j] = error
    losses [i,j] = loss
    print("error: ", error)
    print("loss: ", loss)

print('optimal param: ', params[np.unravel_index(errors.argmin(), errors.shape)[0]])



error:  400
loss:  1.0
error:  400
loss:  1.0
error:  400
loss:  1.0
error:  167
loss:  0.4175
error:  175
loss:  0.4375
error:  181
loss:  0.4525
error:  167
loss:  0.4175
error:  166
loss:  0.415
error:  201
loss:  0.5025
error:  201
loss:  0.5025
error:  400
loss:  1.0
error:  400
loss:  1.0
error:  400
loss:  1.0
error:  167
loss:  0.4175
error:  175
loss:  0.4375
error:  173
loss:  0.4325
error:  165
loss:  0.4125
error:  169
loss:  0.4225
error:  201
loss:  0.5025
error:  201
loss:  0.5025
error:  400
loss:  1.0
error:  400
loss:  1.0
error:  400
loss:  1.0
error:  167
loss:  0.4175
error:  175
loss:  0.4375
error:  165
loss:  0.4125
error:  166
loss:  0.415
error:  201
loss:  0.5025
error:  201
loss:  0.5025
error:  201
loss:  0.5025
error:  400
loss:  1.0
error:  400
loss:  1.0
error:  400
loss:  1.0
error:  167
loss:  0.4175
error:  173
loss:  0.4325
error:  162
loss:  0.405
error:  169
loss:  0.4225
error:  201
loss:  0.5025
error:  201
loss:  0.5025
error:  201
loss:  0.5025

##### Xtr1

In [13]:
params = np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1])
vars = np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1])

errors = np.zeros((len(params),len(vars)))
losses = np.zeros((len(params),len(vars)))

for i in range(len(params)):
  for j in range((len(vars))):

    param =params[i]
    var = vars[j]

    Ridge_kernel  = KernelRidgeRegScratch(param = param, var = var)
    Ridge_kernel.fit(Xtr1_train,ytr1_train)

    ypred_tr0 = np.sign(Ridge_kernel.predict(Xtr1_test))

    error= np.sum((ypred_tr1 - ytr1_test) != 0)
    loss = error/len(ytr1_test)

    errors[i,j] = error
    losses [i,j] = loss
    print("error: ", error)
    print("loss: ", loss)

print('optimal param: ', params[np.unravel_index(errors.argmin(), errors.shape)[0]])


error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161
loss:  0.4025
error:  161


##### Xtr2

In [None]:
params = np.array([1e-4, 1e-3, 1e-2, 1e-1, 1])
vars = np.array([ 1e-2, 1e-1, 1])

errors = np.zeros((len(params),len(vars)))
losses = np.zeros((len(params),len(vars)))

for i in range(len(params)):
  for j in range((len(vars))):

    param =params[i]
    var = vars[j]

    Ridge_kernel  = KernelRidgeRegScratch(param = param, var = var)
    Ridge_kernel.fit(Xtr2_train,ytr2_train)

    ypred_tr0 = np.sign(Ridge_kernel.predict(Xtr2_test))

    error= np.sum((ypred_tr2 - ytr2_test) != 0)
    loss = error/len(ytr2_test)

    errors[i,j] = error
    losses [i,j] = loss
    print("error: ", error)
    print("loss: ", loss)

print('optimal param: ', params[np.unravel_index(errors.argmin(), errors.shape)[0]])

As we can see, the results of the RBF kernel are also non representative. The loss is around 50%  or sometimes 100%! which is not utile for the challenge

### Sequential data

#### Data

In [None]:
Xtr0 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Xtr0.csv", sep=",", index_col=0)
Xtr0 = np.array(Xtr0)

N = Xtr0.shape[0]

ytr0 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Ytr0.csv", index_col=0)
ytr0 = np.array(ytr0).reshape(N)
ytr0 = 2*ytr0 - 1



In [None]:
Xtr1 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Xtr1.csv", sep=",", index_col=0)
Xtr1 = np.array(Xtr1)

N = Xtr1.shape[0]

ytr1 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Ytr1.csv", index_col=0)
ytr1 = np.array(ytr1).reshape(N)
ytr1 = 2*ytr1 - 1



In [None]:
Xtr2 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Xtr2.csv", sep=",", index_col=0)
Xtr2 = np.array(Xtr2)

N = Xtr2.shape[0]

ytr2 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/train/Ytr2.csv", index_col=0)
ytr2 = np.array(ytr2).reshape(N)
ytr2 = 2*ytr2 - 1



In [None]:
Xtr0_train, Xtr0_test, ytr0_train, ytr0_test = train_test_split( Xtr0, ytr0, test_size=0.25, random_state=42)
Xtr1_train, Xtr1_test, ytr1_train, ytr1_test = train_test_split( Xtr1, ytr1, test_size=0.25, random_state=42)
Xtr2_train, Xtr2_test, ytr2_train, ytr2_test = train_test_split( Xtr2, ytr2, test_size=0.25, random_state=42)

In [None]:
Xte0 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/test/Xte0.csv", sep=",", index_col=0)
Xte0 = np.array(Xte0)

Xte1 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/test/Xte1.csv", sep=",", index_col=0)
Xte1 = np.array(Xte1)

Xte2 = pd.read_csv("/content/drive/MyDrive/KM/Challenge/test/Xte2.csv", sep=",", index_col=0)
Xte2 = np.array(Xte2)

#### k-spectrum kernel

In this section, we use the k-spectrum kernel with kernel ridge regression.   
We assess the functioning of this method using several parameters: **k =6, 7, 8, 9**

In [None]:
def make_corres(corpus,k):
    '''
    make_corres
    
    Args:
        corpus: the letters that compose the sequences
        k: length of the sequences
        
    Returns:
        corres: dictionnary of correspondances between sequences of length k and indexes
    '''
    corres = dict()
    count = 0
    for u in product(corpus,repeat=k):
        s = ''
        for i in range(k):
            s += u[i]
        corres[s] = count
        count += 1
    return corres
            
def kspectrum_kernel(x1, x2, k, corres):   
    '''
    kspectrum_kernel
    
    Args:
        x1, x2: data points
        corres: dictionnary of correspondances between sequences of length k and indexes
        k: length of the sequences
        
    Returns:
        k(x1,x2): the kernel evaluated at points x1 and x2 (bormalized dot product between feature vectors)
    '''
    sample1 = x1[0]
    sample2 = x2[0]
    
    N = 4**k
    
    #sparse vectors could be used in case of high space complexity, but this is not used systematically as their appending is costly
    feature1 = np.zeros(N)
    feature2 = np.zeros(N)
    
#     feature1 = csr_matrix((1,N), dtype="int32")
#     feature2 = csr_matrix((1,N), dtype="int32")
    
    for i in range(len(sample1)-k+1):
        #extract sub-sequences
        segment1 = sample1[i:i+k]
        segment2 = sample2[i:i+k]
        
        #update feature vectors
        feature1[corres[segment1]] += 1
        feature2[corres[segment2]] += 1
        
    #norrmalize feature vectors
    feature1 = feature1/np.sqrt(feature1.dot(feature1))
    feature2 = feature2/np.sqrt(feature2.dot(feature2))
    
    return feature1.dot(feature2)

def gram_spectrum(X, X_test, corres, k):
    '''
    gram_spectrum
    
    Args:
        X, X_test: training and test data
        corres: dictionnary of correspondances between sequences of length k and indexes
        k: length of the sequences
        
    Returns:
        K: Gram matrix of X and X_test concatenated
    '''
    
    X = np.concatenate((X,X_test),axis=0)
    
    #dimensions
    N = X.shape[0]
    m = len(X[0][0])
    n = 4**k
    
    #Initialize features matrix
    #features = csr_matrix((N,n), dtype="int8")
    features = np.zeros((N,n))
    for sample in range(N):
        for i in range(m-k+1):
            segment = X[sample][0][i:i+k]
            features[sample,corres[segment]] += 1
        #normalization
        features[sample] = features[sample]/np.sqrt(features[sample].dot(features[sample]))
    
    return features.dot(features.T)

In [None]:
class KernelRidgeRegScratch_spectrum():
  '''
  This class is an implementation of the kernel ridge regression method

  Inputs:
    - param: Regularization value (lambda)
    - k: the width of the sequences
    - corres: the matrix of the different sequences

  Outputs:
    - fit: the learned parameters
    - predict: the prediction using the learned parameters

  '''

  def __init__(self, param=1.0, solver='closed_form', corres=None,  k = 6):
      self.param = param
      self.solver = solver
      self.corres = corres
      self.k = k
     

  def fit(self, X, y, X_test):
      
      self.X = X
      self.y = y
      self.X_test = X_test

      
      # number of rows  in matrix of X 
      n = self.X.shape[0]

      # Identity matrix of dimension compatible with our X_intercept Matrix
      I = np.identity(n)


      # We create a bias term corresponding to param for each column of X 
      I_biased = self.param * I

      n = self.X.shape[0]

      K = gram_spectrum(self.X, self.X_test, self.corres, self.k)
      
      threshold = n

      self.K = K[:threshold,:threshold]
      self.K_test = K[threshold:,:threshold]

      betas = np.linalg.inv(self.K + I_biased).dot(y)
      self.betas = betas


  def predict(self):
      betas = self.betas
      K_test = self.K_test.transpose()

      K_predictor = np.dot(betas,K_test)

      self.predictions = K_predictor
      return self.predictions

##### Evaluation X0

In [None]:
corpus = ['A','C','G','T']
k = 6
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr0_train,ytr0_train, Xtr0_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr0_test) != 0)
  loss = error/len(ytr0_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  209
loss:  0.418
error:  209
loss:  0.418
error:  208
loss:  0.416
error:  206
loss:  0.412
error:  201
loss:  0.402
error:  208
loss:  0.416
error:  209
loss:  0.418
error:  187
loss:  0.374
optimal param:  1.0
optimal error:  187.0


In [None]:
corpus = ['A','C','G','T']
k = 7
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr0_train,ytr0_train, Xtr0_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr0_test) != 0)
  loss = error/len(ytr0_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  177
loss:  0.354
error:  181
loss:  0.362
error:  198
loss:  0.396
optimal param:  1
optimal error:  177.0


In [None]:
corpus = ['A','C','G','T']
k = 8
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr0_train,ytr0_train, Xtr0_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr0_test) != 0)
  loss = error/len(ytr0_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])
print('optimal loss: ', losses[np.argmin(errors)])

error:  181
loss:  0.362
error:  181
loss:  0.362
error:  181
loss:  0.362
error:  182
loss:  0.364
error:  184
loss:  0.368
error:  184
loss:  0.368
error:  184
loss:  0.368
error:  184
loss:  0.368
optimal param:  1e-05
optimal error:  181.0
optimal loss:  0.362


In [None]:
corpus = ['A','C','G','T']
k = 9
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr0_train,ytr0_train, Xtr0_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr0_test) != 0)
  loss = error/len(ytr0_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])
print('optimal loss: ', losses[np.argmin(errors)])

error:  189
loss:  0.378
error:  189
loss:  0.378
error:  189
loss:  0.378
error:  188
loss:  0.376
error:  187
loss:  0.374
error:  187
loss:  0.374
error:  185
loss:  0.37
error:  178
loss:  0.356
optimal param:  1.0
optimal error:  178.0
optimal loss:  0.356


###### Submission

As we can see, the best results are given with k =7 and param =1. Hence we'll use them to apply the method on the test dataset for the challenge submission

In [None]:
corpus = ['A','C','G','T']
k = 7
corres = make_corres(corpus,k)
param = 1

Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)

Ridge.fit(Xtr0,ytr0, Xte0)

ypred = np.sign(Ridge.predict())


In [None]:
rep = ypred.astype("int8")
df_rep1 = pd.DataFrame(data=np.array([range(1000),rep]).T, columns=["id", "Bound"])
df_rep1

Unnamed: 0,id,Bound
0,0,1
1,1,1
2,2,1
3,3,1
4,4,-1
...,...,...
995,995,-1
996,996,-1
997,997,-1
998,998,1


##### Evaluation X1

In [None]:
corpus = ['A','C','G','T']
k = 6
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr1_train,ytr1_train, Xtr1_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr1_test) != 0)
  loss = error/len(ytr1_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  198
loss:  0.396
error:  199
loss:  0.398
error:  198
loss:  0.396
error:  200
loss:  0.4
error:  197
loss:  0.394
error:  204
loss:  0.408
error:  198
loss:  0.396
error:  176
loss:  0.352
optimal param:  1.0
optimal error:  176.0


In [None]:
corpus = ['A','C','G','T']
k = 7
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr1_train,ytr1_train, Xtr1_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr1_test) != 0)
  loss = error/len(ytr1_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  188
loss:  0.376
error:  188
loss:  0.376
error:  188
loss:  0.376
error:  187
loss:  0.374
error:  188
loss:  0.376
error:  184
loss:  0.368
error:  183
loss:  0.366
error:  171
loss:  0.342
optimal param:  1.0
optimal error:  171.0


In [None]:
corpus = ['A','C','G','T']
k = 8
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr1_train,ytr1_train, Xtr1_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr1_test) != 0)
  loss = error/len(ytr1_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  168
loss:  0.336
error:  168
loss:  0.336
error:  168
loss:  0.336
error:  169
loss:  0.338
error:  170
loss:  0.34
error:  169
loss:  0.338
error:  168
loss:  0.336
error:  169
loss:  0.338
optimal param:  1e-05
optimal error:  168.0


In [None]:
corpus = ['A','C','G','T']
k = 9
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr1_train,ytr1_train, Xtr1_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr1_test) != 0)
  loss = error/len(ytr1_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  188
loss:  0.376
error:  188
loss:  0.376
error:  188
loss:  0.376
error:  187
loss:  0.374
error:  186
loss:  0.372
error:  185
loss:  0.37
error:  185
loss:  0.37
error:  184
loss:  0.368
optimal param:  1.0
optimal error:  184.0


###### Submission


As we can see, the best results are given with k =8 and param = 0.1. Hence we'll use them to apply the method on the test dataset for the challenge submission

In [None]:
corpus = ['A','C','G','T']
k = 8
corres = make_corres(corpus,k)
param = 1e-1

Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)

Ridge.fit(Xtr1,ytr1, Xte1)

ypred = np.sign(Ridge.predict())

In [None]:
rep = ypred.astype("int8")
df_rep2 = pd.DataFrame(data=np.array([range(1000,2000),rep]).T, columns=["id", "Bound"])
df_rep2

Unnamed: 0,id,Bound
0,1000,-1
1,1001,1
2,1002,1
3,1003,-1
4,1004,-1
...,...,...
995,1995,-1
996,1996,1
997,1997,-1
998,1998,-1


##### Evaluation X2

In [None]:
corpus = ['A','C','G','T']
k = 6
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr2_train,ytr2_train, Xtr2_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr2_test) != 0)
  loss = error/len(ytr2_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  162
loss:  0.324
error:  162
loss:  0.324
error:  161
loss:  0.322
error:  159
loss:  0.318
error:  153
loss:  0.306
error:  150
loss:  0.3
error:  148
loss:  0.296
error:  141
loss:  0.282
optimal param:  1.0
optimal error:  141.0


In [None]:
corpus = ['A','C','G','T']
k = 7
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr2_train,ytr2_train, Xtr2_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr2_test) != 0)
  loss = error/len(ytr2_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  141
loss:  0.282
error:  141
loss:  0.282
error:  140
loss:  0.28
error:  138
loss:  0.276
error:  138
loss:  0.276
error:  131
loss:  0.262
error:  134
loss:  0.268
error:  130
loss:  0.26
optimal param:  1.0
optimal error:  130.0


In [None]:
corpus = ['A','C','G','T']
k = 8
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr2_train,ytr2_train, Xtr2_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr2_test) != 0)
  loss = error/len(ytr2_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  126
loss:  0.252
error:  136
loss:  0.272
error:  141
loss:  0.282
optimal param:  1
optimal error:  126.0


In [None]:
corpus = ['A','C','G','T']
k = 9
corres = make_corres(corpus,k)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)
  Ridge.fit(Xtr2_train,ytr2_train, Xtr2_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr2_test) != 0)
  loss = error/len(ytr2_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  132
loss:  0.264
error:  132
loss:  0.264
error:  132
loss:  0.264
error:  130
loss:  0.26
error:  131
loss:  0.262
error:  130
loss:  0.26
error:  132
loss:  0.264
error:  133
loss:  0.266
optimal param:  0.01
optimal error:  130.0


###### Submission

As we can see, the best results are given with k = 8 and param = 1. Hence we'll use them to apply the method on the test dataset for the challenge submission

In [None]:
corpus = ['A','C','G','T']
k = 8
corres = make_corres(corpus,k)
param = 1

Ridge = KernelRidgeRegScratch_spectrum(param = param, corres = corres, k = k)

Ridge.fit(Xtr2,ytr2, Xte2)

ypred = np.sign(Ridge.predict())

In [None]:
rep = ypred.astype("int8")
df_rep3 = pd.DataFrame(data=np.array([range(2000,3000),rep]).T, columns=["id", "Bound"])
df_rep3

Unnamed: 0,id,Bound
0,2000,-1
1,2001,-1
2,2002,1
3,2003,1
4,2004,-1
...,...,...
995,2995,1
996,2996,-1
997,2997,-1
998,2998,-1


##### Final Submission

In [None]:
df_sub = pd.concat([df_rep1,df_rep2,df_rep3])
df_sub

Unnamed: 0,id,Bound
0,0,1
1,1,1
2,2,1
3,3,1
4,4,-1
...,...,...
995,2995,1
996,2996,-1
997,2997,-1
998,2998,-1


In [None]:
df_sub = df_sub.replace(-1,0)

In [None]:
df_sub

Unnamed: 0,id,Bound
0,0,1
1,1,1
2,2,1
3,3,1
4,4,0
...,...,...
995,2995,1
996,2996,0
997,2997,0
998,2998,0


In [None]:
df_sub.to_csv('submission_kridgespectr_7.csv', index=False)

#### Mismatch Kernel

In this section, we use the mismatch kernel with kernel ridge regression.   
We assess the functioning of this method using several parameters: **k =7 and m=2, k =7 and m=1, k =7 and m=3, k =6 and m=1, k =6 and m=2** 

In [None]:
def make_corres(corpus,k):
    '''
    make_corres
    
    Args:
        corpus: the letters that compose the sequences
        k: length of the sequences
        
    Returns:
        corres: dictionnary of correspondances between sequences of length k and indexes
    '''
    corres = dict()
    count = 0
    for u in product(corpus,repeat=k):
        s = ''
        for i in range(k):
            s += u[i]
        corres[s] = count
        count += 1
    return corres

def mismatches(k,m,corres):
    '''
    mismatches
    
    Args:
        k: length of the sequences
        m: number of allowed mismatches
        corres: dictionnary of correspondances between sequences of length k and indexes
        
    Returns:
        mms: matrix of mismatches among all k-length sequences 
    '''
    N = 4**k
    #L = lil_matrix((N,N), dtype="int16")
    L = np.zeros((N,N))
    seqs = list(corres.keys())
    for i in range(N):
        seq1 = seqs[i]
        for j in range(N):
            seq2 = seqs[j]
            mm = 0
            count = 0
            while (mm <= m) and (count <= k-1):
                if seq1[count] != seq2[count]:
                    mm += 1
                count += 1
            if mm <= m:
                L[i,j] = 1
    return L
            
def mismatch_kernel(x1, x2, k, corres, mismatches):          
    '''
    mismatch_kernel
    
    Args:
        x1, x2: data points
        corres: dictionnary of correspondances between sequences of length k and indexes
        k: length of the sequences
        mismatches: matrix of mismatches among all k-length sequences
        
    Returns:
        k(x1,x2): the kernel evaluated at points x1 and x2 (normalized dot product between feature vectors)
    '''
    
    sample1 = x1[0]
    sample2 = x2[0]
    
    N = 4**k
    
    #Initialization
    
    feature1 = np.zeros(N)
    feature2 = np.zeros(N)
    
#     feature1 = csr_matrix((1,N), dtype="int32")
#     feature2 = csr_matrix((1,N), dtype="int32")
    
    for i in range(len(sample1)-k+1):
        segment1 = sample1[i:i+k]
        segment2 = sample2[i:i+k]
        
        feature1 += mismatches[corres[segment1]]
        feature2 += mismatches[corres[segment2]]
        
    #normalization
    feature1 = feature1/np.sqrt(feature1.dot(feature1))
    feature2 = feature2/np.sqrt(feature2.dot(feature2))

    return feature1.dot(feature2)

def gram_mismatch(X, X_test, corres, k, mismatches):
    '''
    gram_spectrum
    
    Args:
        X, X_test: training and test data
        corres: dictionnary of correspondances between sequences of length k and indexes
        k: length of the sequences
        mismatches: matrix of mismatches among all k-length sequences
        
    Returns:
        K: Gram matrix of X and X_test concatenated
    '''
    
    X = np.concatenate((X,X_test),axis=0)
    
    N = X.shape[0]
    m = len(X[0][0])
    n = 4**k
    
    
    #features = csr_matrix((N,n), dtype="int8")
    features = np.zeros((N,n))
    for sample in range(N):
        for i in range(m-k+1):
            segment = X[sample][0][i:i+k]
            features[sample] += mismatches[corres[segment]]
        features[sample] = features[sample]/np.sqrt(features[sample].dot(features[sample]))
        
    #features = csr_matrix(features, shape=(N,n), dtype="int8")
    
    return features.dot(features.T)

In [None]:
class KernelRidgeRegScratch_mismatch():
  '''
  This class is an implementation of the kernel ridge regression method

  Inputs:
    - param: Regularization value (lambda)
    - k: the width of the sequences
    - corres: the matrix of the different sequences
    - mms: the matrix of mismatches

  Outputs:
    - fit: the learned parameters
    - predict: the prediction using the learned parameters

  '''

  def __init__(self, param=1.0, solver='closed_form', corres=None,  k = 6, mms =None):
      self.param = param
      self.solver = solver
      self.corres = corres
      self.k = k
      self.mms = mms
     

  def fit(self, X, y, X_test):
      
      self.X = X
      self.y = y
      self.X_test = X_test

      
      # number of rows  in matrix of X 
      n = self.X.shape[0]

      # Identity matrix of dimension compatible with our X_intercept Matrix
      I = np.identity(n)


      # We create a bias term corresponding to param for each column of X 
      I_biased = self.param * I

      n = self.X.shape[0]

      K = gram_mismatch(self.X, self.X_test, self.corres, self.k, self.mms)
      
      threshold = n

      self.K = K[:threshold,:threshold]
      self.K_test = K[threshold:,:threshold]

      betas = np.linalg.inv(self.K + I_biased).dot(y)
      self.betas = betas


  def predict(self):
      betas = self.betas
      K_test = self.K_test.transpose()

      K_predictor = np.dot(betas,K_test)

      self.predictions = K_predictor
      return self.predictions

##### X0

In [None]:
corpus = ['A','C','G','T']
k = 7
m = 2

corres = make_corres(corpus,k)
mms72 = mismatches(k,m,corres)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms72)
  Ridge.fit(Xtr0_train,ytr0_train, Xtr0_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr0_test) != 0)
  loss = error/len(ytr0_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  188
loss:  0.376
error:  188
loss:  0.376
error:  189
loss:  0.378
error:  188
loss:  0.376
error:  187
loss:  0.374
error:  176
loss:  0.352
error:  170
loss:  0.34
error:  167
loss:  0.334
optimal param:  1.0
optimal error:  167.0


In [None]:
corpus = ['A','C','G','T']
k = 7
m = 1

corres = make_corres(corpus,k)
mms71 = mismatches(k,m,corres)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms71)
  Ridge.fit(Xtr0_train,ytr0_train, Xtr0_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr0_test) != 0)
  loss = error/len(ytr0_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  196
loss:  0.392
error:  196
loss:  0.392
error:  196
loss:  0.392
error:  196
loss:  0.392
error:  194
loss:  0.388
error:  196
loss:  0.392
error:  191
loss:  0.382
error:  161
loss:  0.322
optimal param:  1.0
optimal error:  161.0


In [None]:
corpus = ['A','C','G','T']
k = 7
m = 3

corres = make_corres(corpus,k)
mms73 = mismatches(k,m,corres)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms73)
  Ridge.fit(Xtr0_train,ytr0_train, Xtr0_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr0_test) != 0)
  loss = error/len(ytr0_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  198
loss:  0.396
error:  198
loss:  0.396
error:  197
loss:  0.394
error:  190
loss:  0.38
error:  183
loss:  0.366
error:  180
loss:  0.36
error:  182
loss:  0.364
error:  196
loss:  0.392
optimal param:  0.07
optimal error:  180.0


In [None]:
corpus = ['A','C','G','T']
k = 6
m = 2

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr0_train,ytr0_train, Xtr0_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr0_test) != 0)
  loss = error/len(ytr0_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  211
loss:  0.422
error:  212
loss:  0.424
error:  207
loss:  0.414
error:  199
loss:  0.398
error:  189
loss:  0.378
error:  187
loss:  0.374
error:  191
loss:  0.382
error:  188
loss:  0.376
optimal param:  0.07
optimal error:  187.0


In [None]:
corpus = ['A','C','G','T']
k = 6
m = 1

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr0_train,ytr0_train, Xtr0_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr0_test) != 0)
  loss = error/len(ytr0_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  221
loss:  0.442
error:  221
loss:  0.442
error:  223
loss:  0.446
error:  221
loss:  0.442
error:  220
loss:  0.44
error:  212
loss:  0.424
error:  207
loss:  0.414
error:  184
loss:  0.368
optimal param:  1.0
optimal error:  184.0


###### Submission


As we can see, the best results are given with k =7 and param =1 and m=1. Hence we'll use them to apply the method on the test dataset for the challenge submission

In [None]:
corpus = ['A','C','G','T']
k = 7
m = 1

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)
param = 1

Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)

Ridge.fit(Xtr0,ytr0, Xte0)

ypred = np.sign(Ridge.predict())


In [None]:
rep = ypred.astype("int8")
df_rep1 = pd.DataFrame(data=np.array([range(1000),rep]).T, columns=["id", "Bound"])
df_rep1

Unnamed: 0,id,Bound
0,0,1
1,1,1
2,2,1
3,3,1
4,4,-1
...,...,...
995,995,-1
996,996,1
997,997,1
998,998,1


##### X1

In [None]:
corpus = ['A','C','G','T']
k = 7
m = 2

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr1_train,ytr1_train, Xtr1_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr1_test) != 0)
  loss = error/len(ytr1_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  191
loss:  0.382
error:  191
loss:  0.382
error:  191
loss:  0.382
error:  186
loss:  0.372
error:  177
loss:  0.354
error:  170
loss:  0.34
error:  168
loss:  0.336
error:  175
loss:  0.35
optimal param:  0.1
optimal error:  168.0


In [None]:
corpus = ['A','C','G','T']
k = 7
m = 1

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-9,1e-7,1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1,10,100])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr1_train,ytr1_train, Xtr1_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr1_test) != 0)
  loss = error/len(ytr1_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  187
loss:  0.374
error:  187
loss:  0.374
error:  187
loss:  0.374
error:  187
loss:  0.374
error:  188
loss:  0.376
error:  184
loss:  0.368
error:  183
loss:  0.366
error:  176
loss:  0.352
error:  176
loss:  0.352
error:  156
loss:  0.312
error:  169
loss:  0.338
error:  203
loss:  0.406
optimal param:  1.0
optimal error:  156.0


In [None]:
corpus = ['A','C','G','T']
k = 7
m = 3

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr1_train,ytr1_train, Xtr1_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr1_test) != 0)
  loss = error/len(ytr1_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  185
loss:  0.37
error:  185
loss:  0.37
error:  186
loss:  0.372
error:  175
loss:  0.35
error:  179
loss:  0.358
error:  178
loss:  0.356
error:  174
loss:  0.348
error:  199
loss:  0.398
optimal param:  0.1
optimal error:  174.0


In [None]:
corpus = ['A','C','G','T']
k = 6
m = 2

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr1_train,ytr1_train, Xtr1_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr1_test) != 0)
  loss = error/len(ytr1_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  199
loss:  0.398
error:  199
loss:  0.398
error:  193
loss:  0.386
error:  185
loss:  0.37
error:  183
loss:  0.366
error:  181
loss:  0.362
error:  181
loss:  0.362
error:  187
loss:  0.374
optimal param:  0.07
optimal error:  181.0


In [None]:
corpus = ['A','C','G','T']
k = 6
m = 1

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr1_train,ytr1_train, Xtr1_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr1_test) != 0)
  loss = error/len(ytr1_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  210
loss:  0.42
error:  210
loss:  0.42
error:  207
loss:  0.414
error:  204
loss:  0.408
error:  187
loss:  0.374
error:  184
loss:  0.368
error:  183
loss:  0.366
error:  171
loss:  0.342
optimal param:  1.0
optimal error:  171.0


###### Submission

As we can see, the best results are given with k =7 and param =1 and m=1. Hence we'll use them to apply the method on the test dataset for the challenge submission

In [None]:
corpus = ['A','C','G','T']
k = 7
m = 1

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)
param = 1

Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)

Ridge.fit(Xtr1,ytr1, Xte1)

ypred = np.sign(Ridge.predict())

In [None]:
rep = ypred.astype("int8")
df_rep2 = pd.DataFrame(data=np.array([range(1000,2000),rep]).T, columns=["id", "Bound"])
df_rep2

Unnamed: 0,id,Bound
0,1000,1
1,1001,1
2,1002,1
3,1003,-1
4,1004,-1
...,...,...
995,1995,-1
996,1996,-1
997,1997,-1
998,1998,-1


##### X2

In [None]:
corpus = ['A','C','G','T']
k = 7
m = 2

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-9,1e-7,1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1,10,100])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr2_train,ytr2_train, Xtr2_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr2_test) != 0)
  loss = error/len(ytr2_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  152
loss:  0.304
error:  152
loss:  0.304
error:  152
loss:  0.304
error:  152
loss:  0.304
error:  151
loss:  0.302
error:  144
loss:  0.288
error:  142
loss:  0.284
error:  142
loss:  0.284
error:  139
loss:  0.278
error:  129
loss:  0.258
error:  136
loss:  0.272
error:  146
loss:  0.292
optimal param:  1.0
optimal error:  129.0


In [None]:
corpus = ['A','C','G','T']
k = 7
m = 1

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-9,1e-7,1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1,10,100])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr2_train,ytr2_train, Xtr2_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr2_test) != 0)
  loss = error/len(ytr2_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  149
loss:  0.298
error:  149
loss:  0.298
error:  149
loss:  0.298
error:  149
loss:  0.298
error:  148
loss:  0.296
error:  144
loss:  0.288
error:  137
loss:  0.274
error:  138
loss:  0.276
error:  137
loss:  0.274
error:  133
loss:  0.266
error:  129
loss:  0.258
error:  144
loss:  0.288
optimal param:  10.0
optimal error:  129.0


In [None]:
corpus = ['A','C','G','T']
k = 7
m = 3

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-9,1e-7,1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1,10,100])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr2_train,ytr2_train, Xtr2_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr2_test) != 0)
  loss = error/len(ytr2_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  130
loss:  0.26
error:  130
loss:  0.26
error:  130
loss:  0.26
error:  130
loss:  0.26
error:  130
loss:  0.26
error:  125
loss:  0.25
error:  127
loss:  0.254
error:  123
loss:  0.246
error:  126
loss:  0.252
error:  135
loss:  0.27
error:  158
loss:  0.316
error:  197
loss:  0.394
optimal param:  0.07
optimal error:  123.0


In [None]:
corpus = ['A','C','G','T']
k = 6
m = 2

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-9,1e-7,1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1,10,100])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr2_train,ytr2_train, Xtr2_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr2_test) != 0)
  loss = error/len(ytr2_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  164
loss:  0.328
error:  164
loss:  0.328
error:  164
loss:  0.328
error:  166
loss:  0.332
error:  163
loss:  0.326
error:  160
loss:  0.32
error:  148
loss:  0.296
error:  134
loss:  0.268
error:  134
loss:  0.268
error:  132
loss:  0.264
error:  140
loss:  0.28
error:  167
loss:  0.334
optimal param:  1.0
optimal error:  132.0


In [None]:
corpus = ['A','C','G','T']
k = 6
m = 1

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)

params = np.array([1e-9,1e-7,1e-5, 1e-4, 1e-3, 1e-2,3e-2, 7e-2, 1e-1, 1,10,100])

errors = np.zeros(len(params))
losses = np.zeros(len(params))

for i in range(len(params)):
  param =params[i]

  Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)
  Ridge.fit(Xtr2_train,ytr2_train, Xtr2_test)

  ypred = np.sign(Ridge.predict())

  error= np.sum((ypred - ytr2_test) != 0)
  loss = error/len(ytr2_test)

  errors[i] = error
  losses [i] = loss
  print("error: ", error)
  print("loss: ", loss)

print('optimal param: ', params[np.argmin(errors)])
print('optimal error: ', errors[np.argmin(errors)])

error:  168
loss:  0.336
error:  168
loss:  0.336
error:  168
loss:  0.336
error:  166
loss:  0.332
error:  163
loss:  0.326
error:  160
loss:  0.32
error:  162
loss:  0.324
error:  152
loss:  0.304
error:  153
loss:  0.306
error:  131
loss:  0.262
error:  134
loss:  0.268
error:  141
loss:  0.282
optimal param:  1.0
optimal error:  131.0


###### Submission

As we can see, the best results are given with k =7 and param =0.07 and m=3. Hence we'll use them to apply the method on the test dataset for the challenge submission

In [None]:
corpus = ['A','C','G','T']
k = 7
m = 3

corres = make_corres(corpus,k)
mms = mismatches(k,m,corres)
param = 0.07

Ridge = KernelRidgeRegScratch_mismatch(param = param, corres = corres, k = k, mms = mms)

Ridge.fit(Xtr2,ytr2, Xte2)

ypred = np.sign(Ridge.predict())

In [None]:
rep = ypred.astype("int8")
df_rep3 = pd.DataFrame(data=np.array([range(2000,3000),rep]).T, columns=["id", "Bound"])
df_rep3

Unnamed: 0,id,Bound
0,2000,-1
1,2001,-1
2,2002,1
3,2003,1
4,2004,1
...,...,...
995,2995,-1
996,2996,-1
997,2997,-1
998,2998,-1


##### Final Submission

In [None]:
df_sub = pd.concat([df_rep1,df_rep2,df_rep3])
df_sub = df_sub.replace(-1,0)
df_sub


Unnamed: 0,id,Bound
0,0,1
1,1,1
2,2,1
3,3,1
4,4,0
...,...,...
995,2995,0
996,2996,0
997,2997,0
998,2998,0


In [None]:
df_sub.to_csv('submission_kridgemismatch_7.csv', index=False)

### Thank you