# Exercise: Cross-Validation with Symmetric Pair-Input Data

This exercise consists of two tasks. The first task is compulsory: you will not get the right to take the exam if you fail the first task. The second task optional: you do not have to complete the second task but a successful completion will give you an extra point in the exam.

In both tasks, use the K-nearest neighbors classifier with K=1 and Euclidean distance for learning and the concordance index for evaluation. You are encouraged to re-use your own code from the previous exercises. Use the data files `pairs.data`, `features.data`, and `labels.data` that are available in Moodle. The descriptions of these files are provided in the exercise overview, which is also available in Moodle.

Follow the general exercise guidelines of the course (listed in Moodle). Particularly,

- Describe and implement your solution directly to this Jupyter notebook file.
- Remember to describe your solution in general and add detailed comments to the critical parts of your code.
- Remember to justify your design choices and discuss your results.
- Your report must be easy to follow and your code must be runnable in Jupyter notebook.

Feel free to use markdown cells and code cells as you see appropriate.

Submit the finished work to Moodle before the **deadline Wednesday 19th of February 2020 at 23:59**. Late submissions will be ignored.

## Cover page

Student name: Ismail Elnaggar

Student number: 519208

Student email: imelna@utu.fi

## Task 1 (compulsory)

**You must successfully complete this task in order to get the right to take the exam.**

1. Implement the modified leave-one-out cross-validation scheme that is described in the lecture notes.

2. Estimate and report the generalisation performance of the K-nearest neighbor classifier in predicting the functional similarity of proteins. Use both the unmodified and the modified leave-one-out cross-validation.

3. Discuss your results. In particular, answer the following questions:
 - Why do the two cross-validation schemes produce notably different estimates?
 - For which types of pairs (A, B, or C) are these schemes appropriate and why?

In [1]:
# In this cell import all libraries you need. For example: 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

In [2]:
### import data
### features, labels, pairs
pairsdf=pd.read_csv("C:/Users/imelna/anaconda/envs/emgEnv/ADA exercises/Exercise 5 data/pairs.data",header=None)
featuresdf=pd.read_csv("C:/Users/imelna/anaconda/envs/emgEnv/ADA exercises/Exercise 5 data/features.data",header=None)
labelsdf=pd.read_csv("C:/Users/imelna/anaconda/envs/emgEnv/ADA exercises/Exercise 5 data/labels.data",header=None)

print ("The dimensions of the pairs dataset are:",pairsdf.shape,"\n")
print ("The dimensions of the features dataset are:",featuresdf.shape,"\n")
print ("The dimensions of the labels dataset are:",labelsdf.shape,"\n")

The dimensions of the pairs dataset are: (95, 2) 

The dimensions of the features dataset are: (95, 41) 

The dimensions of the labels dataset are: (95, 1) 



### concordance index code

In [3]:
###cindex code
def cindex(true_labels, pred_labels):
    """Returns C-index between true labels and predicted labels""" 
    n = 0
    n_sum = 0
    for i in range(len(true_labels)):
        t = true_labels[i]
        p = pred_labels[i]
        for j in range(i+1,len(true_labels)):
            nt = true_labels[j]
            np = pred_labels[j]
            if t != nt:
                n += 1
                if (p < np and t < nt) or (p > np and t > nt):
                        n_sum +=  1
                elif p == np:
                    n_sum += 0.5     
    return (n_sum/n)

### Task 1 Original unmodified LOOCV CASE A

In [4]:
k=1
knn = KNeighborsClassifier(n_neighbors=k)
my_predictions=[]
my_tests=[]

for a in range(len(pairsdf)):
    
    #set xtest and y test
    xtest=featuresdf.iloc[a,:]
    ytest=labelsdf.iloc[a,:]
    
    # set xtrain and ytrain
    xtrain=pd.concat([featuresdf.iloc[:a],featuresdf.iloc[a+1:]])       
    ytrain=pd.concat([labelsdf.iloc[:a],labelsdf.iloc[a+1:]]).values.ravel()
     
    #train classifier
    knn.fit(xtrain,ytrain)
    #get predictions
    ypred=knn.predict(np.array(xtest).reshape(1, -1))
    #append true and predicted values to lists
    my_tests.append(ytest.values)
    my_predictions.append(ypred.flatten())

caseA_cindex=cindex(my_tests,my_predictions)
print("")
print ("Original Case A LOOCV Cindex =",caseA_cindex)


Original Case A LOOCV Cindex = 0.7617702448210922


### Task 1 Modified LOOCV CASE C

1. In case C the training set should not contain any samples that share protiens with the test sample

### Testing splitting for Case C

In [5]:
### CASE C ###
a=4 # select a random index for testing purpose
test_pair=list(pairsdf.iloc[a,0:2])
print ("the test pair of protiens that should be excluded from the training set:",test_pair,"\n") #print the protien pair

#use conditional statements to remove any index that contains either protien value
caseC_pairs = pairsdf[((pairsdf.iloc[:,0] != pairsdf.iloc[a,0]) & (pairsdf.iloc[:,0] != pairsdf.iloc[a,1])) & ((pairsdf.iloc[:,1] != pairsdf.iloc[a,0]) & (pairsdf.iloc[:,1] != pairsdf.iloc[a,1]))]

#print("The unique values of column 0 of case C:",caseC_pairs[0].unique(),"\n")
#print("The unique values of column 1 of case C:",caseC_pairs[1].unique(),"\n")

#test to see if the protien 1 or protien 2 exist in the dataframe
print("Does either column contain the protien values:",test_pair,"? \n",caseC_pairs.isin(test_pair).any(),"\n")
print ("It is clear that any samples that contain either protien 1 or protien 2 have been removed from the training set for case C")


the test pair of protiens that should be excluded from the training set: ['P15', 'P17'] 

Does either column contain the protien values: ['P15', 'P17'] ? 
 0    False
1    False
dtype: bool 

It is clear that any samples that contain either protien 1 or protien 2 have been removed from the training set for case C


In [6]:
#### Case C
k=1
knn = KNeighborsClassifier(n_neighbors=k)
my_predictions=[]
my_tests=[]

for a in range(len(pairsdf)):
    
    #set xtest and y test
    xtest=featuresdf.iloc[a,:]
    ytest=labelsdf.iloc[a,:]
    #set xtrain and ytrain
    #use conditional statements to check that neither protein is in the training set
    xtrain = featuresdf[((pairsdf.iloc[:,0] != pairsdf.iloc[a,0]) & (pairsdf.iloc[:,0] != pairsdf.iloc[a,1])) & ((pairsdf.iloc[:,1] != pairsdf.iloc[a,0]) & (pairsdf.iloc[:,1] != pairsdf.iloc[a,1]))]
    ytrain = labelsdf[((pairsdf.iloc[:,0] != pairsdf.iloc[a,0]) & (pairsdf.iloc[:,0] != pairsdf.iloc[a,1])) & ((pairsdf.iloc[:,1] != pairsdf.iloc[a,0]) & (pairsdf.iloc[:,1] != pairsdf.iloc[a,1]))].values.ravel()
    
    #train classifier
    knn.fit(xtrain,ytrain)
    #get predictions
    ypred=knn.predict(np.array(xtest).reshape(1, -1))
    #append true and predicted values to lists
    my_tests.append(ytest.values)
    my_predictions.append(ypred.flatten())
    
caseC_cindex=cindex(my_tests,my_predictions)
print("")
print ("Modified Case C LOOCV Cindex =",caseC_cindex)


Modified Case C LOOCV Cindex = 0.6313559322033898


### Discussion of results for task 1
1. Why do the two cross-validation schemes produce notably different estimates?
2. For which types of pairs (A, B, or C) are these schemes appropriate and why?


### Answer 1: 
The unmodified LOO CV used in case A gives a more optimistic performance value because both proteins from the test set exist also in the training set, so the similar proteins leak information into the model. In the modified LOO CV for case C the training data is completely independent of the test data so there is no information leakage into the training set. This is a reason for the notably different estimates.

### Answer 2: 
The unmodified LOO CV estimates the  performance for type A pairs because the training and test set share the same pairs. The modified LOO CV that excludes any samples that contain either test protiens is appropriate for type C pairs. This is because type C pairs do not share any similarities with the training set. 

## Task 2 (optional)

**Successfully completing this task will give you an extra point in the exam.**

1. Design a leave-one-out cross-validation scheme that is appropriate for the type of pairs that was not covered by the task 1.

2. Explain why your cross-validation scheme is appropriate.

3. Implement your cross-validation scheme. Estimate and report the generalisation performance as in the first task.

4. Discuss your results. In particular, compare the results to those you obtained in the first task and give reasons for any similarities or differences you observe.

# Testing splitting for case B 

### Case B
1. In case B we have to perform two rounds of training
2. once with protien 1 removed from the training set and once with protien 2 removed from the training set


In [7]:
### CASE B ###
a=4 # select a random index for testing purpose
test_pair=list(pairsdf.iloc[a,0:2])

print ("In this case, protien 1 is {0} and protien 2 is {1} \n".format(test_pair[0],test_pair[1])) #print the protien pair

#select the indices that contain protien 1
print ("Case B protien 1 {0} removed:".format(test_pair[0]))
caseB_protien1 = pairsdf[(pairsdf.iloc[:,0] == pairsdf.iloc[a,0]) | (pairsdf.iloc[:,1] == pairsdf.iloc[a,0])].index
protien1_removed=pairsdf.drop(caseB_protien1)
#print("The unique values of column 0 of case B protien 1 removed:\n",protien1_removed[0].unique(),"\n")
#print("The unique values of column 1 of case B protien 1 removed:\n",protien1_removed[1].unique(),"\n")

print("Does either column contain the protien values:",test_pair[0],"? \n",protien1_removed.isin(list(test_pair[0])).any(),"\n")


#select the indices that contain protien 2
print ("Case B protien 2 {0} removed:".format(test_pair[1]))
caseB_protien2 = pairsdf[(pairsdf.iloc[:,0] == pairsdf.iloc[a,1]) | (pairsdf.iloc[:,1] == pairsdf.iloc[a,1])].index
protien2_removed=pairsdf.drop(caseB_protien2)
#print("The unique values of column 0 of case B protien 2 removed:\n",protien2_removed[0].unique(),"\n")
#print("The unique values of column 1 of case B protien 2 removed:\n",protien2_removed[1].unique(),"\n")

print("Does either column contain the protien values:",test_pair[1],"? \n",protien2_removed.isin(list(test_pair[1])).any(),"\n")
print ("")
print ("It is clear for each protien we are able to select the indices and remove them from the training set")

In this case, protien 1 is P15 and protien 2 is P17 

Case B protien 1 P15 removed:
Does either column contain the protien values: P15 ? 
 0    False
1    False
dtype: bool 

Case B protien 2 P17 removed:
Does either column contain the protien values: P17 ? 
 0    False
1    False
dtype: bool 


It is clear for each protien we are able to select the indices and remove them from the training set


### Task 2 Modified LOOCV CASE B

1. train the classifier two times in a row
2. Train once with indices that contain protien 1 removed
3. Train again with indices that contain protien 2 removed 
4. Use the true and predicted values from the two rounds of testing together to calculate c-index

In [8]:
k=1
knn = KNeighborsClassifier(n_neighbors=k)
my_predictions=[]
my_tests=[]

for a in range(len(pairsdf)):
    
    #set xtest and y test
    xtest=featuresdf.iloc[a,:]
    ytest=labelsdf.iloc[a,:]
    
    #get pairs for case B protien 1
    caseB_protien1 = pairsdf[(pairsdf.iloc[:,0] == pairsdf.iloc[a,0]) | (pairsdf.iloc[:,1] == pairsdf.iloc[a,0])].index
    
    # set xtrain and ytrain by removing indexes that contain protien 1/ and pairs that contain both p1 and p2 together
    xtrain=featuresdf.drop(caseB_protien1)
    ytrain= labelsdf.drop(caseB_protien1).values.ravel()
    
    #train classifier
    knn.fit(xtrain,ytrain)
    #get predictions
    ypred=knn.predict(np.array(xtest).reshape(1, -1))
    #append true and predicted values to lists
    my_tests.append(ytest.values)
    my_predictions.append(ypred.flatten())
    
    #get pairs for case B protien 2
    caseB_protien2 = pairsdf[(pairsdf.iloc[:,0] == pairsdf.iloc[a,1]) | (pairsdf.iloc[:,1] == pairsdf.iloc[a,1])].index
    
    # set xtrain and ytrain by removing indexes that contain protien 2/ and pairs that contain both p1 and p2 together
    xtrain=featuresdf.drop(caseB_protien2)
    ytrain= labelsdf.drop(caseB_protien2).values.ravel()
    
    #train classifier
    knn.fit(xtrain,ytrain)
    #get predictions
    ypred=knn.predict(np.array(xtest).reshape(1, -1))
    #append true and predicted values to lists
    my_tests.append(ytest.values)
    my_predictions.append(ypred.flatten())
    
caseB_cindex=cindex(my_tests,my_predictions)
print ("")
print ("Modified Case B LOOCV Cindex =",caseB_cindex)


Modified Case B LOOCV Cindex = 0.696563088512241


### Task 2 bonus answers:

1. Explain why your cross-validation scheme is appropriate:

 Case B was a scenario where only one of the two protiens in the test pair was present in the training set. In order to implement this you have to train your model with all the pairs that contain one of the two protiens in the test set. Then you must train the model again but this time with the other protien that was previously excluded.


2. Discuss your results. In particular, compare the results to those you obtained in the first task and give reasons for any similarities or differences you observe:

It would make logical sense that the c-index score of Case B would be inbetween that of Case A and case C. This is because in case A you have both protien pairs present in the test and training set. Which would lead to the best results because of the similarity between the test and training set. In case C you have no shared information between the test and training set. Finally, in case B you share some information between the test and train set but not as much as case A and not as little as case C.