# Task 2: Why are the predictions to good (/bad)?

## Question

> I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
>
> Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
> 
> Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
>
> Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.

```R
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))

library(e1071)
library(pROC)

colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){    
    #get train

    train=sample(seq(length(Y)),0.5*length(Y))
    if(min(table(Y[train]))==0)
    next

    #test from train
    test=seq(length(Y))[-train]

    #train model
    XX=X[train,]
    YY=Y[train]
    mod=svm(XX,YY,probability=FALSE)
    XXX=X[test,]
    predVec=predict(mod,XXX)
    RFans=attr(predVec,'decision.values')
    ansMat[test,i]=as.numeric(predVec)
}

ans=rowMeans(ansMat,na.rm=TRUE)

r=roc(Y,ans)$auc
print(r)
```

Similarly, when I implement the same thing in Python I get similar results.



In [1]:
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.preprocessing import normalize
from sklearn.model_selection import GridSearchCV, LeaveOneOut, KFold

First of all, I want to normalized the random variables X, to be sure that they are on a similar scale because we are using distance-based algorithm (SVM).

In [2]:
Y = np.array([1, 2]*14)
x = np.random.uniform(size=[len(Y), 100])
X_normalized = normalize(x, axis=1)
X = X_normalized

now I want to find the best parameters [C and gamma ] for classifing the given dataset with gridcv 

In [143]:
param_grid = [
    {
        'C' : [1, 10,80 ,100, 1000],
        'gamma': ['scale', 1, 0.1,0.01, 0.001, 0.0001],
        'kernel': ['rbf'],
    }
]
kfold = KFold(n_splits=5) 
svc = SVC(probability=True)
optimal_params = GridSearchCV(
    SVC(probability=True),
    param_grid,
    cv = kfold,
    scoring = 'roc_auc',
    verbose = 0
)

I have use these parameter bellow in my SVC() to solve the problem and enhance the result to the reality

In [144]:
optimal_params.fit(X,Y)
print('optimal_parameter for RBF kernel is :')
print( optimal_params.best_params_)

optimal_parameter for RBF kernel is :
{'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}


In [64]:
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # train model
    mod = SVC(probability=False)
    mod.fit(X=X[train, :], y=Y[train])
    # predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))

0.7857142857142857


In [6]:
nan_counts_per_row = np.count_nonzero(np.isnan(ansMat), axis=1)
# Print the number of nan values in each row
print(nan_counts_per_row)
print(ans)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))

[490 497 505 478 497 504 512 498 512 501 500 491 503 500 517 457 523 516
 486 510 513 493 514 486 523 496 486 492]
[1.5372549  1.26242545 1.58787879 1.55555556 1.5387674  1.27620968
 1.51639344 1.37051793 1.55737705 1.32665331 1.434      1.27897839
 1.56740443 1.504      1.69151139 1.2946593  1.41509434 1.46487603
 1.58365759 1.51836735 1.45995893 1.30769231 1.68518519 1.28210117
 1.67085954 1.43452381 1.54669261 1.31692913]
0.9030612244897959


## Your answer

the problem is that about half of the ansMat is nan due to using half of dataset for training and half for testing in leave group out cross validation, leading the number of nan with signficant diffrences with others row, printed above, which cause this problem. To solve this issue we need to calculate probability estimates for each fold and iteration


In [128]:
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
prob_estimates = np.zeros((len(Y), n_iter))

for i in range(n_iter):
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) ==  1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    
    # Train model with probability estimation enabled
    
    # replaceing gamma value by optimal_param
    # grabing C by trial and error
    mod = SVC(C =80, gamma= 'scale', probability=True)
    mod.fit(X=X[train, :], y=Y[train])
    
    # Predict and collect probability estimates
    prob_estimates[test, i] = mod.predict_proba(X[test, :])[:,  1]

# Calculate the mean probability estimate across iterations for each test instance
mean_prob_estimates = np.nanmean(prob_estimates, axis=1)
#print(mean_prob_estimates)
fpr, tpr, thresholds = roc_curve(Y, mean_prob_estimates, pos_label=1)
print(auc(fpr, tpr))

0.5561224489795918


## Feedback

Was this exercise is difficult or not? In either case, briefly describe why.

One of the hardest challenge in this question was to define a costume scoring function for using one group out coss validation by 1000 iteration.

Choosing the right kernel for Support Vector Machines (SVMs) can indeed be a complex task, as it heavily influences the model's performance. The Radial Basis Function (RBF) kernel is often favored for its ability to handle complex, non-linear patterns in the data, as it maps the input data into a higher-dimensional space where linear separation becomes possible. also I know RBF usually works well for classification as it comes by multiplication of infinite dimensions. However, when I switched to a linear kernel, I encountered an unexpected outcome. my Area Under the Curve (AUC) dropped to approximately 0.5. Despite the intuitive nature of linear kernels in handling high-dimensional spaces without explicit mapping to a higher dimension, this result was puzzling. I'm still contemplating why the simple operation of calculating dot products between feature vectors led to such a significant decrease in performance.

In [146]:
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # train model
    mod = SVC(kernel = 'linear',probability=False)
    mod.fit(X=X[train, :], y=Y[train])
    # predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))

0.6173469387755102
