# Task 2: Why are the predictions to good (/bad)?

## Question

> I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
>
> Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
>
> Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
>
> Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.

```R
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))

library(e1071)
library(pROC)

colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){    
    #get train

    train=sample(seq(length(Y)),0.5*length(Y))
    if(min(table(Y[train]))==0)
    next

    #test from train
    test=seq(length(Y))[-train]

    #train model
    XX=X[train,]
    YY=Y[train]
    mod=svm(XX,YY,probability=FALSE)
    XXX=X[test,]
    predVec=predict(mod,XXX)
    RFans=attr(predVec,'decision.values')
    ansMat[test,i]=as.numeric(predVec)
}

ans=rowMeans(ansMat,na.rm=TRUE)

r=roc(Y,ans)$auc
print(r)
```

Similarly, when I implement the same thing in Python I get similar results.



In [None]:
import numpy as np
from sklearn.svm import SVC
from sklearn import metrics
from scipy import stats

In [None]:
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # train model
    mod = SVC(probability=False)
    mod.fit(X=X[train, :], y=Y[train])
    # predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = metrics.roc_curve(Y, ans, pos_label=1)
print(metrics.auc(fpr, tpr))

0.9591836734693877


## Your answer

My initial focus when tackling this task was centered on determining an appropriate method for calculating 'y_pred' based on values obtained during cross-validation. The utilized approach of simply averaging these values in a binary classification task did not seem to align with best practices. Sending continuous values as predictions for the ROC curve could introduce inherent challenges, though it's important to note that this was just one of several concerns. Each iteration of cross-validation needed to be treated as a distinct experiment, and the AUC had to be computed independently for each. The final result would then be derived from the average AUCs. To rigorously assess this task, I conducted practical testing for all the proposed solutions.

In my initial attempt, I aimed to address the issue of sending continuous values to the ROC curve. Rather than straightforwardly averaging them, I explored two alternatives: calculating the mode value or implementing a conditional code snippet that mapped values to 2 if the average exceeded 1.5, and to 1 otherwise. This effort, however, did not yield satisfactory results. While there were slight improvements in random outcomes, the AUC values consistently remained significantly higher than 0.5 in most experiments. So this solution could not be the correct one.

In my second attempt, I adhered to the same cross-validation method outlined in the question. However, as previously emphasized, I treated each iteration as an independent experiment. Ultimately, I computed the AUC separately for each experiment and then calculated the average AUC. This approach yielded significantly improved results and resolved the earlier issues, making it the preferred solution for this task.

In my third attempt, I adopted a k-fold cross-validation with five folds to gauge the model's performance. Personally, I favored this approach because it provides a more comprehensive examination of the data, ensuring that all data points are thoroughly tested. Historically, I had achieved better results with this method. Similar to the second attempt, I calculated the average AUCs obtained from the test data across different folds to determine the final AUC. The AUC value in this third attempt also demonstrated excellence, closely approaching 0.5.


In summary, the peculiar behavior observed in the mentioned code can be attributed primarily to the unconventional use of cross-validation. The random selection of training and testing sets with limited data rows can lead to issues, especially if the training data is imbalanced. Additionally, I hold the conviction that the employed strategy of averaging 'y_pred' values and subsequently applying this average in ROC analysis is not sound. It is preferable to treat each iteration in the cross-validation process as a distinct experiment, independently calculating the AUC for each. The final result should be obtained by averaging these individual AUCs, ensuring a more accurate and reliable outcome. The code for all the aforementioned attempts is provided below:

### First attempt

In [None]:
def nanmode(arr):
    mask = ~np.isnan(arr)  # Create a mask to exclude NaN values
    if np.any(mask):
        return stats.mode(arr[mask]).mode[0]  # Calculate mode for non-NaN values
    else:
        return np.nan  # If all values are NaN, return NaN

In [None]:
threshold = 1.5  # Choose an appropriate threshold value

In [None]:
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # train model
    mod = SVC(probability=False)
    mod.fit(X=X[train, :], y=Y[train])
    # predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
binary_predictions = np.where(ans >= threshold, 2, 1)
fpr, tpr, thresholds = metrics.roc_curve(Y, binary_predictions, pos_label=1)
# ans = np.apply_along_axis(nanmode, axis=1, arr=ansMat)
# fpr, tpr, thresholds = metrics.roc_curve(Y, ans, pos_label=1)
print(metrics.auc(fpr, tpr))

0.6071428571428571


### Second attempt

In [None]:
import numpy as np
from sklearn.metrics import roc_curve, auc
from sklearn.svm import SVC

Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])

n_iter = 1000  # Number of cross-validation iterations
ansMat = np.full((len(Y), n_iter), np.nan)


# Perform your custom cross-validation
for i in range(n_iter):
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # Train/test split logic here...

    # Train your classifier (e.g., SVM)
    clf = SVC(probability=False)
    clf.fit(X=X[train, :], y=Y[train])

    # Make predictions on the test data
    y_pred = clf.predict(X[test, :])

    # Compute ROC curve and AUC for this iteration
    fpr, tpr, _ = metrics.roc_curve(Y[test], y_pred, pos_label=2)  # Adjust pos_label accordingly
    fold_auc = metrics.auc(fpr, tpr)

    # Append AUC to the result matrix
    ansMat[test, i] = fold_auc

# Calculate the mean AUC across all iterations
mean_auc = np.nanmean(ansMat)

print("Mean AUC:", mean_auc)


Mean AUC: 0.5344166666666667


### Third attempt

In [None]:
import numpy as np
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC

Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])

n_splits = 5  # Choose an appropriate number of splits
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Initialize list to store AUC for each fold
auc_values = []

# Perform cross-validation
for train_idx, test_idx in cv.split(X, Y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = Y[train_idx], Y[test_idx]

    # Train your classifier (e.g., SVM)
    clf = SVC(probability=False)
    clf.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = clf.predict(X_test)

    # Compute ROC curve and AUC for this fold
    fpr, tpr, _ = roc_curve(y_test, y_pred, pos_label=2)  # Adjust pos_label accordingly
    fold_auc = auc(fpr, tpr)

    # Append AUC to the list
    auc_values.append(fold_auc)

# Calculate the mean AUC across all folds
mean_auc = np.mean(auc_values)

print("Mean AUC:", mean_auc)


Mean AUC: 0.5


## Feedback

Was this exercise is difficult or not? In either case, briefly describe why.

This task initially appeared quite challenging, but upon closer examination, it revealed itself to be a thought-provoking and conceptual question that required a deeper level of contemplation to answer effectively.