# Task 2: Random Data?

## Question

> I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
>
> Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
>
> Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
>
> Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.

```R
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))

library(e1071)
library(pROC)

colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){    
    #get train

    train=sample(seq(length(Y)),0.5*length(Y))
    if(min(table(Y[train]))==0)
    next

    #test from train
    test=seq(length(Y))[-train]

    #train model
    XX=X[train,]
    YY=Y[train]
    mod=svm(XX,YY,probability=FALSE)
    XXX=X[test,]
    predVec=predict(mod,XXX)
    RFans=attr(predVec,'decision.values')
    ansMat[test,i]=as.numeric(predVec)
}

ans=rowMeans(ansMat,na.rm=TRUE)

r=roc(Y,ans)$auc
print(r)
```

Similarly, when I implement the same thing in Python I get similar results.



In [None]:
import numpy as np
from sklearn.svm import SVC
from sklearn import metrics
from scipy import stats
from sklearn.metrics import roc_curve, auc

In [None]:
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # train model
    mod = SVC(probability=False)
    mod.fit(X=X[train, :], y=Y[train])
    # predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))

0.9081632653061225


## Your answer

I aim to transparently outline my problem-solving approach for this task. My thought process involved exploring numerous possibilities and implementing various checks, including examining how predicted values were sent to roc_curve and evaluating the average AUC. However, upon completing my problem-solving journey, I concluded that the root cause of the strange results lies in the methodology used for data splitting during cross-validation. The specifics of how and why I arrived at this conclusion are detailed in the following explanation.

In my initial attempt, I aimed to address the issue of sending continuous values to the ROC curve. Rather than straightforwardly averaging them, I explored two alternatives: calculating the mode value or implementing a conditional code snippet that mapped values to 2 if the average exceeded 1.5, and to 1 otherwise. This effort, however, did not yield satisfactory results. While there were slight improvements in random outcomes, the AUC values consistently remained significantly higher than 0.5 in most experiments. So this solution could not be the correct one.

Following my initial attempt, it became evident that an alternative explanation was required for the peculiar results observed. I began to scrutinize the methodology employed in calculating the final AUC, suspecting that there might be a flaw in the approach. I believe each iteration of cross-validation needed to be treated as a distinct experiment, and the AUC had to be computed independently for each. The final result would then be derived from the average AUCs. So in my second attempt, I adhered to the same cross-validation method outlined in the question. However, as previously emphasized, I treated each iteration as an independent experiment. Ultimately, I computed the AUC separately for each experiment and then calculated the average AUC. This approach yielded significantly improved results and resolved the earlier issues, making it the preferred solution for this task.

In my third attempt, I adopted a k-fold cross-validation with five folds to gauge the model's performance. Personally, I favored this approach because it provides a more comprehensive examination of the data, ensuring that all data points are thoroughly tested. Historically, I had achieved better results with this method. Similar to the second attempt, I calculated the average AUCs obtained from the test data across different folds to determine the final AUC. The AUC value in this third attempt also demonstrated excellence, closely approaching 0.5.


The attempts discussed earlier did not seem to pinpoint the core issue in this question. As suggested, manipulating the size of the input data by increasing the number of rows in X or reducing the number of columns might help mitigate the problem. Therefore, it appears that the primary challenge lies in the data-splitting process. The current method of randomly assigning samples to the training and test sets may lead to imbalanced distributions, resulting in a biased model that consistently predicts one value. The interpretation of model performance, based on the pos_label parameter, can vary widely, yielding either exceptionally good or bad results.

To investigate this hypothesis, I conducted a fourth experiment, altering the data splitting approach to adopt a StratifiedKFold cross-validation with five folds. This method offers a more thorough examination of the data, ensuring that all data points are comprehensively tested. Importantly, StratifiedKFold addresses the issue of imbalanced train and test splits. As mentioned in the StratifiedKFold webpage, the folds are made by preserving the percentage of samples for each class. The AUC value obtained in this attempt demonstrated excellence, closely approaching 0.5. In conclusion, it appears that the main problem was the methodology used for data splitting.

Let me explain why an imbalanced training set can lead to unusual results. We have a matrix where we record predicted values (y_pred) for our X_test in each of the 1000 iterations. Under normal circumstances, if everything is working well, the final average (ans) calculated by taking the mean from the matrix columns should be around 1.5, give or take a bit.

However, due to random splitting and small sample sizes, some iterations result in an imbalanced training set. This causes our model to be biased towards predicting test samples as the majority class in our training set. For instance, if the test set is imbalanced with mostly class 1 samples, and our training set majority is class 2, the model tends to predict more class 2 for class 1 samples and vice versa. Consequently, the averages for class 1 samples become greater than 1.5, and for class 2 samples, become smaller than 1.5.

When we use these averages in roc_curve, it sets n+1 thresholds, with n being the number of unique samples in our ans list. The maximum threshold is the maximum element in ans + 1, and the minimum is the minimum element in ans. The other thresholds lie in between. Setting pos_label to 1 for many thresholds results in a high True Positive Rate (TPR) and low False Positive Rate (FPR), leading to a high Area Under the Curve (AUC). Conversely, setting pos_label to 2 yields a low TPR and high FPR, resulting in a very low AUC.

To address this issue, we can use different cross-validation techniques, such as stratified k-fold. Although stratified k-fold ensures that each sample is used as a test set exactly once, automatically handling the problem of averaging.

In conclusion, the fundamental issue in this question appears to stem from the unconventional method of data splitting, where a random selection of training and testing sets with limited data rows can lead to imbalances, consequently biasing the model's predictions. Moreover, the strategy of averaging 'y_pred' values and applying this average in ROC analysis is questioned, as treating each cross-validation iteration as a distinct experiment and independently calculating AUC for each seems to offer a more robust and reliable evaluation. The provided code for all attempts is detailed below.

### First attempt

In [None]:
def nanmode(arr):
    mask = ~np.isnan(arr)  # Create a mask to exclude NaN values
    if np.any(mask):
        return stats.mode(arr[mask]).mode[0]  # Calculate mode for non-NaN values
    else:
        return np.nan  # If all values are NaN, return NaN

In [None]:
threshold = 1.5  # Choose an appropriate threshold value

In [None]:
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # train model
    mod = SVC(probability=False)
    mod.fit(X=X[train, :], y=Y[train])
    # predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
binary_predictions = np.where(ans >= threshold, 2, 1)
fpr, tpr, thresholds = metrics.roc_curve(Y, binary_predictions, pos_label=1)
# ans = np.apply_along_axis(nanmode, axis=1, arr=ansMat)
# fpr, tpr, thresholds = metrics.roc_curve(Y, ans, pos_label=1)
print("AUC: ",metrics.auc(fpr, tpr))

AUC:  0.7142857142857143


### Second attempt

In [None]:
import numpy as np
from sklearn.metrics import roc_curve, auc
from sklearn.svm import SVC

Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])

n_iter = 1000  # Number of cross-validation iterations
ansMat = np.full((len(Y), n_iter), np.nan)


# Perform task's custom cross-validation
for i in range(n_iter):
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # Train/test split logic here...

    clf = SVC(probability=False)
    clf.fit(X=X[train, :], y=Y[train])

    # Make predictions on the test data
    y_pred = clf.predict(X[test, :])

    # Compute ROC curve and AUC for this iteration
    fpr, tpr, _ = metrics.roc_curve(Y[test], y_pred, pos_label=2)
    fold_auc = metrics.auc(fpr, tpr)

    # Append AUC to the result matrix
    ansMat[test, i] = fold_auc

# Calculate the mean AUC across all iterations
mean_auc = np.nanmean(ansMat)

print("Mean AUC:", mean_auc)


Mean AUC: 0.543625


### Third attempt

In [None]:
import numpy as np
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC

Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])

n_splits = 5  # Choose an appropriate number of splits
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Initialize list to store AUC for each fold
auc_values = []

# Perform cross-validation
for train_idx, test_idx in cv.split(X, Y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = Y[train_idx], Y[test_idx]

    clf = SVC(probability=False)
    clf.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = clf.predict(X_test)

    # Compute ROC curve and AUC for this fold
    fpr, tpr, _ = roc_curve(y_test, y_pred, pos_label=2)
    fold_auc = auc(fpr, tpr)

    # Append AUC to the list
    auc_values.append(fold_auc)

# Calculate the mean AUC across all folds
mean_auc = np.mean(auc_values)

print("Mean AUC:", mean_auc)


Mean AUC: 0.5


### Fourth attempt

In [None]:
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_curve, auc

# Set a random seed for reproducibility
# np.random.seed(42)

Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)

# Use StratifiedKFold for better data splitting
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for i, (train, test) in enumerate(skf.split(X, Y)):
    # train model
    mod = SVC(probability=False)
    mod.fit(X[train, :], Y[train])
    # predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])

ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=2)
print("AUC: ",auc(fpr, tpr))


AUC:  0.5


## Feedback

Was this exercise is difficult or not? In either case, briefly describe why.

I prefer not to categorize this task as either difficult or easy. Instead, I find it more fitting to describe it as an enjoyable and challenging endeavor that requires a unique approach. While I cannot guarantee that my response aligns precisely with your expectations, I appreciate the opportunity to engage in problem-solving that encourages a shift in perspective. It felt like a mental workout, and I relish the chance to navigate such stimulating paths.