# Task 2: Why are the predictions to good (/bad)?

## Question

> I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
>
> Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
>
> Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
>
> Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.

```R
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))

library(e1071)
library(pROC)

colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){    
    #get train

    train=sample(seq(length(Y)),0.5*length(Y))
    if(min(table(Y[train]))==0)
    next

    #test from train
    test=seq(length(Y))[-train]

    #train model
    XX=X[train,]
    YY=Y[train]
    mod=svm(XX,YY,probability=FALSE)
    XXX=X[test,]
    predVec=predict(mod,XXX)
    RFans=attr(predVec,'decision.values')
    ansMat[test,i]=as.numeric(predVec)
}

ans=rowMeans(ansMat,na.rm=TRUE)

r=roc(Y,ans)$auc
print(r)
```

Similarly, when I implement the same thing in Python I get similar results.



In [6]:
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # train model
    mod = SVC(probability=False)
    mod.fit(X=X[train, :], y=Y[train])
    # predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))

0.7857142857142857


## Your answer

It sounds like you're encountering an unexpected result in the area under the curve (AUC) for your binary classification task using an SVM on randomly generated data. Given that the data is random, the AUC should indeed be close to 0.5. There are a few potential issues that could be contributing to the higher-than-expected AUC:

**Data Imbalance in Train/Test Splits:** Even though you're sampling the train and test sets randomly, there could be imbalances in the class distributions between these sets, especially with a small number of samples. This could lead to models that perform better than random chance on certain splits.

**Leakage from Repeated Measures:** If the dataset is too small and you are repeating the process multiple times, the random sampling might inadvertently lead to some form of leakage where the model sees similar patterns multiple times, artificially inflating performance.

**Small Sample Size**: With a small sample size, statistical fluctuations can cause significant deviations in performance. This means that the variance in your AUC could be high simply due to the small number of samples.

**SVM's Sensitivity to Small Sample Sizes**: SVMs can be sensitive to the size and dimensionality of the data. With high-dimensional data (many features) and few samples, SVMs might find spurious patterns that do not generalize.

Here are a few steps to diagnose and potentially mitigate the issue:

1. Check Class Distribution in Train/Test Splits
Ensure that each train/test split maintains a balanced class distribution.

2. Use Stratified Sampling
Stratified sampling ensures that each train/test split has a similar class distribution.

3. Increase Sample Size or Decrease Dimensionality
If possible, increase the number of samples or reduce the number of features to ensure a more stable estimation of model performance.

4. Review Model Outputs
Check the predictions directly to see if they are consistent with random guessing.

In [12]:
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.model_selection import train_test_split

# Generate random data
Y = np.array([1, 2] * 14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)

for i in range(n_iter):
    # Get train/test index ensuring stratified split
    train, test = train_test_split(range(len(Y)), test_size=0.5, stratify=Y)

    # Ensure that both classes are present in the training set
    if len(np.unique(Y[train])) < 2:
        continue

    # Train model
    mod = SVC(probability=False)
    mod.fit(X[train, :], y=Y[train])

    # Predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])

# Compute the mean predictions while ignoring NaNs
ans = np.nanmean(ansMat, axis=1)

# Ensure there are no NaN values in the final predictions
if np.isnan(ans).any():
    # Fill NaN values with a neutral prediction (e.g., 0.5 in binary classification)
    ans = np.nan_to_num(ans, nan=0.5)

# Calculate and print ROC AUC
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
roc_auc = auc(fpr, tpr)
print("AUC:", roc_auc)




AUC: 0.2857142857142857


## Feedback

Was this exercise is difficult or not? In either case, briefly describe why.