# Task 2: Why are the predictions to good (/bad)?

## Question

> I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
>
> Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
>
> Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
>
> Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.

```R
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))

library(e1071)
library(pROC)

colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){    
    #get train

    train=sample(seq(length(Y)),0.5*length(Y))
    if(min(table(Y[train]))==0)
    next

    #test from train
    test=seq(length(Y))[-train]

    #train model
    XX=X[train,]
    YY=Y[train]
    mod=svm(XX,YY,probability=FALSE)
    XXX=X[test,]
    predVec=predict(mod,XXX)
    RFans=attr(predVec,'decision.values')
    ansMat[test,i]=as.numeric(predVec)
}

ans=rowMeans(ansMat,na.rm=TRUE)

r=roc(Y,ans)$auc
print(r)
```

Similarly, when I implement the same thing in Python I get similar results.



In [None]:
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc

# Create a binary class target variable Y
#with a length of 28 (14 instances of each class)
Y = np.array([1, 2]*14)
# Generate a random feature matrix X with 100 features for each of the 28 instances
X = np.random.uniform(size=[len(Y), 100])
# Set the number of iterations for the loop
n_iter = 1000

# # Create a matrix to store the predictions for each iteration
ansMat = np.full((len(Y), n_iter), np.nan)

# Loop for n_iter iterations
for i in range(n_iter):
  # Get a random train set (50% of the data) without replacement
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)

    # Check if all instances belong to the same class, if yes, skip the iteration
    if len(np.unique(Y)) == 1:
        continue
    # Create a test set with the remaining instances
    test = np.array([i for i in range(len(Y)) if i not in train])
    # train model: Train a support vector machine (SVM) classifier on the training set
    mod = SVC(probability=False)
    mod.fit(X=X[train, :], y=Y[train])

    # # Predict and collect the answers for the test set
    ansMat[test, i] = mod.predict(X[test, :])
# Calculate the mean prediction for each instance across all iterations
ans = np.nanmean(ansMat, axis=1)
#Compute the ROC curve and AUC (Area Under the Curve) for
#the mean predictions
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))

0.903061224489796


In [None]:
Y

array([1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2,
       1, 2, 1, 2, 1, 2])

In [None]:
X

array([[0.4813329 , 0.24041517, 0.3959143 , ..., 0.38659706, 0.32077143,
        0.7105974 ],
       [0.97122314, 0.5799872 , 0.04110099, ..., 0.34391983, 0.76650396,
        0.12696706],
       [0.81492051, 0.6308246 , 0.10334558, ..., 0.07782934, 0.28957969,
        0.36377419],
       ...,
       [0.69993134, 0.57577897, 0.72294761, ..., 0.74298805, 0.83345724,
        0.34659511],
       [0.93516953, 0.17298466, 0.90804303, ..., 0.89213749, 0.72322016,
        0.55808143],
       [0.27831134, 0.08327156, 0.00862988, ..., 0.69294783, 0.93725279,
        0.07454584]])

In [None]:
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc

# Create a binary class target variable Y with a length of 28 (14 instances of each class)
Y = np.array([1, 2]*14)

# Generate a random feature matrix X with 100 features for each of the 28 instances
X = np.random.uniform(size=[len(Y), 100])

# Set the number of iterations for the loop
n_iter = 1000

# Create a matrix to store the predictions for each iteration
ansMat = np.full((len(Y), n_iter), np.nan)

# Loop for n_iter iterations
for i in range(n_iter):
    # Get a random train set (50% of the data) without replacement
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)

    # Check if all instances belong to the same class, if yes, skip the iteration
    if len(np.unique(Y)) == 1:
        continue

    # Create a test set with the remaining instances
    test = np.array([i for i in range(len(Y)) if i not in train])

    # Train a support vector machine (SVM) classifier on the training set
    mod = SVC(probability=False)
    mod.fit(X=X[train, :], y=Y[train])

    # Predict and collect the answers for the test set directly for each iteration
    ansMat[:, i] = mod.predict(X)

# Compute the ROC curve and AUC (Area Under the Curve) for each iteration
for i in range(n_iter):
    fpr, tpr, thresholds = roc_curve(Y, ansMat[:, i], pos_label=1)
    print(f"Iteration {i + 1} AUC: {auc(fpr, tpr)}")


Iteration 1 AUC: 0.35714285714285715
Iteration 2 AUC: 0.3214285714285714
Iteration 3 AUC: 0.3928571428571429
Iteration 4 AUC: 0.3928571428571429
Iteration 5 AUC: 0.3928571428571428
Iteration 6 AUC: 0.2857142857142857
Iteration 7 AUC: 0.2857142857142857
Iteration 8 AUC: 0.25000000000000006
Iteration 9 AUC: 0.2857142857142857
Iteration 10 AUC: 0.2857142857142857
Iteration 11 AUC: 0.32142857142857145
Iteration 12 AUC: 0.2857142857142857
Iteration 13 AUC: 0.2857142857142857
Iteration 14 AUC: 0.32142857142857145
Iteration 15 AUC: 0.2857142857142857
Iteration 16 AUC: 0.2857142857142857
Iteration 17 AUC: 0.2857142857142857
Iteration 18 AUC: 0.2857142857142857
Iteration 19 AUC: 0.2142857142857143
Iteration 20 AUC: 0.3214285714285714
Iteration 21 AUC: 0.3928571428571428
Iteration 22 AUC: 0.32142857142857145
Iteration 23 AUC: 0.3214285714285714
Iteration 24 AUC: 0.32142857142857145
Iteration 25 AUC: 0.2857142857142857
Iteration 26 AUC: 0.2857142857142857
Iteration 27 AUC: 0.32142857142857145
Ite

## Your answer

Area under the curve (AUC) is a definite integral between two points in the ROC Curve (Receiver Operating Characteristic Curve), a plot of true positive and false positive rates at numerous classification thresholds.

It is not however purely reflective of the model’s performance, as AUC measures the model’s ability to distinguish true positive and false positive rates across different threshold values even if the mean average is taken, averaging predictions over iterations in the loop where the SVM model is trained and predictions are made, the training set (train) is randomly selected without replacement. This means that, in each iteration, a new random subset of the data is chosen as the training set, and the remaining instances form the test set.
AUC is less sensitive to class imbalance, but how data is split can still influence the observed AUC values.

In each iteration, a different subset of the data is used as the training set, and the performance is evaluated on the remaining instances (the test set). This process is repeated for n_iter times. The mean prediction across all iterations is then used to compute the ROC curve and AUC.
This is not a strict leave-one-group-out cross-validation, as groups are not explicitly defined, and the sampling is done randomly without specific consideration for grouping. Hence the high value of AUC as data may have been repeated across folds.

1. Randomness and Size:  6 instances with 100 features are generated randomly ith (1,2), occuring 14 times each, this class imbalance with randomness doesn't necessarily give a 50/50 chance at True Positive/FPR  relations.
2.	Classifier Sensitivity to Data Split:
In binary classification, classifier sensitivity to the data split refers to how the performance of a machine learning model might vary when different subsets of the data are used for training and testing. The sensitivity is particularly relevant when dealing with a small or imbalanced dataset.
3.	Overfitting or Underfitting:
Depending on the complexity of the model and the size of the training set, different splits can lead to overfitting or underfitting. A small training set might result in overfitting, where the model memorizes noise instead of learning general patterns.


## Feedback

Was this exercise is difficult or not? In either case, briefly describe why.

In [None]:
It was a thought provoking exercise. It required extensive research.