# Task 2: Why are the predictions to good (/bad)?

## Question

### Proposition of solution

Below, we give an proposition of code to tackle that above problem.

Here's an example of how you can modify your code to incorporate stratified sampling using the train_test_split function from scikit-learn.

> I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
>
> Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
>
> Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
>
> Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.

```R
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))

library(e1071)
library(pROC)

colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){    
    #get train

    train=sample(seq(length(Y)),0.5*length(Y))
    if(min(table(Y[train]))==0)
    next

    #test from train
    test=seq(length(Y))[-train]

    #train model
    XX=X[train,]
    YY=Y[train]
    mod=svm(XX,YY,probability=FALSE)
    XXX=X[test,]
    predVec=predict(mod,XXX)
    RFans=attr(predVec,'decision.values')
    ansMat[test,i]=as.numeric(predVec)
}

ans=rowMeans(ansMat,na.rm=TRUE)

r=roc(Y,ans)$auc
print(r)
```

Similarly, when I implement the same thing in Python I get similar results.



In [None]:
## To ensure the following code you gave to work, we have to import the necessary libraries.
## Let's do and importing the necessary libraries for the code to work
import numpy as np
from sklearn.metrics import roc_curve,auc
from sklearn.svm import SVC

In [None]:
Y = np.array([1, 2]*14) #To create a one dimensional array of "[1,2]" 14 times. So Y have 2*14=28 elements.
X = np.random.uniform(size=[len(Y), 100]) #here we are generating randomly and following the uniform law and two dimension (28,1000) array.
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan) #here we contruct and array of of shape (28,1000) with a floating value "nan"
for i in range(n_iter):
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None) #we generates a random sample taking between the range(len(y)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # train model
    mod = SVC(probability=False)
    mod.fit(X=X[train, :], y=Y[train])
    # predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))

0.8775510204081631


## Answer

The issue with encountering with the AUC-ROC metric being significantly higher than 0.5 despite using completely random data can be attributed to the way the train-test splits are generated in the. Ok Let's have a closer look where we are splitting the tarin test:

train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)

In this line, we randomly select indices for the training set using np.random.choice. The probability p parameter is set to None, which means each element in target or response Y has an equal chance of being selected. However, this does not take into account the class distribution of the data.

Since the class labels in Y are alternating between 1 and 2, it's possible for the training set to contain a disproportionate number of instances from one class compared to the other. This imbalance in the training set can lead to biased predictions and, consequently, inflated AUC-ROC values.

To address this issue, you can modify your code to ensure a balanced distribution of classes in the training set. One approach is to use stratified sampling, which ensures that the proportion of each class is preserved in the training set.

Below I give my proposition code to adress this issue


In [None]:
import numpy as np
from sklearn.metrics import roc_curve,auc
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)

# Split the data into train and test sets using stratified sampling
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5, stratify=Y, random_state=42)

# Perform your model training and prediction on the stratified train and test sets
mod = SVC(probability=False)
mod.fit(X=X_train, y=Y_train)
ansMat = mod.predict(X_test)

# Calculate AUC-ROC on the stratified test set
fpr, tpr, thresholds = roc_curve(Y_test, ansMat, pos_label=1)
print(auc(fpr, tpr))

0.5


By using train_test_split with the stratify parameter set to Y, the function will ensure that the training and test sets have a balanced class distribution. This should provide more accurate and reliable AUC-ROC results, even with a small dataset.

## Feedback

Was this exercise is difficult or not? In either case, briefly describe why.

This exercice was a bit difficult. In fact, debugging the code of someone is always the difficult task. So after went trough each line of code and master each of them, I finally understand what can be the problem and proposed my own sugestion.