### What is Cross Validation 

Cross-validation is a technique used in machine learning to assess the performance and generalization ability of a model. It involves partitioning a dataset into subsets, training the model on some subsets, and evaluating its performance on the remaining subset.

The main goal of cross-validation is to estimate how well a trained model will perform on unseen data. It helps to detect issues such as overfitting or underfitting and provides a more robust evaluation of the model's performance.

*Here's a general overview of the steps involved in cross-validation:*

**Splitting the data**: The dataset is divided into two or more subsets, usually referred to as training set and validation set (or test set). The training set is used to train the model, while the validation set is used to evaluate its performance.

**Training the model**: The model is trained on the training set using a specific algorithm or technique. The model learns from the input features and their corresponding labels in the training set.

**Evaluating the model**: The trained model is then used to make predictions on the validation set. The predictions are compared with the actual labels in the validation set, and various evaluation metrics (such as accuracy, precision, recall, etc.) are calculated to assess the model's performance.

**Repeating the process**: The above steps are repeated multiple times, with different subsets of the data used as the validation set each time. This allows for a more comprehensive assessment of the model's performance by considering different subsets of the data for training and validation.

**Performance aggregation**: The performance metrics obtained from each iteration of the cross-validation process are usually averaged or combined to get an overall performance measure for the model.

### Importing the Libraries

In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
from warnings import filterwarnings
filterwarnings('ignore')

### Importing the dataset

In [3]:
df = pd.read_csv('cancer_dataset.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
# checking the shape of the dataset
df.shape

(569, 33)

In [5]:
# checking if the dataset is balanced or not
df.diagnosis.value_counts()

diagnosis
B    357
M    212
Name: count, dtype: int64

In [6]:
df.diagnosis.value_counts(normalize=True)*100

diagnosis
B    62.741652
M    37.258348
Name: proportion, dtype: float64

### Checking for Duplicated records

In [7]:
df[df.duplicated()]

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32


### Checking for Null values

In [8]:
df.isnull().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

You can see in Unnamed : 32 column all are null values so we are droping those columns and as well id (because it represents only unique values)

In [9]:
df.drop(['id','Unnamed: 32'],axis=1,inplace=True)

In [10]:
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Now will convert into Independent and Dependent Variable

In [11]:
X = df.drop('diagnosis',axis=1)
y = df['diagnosis']

In [12]:
X.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [13]:
y.head()

0    M
1    M
2    M
3    M
4    M
Name: diagnosis, dtype: object

### 1.  Leave One Out Cross Validation(LOOCV)

Leave-One-Out Cross Validation (LOOCV): Each Sample (observation) is used as the validation set once, while the remaining samples are used for training.

In [14]:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
leave_validation=LeaveOneOut()
results=cross_val_score(LR,X,y,cv=leave_validation)

In [15]:
X.shape

(569, 30)

In [16]:
len(results)

569

In [17]:
print(np.mean(results))

0.945518453427065


Leave-one-out cross validation tends to overfitting of the dataset 

### 2. HoldOut Validation Approach- Train And Test Split

Holdout Cross Validation: The dataset is split into two parts, a training set and a separate test set. The model is trained on the training set and evaluated on the test set.

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=10)
model = LogisticRegression()
model.fit(X_train, y_train)
result = model.score(X_test, y_test)
print(result)

0.9473684210526315


### 3. K Fold Cross Validation

K-Fold Cross Validation: The dataset is divided into 'k' equally sized folds. The model is trained and evaluated 'k' times, each time using a different fold as the validation set.

In [19]:
from sklearn.model_selection import KFold

model=LogisticRegression()
kfold_validation=KFold(10)

results=cross_val_score(model,X,y,cv=kfold_validation)
print(results)
print(np.mean(results))

[0.84210526 0.94736842 0.94736842 0.9122807  0.98245614 0.96491228
 0.96491228 0.96491228 0.92982456 0.96428571]
0.9420426065162907


### 4. Stratified K-fold Cross Validation

Stratified K-Fold Cross Validation: Similar to K-Fold, but it ensures that each fold has approximately the same proportion of samples from each class, which is useful for imbalanced datasets.

In [20]:
from sklearn.model_selection import StratifiedKFold
skfold=StratifiedKFold(n_splits=5)
model=LogisticRegression()
scores=cross_val_score(model,X,y,cv=skfold)
print(np.mean(scores))

0.947290793355069


### 5. Repeated Random Test-Train Splits
This technique is a  traditional train-test splitting and the k-fold cross-validation method. In this technique, we create random splits of the data in the training-test set manner and then repeat the process of splitting and evaluating the algorithm multiple times, just like the cross-validation method.

In [24]:
from sklearn.model_selection import ShuffleSplit
model=LogisticRegression()
ssplit=ShuffleSplit(n_splits=10,test_size=0.30)
results=cross_val_score(model,X,y,cv=ssplit)
print(results)
print(np.mean(results))

[0.95906433 0.96491228 0.92397661 0.90643275 0.95906433 0.92397661
 0.95321637 0.97076023 0.95906433 0.93567251]
0.9456140350877194
