# KFold Cross Validation

#### Importing the dependencies

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

##### Importing the models

In [2]:
# Classification
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

##### Data Collection & Preprocessing

In [3]:
data = pd.read_csv(r"D:\ai_ds-General\dataset\heart_v1.csv")
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
data.shape

(303, 14)

In [5]:
# Checking Missing Values (whole data)
data.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [7]:
# checking the distribution of Target Variable (Checking balance or imbalance), 1: Defective Heart, 2:Healthy Heart
data['target'].value_counts()

target
1    165
0    138
Name: count, dtype: int64

### Splitting the features & targets 

In [9]:
X = data.drop(columns=['target'], axis=1)
y = data['target']

In [10]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [11]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [16]:
# initialize train test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=.2, random_state=3)

# check shape
print(X.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(303, 13) (242, 13) (61, 13) (242,) (61,)


In [18]:
242/303

0.7986798679867987

# Modeling

#### Compare the performance of the models

In [21]:
# list of models
models = [LogisticRegression(max_iter=1000), SVC(kernel='linear'), KNeighborsClassifier(), RandomForestClassifier()]

In [23]:
# compare models based on train_test_split fun
def compare_models_train_test():
    
    # training each model indiviaually
    for model in models:
        
        # train the model
        model.fit(X_train, y_train)
        
        # evaluating the model
        train_data_prediction = model.predict(X_test)
        
        # Accuracy
        accuracy = accuracy_score(y_test, train_data_prediction)
        
        # Print results
        print('Accuracy Score of the', model, ' = ', accuracy)

In [24]:
# calling function
compare_models_train_test()

Accuracy Score of the LogisticRegression(max_iter=1000)  =  0.7704918032786885
Accuracy Score of the SVC(kernel='linear')  =  0.7704918032786885
Accuracy Score of the KNeighborsClassifier()  =  0.6557377049180327
Accuracy Score of the RandomForestClassifier()  =  0.7868852459016393


<center> <h2> Notes </h2> </center>

## 1.0 Imbance & Balance Dataset

A balanced data set is one where the classes are approximately equally represented. An imbalanced data set is one where there is a significant difference between the number of instances of each class1.

In your case, the target variable has two classes: 1 and 0. The class 1 has 165 instances, while the class 0 has 138 instances. The difference between the two classes is not very large, so this data set can be considered balanced.

However, if the difference between the classes was much larger, such as 165 vs 15, then the data set would be imbalanced. This could cause problems for some machine learning algorithms, as they might learn to favor the majority class and ignore the minority class. In such cases, you might need to use some techniques to balance the data set, such as over-sampling, under-sampling, or synthetic data generation

#### Q1. What is the ideal ratio for considering balance vs imbalance data ?

There is no definitive answer to what constitutes a balanced or imbalanced data set, as it may depend on the context, the problem, and the algorithm you are using. However, some general guidelines are:

A balanced data set is one where the classes are approximately equally represented. For example, a 50:50 or 60:40 split between two classes would be considered balanced

An imbalanced data set is one where there is a significant difference between the number of instances of each class. For example, a 10:90 or 1:99 split between two classes would be considered imbalanced

The degree of imbalance can be mild, moderate, or extreme, depending on the proportion of the minority class. For example, a binary classification problem with a minority class of 20-40% of the data set would be mildly imbalanced, while one with a minority class of less than 1% of the data set would be extremely imbalanced

#### Q2. Explain Over Sampling Technique

Over-sampling is a technique for dealing with imbalanced data sets, where the number of instances of one class is much lower than the other classes. Over-sampling involves randomly duplicating examples from the minority class, with or without replacement, and adding them to the training data set. This way, the class distribution becomes more balanced and the learning algorithm can better capture the characteristics of the minority class

However, over-sampling also has some drawbacks, such as increasing the computational cost and the risk of overfitting. Overfitting occurs when the model learns the noise or specific patterns of the minority class that are not generalizable to new data. To avoid overfitting, some variations of over-sampling have been proposed, such as Synthetic Minority Oversampling Technique (SMOTE), which creates new synthetic examples from the minority class instead of simply replicating them

#### Q3. Explain Under Sampling Technique

Under-sampling is a technique for dealing with imbalanced data sets, where the number of instances of one class is much lower than the other classes. Under-sampling involves randomly removing examples from the majority class, with or without replacement, and reducing the size of the training data set. This way, the class distribution becomes more balanced and the learning algorithm can better capture the characteristics of the minority class

However, under-sampling also has some drawbacks, such as losing potentially useful information and increasing the risk of underfitting. Underfitting occurs when the model fails to learn the general patterns of the data and performs poorly on new data. To avoid underfitting, some variations of under-sampling have been proposed, such as Tomek Links, Edited Nearest Neighbors, and One-Sided Selection, which remove only the noisy or borderline examples from the majority class

#### Q4. Explain Synthetic data concept for handing imbalance dataset

Synthetic data is a technique for handling imbalanced data sets, where the number of instances of one class is much lower than the other classes. Synthetic data involves creating new examples from the minority class that are not present in the original data set, but are similar enough to represent its characteristics. This way, the class distribution becomes more balanced and the learning algorithm can better capture the features of the minority class

There are different methods for generating synthetic data, such as SMOTE (Synthetic Minority Oversampling Technique), MBS (Model-Based Synthetic Sampling), and SYNAuG (Synthetic Augmentation). These methods use different approaches, such as interpolation, modeling, or augmentation, to create new synthetic examples from the minority class

Synthetic data can help overcome some of the drawbacks of traditional sampling techniques, such as data loss, data duplication, or overfitting. However, synthetic data also has some challenges, such as ensuring the quality, validity, and diversity of the generated data, and avoiding introducing unwanted biases or noise

## 2.0 Rules for Train Test Split

- Step 1: Import module 'from sklearn.model_selection import train_test_split'
- Step 2: Seperate X & y first	
- Step 3: Initialize Train Test spit > pass (X, y, test_size, stratify = y (optional), random_stat)
- Step 4: Check & verify shape(X, y) Train vs train, test vs test shape must be same

##### Q1 Explain the concept of 'Stratify=y' in Train Test Split, y (dependent variable)

The stratify parameter in the train_test_split function is used to split the data in a stratified fashion, meaning that the proportion of classes in the original data is preserved in the train and test sets. For example, if you have a binary classification problem with 60% of the data belonging to class 0 and 40% to class 1, setting stratify=Y will ensure that the train and test sets have the same ratio of 0s and 1s. This can help to avoid bias and improve the generalization of the model.


## Models

#### Q6 what is meat by Kernel='linear' ?

The kernel parameter in the SVC function specifies the type of hyperplane used to separate the data. A hyperplane is a boundary that divides the data into different classes. A linear kernel means that the hyperplane is a straight line (or a plane in higher dimensions). A linear kernel is suitable for data that is linearly separable, meaning that a line can separate the data without errors.