# Model Selection

## cross Validation/ k-fold/ starttified k -fold cross validation

Model selection in machine learning refers to the process of choosing the best model among various candidates to solve a given problem. The goal is to identify the model that performs well on unseen data, which is critical for generalization. Here are the key steps involved in model selection:

### 1. **Problem Definition**
   - **Objective:** Clearly define the task (e.g., classification, regression, clustering).
   - **Data:** Understand the characteristics of the dataset, such as the number of features, data distribution, and whether the problem is supervised or unsupervised.
   - **Performance Metrics:** Choose appropriate metrics based on the problem (e.g., accuracy, precision, recall, F1-score for classification; MSE or RMSE for regression).

### 2. **Data Preprocessing**
   - **Cleaning and Transformation:** Handle missing values, outliers, and normalize/standardize the data as needed.
   - **Feature Engineering:** Create new features or select the most relevant features to improve the model's performance.
   - **Train-Test Split:** Divide the dataset into training and test sets (and sometimes a validation set).

### 3. **Model Selection Criteria**
   - **Model Complexity:** Choose a model that balances bias and variance. Simpler models may underfit, while more complex models may overfit.
   - **Interpretability vs. Accuracy:** Simpler models like linear regression may be more interpretable, while more complex models like deep neural networks may provide higher accuracy but be harder to interpret.
   - **Computational Efficiency:** Consider the training time and prediction speed of the model, especially for large datasets.
   - **Scalability:** Ensure that the chosen model can handle the size of the data in both training and prediction.

### 4. **Choosing Candidate Models**
   Based on the problem and data, consider the following types of models:
   - **Linear Models:** Logistic regression, linear regression.
   - **Tree-Based Models:** Decision trees, random forests, gradient boosting machines (e.g., XGBoost, LightGBM).
   - **Neural Networks:** Shallow and deep neural networks, CNNs, RNNs, etc.
   - **Support Vector Machines (SVMs)**
   - **Ensemble Methods:** Combining multiple models to improve performance, such as bagging, boosting, and stacking.
   - **K-Nearest Neighbors (KNN)**

### 5. **Model Training and Hyperparameter Tuning**
   - **Cross-Validation:** Use techniques like k-fold cross-validation to estimate how well a model generalizes to unseen data.
   - **Hyperparameter Tuning:** Use methods like grid search or random search to find the optimal hyperparameters for a model (e.g., learning rate, tree depth).
   - **Early Stopping (for deep learning):** Prevent overfitting by stopping training once the validation performance stops improving.

### 6. **Evaluation**
   - **Performance on Validation Set:** Evaluate models using the validation set (or through cross-validation). Use the chosen metrics to compare models.
   - **Overfitting and Underfitting:** Ensure the model is neither overfitting (too complex, poor generalization) nor underfitting (too simple, poor performance).
   - **Out-of-Sample Performance:** Assess the model’s ability to generalize to unseen data using the test set.
   
### 7. **Model Comparison**
   - Compare different models based on their performance metrics, training time, and complexity.
   - **Model Selection Criteria:** Make the final selection by balancing the performance on the validation set and other practical considerations (e.g., computational cost).

### 8. **Final Model Deployment**
   - Once the best model is selected and tuned, deploy it to production.
   - Ensure the model is able to handle new, real-world data and is regularly monitored for any potential model drift.

### Common Challenges in Model Selection
   - **Overfitting:** Models that perform well on the training data but poorly on unseen data.
   - **Underfitting:** Models that fail to capture the underlying patterns in the data, leading to poor performance on both training and testing data.
   - **Bias-Variance Trade-off:** Balancing between models that are too simple (high bias) and too complex (high variance).

### Conclusion
Model selection is an iterative and complex process that requires careful consideration of data, problem type, and the trade-offs between different models. It’s essential to experiment with multiple models and approaches, using cross-validation and careful performance evaluation to ensure that the chosen model is both accurate and generalizes well to new data.

In [None]:
# import the dependencies.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Collection and Processing

In [None]:
# loading the csv data to a pandas DataFrame.

heart_data = pd.read_csv("/content/heart.csv")

heart_data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [None]:
# number of rows and columns in the dataset.

heart_data.shape

(303, 14)

In [None]:
# checking for the missing value.

heart_data.isnull().sum()

Unnamed: 0,0
age,0
sex,0
cp,0
trestbps,0
chol,0
fbs,0
restecg,0
thalach,0
exang,0
oldpeak,0


In [None]:
# Checking the distribution of the Target Variable.

heart_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
1,165
0,138


1 --> Defactive Heart

0 --> Healthy Heart

## Splitting the Feature and Target

In [None]:
X = heart_data.drop(columns = 'target', axis = True)
Y = heart_data['target']

In [None]:
print(X)

     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   3       145   233    1        0      150      0      2.3   
1     37    1   2       130   250    0        1      187      0      3.5   
2     41    0   1       130   204    0        0      172      0      1.4   
3     56    1   1       120   236    0        1      178      0      0.8   
4     57    0   0       120   354    0        1      163      1      0.6   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
298   57    0   0       140   241    0        1      123      1      0.2   
299   45    1   3       110   264    0        1      132      0      1.2   
300   68    1   0       144   193    1        1      141      0      3.4   
301   57    1   0       130   131    0        1      115      1      1.2   
302   57    0   1       130   236    0        0      174      0      0.0   

     slope  ca  thal  
0        0   0     1  
1        0   0     2  
2        2   0    

In [None]:
print(Y)

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64


# Train_Test_Split

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2, random_state=42, stratify=Y)

In [None]:
print(X.shape,X_train.shape,X_test.shape)

(303, 13) (242, 13) (61, 13)


# Comparing the performance of the models by train_test_split method.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

- Comparing the performance of the models.

In [None]:
# list of the models.

models = [LogisticRegression(max_iter=1000),SVC(kernel='linear'), KNeighborsClassifier(), RandomForestClassifier() ]

In [None]:
def compare_model_train_test():
  for model in models:
    # training the models.
    model.fit(X_train, Y_train)

    # evaluating the model.
    test_data_prediction = model.predict(X_test)

    accuracy = accuracy_score(Y_test,test_data_prediction )
    print("Accuracy score od the", model, "=", accuracy)



In [None]:
compare_model_train_test()

Accuracy score od the LogisticRegression(max_iter=1000) = 0.8032786885245902
Accuracy score od the SVC(kernel='linear') = 0.8032786885245902
Accuracy score od the KNeighborsClassifier() = 0.5901639344262295
Accuracy score od the RandomForestClassifier() = 0.7868852459016393


# Cross Validation
 - It is the alternative method of the train_test_split method.
 - in this method we directly use X,Y column in the model.
 - in this method we split the data into the subsets.

## Use cross validation with Logistic Regression.

In [None]:
from sklearn.model_selection import cross_val_score

cv_score_lr = cross_val_score(LogisticRegression(max_iter=1000),X,Y, cv = 5 )


mean_accuracy_lr = sum(cv_score_lr)/len(cv_score_lr)
mean_accuracy_lr = mean_accuracy_lr*100
mean_accuracy_lr = mean_accuracy_lr.round(2)

print(cv_score_lr) # return the 5 accuracy score when each time model take deffirent subset of train and test data.
print(mean_accuracy_lr) # return the mean of the all 5 accuracy score.



[0.80327869 0.86885246 0.85245902 0.86666667 0.75      ]
82.83


# Use cross validation with SVC()

In [None]:
from sklearn.model_selection import cross_val_score

cr_score_svc = cross_val_score(SVC(kernel='linear'), X,Y, cv =5)

mean_accuracy_svc = sum(cr_score_svc)/len(cr_score_svc)
mean_accuracy_svc = mean_accuracy_svc*100
mean_accuracy_svc = mean_accuracy_svc.round(2)

print(cr_score_svc)
print(mean_accuracy_svc)


[0.81967213 0.8852459  0.80327869 0.86666667 0.76666667]
82.83


# Now we create a function to compare multiple model at the same time by using cross validation methos.

In [None]:
# list of models.

models = [LogisticRegression(max_iter=1000),SVC(kernel='linear'), KNeighborsClassifier(), RandomForestClassifier() ]


In [None]:
def compare_models_cross_validation():

  for model in models:

    cv_score = cross_val_score(model, X, Y, cv=5)

    mean_accuracy = sum(cv_score)/len(cv_score)
    mean_accuracy = mean_accuracy*100
    mean_accuracy = mean_accuracy.round(2)

    print("Cross Validation accuracy score for", model, "=",cv_score)
    print("Accuracy Score of the", model, "=", mean_accuracy)
    print('========================================================')

In [None]:
compare_models_cross_validation()

Cross Validation accuracy score for LogisticRegression(max_iter=1000) = [0.80327869 0.86885246 0.85245902 0.86666667 0.75      ]
Accuracy Score of the LogisticRegression(max_iter=1000) = 82.83
Cross Validation accuracy score for SVC(kernel='linear') = [0.81967213 0.8852459  0.80327869 0.86666667 0.76666667]
Accuracy Score of the SVC(kernel='linear') = 82.83
Cross Validation accuracy score for KNeighborsClassifier() = [0.60655738 0.6557377  0.57377049 0.73333333 0.65      ]
Accuracy Score of the KNeighborsClassifier() = 64.39
Cross Validation accuracy score for RandomForestClassifier() = [0.80327869 0.86885246 0.83606557 0.81666667 0.78333333]
Accuracy Score of the RandomForestClassifier() = 82.16
