<details>
  <summary>Supervised Learning Steps</summary>
    
1. Data Collection
   * 1.1\. Data Sources
   * 1.2\. Data Collection Considerations
2. Data Exploration and Preparation
   * 2.1\. Data Exploration
   * 2.2\. Data Preparation/Cleaning
3. Split Data into Training and Test Sets
   * 3.1\. Holdout Method
   * 3.2\. Cross Validation
   * 3.3\. Data Leakage
   * 3.4\. Best Practices
4. Choose a Supervised Learning Algorithm
   * 4.1\. Consider algorithm categories
   * 4.2\. Evaluate algorithm characteristics
   * 4.3\. Try multiple algorithms
5. Train the Model
   * 5.1\. Objective Function (Loss/Cost Function)
   * 5.2\. Optimization Algorithms
   * 5.3\. Overfitting and Underfitting
6. Evaluate Model Performance
   * 6.1\. Evaluate Model Performance
   * 6.2\. Performance Metrics for Classification Models
   * 6.3\. Interpreting and Reporting Model Performance
7. Model Tuning and Selection
   * 7.1\. Hyperparameter Tuning
   * 7.2\. Ensemble Methods
</details>

# 3. Split Data into Training and Test Sets

![image](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRFvC9V823fYfoOP5jRe9uInBq6gek79wEP3P5ZeXGqMA&s)

In supervised learning, we have a dataset with labeled examples, where each instance consists of input features (independent variables) and a corresponding target variable (dependent variable). The goal is to learn a mapping function from the input features to the target variable, which can then be used to make predictions on new, unseen data.

However, if we train and evaluate our model on the same data, we risk overfitting, where the model memorizes the training data instead of learning the underlying patterns. This leads to poor generalization performance on new data.

To avoid this, we split the dataset into two mutually exclusive sets:

- **Training Set**: Used to train the model, allowing it to learn the patterns and relationships between the input features and the target variable.

- **Test Set**: Used to evaluate the trained model's performance on unseen data, providing an unbiased estimate of its generalization ability.

Failing to separate the data into training and test sets can result in overly optimistic performance estimates, as the model has already seen and memorized the data it's being evaluated on.

## 3.1. Holdout Method

The holdout method is a simple and widely used technique for splitting the dataset into training and test sets.

Here's how it works:

1. Shuffle the dataset randomly to ensure that the examples are not ordered in any systematic way (depends on the type of data).

2. Determine the desired split ratio, such as 80/20 or 70/30, where the first number represents the percentage of data for the training set, and the second number represents the percentage for the test set.

3. Split the dataset according to the chosen ratio, ensuring that the training and test sets are mutually exclusive (no overlapping instances).

For classification problems, it's essential to maintain the class proportions in both the training and test sets. This process is called stratification, and it ensures that the target variable's distribution is similar in both sets, preventing any class imbalance issues.

![image](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSmihXQGQgZYemUPa0Qvb5VTcCqXunGZuxyF_pnHLbrvQ&s)

## 3.2. Cross-Validation

While the holdout method is simple and widely used, it can lead to unstable performance estimates, especially for small datasets.

Cross-validation is an alternative technique that can provide more reliable performance estimates and help mitigate overfitting.

The idea behind cross-validation is to split the dataset into multiple folds (subsets), train the model on a combination of folds, and evaluate it on the remaining fold(s). This process is repeated for different fold combinations, and the performance metrics are averaged across all iterations.

One popular cross-validation technique is k-fold cross-validation, where the dataset is divided into k equal-sized folds. The model is trained on k-1 folds and evaluated on the remaining fold. This process is repeated k times, with each fold serving as the test set once.

![image](https://user-images.githubusercontent.com/25267873/74094572-746fd200-4adb-11ea-91fd-93935d51982f.png)

Cross-validation is particularly useful when working with small datasets or when the performance estimates from the holdout method are unstable. However, it comes at the cost of increased computational time, as the model needs to be trained and evaluated multiple times.

## 3.3. Data Leakage

Data leakage occurs when information from the test set is inadvertently used during the training process, leading to overly optimistic performance estimates and poor generalization on new, unseen data.

There are two main types of data leakage:

- **Target Leakage**: This occurs when the target variable (or a close approximation) is used as an input feature for training the model. For example, in a credit risk prediction problem, using the customer's current credit score as a feature would be considered target leakage, as the model would essentially be learning to predict the target variable from itself.

- **Feature Leakage**: This happens when information from the test set is used to create or transform features during the training process. For example, if you calculate the mean or standard deviation of a feature using the entire dataset (including the test set) and then use these statistics to normalize the feature values, you have introduced feature leakage.

To avoid data leakage, it's crucial to ensure that the test set remains completely unseen during the model training and selection process. This includes:
- Splitting the data into training and test sets before any feature engineering or preprocessing steps.
- Performing all feature transformations (e.g., scaling, encoding) using only the training set, and then applying the same transformations to the test set.
- Avoiding peeking at the test set during model selection, hyperparameter tuning, or any other step that could potentially introduce bias.

Data leakage can lead to overly optimistic performance estimates and poor generalization on new data, as the model has effectively seen and learned from information it should not have had access to during training.

## 3.4. Best Practices

Here are some best practices to follow when splitting data into training and test sets:

- **Shuffle the data**: Before splitting the dataset, ensure that the instances are shuffled randomly to avoid any systematic ordering that could introduce bias - unless the order of the samples is wanted, e.g. time-series forecasting.
  
- **Use a consistent random_state**: When using functions like train_test_split() or cross_val_score(), set a consistent random_state value to ensure reproducibility of the splits across different runs.

- **Avoid peeking at the test set**: Never use the test set for any step that could potentially introduce bias, such as feature engineering, model selection, or hyperparameter tuning.

- **Consider cross-validation for small datasets**: If you have a small dataset or if the performance estimates from the holdout method are unstable, consider using cross-validation techniques like k-fold cross-validation to obtain more reliable performance estimates.

- **Maintain class proportions (stratification)**: For classification problems, ensure that the class proportions are maintained in both the training and test sets by using stratification (e.g., stratify=y in train_test_split()).

- **Evaluate multiple train/test split**: To account for potential variability in the splits, consider evaluating your model's performance on multiple train/test splits and reporting the average or distribution of performance metrics.

- **Document and version control**: Maintain clear documentation and version control for your data splitting and preprocessing steps to ensure reproducibility and transparency in your modeling process.

By following these best practices, you can ensure that your data splitting process is robust, unbiased, and provides reliable performance estimates for your supervised learning models.