# Splitting Data

- **Train Data:** Train data is the data that we use to train our machine learning model. This data consists of input features and their corresponding output labels. The model learns patterns from this data and uses them to make predictions on new, unseen data.
- **Validation Data:** Validation data is the data that we use to evaluate the performance of our model during training. We use this data to tune the model's hyperparameters and prevent overfitting. The validation data should be representative of the unseen data that the model will encounter in the real world.
- **Test Data:** Test data is the data that we use to evaluate the final performance of our model after training and tuning. This data should be completely separate from the train and validation data to avoid any data leakage.

A very common issue when training a model is **overfitting**. This phenomenon occurs when a model performs really well on the data that we used to train it but it fails to generalize well to new, unseen data points. There are numerous reasons why this can happen — it could be due to the noise in data or it could be that the model learned to predict specific inputs rather than the predictive parameters that could help it make correct predictions. Typically, the higher the complexity of a model the higher the chance that it will be overfitted.

On the other hand, **underfitting** occurs when the model has poor performance even on the data that was used to train it. In most cases, underfitting occurs because the model is not suitable for the problem you are trying to solve. Usually, this means that the model is less complex than required in order to learn those parameters that can be proven to be predictive.

## Fastest and most usual way to split data

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
print(X_train)
print(X_test)
print(y_train)
print(y_test)

## Random Train-Test Split

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


# Load Iris dataset
iris = load_iris()


# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)


# Train logistic regression model on training set
clf = LogisticRegression().fit(X_train, y_train)


# Evaluate model on test set
score = clf.score(X_test, y_test)
print(f"Accuracy: {score:.2f}")

Accuracy: 1.00


## Stratified Sampling

In [11]:
from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression


# Load Iris dataset
iris = load_iris()


# Create stratified sampling object
strat_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)


# Split dataset into training and test sets using stratified sampling
for train_index, test_index in strat_split.split(iris.data, iris.target):
    X_train, X_test = iris.data[train_index], iris.data[test_index]
y_train, y_test = iris.target[train_index], iris.target[test_index]


# Train logistic regression model on training set
clf = LogisticRegression().fit(X_train, y_train)


# Evaluate model on test set
score = clf.score(X_test, y_test)
print(f"Accuracy: {score:.2f}")

Accuracy: 0.97


# Validation

## Holdout Validation with Validation Set

In [12]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


# Load Iris dataset
iris = load_iris()


# Split dataset into training, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)


# Train logistic regression model on training set
clf = LogisticRegression().fit(X_train, y_train)


# Evaluate model on validation set
score = clf.score(X_val, y_val)
print(f"Validation accuracy: {score:.2f}")


# Evaluate model on test set
score = clf.score(X_test, y_test)
print(f"Test accuracy: {score:.2f}")

Validation accuracy: 0.97
Test accuracy: 1.00


## Cross-Validation

In [13]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Load Iris dataset
iris = load_iris()

# Train logistic regression model using cross-validation
clf = LogisticRegression()
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

# Print average accuracy across folds
print(f"Accuracy: {scores.mean():.2f}")

Accuracy: 0.97


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Leave-One-Out Cross-Validation

In [14]:
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import accuracy_score


#  Load  Iris  dataset
iris = load_iris()


#  Define  leave-one-out  cross-validation  object
loo = LeaveOneOut()


#  Train  logistic  regression  model  using  leave-one-out  cross-validation
scores  =  []
for  train_index,  test_index  in  loo.split(iris.data):
        X_train,  X_test  =  iris.data[train_index],  iris.data[test_index]
        y_train,  y_test  =  iris.target[train_index],  iris.target[test_index]
        clf  =  LogisticRegression().fit(X_train,  y_train)
        y_pred  =  clf.predict(X_test)
        scores.append(accuracy_score(y_test,  y_pred))


#  Compute  average  accuracy  across  all  samples
score_a  =  sum(scores)  /  len(scores)
print(f"Accuracy:  {score:.2f}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Accuracy:  1.00


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt