#### Evaluate logistic regression model using train and test dataset on synthetic binary classification dataset

In [18]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### Data Preparation witn train & test

In [172]:
# Generate sample data with make_classification
X, y = make_classification(n_samples=5000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

In [173]:
print(X.shape, y.shape)

(5000, 20) (5000,)


In [174]:
X[:1]

array([[ 2.31769126, -3.91726945, -2.39294994, -0.16218061,  1.16487447,
        -0.37634738, -2.833378  ,  1.9359793 , -1.37426782,  1.96737752,
        -3.41222262, -0.96116369,  0.73348974,  1.36286215, -0.69056844,
         3.2143011 ,  3.88749404, -0.67684237,  0.83908542,  0.37365759]])

In [175]:
y[0:5]

array([1, 0, 1, 1, 1])

#### Naive Approach with data leakage
- Next, we can evaluate our model on a scaled dataset, starting with their naive or incorrect approach.
- The naive approach involves first applying the data preparation method, then splitting the data before finally evaluating the model.


In [176]:
# Standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [177]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [178]:
# Create and fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [179]:
# Evaluate the model
preds = model.predict(X_test)
accuracy = accuracy_score(y_test, preds)

In [180]:
# Print accuaracy
print('Accuracy  %.3f' % (accuracy*100))

Accuracy  87.939


Given we know that there was data leakage, we know that this estimate of model accuracy is wrong. Next, let’s explore how we might correctly prepare the data to avoid data leakage.

### Train-Test Evaluation With Correct Data Preparation

In [181]:
# First, split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [182]:
# Define scaler
scaler = MinMaxScaler()
# Fit on the training dataset
scaler.fit(X_train)
# Scale the training dataset
X_train = scaler.transform(X_train)
# Scale the test dataset
X_test = scaler.transform(X_test)

In [183]:
# Create and fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [184]:
# Evaluate the model
preds = model.predict(X_test)

In [185]:
accuracy = accuracy_score(y_test, preds)
print('Accuracy %.3f' % (accuracy*100))

Accuracy 88.000


### K-fold Cross-Validation Evaluation With Naive Data Preparation

In [25]:
# Naive data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

In [203]:
# Define dataset
X, y = make_classification(n_samples=5000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

In [204]:
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [205]:
# Define the model
model = LogisticRegression()

In [206]:
# Define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
#r Evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

In [207]:
# Report performance
print('Accuracy: %.3f' % (mean(scores)*100), std(scores)*100)

Accuracy: 86.753 1.1380490128090082


In [208]:
######################################################################################################################

### Cross-Validation Evaluation With Correct Data Preparation

- Data preparation without data leakage when using cross-validation is slightly more challenging. It requires that the data preparation method is prepared on the training set and applied to the train and test sets within the cross-validation procedure, e.g. the groups of folds of rows. We can achieve this by defining a modeling pipeline that defines a sequence of data preparation steps to perform and ending in the model to fit and evaluate.

- The evaluation procedure changes from simply and incorrectly evaluating just the model to correctly evaluating the entire pipeline of data preparation and model together as a single atomic unit. This can be achieved using the Pipeline class. This class takes a list of steps that define the pipeline. Each step in the list is a tuple with two elements. The first element is the name of the step (a string) and the second is the configured object of the step, such as a transform or a model. The model is only supported as the final step, although we can have as many transforms as we like in the sequence.

In [209]:
# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [210]:
# define dataset
X, y = make_classification(n_samples=5000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

In [211]:
# Define the pipline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)

In [212]:
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [213]:
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

In [214]:
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Accuracy: 86.747 (1.143)
