Now that the features have been selected, we are going to train our classifiers on the selected features.

# Training the models

### Data spliting

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X_new, data_train['Transported'], test_size=0.2, random_state=42)

print(f'X_train shape: {X_train.shape}')
print(f'Y_train shape: {Y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'Y_test shape: {Y_test.shape}')

### Logistic regression

**Training the model**

We will start by training a logistic regression model.

In [None]:
reg = LogisticRegression()
reg.fit(X_train, Y_train)

print(f'Accuracy: {reg.score(X_test, Y_test)}')

scores = cross_val_score(reg, X_new, data_train['Transported'], cv=5)
print(f'Cross validation scores: {scores}')
print(f'Cross validation mean score: {scores.mean()}')
print(f'Cross validation standard deviation: {scores.std()}')

These results indicate that the logistic regression model achieved an accuracy of approximately 75.04% on the test set. Additionally, during the cross-validation process, the model obtained five different accuracy scores: 75.22%, 71.25%, 74.93%, 75.66%, and 72.27% for each fold, respectively.

The mean cross-validation score, which represents the average accuracy across all folds, is approximately 73.86%. The standard deviation of the cross-validation scores is approximately 1.77%, indicating the variability or consistency of the model's performance across different folds.

Overall, the model demonstrates a reasonably stable performance across different subsets of the data, with an average accuracy around 73.86%, suggesting that it generalizes well to unseen data.

We already have a good accuracy, but we will try to improve it by tuning the hyperparameters.

**Tuning the hyperparameters**

As we made variable selection, wee will try to make a model that uses all the features.

In [None]:
# Impute missing values using mean strategy
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Train logistic regression model
reg = LogisticRegression()
reg.fit(X_train_imputed, Y_train)

# Evaluate model
##accuracy = reg.score(X_test_imputed, Y_test)
#print(f'Accuracy: {accuracy}')

#X_train, X_test, Y_train, Y_test = train_test_split(data_train.drop('Transported', axis = 1), data_train['Transported'], test_size=0.2, random_state=42)

print(f'X_train shape: {X_train.shape}')
print(f'Y_train shape: {Y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'Y_test shape: {Y_test.shape}')


reg = LogisticRegression()
reg.fit(X_train_imputed, Y_train)

print(f'Accuracy: {reg.score(X_test_imputed, Y_test)}')

scores = cross_val_score(reg, data_train.drop('Transported', axis = 1), data_train['Transported'], cv=5)
print(f'Cross validation scores: {scores}')
print(f'Cross validation mean score: {scores.mean()}')
print(f'Cross validation standard deviation: {scores.std()}')