# Classification example (Bank Churn dataset)

This notebook introduces classification using in scikit-learn package. We will create a logistic regression model (a classification model despite its name).  We start by loading data with pandas. We will select the columns that we need to train the model, fit the model and make predictions. We will also look at how to handle categorical variables. In these examples we will use the [bank churn dataset](course_datasets.md#bank-churn). We will visualise the data using the matplotlib package.

## Setup

The import statements below use numpy, pandas and several modules from scikit-learn.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder


Load the bank churn dataset using pandas and display the first few rows.

In [None]:
df = pd.read_csv('https://zomalextrainingstorage.blob.core.windows.net/datasets/misc/Churn.csv')
df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
df.head()

To keep it simple, we will only have a few columns, Gender, Age and Geography as the features of the model.  Note that:

* Gender: (Male, Female)
* Geography: (France, Germany, Spain) - treated as ordinal feature with France < Germany < Spain
* Age:  an integer

In [None]:
X = df[['Gender', 'Geography', 'Age']]
y = df['Exited']

Split the data into training and test datasets

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)

print(f'X_train\n{X_train[:5]}')
print(f'y_train\n{y_train[:5]}')

## Encoding of categorical and ordinal features

Gender is a categorical variable that we encode using one-hot encoding. Geography is treated as an ordinal feature with the ordering: France < Germany < Spain. We can map these to numeric values (France=0, Germany=1, Spain=2) that preserve the ordinal relationship.

One-hot encoding for Gender creates 2 columns:
* Gender_Male: [0 or 1]
* (Gender_Female is implicit when both are 0)

Ordinal encoding for Geography creates 1 column:
* Geography: [0, 1, or 2] representing France, Germany, Spain respectively

In [None]:
categorical_features = ['Gender']
ordinal_features = ['Geography']
numerical_features = ['Age']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), categorical_features),
        ('ord', OrdinalEncoder(categories=[[ 'Spain', 'Germany','France']]), ordinal_features),
        ('num', 'passthrough', numerical_features)
    ])

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)



In [None]:
print(f'Feature names: {preprocessor.get_feature_names_out()}')
print(f'X_train\n{X_train[:5]}')
print(f'X_train_processed\n{X_train_processed[:5]}')

In [None]:
# Alternative approach using pandas instead of ColumnTransformer

# One-hot encode Gender (drop first category)
X_train_gender_encoded = pd.get_dummies(X_train['Gender'], drop_first=True, prefix='Gender')
X_test_gender_encoded = pd.get_dummies(X_test['Gender'], drop_first=True, prefix='Gender')

# Ordinal encode Geography (France=0, Germany=1, Spain=2) using lambda
geography_mapping = {'France': 0, 'Germany': 1, 'Spain': 2}
X_train_geography_encoded = X_train['Geography'].apply(lambda x: geography_mapping[x]).to_frame(name='Geography')
X_test_geography_encoded = X_test['Geography'].apply(lambda x: geography_mapping[x]).to_frame(name='Geography')

# Combine all features and convert to numpy array
X_train_processed_pandas = np.hstack([
    X_train_gender_encoded.values,
    X_train_geography_encoded.values,
    X_train[['Age']].values
])

X_test_processed_pandas = np.hstack([
    X_test_gender_encoded.values,
    X_test_geography_encoded.values,
    X_test[['Age']].values
])

print(f'Pandas approach results:')
print(f'X_train_processed_pandas\n{X_train_processed_pandas[:5]}')
print(f'X_test_processed_pandas\n{X_test_processed_pandas[:5]}')
print(f'Shape: {X_train_processed_pandas.shape}')

Inspect the transformed data and compare it to the original data

Create and fit the model

In [None]:
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_processed, y_train)

Make predictions

In [None]:
y_pred = model.predict(X_test_processed)

print(f'y_pred\n{y_pred[:5]}')
print(f'y_test\n{y_test[:5]}')


Evaluate the model's performance

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

END OF TUTORIAL