### Feature Discretization (Binning)

Discretization, also known as binning, is the process of converting continuous numerical variables into discrete intervals or bins. This can be useful in various machine learning scenarios, especially when working with models that benefit from categorical input features or when dealing with non-linear relationships.

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Create a larger sample dataset
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95],
    'Outcome': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)

# Separate features and target variable
X = df[['Age']]
y = df['Outcome']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest classifier with a single tree
model = RandomForestClassifier(random_state=42, n_estimators=1)
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Display predictions and true labels
result_df = pd.DataFrame({'Age': X_test['Age'], 'True_Label': y_test, 'Prediction': y_pred})
print(result_df)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


    Age  True_Label  Prediction
9    70           1           0
11   80           1           0
0    25           0           0
Accuracy: 0.3333333333333333


In [4]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import KBinsDiscretizer

# Create a sample dataset
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95],
    'Outcome': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)

# Separate features and target variable
X = df[['Age']]
y = df['Outcome']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Discretize the 'Age' feature into three bins using equal-width binning
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
X_train_binned = discretizer.fit_transform(X_train)
X_test_binned = discretizer.transform(X_test)

# Train a decision tree classifier on the binned training set
model = RandomForestClassifier(random_state=42)
model.fit(X_train_binned, y_train)

# Make predictions on the binned testing set
y_pred = model.predict(X_test_binned)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.0




In this example:

- We have a continuous numerical feature 'Age' and a binary target variable 'Outcome.'
- We split the dataset into training and testing sets.
- We use KBinsDiscretizer from scikit-learn to discretize the 'Age' feature into three bins using equal-width binning (other strategies include 'quantile' and 'kmeans').
- We train a decision tree classifier on the binned training set and evaluate its performance on the binned testing set.

> Note that the `encode='ordinal'` argument specifies that the bins should be represented as integers. Other encoding options include `'onehot'` and `'onehot-dense,'` which represent bins using one-hot encoding.