In [1]:
import pandas as pd
import numpy as np


audiobook dataset has features like:

Book length, Price, Reviews, Ratings, Minutes listened, Completion %, etc.

The relationship between these and purchase is not purely linear.
For example:

A long book doesn’t always mean purchase → depends on whether they finished it, the rating, and price.

These kinds of interactions between variables are hard for logistic regression but natural for neural networks.
Logistic regression: simple, interpretable, fast, but limited to linear patterns.

Deep learning: more powerful, captures complex nonlinear interactions, but needs more data and tuning.

In [9]:
#we are loading the dataset and defining the features and target

df=pd.read_csv('audioboook_scaled.csv')
x=df.drop("Target",axis=1)
y=df['Target']


In [16]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42,stratify=y)

In [24]:
#Now we can build a neural network
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout

#Nowe we can define a model

model = Sequential([
    Dense(64, activation='relu', input_shape=(x_train.shape[1],)),  # hidden layer 1
    Dropout(0.3),  # regularization to prevent overfitting
    Dense(32, activation='relu'),  # hidden layer 2
    Dense(1, activation='sigmoid')  # output layer (binary classification)
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Dense(64, activation='relu') → first hidden layer with 64 neurons.

Dropout(0.3) → randomly drops 30% of neurons during training (to reduce overfitting).

Dense(1, activation='sigmoid') → output layer (gives probability between 0 and 1).

binary_crossentropy → loss function for binary classification.

adam → adaptive optimizer that works well in practice.

In [27]:
history = model.fit(
    x_train, y_train,
    validation_data=(x_test, y_test),
    epochs=20,
    batch_size=32,
    verbose=1
)

Epoch 1/20
[1m353/353[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.8387 - loss: 0.4202 - val_accuracy: 0.8974 - val_loss: 0.2868
Epoch 2/20
[1m353/353[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.8991 - loss: 0.2772 - val_accuracy: 0.9049 - val_loss: 0.2627
Epoch 3/20
[1m353/353[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8998 - loss: 0.2620 - val_accuracy: 0.9038 - val_loss: 0.2558
Epoch 4/20
[1m353/353[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8973 - loss: 0.2651 - val_accuracy: 0.9056 - val_loss: 0.2533
Epoch 5/20
[1m353/353[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9105 - loss: 0.2420 - val_accuracy: 0.9081 - val_loss: 0.2470
Epoch 6/20
[1m353/353[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9052 - loss: 0.2484 - val_accuracy: 0.8981 - val_loss: 0.2508
Epoch 7/20
[1m353/353[0m 

In [29]:
#Let's Evaluate the model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix, roc_auc_score

# Predictions
y_pred = (model.predict(x_test) > 0.5).astype("int32")   # threshold 0.5
y_prob = model.predict(x_test)                          # probabilities

# Evaluation Metrics
print("Accuracy :", accuracy_score(y_test, y_pred))
print("Precision :", precision_score(y_test, y_pred))
print("Recall :", recall_score(y_test, y_pred))
print("F1 Score :", f1_score(y_test, y_pred))
print("ROC AUC :", roc_auc_score(y_test, y_prob))

print("\nClassification Report :\n", classification_report(y_test, y_pred))
print("Confusion Matrix :\n", confusion_matrix(y_test, y_pred))


[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
Accuracy : 0.9094781682641108
Precision : 0.9615384615384616
Recall : 0.44742729306487694
F1 Score : 0.6106870229007634
ROC AUC : 0.9043166350446955

Classification Report :
               precision    recall  f1-score   support

           0       0.91      1.00      0.95      2370
           1       0.96      0.45      0.61       447

    accuracy                           0.91      2817
   macro avg       0.93      0.72      0.78      2817
weighted avg       0.91      0.91      0.90      2817

Confusion Matrix :
 [[2362    8]
 [ 247  200]]


In [31]:
#Similar to that of Logistic regression once again 1 is underrepresented, So let's try SMOTE now


we should consider SMOTE when:
Class imbalance exists (like 90% vs 10%).
You care about minority class prediction (e.g., churn, fraud, disease detection).
Traditional oversampling (simply copying minority examples) may cause overfitting.
For our audiobook dataset → predicting whether a customer will purchase (Target=1) is more important, so SMOTE is a good choice.


Instead of just copying minority class rows, SMOTE generates new, synthetic but realistic examples. Here’s how:
For each minority class data point, SMOTE finds its k nearest neighbors (default k=5).
It picks one of those neighbors randomly.
It then creates a new synthetic sample somewhere along the line between the two points.

SMOTE balances classes by creating synthetic minority samples.
Use it when you care about the minority class performance (recall, F1).
It makes the model less biased and improves detection of rare events.

In [45]:
import sys
!{sys.executable} -m pip install -U imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.12.4-py3-none-any.whl.metadata (8.3 kB)
Downloading imbalanced_learn-0.12.4-py3-none-any.whl (258 kB)
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.12.4


In [47]:
#Apply SMOTE
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
x_train_resampled, y_train_resampled = smote.fit_resample(x_train, y_train)

print("Before SMOTE:", y_train.value_counts().to_dict())
print("After SMOTE:", dict(zip(*np.unique(y_train_resampled, return_counts=True))))



Before SMOTE: {0: 9476, 1: 1790}
After SMOTE: {np.int64(0): np.int64(9476), np.int64(1): np.int64(9476)}


In [51]:
#Retrain with the sampled data 

history = model.fit(
    x_train_resampled, y_train_resampled,
    validation_data=(x_test, y_test),  # keep test untouched
    epochs=20,
    batch_size=32,
    verbose=1
)

Epoch 1/20
[1m593/593[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8012 - loss: 0.3709 - val_accuracy: 0.8537 - val_loss: 0.3573
Epoch 2/20
[1m593/593[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8125 - loss: 0.3504 - val_accuracy: 0.8576 - val_loss: 0.3267
Epoch 3/20
[1m593/593[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8190 - loss: 0.3470 - val_accuracy: 0.8605 - val_loss: 0.3173
Epoch 4/20
[1m593/593[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8148 - loss: 0.3503 - val_accuracy: 0.8616 - val_loss: 0.3106
Epoch 5/20
[1m593/593[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8180 - loss: 0.3492 - val_accuracy: 0.8601 - val_loss: 0.3409
Epoch 6/20
[1m593/593[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8205 - loss: 0.3376 - val_accuracy: 0.8573 - val_loss: 0.3323
Epoch 7/20
[1m593/593[0m 

In [61]:
#Evaluate again

y_pred = model.predict(x_test)
# Apply threshold (default 0.7)
y_pred_classes = (y_pred > 0.7).astype("int32")

print("Accuracy :", accuracy_score(y_test, y_pred_classes))
print("Precision :", precision_score(y_test, y_pred_classes))
print("Recall :", recall_score(y_test, y_pred_classes))
print("F1 Score :", f1_score(y_test, y_pred_classes))
print("ROC AUC :", roc_auc_score(y_test, y_pred))

[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Accuracy : 0.8821441249556266
Precision : 0.6280623608017817
Recall : 0.6308724832214765
F1 Score : 0.6294642857142857
ROC AUC : 0.905415380549184
