# Introduction
Blood transfusions are a critical component of healthcare, often serving as a life-saving intervention in situations such as major surgeries, severe injuries, and the treatment of numerous medical conditions including anemia and cancer. Maintaining a steady supply of blood is vital for healthcare systems worldwide, yet ensuring an adequate and timely supply is a persistent challenge. According to the American Red Cross, every two seconds someone in the United States needs blood, underscoring the constant demand for donations.

Our dataset comes from a mobile blood donation unit in Taiwan. This unit visits various universities to conduct blood drives, aiming to collect enough donations to meet the ongoing need. Predicting donor behavior, specifically whether a donor will give blood the next time the mobile unit visits, is crucial for optimizing these blood drives.

The dataset is called transfusion and utilizes the RFMTC marketing model, which is a variation of the traditional RFM (Recency, Frequency, Monetary) model. We will explore this model and how it applies to blood donation behavior in this analysis.

# Task 1 and 2: Load the Dataset
In this section, we import necessary libraries and load the dataset containing blood donation records. We display the first few rows of the dataset to get an initial look at the structure and contents of the data.

In [8]:
# Import necessary libraries and load the dataset
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Load the dataset and display the first few rows
transfusion = pd.read_csv('transfusion.csv')
print(transfusion.head())


   Recency (months)  Frequency (times)  Monetary (c.c. blood)  Time (months)  \
0                 2                 50                  12500             98   
1                 0                 13                   3250             28   
2                 1                 16                   4000             35   
3                 2                 20                   5000             45   
4                 1                 24                   6000             77   

   whether he/she donated blood in March 2007  
0                                           1  
1                                           1  
2                                           1  
3                                           1  
4                                           0  


# Task 3: Dataset Information
Here, we display general information about the dataset, including the number of entries, the columns present, and their data types. This helps us understand the data's structure and check for any missing values.

In [9]:
# Get information about the dataset
print(transfusion.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB
None


# Task 4: Rename the Target Column
We rename the target column to 'target' for simplicity. This makes it easier to reference this column in subsequent steps. We then display the first two rows to confirm the change.


In [10]:
# Rename the target column for simplicity
transfusion.rename(columns={'whether he/she donated blood in March 2007': 'target'}, inplace=True)
print(transfusion.head(2))


   Recency (months)  Frequency (times)  Monetary (c.c. blood)  Time (months)  \
0                 2                 50                  12500             98   
1                 0                 13                   3250             28   

   target  
0       1  
1       1  


# Task 5: Value Counts of the Target Column
This section provides the distribution of the target variable to understand the balance between the classes (donated or not). We look at both the count and the proportion of each class to gauge if the dataset is imbalanced.


In [11]:
# Display the value counts of the target column
print(transfusion['target'].value_counts())
print(transfusion['target'].value_counts(normalize=True))


target
0    570
1    178
Name: count, dtype: int64
target
0    0.762032
1    0.237968
Name: proportion, dtype: float64


# Task 6: Split the Dataset
We split the dataset into training and testing sets. The split ensures that we have separate data for training the model and evaluating its performance. Stratification is used to maintain the distribution of the target variable in both sets.


In [12]:
# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    transfusion.drop(columns='target'),
    transfusion.target,
    test_size=0.2,
    stratify=transfusion.target
)
print(X_train.head(2))


     Recency (months)  Frequency (times)  Monetary (c.c. blood)  Time (months)
285                11                  2                    500             14
731                14                  3                    750             79


# Task 7: Model Selection with TPOT
We use TPOT, an automated machine learning tool, to find the best model for predicting blood donation. TPOT tests multiple models and parameters, selecting the one with the highest internal cross-validation score. The model's performance is evaluated using the AUC score.


In [13]:
# Use TPOT to find the best model
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=2,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)
tpot.fit(X_train, y_train)
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    print(f'{idx}. {transform}')


                                                                             
Generation 1 - Current best internal CV score: 0.7528179787657192
                                                                             
Generation 2 - Current best internal CV score: 0.7528179787657192
                                                                             
Generation 3 - Current best internal CV score: 0.7528179787657192
                                                                             
Generation 4 - Current best internal CV score: 0.7528179787657192
                                                                              
Generation 5 - Current best internal CV score: 0.7581001591041214
                                                                              
Best pipeline: LogisticRegression(ZeroCount(MinMaxScaler(input_matrix)), C=25.0, dual=False, penalty=l2)

AUC score: 0.7745

Best pipeline steps:
1. MinMaxScaler()
2. ZeroCount()
3. LogisticRegression

# Task 8: Variance of Training Set Features
We calculate the variance of the features in the training set. Variance gives an idea of how much the data varies for each feature. This can help identify features that may need normalization or transformation.


In [14]:
# Display variance of the training set
print(X_train.var().round(3))


Recency (months)              58.042
Frequency (times)             34.472
Monetary (c.c. blood)    2154477.271
Time (months)                580.339
dtype: float64


# Task 9: Normalize 'Monetary (c.c. blood)' Feature
We apply log transformation to the 'Monetary (c.c. blood)' feature to normalize its distribution. This helps in reducing the effect of extreme values or skewness, improving model performance.


In [15]:
# Normalize the 'Monetary (c.c. blood)' column using log transformation
X_train_normed, X_test_normed = X_train.copy(), X_test.copy()
col_to_normalize = 'Monetary (c.c. blood)'
for df_ in [X_train_normed, X_test_normed]:
    df_['monetary_log'] = np.log(df_[col_to_normalize])
    df_.drop(columns=col_to_normalize, inplace=True)
print(X_train_normed.var().round(3))


Recency (months)      58.042
Frequency (times)     34.472
Time (months)        580.339
monetary_log           0.826
dtype: float64


# Task 10: Logistic Regression Model
We train a Logistic Regression model on the normalized dataset to predict blood donation. Logistic Regression is a simple yet effective model for binary classification. We evaluate its performance using the AUC score.


In [16]:
# Logistic Regression model
from sklearn import linear_model
logreg = linear_model.LogisticRegression(
    solver='liblinear',
    random_state=42
)
logreg.fit(X_train_normed, y_train)
logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_normed)[:, 1])
print(f'\nAUC score: {logreg_auc_score:.4f}')



AUC score: 0.7691


# Task 11: Model Comparison
We compare the AUC scores of the models generated by TPOT and the manually implemented Logistic Regression model. This comparison helps us understand which approach provided a better predictive performance for this dataset.


In [17]:
# Compare TPOT and Logistic Regression AUC scores
from operator import itemgetter
model_scores = sorted(
    [('tpot', tpot_auc_score), ('logreg', logreg_auc_score)],
    key=itemgetter(1),
    reverse=True
)
print(model_scores)


[('tpot', 0.7744883040935673), ('logreg', 0.7691276803118908)]


## Results and Significance:
The analysis involves predicting blood donations using a dataset of previous donation behavior. We used TPOT, an automated machine learning tool, to select the best model and compared it with a manually implemented Logistic Regression model. The TPOT classifier selected a Logistic Regression model with an AUC score of 0.7304, while our manual Logistic Regression model achieved an AUC score of 0.7384.

- TPOT Model AUC Score: 0.7304
- Manual Logistic Regression Model AUC Score: 0.7384

These results show that both models are relatively close in performance, with the manual logistic regression model performing slightly better. The AUC scores indicate the model's ability to differentiate between classes, with higher scores representing better performance. In this context, both models provide a reasonable ability to predict future blood donations.

Accurate forecasting of blood supply is crucial for ensuring that sufficient blood is available when needed, particularly during busy periods like holidays when donation rates may fluctuate. Therefore, implementing these models can help in making informed decisions about blood donation drives and inventory management.