# Predict Blood Donation for Future Expectancy 

Forecasting blood supply is a serious and recurrent problem for blood collection managers: in January 2019, "Nationwide, the Red Cross saw 27,000 fewer blood donations over the holidays than they see at other times of the year." Machine learning can be used to learn the patterns in the data to help to predict future blood donations and therefore save more lives.

# Loading the blood donations data

In [21]:
import pandas as pd
import numpy as np

df = pd.read_csv("C:/Users/Dell/Desktop/transfusion.data")
df

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
743,23,2,500,38,0
744,21,2,500,52,0
745,23,3,750,62,0
746,39,1,250,39,0


# Overview of dataset

In [6]:
df.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [7]:
df.shape

(748, 5)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


# Creating target column

In [9]:
df.rename(columns={'whether he/she donated blood in March 2007': 'target'},inplace=True)

df.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


# Checking target incidence

In [10]:
df.target.value_counts(normalize=True)

0    0.762032
1    0.237968
Name: target, dtype: float64

# Splitting transfusion into train and test datasets

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='target'),df.target,test_size=0.25,random_state=42,stratify=df.target)
X_train.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
334,16,2,500,16
99,5,7,1750,26
116,2,7,1750,46
661,16,2,500,16
154,2,1,250,2


# Selecting model using TPOT

In [18]:
pip install tpot

Note: you may need to restart the kernel to use updated packages.


In [12]:
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

tpot = TPOTClassifier(generations=5,population_size=20,verbosity=2,scoring='roc_auc',random_state=42,disable_update_check=True,config_dict='TPOT light')
tpot.fit(X_train, y_train)

tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    
    print(f'{idx}.{transform}')

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=120.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: 0.7422459184429089
Generation 2 - Current best internal CV score: 0.7422459184429089
Generation 3 - Current best internal CV score: 0.7422459184429089
Generation 4 - Current best internal CV score: 0.7422459184429089
Generation 5 - Current best internal CV score: 0.7423330644124079
Best pipeline: LogisticRegression(input_matrix, C=0.1, dual=False, penalty=l2)

AUC score: 0.7853

Best pipeline steps:
1.LogisticRegression(C=0.1, random_state=42)


# Checking the variance

In [13]:
X_train.var().round(3)

Recency (months)              66.929
Frequency (times)             33.830
Monetary (c.c. blood)    2114363.700
Time (months)                611.147
dtype: float64

# Log normalization

In [14]:
X_train_norm, X_test_norm = X_train.copy(), X_test.copy()

col_norm = 'Monetary (c.c. blood)'     ## Specify which column to normalize

# Log normalization
for df_ in [X_train_norm, X_test_norm]: 
    df_['monetary_log'] = np.log(df_[col_norm])
    df_.drop(columns=col_norm, inplace=True)

X_train_norm.var()

Recency (months)      66.929017
Frequency (times)     33.829819
Time (months)        611.146588
monetary_log           0.837458
dtype: float64

# logistic regression model

In [19]:
from sklearn import linear_model

logreg = linear_model.LogisticRegression(
    solver='liblinear',
    random_state=69
)

# Train the model
logreg.fit(X_train_norm, y_train)

# AUC score for tpot model
logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_norm)[:, 1])
print(f'\nAUC score: {logreg_auc_score:.4f}')

print("Tpot AUC Score: ",tpot_auc_score.round(2))
print("Logistic Regression AUC Score: ",logreg_auc_score.round(2))


AUC score: 0.7891
Tpot AUC Score:  0.79
Logistic Regression AUC Score:  0.79


# Conclusion


The demand for blood fluctuates throughout the year. As one prominent example, blood donations slow down during busy holiday seasons. An accurate forecast for the future supply of blood allows for an appropriate action to be taken ahead of time and therefore saving more lives.
In this notebook, we explored automatic model selection using TPOT and AUC score we got was 0.7850. This is better than simply choosing 0 all the time (the target incidence suggests that such a model would have 76% success rate). We then log normalized our training data and improved the AUC score by 0.5%. In the field of machine learning, even small improvements in accuracy can be important, depending on the purpose.
Another benefit of using logistic regression model is that it is interpretable. We can analyze how much of the variance in the response variable (target) can be explained by other variables in our dataset