Blood Transfusion Service Center Data Set (https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center)

Data Set Information:

To demonstrate the RFMTC marketing model (a modified version of RFM), this study 
adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City 
in Taiwan. The center passes their blood transfusion service bus to one 
university in Hsin-Chu City to gather blood donated about every three months. To 
build a FRMTC model, we selected 748 donors at random from the donor database. 
These 748 donor data, each one included R (Recency - months since last 
donation), F (Frequency - total number of donation), M (Monetary - total blood 
donated in c.c.), T (Time - months since first donation), and a binary variable 
representing whether he/she donated blood in March 2007 (1 stand for donating 
blood; 0 stands for not donating blood).

Attribute Information:

Given is the variable name, variable type, the measurement unit and a brief 
description. The "Blood Transfusion Service Center" is a classification problem. 
The order of this listing corresponds to the order of numerals along the rows of 
the database.

 - R (Recency - months since last donation)
 - F (Frequency - total number of donation)
 - M (Monetary - total blood donated in c.c.)
 - T (Time - months since first donation)
 - a binary variable representing whether he/she donated blood in March 2007 (1 
stand for donating blood; 0 stands for not donating blood).

In [13]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
import dill
# import xgboost as xgb

In [2]:
df = pd.read_csv("data/webinar_6/transfusion.data", header=0, names=['recency', 'frequency', 'monetary', 'time', 'target'], dtype='int64')
df.head()

Unnamed: 0,recency,frequency,monetary,time,target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [3]:
df.drop_duplicates(inplace=True)
df.reset_index(inplace=True, drop=True)

In [4]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df['target'], test_size=0.2, random_state=42)
#save test
X_test.to_csv("./data/webinar_9/X_test.csv", index=None)
y_test.to_csv("./data/webinar_9/y_test.csv", index=None)
#save train
X_train.to_csv("./data/webinar_9/X_train.csv", index=None)
y_train.to_csv("./data/webinar_9/y_train.csv", index=None)

In [5]:
features = ['recency', 'frequency', 'time']
target = 'target'

In [24]:
class ColumnSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]

In [25]:
final_transformers = []
for feat_name in features:
    feat_transformer = Pipeline([
                ('selector', ColumnSelector(key=feat_name)),
                ('scaler', StandardScaler())
            ])
        
    final_transformers.append((feat_name, feat_transformer))

In [31]:
feats = FeatureUnion(final_transformers)

regressor = Pipeline([
    ('features',feats),
    ('classifier', LogisticRegression()),
])
regressor.fit(X_train, y_train)

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('recency',
                                                 Pipeline(steps=[('selector',
                                                                  ColumnSelector(key='recency')),
                                                                 ('scaler',
                                                                  StandardScaler())])),
                                                ('frequency',
                                                 Pipeline(steps=[('selector',
                                                                  ColumnSelector(key='frequency')),
                                                                 ('scaler',
                                                                  StandardScaler())])),
                                                ('time',
                                                 Pipeline(steps=[('selector',
                                

In [32]:
with open("data/webinar_9/logreg_pipeline.dill", "wb") as f:
    dill.dump(regressor, f)