## DECISION TREE CLASSIFICATIONAL ANALYSIS OF BLOOD TRANSFUSION SERVICE 

DATASET SOURCE: "UCI MACHINE LEARNING REPOSITRY" 

### IMPORTING THE LIBRARIES 

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [2]:
dataset=pd.read_csv("transfusion.csv")

In [3]:
dataset.head(6)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
5,4,4,1000,4,0


In [4]:
dataset.shape

(748, 5)

####  DATASET DESCRIPTION
    Title: Blood Transfusion Service Center Data Set

    Abstract: Data taken from the Blood Transfusion Service Center in Hsin-Chu City 
    in Taiwan -- this is a classification problem.


    Data Set Information:

    To demonstrate the RFMTC marketing model (a modified version of RFM), this study 
    adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City 
    in Taiwan. The center passes their blood transfusion service bus to one 
    university in Hsin-Chu City to gather blood donated about every three months. To 
    build a FRMTC model, we selected 748 donors at random from the donor database. 
    These 748 donor data, each one included R (Recency - months since last 
    donation), F (Frequency - total number of donation), M (Monetary - total blood 
    donated in c.c.), T (Time - months since first donation), and a binary variable 
    representing whether he/she donated blood in March 2007 (1 stand for donating 
    blood; 0 stands for not donating blood).
    
    -----------------------------------------------------
    
    Attribute Information:
    
    Given is the variable name, variable type, the measurement unit and a brief 
    description. The "Blood Transfusion Service Center" is a classification problem. 
    The order of this listing corresponds to the order of numerals along the rows of 
    the database.
    
    R (Recency - months since last donation),
    F (Frequency - total number of donation),
    M (Monetary - total blood donated in c.c.),
    T (Time - months since first donation), and
    a binary variable representing whether he/she donated blood in March 2007 (1 
    stand for donating blood; 0 stands for not donating blood).    

In [5]:
dataset.columns

Index(['Recency (months)', 'Frequency (times)', 'Monetary (c.c. blood)',
       'Time (months)', 'whether he/she donated blood in March 2007'],
      dtype='object')

##  Checking the datatypes of the attributes

In [6]:
for i in dataset.columns:
    print('{}   dtype    "{}!"'.format(i,dataset[i].dtype)) 
    

Recency (months)   dtype    "int64!"
Frequency (times)   dtype    "int64!"
Monetary (c.c. blood)   dtype    "int64!"
Time (months)   dtype    "int64!"
whether he/she donated blood in March 2007   dtype    "int64!"


### 'whether he/she donated blood in March 2007' IS THE DEPENDENT VARIABLE 

In [7]:
print(dataset['whether he/she donated blood in March 2007'].unique())

[1 0]


## Missing values

In [8]:
dataset.isna().sum()

Recency (months)                              0
Frequency (times)                             0
Monetary (c.c. blood)                         0
Time (months)                                 0
whether he/she donated blood in March 2007    0
dtype: int64

In [9]:
dataset.sample(5)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
131,2,12,3000,95,0
596,3,1,250,3,1
217,4,1,250,4,0
451,21,3,750,38,0
415,16,1,250,16,0


### SPLITTING INTO DEPENDENT AND INDEPENDENT VARIABLES

In [10]:
X = dataset.iloc[:, :-1].values  #independent variables
y = dataset.iloc[:, 4].values #dependent variable

In [19]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X[:,[1,2,3]]=sc.fit_transform(X[:,[1,2,3]])




# Splitting the dataset into the Training set and Test set

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## MODEL CREATION AND TRAINING WITH TRAINING SET

In [21]:
from sklearn.tree import DecisionTreeClassifier
cls = DecisionTreeClassifier(criterion='entropy',random_state=42)
cls.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')

In [22]:
# Predicting the Test set results
y_pred = cls.predict(X_test)

## Confusion Matrix

In [23]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [24]:
print(cm)

[[110   3]
 [ 33   4]]


In [25]:
## confusion matrix results
#--[110+4]=[114]correct predictions
#--[33+3]=[36]---incorrect predictions

## ACCURACY

In [26]:
print("ACCURACY OF MODEL IS : ",cls.score(X_test,y_test)*100,"%")

ACCURACY OF MODEL IS :  76.0 %


#### WE HAVE CREATED A MODEL THAT CAN ANALYSE THE BLOOD TANSFUSION SERVICE CENTER WITH AN ACCURACY OF 76%