<a href="https://colab.research.google.com/github/Hassan-293/Predict-Blood-Donation-for-Future-Expectancy/blob/main/Predict_Blood_Donation_for_Future_Expectancy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [29]:
import numpy as np
import pandas as pd 
from scipy import stats
from sklearn import preprocessing
# for min_max scaling
from mlxtend.preprocessing import minmax_scaling
# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

**-----Data Inspection----**

In [30]:
df = pd.read_csv("transfusion.data")
df.head(750)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
743,23,2,500,38,0
744,21,2,500,52,0
745,23,3,750,62,0
746,39,1,250,39,0


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


**Data Cleaning**

In [31]:
Null_Counts = df.isnull().sum()
print("The missing values in the each column are: ", Null_Counts)

The missing values in the each column are:  Recency (months)                              0
Frequency (times)                             0
Monetary (c.c. blood)                         0
Time (months)                                 0
whether he/she donated blood in March 2007    0
dtype: int64


In [32]:
Recency = df['Recency (months)'].unique()
Recency.sort()
Recency

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 20, 21, 22, 23, 25, 26, 35, 38, 39, 40, 72, 74])

In [6]:
Monetary = df['Monetary (c.c. blood)'].unique()
Monetary.sort()
Monetary

array([  250,   500,   750,  1000,  1250,  1500,  1750,  2000,  2250,
        2500,  2750,  3000,  3250,  3500,  3750,  4000,  4250,  4500,
        4750,  5000,  5250,  5500,  5750,  6000,  6500,  8250,  8500,
        9500, 10250, 10750, 11000, 11500, 12500])

In [7]:
Frequency = df['Frequency (times)'].unique()
Frequency.sort()
Frequency

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 26, 33, 34, 38, 41, 43, 44, 46, 50])

In [8]:
Time = df['Time (months)'].unique()
Time.sort()
Time

array([ 2,  3,  4,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23,
       24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
       41, 42, 43, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 57, 58, 59,
       60, 61, 62, 63, 64, 65, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
       81, 82, 83, 86, 87, 88, 89, 93, 95, 98])

In [11]:
Target = df['target'].unique()
Target.sort()
Target

array([0, 1])

  **Creating Target Column**

In [10]:
df.rename(
    columns={'whether he/she donated blood in March 2007':'target'},
    inplace=True
)
df.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


**Data Split for Training and Testing**

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train,Y_test= train_test_split(
    df.drop(columns='target'),
    df.target,
    test_size=0.25,
    random_state=42,
    stratify=df.target
)
X_train.head(5)


Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
334,16,2,500,16
99,5,7,1750,26
116,2,7,1750,46
661,16,2,500,16
154,2,1,250,2


**Installing TPOT PACKAGE**

In [13]:
!pip install TPOT

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting TPOT
  Downloading TPOT-0.11.7-py3-none-any.whl (87 kB)
[K     |████████████████████████████████| 87 kB 3.4 MB/s 
Collecting stopit>=1.1.1
  Downloading stopit-1.1.2.tar.gz (18 kB)
Collecting xgboost>=1.1.0
  Downloading xgboost-1.6.1-py3-none-manylinux2014_x86_64.whl (192.9 MB)
[K     |████████████████████████████████| 192.9 MB 79 kB/s 
[?25hCollecting update-checker>=0.16
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting deap>=1.2
  Downloading deap-1.3.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (160 kB)
[K     |████████████████████████████████| 160 kB 56.5 MB/s 
Building wheels for collected packages: stopit
  Building wheel for stopit (setup.py) ... [?25l[?25hdone
  Created wheel for stopit: filename=stopit-1.1.2-py3-none-any.whl size=11956 sha256=ff2ec330ede6dcacf60f9a6be69096511f85eb32d51

**USING THE TPOT CLASSIFEIR**

In [14]:
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

tpot = TPOTClassifier(
    generations = 5,
    population_size = 20,
    verbosity = 2,
    scoring = 'roc_auc',
    random_state = 42,
    disable_update_check = True,
    config_dict = 'TPOT sparse'
)

tpot.fit(X_train, Y_train)
tpot_auc_score = roc_auc_score(Y_test, tpot.predict_proba(X_test)[:,1])
print(f'\nAUC score:{tpot_auc_score:.5f}')



print('\nBest Piplines Steps:', end='\n')
for identity,  (name,transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):

 print({f'{ identity}. {transform}'})


Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7422459184429089

Generation 2 - Current best internal CV score: 0.7422459184429089

Generation 3 - Current best internal CV score: 0.7446043939340792

Generation 4 - Current best internal CV score: 0.7446043939340792

Generation 5 - Current best internal CV score: 0.7446043939340792

Best pipeline: XGBClassifier(input_matrix, learning_rate=0.1, max_depth=5, min_child_weight=4, n_estimators=100, n_jobs=1, subsample=0.4, verbosity=0)

AUC score:0.76120

Best Piplines Steps:
{"1. XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,\n              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n              early_stopping_rounds=None, enable_categorical=False,\n              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n              importance_type=None, interaction_constraints='',\n              learning_rate=0.1, max_bin=256, max_cat_to_onehot=4,\n              max_delta_step=0, max_depth=5, m

**CALCULATING THE VARIANCE**

In [18]:
X_train.var().round(2)

Recency (months)              66.93
Frequency (times)             33.83
Monetary (c.c. blood)    2114363.70
Time (months)                611.15
dtype: float64

**USING LOG NORMALIZATION TO NORMALIZE THE DATA**

In [44]:
# Import numpy
import numpy as np

X_train_normed, X_test_normed = X_train.copy(), X_test.copy()

col_to_normalize = 'Monetary (c.c. blood)'

for data in [X_train_normed, X_test_normed]:
  
    data['monetary_log'] = np.log(data[col_to_normalize])
  
    data.drop(columns=col_to_normalize, inplace=True)

# Check the variance for X_train_normed
X_train_normed.var().round(2)

Recency (months)      66.93
Frequency (times)     33.83
Time (months)        611.15
monetary_log           0.84
dtype: float64

**Calculating the SCORE**

In [47]:
from sklearn import linear_model

Logistic_Regression = linear_model.LogisticRegression(
    solver='liblinear',
    random_state=42
)

Logistic_Regression.fit(X_train_normed,Y_train)
logreg_auc_score = roc_auc_score(Y_test, Logistic_Regression.predict_proba(X_test_normed)[:,1])

print(f'\nAUC score: {logreg_auc_score}')


AUC score: 0.7890178003814368
