In [None]:
from google.colab import drive
drive.mount('/googledrive')

Drive already mounted at /googledrive; to attempt to forcibly remount, call drive.mount("/googledrive", force_remount=True).


**1.Inspecting transfusion.data file**

The Transfusion dataset consists of data about blood donors, with the goal of predicting whether a person donated blood in March 2007. The dataset includes the following columns:

1. Recency (months): The number of months since the last donation.
2. Frequency (times): The total number of donations.
3. Monetary (c.c. blood): The total amount of blood donated in c.c.
4. Time (months): The number of months since the first donation.
5. Whether he/she donated blood in March 2007: The target variable (1 if the person donated blood in March 2007, 0 otherwise).








In [None]:
# Open and read the transfusion.data file
with open('/googledrive/MyDrive/27-06-2024/transfusion.data', 'r') as file:
    # Read the first few lines
    for i in range(5):  # Adjust the number of lines as needed
        print(file.readline())


Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),"whether he/she donated blood in March 2007"

2 ,50,12500,98 ,1

0 ,13,3250,28 ,1

1 ,16,4000,35 ,1

2 ,20,5000,45 ,1



**2.Loading the blood donations data**

 load the dataset into a Pandas DataFrame.

In [None]:
#!pip install --upgrade numpy
import pandas as pd

# Load the .data file into a DataFrame
df = pd.read_csv('/googledrive/MyDrive/27-06-2024/transfusion.data')

# Display the first few rows of the dataframe
print(df.head())

   Recency (months)  Frequency (times)  Monetary (c.c. blood)  Time (months)  \
0                 2                 50                  12500             98   
1                 0                 13                   3250             28   
2                 1                 16                   4000             35   
3                 2                 20                   5000             45   
4                 1                 24                   6000             77   

   whether he/she donated blood in March 2007  
0                                           1  
1                                           1  
2                                           1  
3                                           1  
4                                           0  


**3.Inspecting transfusion DataFrame**

Inspecting the DataFrame to understand its structure and content.


In [None]:
# Display basic information about the dataframe
print(df.info())

# Display summary statistics of the dataframe
print(df.describe())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB
None
       Recency (months)  Frequency (times)  Monetary (c.c. blood)  \
count        748.000000         748.000000             748.000000   
mean           9.506684           5.514706            1378.676471   
std            8.095396           5.839307            1459.826781   
min            0.000000           1.000000             250.000000   
25%       

**4.Creating target column**

target column named target which will be derived from the last column indicating whether the person donated blood in March 2007.

In [None]:
# Rename the target column
df.rename(columns={'whether he/she donated blood in March 2007': 'target'}, inplace=True)
df.head(5)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


**5.Checking target incidence**

check the incidence of the target variable to understand the distribution of donors and non-donors.

In [None]:
# Display the value counts of the target variable
print(df['target'].value_counts(normalize=True))


target
0    0.762032
1    0.237968
Name: proportion, dtype: float64


**6.Splitting transfusion into train and test datasets**

spliting into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into features and target
X = df.drop(columns='target')
y = df['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


# Print out the first 2 rows of X_train
X_train.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
529,2,6,1500,22
271,16,7,1750,28


In [None]:
#to install tpot package
!pip install tpot

Collecting numpy>=1.16.3 (from tpot)
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.0
    Uninstalling numpy-2.0.0:
      Successfully uninstalled numpy-2.0.0
Successfully installed numpy-1.26.4


**7.Selecting model using TPOT**

TPOT is an automated machine learning tool that optimizes machine learning pipelines using genetic programming. we use it to select the best model.

In [None]:
# Import necessary libraries
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline

# Instantiate TPOTClassifier
tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=2,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)

# Fit TPOTClassifier to the training data
tpot.fit(X_train, y_train)

# Calculate AUC score for the TPOT model
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

# Print best pipeline steps
print('\nBest pipeline steps:')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    print(f'{idx}. {transform}')

Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7455534399205421

Generation 2 - Current best internal CV score: 0.7455534399205421

Generation 3 - Current best internal CV score: 0.7457088665243516

Generation 4 - Current best internal CV score: 0.7458704125174461

Generation 5 - Current best internal CV score: 0.7458704125174461

Best pipeline: LogisticRegression(MaxAbsScaler(input_matrix), C=15.0, dual=False, penalty=l2)

AUC score: 0.7863

Best pipeline steps:
1. MaxAbsScaler()
2. LogisticRegression(C=15.0, random_state=42)


**8.Checking the variance**

Checking the variance of the features is used to understand their distribution.



In [None]:
# Display the variance of each feature
print(X.var())

# more specific display without scientific notation:
variances = X.var()
print("\nVariance of each feature without scientific notation:")
for feature, variance in variances.items():
    print(f"{feature}: {variance:,.2f}")

Recency (months)         6.553543e+01
Frequency (times)        3.409751e+01
Monetary (c.c. blood)    2.131094e+06
Time (months)            5.942242e+02
dtype: float64

Variance of each feature without scientific notation:
Recency (months): 65.54
Frequency (times): 34.10
Monetary (c.c. blood): 2,131,094.23
Time (months): 594.22


**9.Log normalization**

If any features have high variance, we can apply log normalization to them.

In [None]:
import numpy as np

# Define the threshold for high variance
threshold = 1000

# Apply log normalization to high variance features if necessary
X_train_log = X_train.copy()
X_test_log = X_test.copy()

high_variance_columns = X_train.columns[X_train.var() > threshold]
print(high_variance_columns)

for column in high_variance_columns:
    X_train_log[column] = np.log1p(X_train[column])
    X_test_log[column] = np.log1p(X_test[column])

print("Transformed training data sample:")
print(X_train_log.head())
print(X_train_log.var().head())

Index(['Monetary (c.c. blood)'], dtype='object')
Transformed training data sample:
     Recency (months)  Frequency (times)  Monetary (c.c. blood)  Time (months)
529                 2                  6               7.313887             22
271                16                  7               7.467942             28
455                21                  1               5.525453             21
175                11                 10               7.824446             35
309                16                  3               6.621406             19
Recency (months)          67.473174
Frequency (times)         32.624460
Monetary (c.c. blood)      0.831421
Time (months)            599.951967
dtype: float64


**10.Training the linear regression model**

Train a linear regression model on the log-normalized data.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initialize and fit the Logistic Regression model
log_reg =LogisticRegression(random_state=42)
log_reg.fit(X_train_log, y_train)

# Predict on the test set
y_pred = log_reg.predict(X_test_log)
print("Prediction on the test set\n",y_pred )

# Display the classification report
print("classification report\n",classification_report(y_test, y_pred))

# Display the confusion matrix
print("confusion matrix\n",confusion_matrix(y_test, y_pred))

# Display the accuracy score
log_reg_accuracy = roc_auc_score(y_test, log_reg.predict_proba(X_test_log)[:, 1])
print(f'\nAccuracy Score: {log_reg_accuracy:.4f}')

Prediction on the test set
 [1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0]
classification report
               precision    recall  f1-score   support

           0       0.81      0.94      0.87       114
           1       0.61      0.31      0.41        36

    accuracy                           0.79       150
   macro avg       0.71      0.62      0.64       150
weighted avg       0.76      0.79      0.76       150

confusion matrix
 [[107   7]
 [ 25  11]]

Accuracy Score: 0.7907


**11.Conclusion**

The automatic model selection using TPOT and evaluated the performance based on the AUC score. Initially, the AUC score achieved by the TPOT-optimized model was 0.7863. In comparison, the logistic regression model applied to log-normalized data achieved a slightly higher AUC score of 0.7907.

 The improvement, although small, is significant. In the field of machine learning, even minor enhancements in performance metrics like accuracy or AUC can be crucial, especially for applications where predictive accuracy is critical.
 By log-normalizing the training data and applying a logistic regression model, we improved the AUC score by approximately 0.56%. Such incremental improvements are valuable as they can enhance the model's reliability and effectiveness, particularly in real-world applications where precision is paramount.

In [None]:
from operator import itemgetter

# tpot_auc_score and logreg_auc_score are already defined
tpot_auc_score = 0.7863
logreg_auc_score = 0.7907

# Sort models based on their AUC score from highest to lowest
sorted_models = sorted(
    [('tpot', tpot_auc_score), ('logreg', logreg_auc_score)],
    key=itemgetter(1),
    reverse=True
)

# Print sorted models
print("Sorted models based on AUC score:")
for model_name, auc_score in sorted_models:
    print(f"{model_name}: {auc_score:.4f}")


Sorted models based on AUC score:
logreg: 0.7907
tpot: 0.7863
