## Model Training
## Importing data and Required Files



In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [51]:
df=pd.read_csv("/content/t.data")

In [52]:
df.drop_duplicates(inplace=True)

In [53]:
df.columns = ['recency_months', 'frequency_times', 'monetary_cc_blood', 'time_months', 'donated_march_2007']

## Preparing X and Y variables

In [54]:
X= df.iloc[:, :-1]
X

Unnamed: 0,recency_months,frequency_times,monetary_cc_blood,time_months
0,2,50,12500,98
1,0,13,3250,28
2,1,16,4000,35
3,2,20,5000,45
4,1,24,6000,77
...,...,...,...,...
743,23,2,500,38
744,21,2,500,52
745,23,3,750,62
746,39,1,250,39


In [55]:
y=df.iloc[:,-1]
y

Unnamed: 0,donated_march_2007
0,1
1,1
2,1
3,1
4,0
...,...
743,0
744,0
745,0
746,0


In [56]:
"""# Create Column Transformer with 3 types of transformers
num_features = X.select_dtypes(exclude="object").columns

from sklearn.preprocessing import  StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    [
                ("StandardScaler", numeric_transformer, num_features)
    ]
)"""

'# Create Column Transformer with 3 types of transformers\nnum_features = X.select_dtypes(exclude="object").columns\n\nfrom sklearn.preprocessing import  StandardScaler\nfrom sklearn.compose import ColumnTransformer\n\nnumeric_transformer = StandardScaler()\n\npreprocessor = ColumnTransformer(\n    [\n                ("StandardScaler", numeric_transformer, num_features)       \n    ]\n)'

In [57]:
#X = preprocessor.fit_transform(X)


In [58]:
X.shape

(533, 4)

In [59]:
X

Unnamed: 0,recency_months,frequency_times,monetary_cc_blood,time_months
0,2,50,12500,98
1,0,13,3250,28
2,1,16,4000,35
3,2,20,5000,45
4,1,24,6000,77
...,...,...,...,...
743,23,2,500,38
744,21,2,500,52
745,23,3,750,62
746,39,1,250,39


## Splitting transfusion into train and test datasets
We'll now use train_test_split() method to split transfusion DataFrame.

Target incidence informed us that in our dataset 0s appear 76% of the time. We want to keep the same structure in train and test datasets, i.e., both datasets must have 0 target incidence of 76%. This is very easy to do using the train_test_split() method from the scikit learn library - all we need to do is specify the stratify parameter. In our case, we'll stratify on the target column.

In [60]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape

((426, 4), (107, 4))

In [61]:
y_train.shape, y_test.shape

((426,), (107,))

## Selecting model using TPOT
TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

TPOT Machine Learning Pipeline

TPOT will automatically explore hundreds of possible pipelines to find the best one for our dataset. Note, the outcome of this search will be a scikit-learn pipeline, meaning it will include any pre-processing steps as well as the model.

We are using TPOT to help us zero in on one model that we can then explore and optimize further.

In [62]:
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score


tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=2,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)
tpot.fit(X_train, y_train)

# AUC score for tpot model:
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f"\nAUC score: {tpot_auc_score:.4f}")

# Print best pipeline steps:
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    print(f'{idx}. {transform}')

is_classifier
is_regressor
is_classifier
is_regressor
is_classifier
is_regressor
is_classifier
is_regressor
is_classifier
is_regressor
is_classifier
is_regressor
is_classifier
is_classifier
is_classifier
is_classifier
is_classifier
is_classifier




is_regressor
is_classifier
is_regressor
is_classifier
is_regressor
is_classifier
is_regressor
is_classifier
is_regressor
is_classifier
is_regressor
is_classifier
is_classifier
is_regressor


Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7266464480874316

Generation 2 - Current best internal CV score: 0.7266464480874316

Generation 3 - Current best internal CV score: 0.728164480874317

Generation 4 - Current best internal CV score: 0.728164480874317

Generation 5 - Current best internal CV score: 0.728164480874317

Best pipeline: LogisticRegression(BernoulliNB(GaussianNB(input_matrix), alpha=0.01, fit_prior=False), C=25.0, dual=False, penalty=l2)

AUC score: 0.7775

Best pipeline steps:
1. StackingEstimator(estimator=GaussianNB())
2. StackingEstimator(estimator=BernoulliNB(alpha=0.01, fit_prior=False))
3. LogisticRegression(C=25.0, random_state=42)




## Checking the variance
TPOT picked LogisticRegression as the best model for our dataset with no pre-processing steps, giving us the AUC score of 0.77. This is a great starting point. Let's see if we can make it better.

One of the assumptions for linear models is that the data and the features we are giving it are related in a linear fashion, or can be measured with a linear distance metric. If a feature in our dataset has a high variance that's orders of magnitude greater than the other features, this could impact the model's ability to learn from other features in the dataset.

Correcting for high variance is called normalization. It is one of the possible transformations you do before training a model. Let's check the variance to see if such transformation is needed.

In [64]:
# X_train's variance, rounding the output to 3 decimal places:
X_train.var().round(3)

Unnamed: 0,0
recency_months,64.025
frequency_times,27.455
monetary_cc_blood,1715920.393
time_months,505.246


In [65]:


# Normalize 'monetary_cc_blood' using MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train['monetary_cc_blood'] = scaler.fit_transform(X_train[['monetary_cc_blood']])
X_test['monetary_cc_blood'] = scaler.transform(X_test[['monetary_cc_blood']])

# Now X_train and X_test have the normalized 'monetary_cc_blood' column
print(X_train.var().round(3))

recency_months        64.025
frequency_times       27.455
monetary_cc_blood      0.020
time_months          505.246
dtype: float64


## Training the logistic regression model
The variance looks much better now. Notice that now Time (months) has the largest variance, but it's not the orders of magnitude higher than the rest of the variables, so we'll leave it as is.

We are now ready to train the logistic regression model.

In [74]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Initialize and train the Logistic Regression model
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)


# Make predictions on the test set
y_pred_proba = logreg.predict_proba(X_test)[:, 1]

# Calculate the AUC score
logreg_auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"\nAUC score for Logistic Regression: {logreg_auc_score:.4f}")


AUC score for Logistic Regression: 0.7813
