#                                     CUSTOMER TRANSACTION PREDICTION 


The objective of this project is to build a machine learning system that predicts whether a customer will make a transaction in the future. The dataset consists of anonymized features, so detailed Exploratory Data Analysis (EDA) is skipped as feature names are not interpretable.

In [1]:
# ============================================
# CELL 1: IMPORT LIBRARIES
# ============================================
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    classification_report
)

import joblib


In [3]:
# ============================================
# CELL 2: DATA COLLECTION
# ============================================
# Change the path below according to where the file is on your system
data_path = "train(1).csv"   # e.g. "train.csv"

df = pd.read_csv(data_path)
print("‚úÖ Data loaded successfully!")


‚úÖ Data loaded successfully!


In [4]:
# ============================================
# CELL 3: BASIC CHECKS
# ============================================

print("\n1. Dataset Shape (rows, columns):")
print(df.shape)

print("\n2. First 5 rows:")
display(df.head())

print("\n3. Data types:")
print(df.dtypes.head(10))  # show first 10 types

print("\n4. Statistical summary of numeric features:")
display(df.describe().T.head(10))  # show first 10 rows for readability

print("\n5. Missing values in each column (first 20 columns):")
print(df.isnull().sum().head(20))

print("\n6. Number of duplicate rows:")
print(df.duplicated().sum())

print("\n7. Target variable distribution (count):")
print(df["target"].value_counts())

print("\n8. Target variable distribution (percentage):")
print(df["target"].value_counts(normalize=True) * 100)



1. Dataset Shape (rows, columns):
(200000, 202)

2. First 5 rows:


Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104



3. Data types:
ID_code     object
target       int64
var_0      float64
var_1      float64
var_2      float64
var_3      float64
var_4      float64
var_5      float64
var_6      float64
var_7      float64
dtype: object

4. Statistical summary of numeric features:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
target,200000.0,0.10049,0.300653,0.0,0.0,0.0,0.0,1.0
var_0,200000.0,10.679914,3.040051,0.4084,8.45385,10.52475,12.7582,20.315
var_1,200000.0,-1.627622,4.050044,-15.0434,-4.740025,-1.60805,1.358625,10.3768
var_2,200000.0,10.715192,2.640894,2.1171,8.722475,10.58,12.5167,19.353
var_3,200000.0,6.796529,2.043319,-0.0402,5.254075,6.825,8.3241,13.1883
var_4,200000.0,11.078333,1.62315,5.0748,9.883175,11.10825,12.261125,16.6714
var_5,200000.0,-5.065317,7.863267,-32.5626,-11.20035,-4.83315,0.9248,17.2516
var_6,200000.0,5.408949,0.866607,2.3473,4.7677,5.3851,6.003,8.4477
var_7,200000.0,16.54585,3.418076,5.3497,13.9438,16.4568,19.1029,27.6918
var_8,200000.0,0.284162,3.332634,-10.5055,-2.3178,0.3937,2.9379,10.1513



5. Missing values in each column (first 20 columns):
ID_code    0
target     0
var_0      0
var_1      0
var_2      0
var_3      0
var_4      0
var_5      0
var_6      0
var_7      0
var_8      0
var_9      0
var_10     0
var_11     0
var_12     0
var_13     0
var_14     0
var_15     0
var_16     0
var_17     0
dtype: int64

6. Number of duplicate rows:
0

7. Target variable distribution (count):
target
0    179902
1     20098
Name: count, dtype: int64

8. Target variable distribution (percentage):
target
0    89.951
1    10.049
Name: proportion, dtype: float64


In [5]:
# ============================================
# CELL 4: DOMAIN ANALYSIS & PROBLEM DEFINITION (TEXT ONLY)
# ============================================

problem_description = """
Business Problem:
-----------------
A bank wants to identify which customers are likely to make a specific 
transaction in the future. 

Data:
-----
- ID_code: unique identifier for each customer
- target: 0 (no transaction), 1 (will make transaction)
- 200 anonymized numerical features (e.g., var_0, var_1, ..., var_199)

Goal:
-----
1. Build a classification model to predict target (0/1).
2. Compare multiple models and choose the best one for production.
"""

print(problem_description)



Business Problem:
-----------------
A bank wants to identify which customers are likely to make a specific 
transaction in the future. 

Data:
-----
- ID_code: unique identifier for each customer
- target: 0 (no transaction), 1 (will make transaction)
- 200 anonymized numerical features (e.g., var_0, var_1, ..., var_199)

Goal:
-----
1. Build a classification model to predict target (0/1).
2. Compare multiple models and choose the best one for production.



In [6]:
# ============================================
# CELL 5: FEATURE ENGINEERING (BASIC)
# ============================================

# 1. Separate target and features
TARGET_COL = "target"
ID_COL = "ID_code"

X = df.drop(columns=[TARGET_COL, ID_COL])  # drop target + ID
y = df[TARGET_COL]

print("Features shape:", X.shape)
print("Target shape:", y.shape)

# 2. Train-test split (70% train, 30% test) with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

print("\nTrain shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)


Features shape: (200000, 200)
Target shape: (200000,)

Train shape: (140000, 200) (140000,)
Test shape: (60000, 200) (60000,)


In [7]:
# ============================================
# CELL 6: PREPROCESSING + FEATURE SELECTION
# ============================================
# We will create a common preprocessing + feature selection step
# - SimpleImputer: handles missing values (if any)
# - StandardScaler: scales features
# - SelectKBest: keep top K important features (here K=50, you can change)

K_FEATURES = 50  # you can tune this number later

preprocess_and_select = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("select", SelectKBest(score_func=mutual_info_classif, k=K_FEATURES))
])

print("‚úÖ Preprocessing + feature selection pipeline created.")


‚úÖ Preprocessing + feature selection pipeline created.


In [9]:
# ============================================
# CELL 7: DEFINE MULTIPLE MODELS FOR COMPARISON
# ============================================

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, n_jobs=None),
    "Random Forest": RandomForestClassifier(
        n_estimators=200,
        random_state=42,
        n_jobs=-1
    ),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=5)
}

print("Models to compare:")
for name in models:
    print("-", name)


Models to compare:
- Logistic Regression
- Random Forest
- Gradient Boosting
- KNN


In [None]:
# ============================================
# CELL 8: MODEL COMPARISON USING CROSS-VALIDATION
# ============================================

results = []

# Stratified K-Fold to respect class imbalance
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scoring = {
    "accuracy": "accuracy",
    "roc_auc": "roc_auc",
    "f1": "f1"
}

for name, model in models.items():
    # Create a full pipeline: preprocess + model
    pipe = Pipeline(steps=[
        ("preprocess_select", preprocess_and_select),
        ("model", model)
    ])
    
    print(f"\n‚è≥ Training and evaluating: {name}")
    
    cv_scores = cross_validate(
        pipe,
        X_train, y_train,
        cv=cv,
        scoring=scoring,
        n_jobs=-1
    )
    
    results.append({
        "Model": name,
        "Accuracy_mean": cv_scores["test_accuracy"].mean(),
        "Accuracy_std": cv_scores["test_accuracy"].std(),
        "ROC_AUC_mean": cv_scores["test_roc_auc"].mean(),
        "ROC_AUC_std": cv_scores["test_roc_auc"].std(),
        "F1_mean": cv_scores["test_f1"].mean(),
        "F1_std": cv_scores["test_f1"].std()
    })

results_df = pd.DataFrame(results)
results_df_sorted = results_df.sort_values(by="ROC_AUC_mean", ascending=False)

print("\nüìä Model Comparison (sorted by ROC_AUC_mean):")
display(results_df_sorted)



‚è≥ Training and evaluating: Logistic Regression


In [None]:
# ============================================
# CELL 9: SELECT BEST MODEL (BASED ON ROC_AUC) AND TRAIN ON FULL TRAIN DATA
# ============================================

# Pick the model with highest ROC_AUC_mean
best_model_name = results_df_sorted.iloc[0]["Model"]
print("Best model selected:", best_model_name)

best_model = models[best_model_name]

# Build final pipeline with best model
best_pipeline = Pipeline(steps=[
    ("preprocess_select", preprocess_and_select),
    ("model", best_model)
])

# Fit on training data
best_pipeline.fit(X_train, y_train)

print("‚úÖ Best model trained on training data.")


In [22]:
# ============================================
# CELL 10: EVALUATION ON TEST SET
# ============================================

# Predictions
y_pred = best_pipeline.predict(X_test)
y_proba = best_pipeline.predict_proba(X_test)[:, 1]  # probability for class 1

# Metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc = roc_auc_score(y_test, y_proba)

print("==== TEST SET PERFORMANCE ====")
print(f"Accuracy : {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall   : {rec:.4f}")
print(f"F1-score : {f1:.4f}")
print(f"ROC-AUC  : {roc:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


==== TEST SET PERFORMANCE ====
Accuracy : 0.9026
Precision: 0.8046
Recall   : 0.0403
F1-score : 0.0768
ROC-AUC  : 0.8081

Classification Report:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95     53971
           1       0.80      0.04      0.08      6029

    accuracy                           0.90     60000
   macro avg       0.85      0.52      0.51     60000
weighted avg       0.89      0.90      0.86     60000


Confusion Matrix:
[[53912    59]
 [ 5786   243]]


In [21]:
# ============================================
# CELL 11: SAVE BEST MODEL FOR PRODUCTION
# ============================================

model_filename = f"best_model_{best_model_name.replace(' ', '_').lower()}.joblib"
joblib.dump(best_pipeline, model_filename)

print(f"‚úÖ Best model saved as: {model_filename}")


‚úÖ Best model saved as: best_model_gradient_boosting.joblib


In [26]:
# ============================================
# CELL 12: SAMPLE PREDICTION ON NEW DATA (OPTIONAL DEMO)
# ============================================
# Take a few rows from X_test as "new customers" and show predicted probabilities.

sample_customers = X_test.iloc[:5]
sample_true = y_test.iloc[:5]

sample_proba = best_pipeline.predict_proba(sample_customers)[:, 1]
sample_pred = best_pipeline.predict(sample_customers)

print("Sample customers (first 5 from test set):")
display(sample_customers)

print("\nTrue target values:     ", list(sample_true.values))
print("Predicted probabilities:", np.round(sample_proba, 4))
print("Predicted classes:      ", list(sample_pred))


Sample customers (first 5 from test set):


Unnamed: 0,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
163990,8.831,-2.3016,13.2316,5.0266,9.86,10.2081,3.6479,21.8577,-3.1201,5.7758,...,-5.5464,10.3932,-0.5231,-7.0648,18.2618,-1.7615,12.5539,7.687,15.0994,-16.9054
106916,17.8659,3.1619,13.0525,9.5777,11.4522,-16.2292,5.8872,10.4428,-0.7323,8.7069,...,7.7251,6.4649,1.7492,-3.1808,19.2917,-0.6208,10.6752,7.4027,12.0198,12.9762
189758,12.2995,-0.4513,8.6624,7.7633,10.264,2.8404,6.2003,14.452,0.8639,8.166,...,0.4213,4.3329,1.4159,6.5954,15.7945,1.9027,2.8749,8.61,22.3738,10.8293
185006,9.4057,-3.1699,8.0503,5.1969,11.5919,1.9257,5.27,16.8097,1.7853,8.0217,...,13.3047,13.509,0.6232,6.221,21.4579,0.4512,4.7943,9.1904,13.6194,-24.0796
175007,13.5909,7.8904,15.9594,7.4401,11.4552,-17.8994,5.0994,17.5617,2.2557,8.0235,...,2.4529,9.6375,-0.6109,3.1691,19.6098,1.9964,7.409,9.0665,15.5921,-5.3999



True target values:      [0, 0, 0, 0, 1]
Predicted probabilities: [0.076  0.0778 0.0352 0.0759 0.078 ]
Predicted classes:       [0, 0, 0, 0, 0]


In [32]:
# ============================================
# CELL 13: PLACEHOLDER ‚Äì CHALLENGES FACED (WRITE AS TEXT/MARKDOWN)
# ============================================

challenges_text = """
Challenges Faced:
-----------------
1. Anonymized features:
   - No domain meaning for individual features.
   - Could not perform domain-specific EDA/feature engineering.
   - Solution: Treated all features as generic numeric variables and focused
     on robust models (tree-based, ensemble methods).

2. High-dimensional data (200 features):
   - Risk of overfitting and longer training time.
   - Solution: Used SelectKBest with mutual information to reduce to K
     most informative features.

3. Class imbalance (if target 1 is much smaller than target 0):
   - Model could get high accuracy by predicting mostly class 0.
   - Solution: Used stratified train-test split and ROC-AUC / F1 as key
     metrics instead of accuracy alone. (Optionally we could use
     class_weight='balanced' or resampling methods.)

4. No missing values or few missing values:
   - Still added SimpleImputer in pipeline to make it robust if data
     quality changes in production. """


print(challenges_text)



Challenges Faced:
-----------------
1. Anonymized features:
   - No domain meaning for individual features.
   - Could not perform domain-specific EDA/feature engineering.
   - Solution: Treated all features as generic numeric variables and focused
     on robust models (tree-based, ensemble methods).

2. High-dimensional data (200 features):
   - Risk of overfitting and longer training time.
   - Solution: Used SelectKBest with mutual information to reduce to K
     most informative features.

3. Class imbalance (if target 1 is much smaller than target 0):
   - Model could get high accuracy by predicting mostly class 0.
   - Solution: Used stratified train-test split and ROC-AUC / F1 as key
     metrics instead of accuracy alone. (Optionally we could use
     class_weight='balanced' or resampling methods.)

4. No missing values or few missing values:
   - Still added SimpleImputer in pipeline to make it robust if data
     quality changes in production. 
