## Cách hoạt động của Agglomerative Clustering:

Agglomerative Clustering là một phương pháp phân cụm phân cấp (hierarchical clustering) thuộc loại bottom-up, tức là bắt đầu với mỗi điểm dữ liệu như một cụm riêng biệt và sau đó kết hợp các cụm lại với nhau từng bước một cho đến khi tất cả các điểm dữ liệu thuộc về một cụm duy nhất hoặc đạt đến số lượng cụm mong muốn.

Khởi tạo: Bắt đầu với mỗi điểm dữ liệu như một cụm riêng biệt.

Tính khoảng cách: Tính khoảng cách giữa tất cả các cụm hiện tại. Khoảng cách này có thể được đo bằng nhiều cách khác nhau, chẳng hạn như khoảng cách Euclidean, Manhattan, hoặc các thước đo khoảng cách khác.

Kết hợp cụm: Tìm hai cụm gần nhất và kết hợp chúng lại với nhau thành một cụm mới.

Lặp lại: Lặp lại quá trình tính khoảng cách và kết hợp cụm cho đến khi đạt đến số lượng cụm mong muốn hoặc tất cả các điểm dữ liệu thuộc về một cụm duy nhất.

In [21]:
pip install numpy pandas scikit-learn imbalanced-learn 




In [22]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from imblearn.under_sampling import TomekLinks
from imblearn.over_sampling import BorderlineSMOTE
from collections import Counter
from sklearn.cluster import AgglomerativeClustering

## Data exploration and Data preparation

In [23]:
# Load dataset
dataset_main = pd.read_csv('PS_20174392719_1491204439457_log.csv')
dataset_main.head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0
5,1,PAYMENT,7817.71,C90045638,53860.0,46042.29,M573487274,0.0,0.0,0,0
6,1,PAYMENT,7107.77,C154988899,183195.0,176087.23,M408069119,0.0,0.0,0,0
7,1,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0.0,0.0,0,0
8,1,PAYMENT,4024.36,C1265012928,2671.0,0.0,M1176932104,0.0,0.0,0,0
9,1,DEBIT,5337.77,C712410124,41720.0,36382.23,C195600860,41898.0,40348.79,0,0


In [24]:
dataset_main.drop('nameOrig', axis=1, inplace=True)
dataset_main.drop('nameDest', axis=1, inplace=True)
dataset_main.drop('isFlaggedFraud', axis=1, inplace=True)
dataset_main.head(10)

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,1,PAYMENT,9839.64,170136.0,160296.36,0.0,0.0,0
1,1,PAYMENT,1864.28,21249.0,19384.72,0.0,0.0,0
2,1,TRANSFER,181.0,181.0,0.0,0.0,0.0,1
3,1,CASH_OUT,181.0,181.0,0.0,21182.0,0.0,1
4,1,PAYMENT,11668.14,41554.0,29885.86,0.0,0.0,0
5,1,PAYMENT,7817.71,53860.0,46042.29,0.0,0.0,0
6,1,PAYMENT,7107.77,183195.0,176087.23,0.0,0.0,0
7,1,PAYMENT,7861.64,176087.23,168225.59,0.0,0.0,0
8,1,PAYMENT,4024.36,2671.0,0.0,0.0,0.0,0
9,1,DEBIT,5337.77,41720.0,36382.23,41898.0,40348.79,0


In [25]:
dataset_main_clean = dataset_main.drop_duplicates()

# Kiểm tra lại để đảm bảo rằng các hàng trùng lặp đã được loại bỏ
duplicates_after = dataset_main_clean.duplicated()
num_duplicates_after = duplicates_after.sum()

In [26]:
from sklearn.preprocessing import LabelEncoder

# Khởi tạo và fit LabelEncoder
label_encoder = LabelEncoder()
dataset_main_clean.loc[:, 'type'] = label_encoder.fit_transform(dataset_main_clean['type'])

# Hiển thị danh sách các phương thức và số tương ứng
print("Các phương thức giao dịch và số tương ứng:")
for method, code in zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)):
    print(f"{method}: {code}")

Các phương thức giao dịch và số tương ứng:
CASH_IN: 0
CASH_OUT: 1
DEBIT: 2
PAYMENT: 3
TRANSFER: 4


## Handle Imbalanced dataset

#### Optimize Data Types

In [27]:
dataset_main_clean.loc[:, 'amount'] = dataset_main_clean['amount'].astype(np.float32)
dataset_main_clean.loc[:, 'oldbalanceOrg'] = dataset_main_clean['oldbalanceOrg'].astype(np.float32)
dataset_main_clean.loc[:, 'newbalanceOrig'] = dataset_main_clean['newbalanceOrig'].astype(np.float32)
dataset_main_clean.loc[:, 'oldbalanceDest'] = dataset_main_clean['oldbalanceDest'].astype(np.float32)
dataset_main_clean.loc[:, 'newbalanceDest'] = dataset_main_clean['newbalanceDest'].astype(np.float32)

In [28]:
dataset_main_clean.head(10)

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,1,3,9839.639648,170136.0,160296.359375,0.0,0.0,0
1,1,3,1864.280029,21249.0,19384.720703,0.0,0.0,0
2,1,4,181.0,181.0,0.0,0.0,0.0,1
3,1,1,181.0,181.0,0.0,21182.0,0.0,1
4,1,3,11668.139648,41554.0,29885.859375,0.0,0.0,0
5,1,3,7817.709961,53860.0,46042.289062,0.0,0.0,0
6,1,3,7107.77002,183195.0,176087.234375,0.0,0.0,0
7,1,3,7861.640137,176087.234375,168225.59375,0.0,0.0,0
8,1,3,4024.360107,2671.0,0.0,0.0,0.0,0
9,1,2,5337.77002,41720.0,36382.230469,41898.0,40348.789062,0


#### train data

In [29]:
X = dataset_main_clean.drop(columns=['isFraud'])
y = dataset_main_clean['isFraud']
print(y.value_counts())

isFraud
0    6353880
1       8197
Name: count, dtype: int64


In [30]:
import warnings
# Suppress warnings for clean output
warnings.filterwarnings('ignore')

In [31]:
def tomek_links_undersampling(X, y):
    tl = TomekLinks(sampling_strategy='auto')
    X_res, y_res = tl.fit_resample(X, y)
    return X_res, y_res


In [32]:
def agglomerative_clustering_borderline_smote(X, y, n_clusters=10, affinity='euclidean', linkage='ward', batch_size=5000, min_samples=6):
    clusterer = AgglomerativeClustering(n_clusters=n_clusters)
    
    X_res = []
    y_res = []
    
    for start in range(0, X.shape[0], batch_size):
        end = min(start + batch_size, X.shape[0])
        X_batch = X[start:end]
        y_batch = y[start:end]
        
        cluster_labels = clusterer.fit_predict(X_batch)
        
        for cluster in np.unique(cluster_labels):
            X_cluster = X_batch[cluster_labels == cluster]
            y_cluster = y_batch[cluster_labels == cluster]
            
            if len(X_cluster) >= min_samples:
                if len(set(y_cluster)) > 1:
                    k_neighbors = min(len(X_cluster) - 1, 5)
                    sm = BorderlineSMOTE(sampling_strategy='minority', k_neighbors=k_neighbors, kind='borderline-1')
                    
                    try:
                        X_resampled, y_resampled = sm.fit_resample(X_cluster, y_cluster)
                        X_res.append(X_resampled)
                        y_res.append(y_resampled)
                    except ValueError as e:
                        print(f"Skipping cluster {cluster} due to error: {e}")
                        X_res.append(X_cluster)
                        y_res.append(y_cluster)
                else:
                    X_res.append(X_cluster)
                    y_res.append(y_cluster)
            else:
                print(f"Skipping cluster {cluster} due to insufficient samples: {len(X_cluster)}")
                X_res.append(X_cluster)
                y_res.append(y_cluster)
    
    X_res = np.vstack(X_res)
    y_res = np.hstack(y_res)

    return X_res, y_res

In [33]:
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [34]:
# Step 1: Tomek Links Undersampling
X_tomek, y_tomek = tomek_links_undersampling(X_train, y_train)
print('After Tomek Links Undersampling:', Counter(y_tomek))

After Tomek Links Undersampling: Counter({0: 5082173, 1: 6558})


In [35]:
X_agglomerative_smote, y_agglomerative_smote = agglomerative_clustering_borderline_smote(X_tomek, y_tomek, n_clusters=15,affinity='manhattan', linkage='average', batch_size=5000)
print('After Agglomerative Clustering Borderline SMOTE:', Counter(y_agglomerative_smote))

Skipping cluster 9 due to insufficient samples: 1
Skipping cluster 10 due to insufficient samples: 1
Skipping cluster 1 due to insufficient samples: 5
Skipping cluster 6 due to error: Expected n_neighbors <= n_samples_fit, but n_neighbors = 6, n_samples_fit = 4, n_samples = 4
Skipping cluster 11 due to insufficient samples: 1
Skipping cluster 12 due to error: Expected n_neighbors <= n_samples_fit, but n_neighbors = 6, n_samples_fit = 5, n_samples = 1
Skipping cluster 13 due to insufficient samples: 5
Skipping cluster 4 due to insufficient samples: 2
Skipping cluster 8 due to insufficient samples: 4
Skipping cluster 9 due to error: Expected n_neighbors <= n_samples_fit, but n_neighbors = 6, n_samples_fit = 3, n_samples = 2
Skipping cluster 13 due to insufficient samples: 1
Skipping cluster 3 due to error: Expected n_neighbors <= n_samples_fit, but n_neighbors = 6, n_samples_fit = 5, n_samples = 2
Skipping cluster 7 due to insufficient samples: 1
Skipping cluster 8 due to error: Expected

## Test model

In [36]:
# Huấn luyện mô hình trên tập dữ liệu đã được cân bằng
clf = RandomForestClassifier(random_state=42)
clf.fit(X_agglomerative_smote, y_agglomerative_smote)

# Dự đoán và đánh giá
y_pred = clf.predict(X_test)

# In ra các chỉ số đánh giá
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))

Confusion Matrix:
[[1270460     317]
 [    232    1407]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270777
           1       0.82      0.86      0.84      1639

    accuracy                           1.00   1272416
   macro avg       0.91      0.93      0.92   1272416
weighted avg       1.00      1.00      1.00   1272416


Accuracy Score:
0.9995685373337022
