Reference: 
https://app.dataquest.io/c/135/m/745/guided-project%3A-credit-card-customer-segmentation/   with solution: https://github.com/dataquestio/solutions/blob/master/Mission745Solutions.ipynb 
*https://nbviewer.org/github/dataquestio/solutions/blob/master/Mission745Solutions.ipynb*
  *https://github.com/dataquestio/solutions*

We’ll play the role of a data scientist working for a credit card company. We’ve been given a dataset containing information about the company’s clients and asked to help segment them into different groups in order to apply different business strategies for each type of customer.

The company expects to receive a group for each client and also an explanation of the characteristics of each group and what are the main points that make them different.

Annotation on columns in df:
- customer_id, age, gender
- dependent_count: number of dependents of each customer.
- education_level
- marital_status: marital status ("Single", "Married", etc.) tình trạng hôn nhân 
- estimated_income: the estimated income for the customer projected by the data science team.
- months_on_book: time as a customer in months.
- total_relationship_count: number of times the customer contacted the company.
- months_inactive_12_mon: number of months the customer did not use the credit card in the last 12 months.
- credit_limit: customer's credit limit.  : Hạn mức tín dụng
- total_trans_amount: (total amount transitioned) the overall amount of money spent on the card by the customer.
- total_trans_count: (total number of transactions) the overall number of times the customer used the card.
- avg_utilization_ratio: daily average utilization ratio (Tỷ lệ sử dụng trung bình hàng ngày)

USE GCN (Graph Convolution networks) to clustering: credit card customer segmentation. 

In [6]:
import pandas as pd
from sklearn.cluster import KMeans


customers_df = pd.read_csv('customer_segmentation.csv')
customers_df
# Familiarize ourselves with the dataset

## Feature Engineering: dealing with the 3 categorical variables

### drop `customer_id` col
customers_copy_df = customers_df.copy()

### transform the `gender`, `education_level`` column to numeric
customers_copy_df['gender'] = customers_copy_df['gender'].replace({'M': 1, 'F': 0})

education_level_mapping = {
    'Uneducated': 0,
    'High School': 1,
    'College': 2,
    'Graduate': 3,
    'Post-Graduate': 4,
    'Doctorate': 5
}
customers_copy_df['education_level'] = customers_copy_df['education_level'].map(education_level_mapping)

### no level of magnitude between "Single", "Married" or "Divorced": one-hot-encoding to create dummy variables from `marital_status` column
marital_status_dummies = pd.get_dummies(customers_copy_df['marital_status'])
marital_status_dummies = marital_status_dummies.astype(int)
preprocessed_customers_df = pd.concat([customers_copy_df, marital_status_dummies], axis=1).drop('marital_status', axis=1)
preprocessed_customers_df

## Scaling the Data: transform the data so it's all on the same scale
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(preprocessed_customers_df) # instantiate 1 obj 
scaled_customers_array = scaler.transform(preprocessed_customers_df) #transform method to scale the data, assign to 1 diff variable. 
scaled_customers_df = pd.DataFrame(scaled_customers_array, columns=preprocessed_customers_df.columns)
scaled_customers_df
scaled_customers_dropid_df = scaled_customers_df.drop('customer_id', axis=1)

chosen_k = 6#  based on the elbow curve
kmeans_final = KMeans(n_clusters=chosen_k, random_state=42)

cluster_labels_final = kmeans_final.fit_predict(scaled_customers_dropid_df)
print(cluster_labels_final) # array containing cluster's labels



[5 2 5 ... 1 3 1]


In [11]:
# create a CLUSTER column in our original dataset to better understand the characteristics of each one.
clustered_preprocessed_customers_df = preprocessed_customers_df.assign(CLUSTER = cluster_labels_final + 1) # assign, adjust start from 1 insted of 0
clustered_preprocessed_customers_df.head(20)


Unnamed: 0,customer_id,age,gender,dependent_count,education_level,estimated_income,months_on_book,total_relationship_count,months_inactive_12_mon,credit_limit,total_trans_amount,total_trans_count,avg_utilization_ratio,Divorced,Married,Single,Unknown,CLUSTER
0,768805383,45,1,3,1,69000,39,5,1,12691.0,1144,42,0.061,0,1,0,0,6
1,818770008,49,0,5,3,24000,44,6,1,8256.0,1291,33,0.105,0,0,1,0,3
2,713982108,51,1,3,3,93000,36,4,1,3418.0,1887,20,0.0,0,1,0,0,6
3,769911858,40,0,4,1,37000,34,3,4,3313.0,1171,20,0.76,0,0,0,1,4
4,709106358,40,1,3,0,65000,21,5,1,4716.0,816,28,0.0,0,1,0,0,2
5,713061558,44,1,2,3,54000,36,3,1,4010.0,1088,24,0.311,0,1,0,0,2
6,810347208,51,1,4,1,166000,46,6,1,34516.0,1330,31,0.066,0,1,0,0,6
7,818906208,32,1,0,1,66000,27,2,2,29081.0,1538,36,0.048,0,0,0,1,4
8,710930508,37,1,3,0,77000,36,5,2,22352.0,1350,24,0.113,0,0,1,0,6
9,719661558,48,1,2,3,87000,36,6,3,11656.0,1441,32,0.144,0,0,1,0,6


In [20]:
clustered_preprocessed_customers_df

Unnamed: 0,customer_id,age,gender,dependent_count,education_level,estimated_income,months_on_book,total_relationship_count,months_inactive_12_mon,credit_limit,total_trans_amount,total_trans_count,avg_utilization_ratio,Divorced,Married,Single,Unknown,CLUSTER
0,768805383,45,1,3,1,69000,39,5,1,12691.0,1144,42,0.061,0,1,0,0,6
1,818770008,49,0,5,3,24000,44,6,1,8256.0,1291,33,0.105,0,0,1,0,3
2,713982108,51,1,3,3,93000,36,4,1,3418.0,1887,20,0.000,0,1,0,0,6
3,769911858,40,0,4,1,37000,34,3,4,3313.0,1171,20,0.760,0,0,0,1,4
4,709106358,40,1,3,0,65000,21,5,1,4716.0,816,28,0.000,0,1,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10122,772366833,50,1,2,3,51000,40,3,2,4003.0,15476,117,0.462,0,0,1,0,1
10123,710638233,41,1,2,3,40000,25,4,2,4277.0,8764,69,0.511,1,0,0,0,5
10124,716506083,44,0,1,1,33000,36,5,3,5409.0,10291,60,0.000,0,1,0,0,2
10125,717406983,30,1,2,3,47000,36,4,3,5281.0,8395,62,0.000,0,0,0,1,4


In [13]:
import networkx as nx
import pandas as pd


G = nx.Graph() # Create graph object

# Add nodes to the graph
for _, row in clustered_preprocessed_customers_df.head(10).iterrows():# _ index of the row(which we're not using)
    customer_id = row['customer_id']
    G.add_node(customer_id)

    G.nodes[customer_id].update(row.to_dict()) # Add customer attributes to the node
# Tạo cạnh giữa các nodes dựa trên cột credit_limit
for u, u_data in G.nodes(data=True):
    for v, v_data in G.nodes(data=True):
        if u != v and abs(u_data['credit_limit'] - v_data['credit_limit']) <= 5000:
            G.add_edge(u, v)


print("Nodes:", G.number_of_nodes())
print("Edges:", G.number_of_edges())


Nodes: 10
Edges: 13


In [32]:

# Lấy ma trận đặc trưng từ dữ liệu
node_features = []
for u, u_data in G.nodes(data=True):
    node_features.append(list(u_data.values())) # Thêm danh sách thuộc tính của node vào ma trận đặc trưng

print("Node features:", node_features)

labels = []
for u, u_data in G.nodes(data=True):
    labels.append(u_data['CLUSTER']) # Thêm nhãn của node vào danh sách nhãn

print("Labels:", labels)


for u, u_data in G.nodes(data=True):
    print(len(u_data))

Node features: [[768805383.0, 45.0, 1.0, 3.0, 1.0, 69000.0, 39.0, 5.0, 1.0, 12691.0, 1144.0, 42.0, 0.061, 0.0, 1.0, 0.0, 0.0, 6.0], [818770008.0, 49.0, 0.0, 5.0, 3.0, 24000.0, 44.0, 6.0, 1.0, 8256.0, 1291.0, 33.0, 0.105, 0.0, 0.0, 1.0, 0.0, 3.0], [713982108.0, 51.0, 1.0, 3.0, 3.0, 93000.0, 36.0, 4.0, 1.0, 3418.0, 1887.0, 20.0, 0.0, 0.0, 1.0, 0.0, 0.0, 6.0], [769911858.0, 40.0, 0.0, 4.0, 1.0, 37000.0, 34.0, 3.0, 4.0, 3313.0, 1171.0, 20.0, 0.76, 0.0, 0.0, 0.0, 1.0, 4.0], [709106358.0, 40.0, 1.0, 3.0, 0.0, 65000.0, 21.0, 5.0, 1.0, 4716.0, 816.0, 28.0, 0.0, 0.0, 1.0, 0.0, 0.0, 2.0], [713061558.0, 44.0, 1.0, 2.0, 3.0, 54000.0, 36.0, 3.0, 1.0, 4010.0, 1088.0, 24.0, 0.311, 0.0, 1.0, 0.0, 0.0, 2.0], [810347208.0, 51.0, 1.0, 4.0, 1.0, 166000.0, 46.0, 6.0, 1.0, 34516.0, 1330.0, 31.0, 0.066, 0.0, 1.0, 0.0, 0.0, 6.0], [818906208.0, 32.0, 1.0, 0.0, 1.0, 66000.0, 27.0, 2.0, 2.0, 29081.0, 1538.0, 36.0, 0.048, 0.0, 0.0, 0.0, 1.0, 4.0], [710930508.0, 37.0, 1.0, 3.0, 0.0, 77000.0, 36.0, 5.0, 2.0, 22352.

In [33]:
# Xác định ma trận kề: adjacency matrix hoặc adjacency list để biểu diễn dữ liệu đồ thị này
adjacency_matrix = nx.adjacency_matrix(G)
adjacency_list = nx.to_dict_of_lists(G)

In [34]:
'''
2. Xây dựng mô hình GCN: Tiếp theo, bạn cần xây dựng một mô hình GCN.
 Mô hình này bao gồm các lớp GCN, mỗi lớp có thể có các thông số riêng như số lượng đặc trưng đóng góp
   và hàm kích hoạt. 
   Một số kiến trúc mô hình phổ biến bao gồm: 
   1) Định nghĩa lớp GCN đầu vào, 
   2) Áp dụng các lớp GCN tuần tự để mã hóa thông tin trong đồ thị, và 
   3) Đánh giá kết quả cuối cùng.'''

import torch
from torch import nn
import torch.nn.functional as F

class GCNLayer(nn.Module):
    def __init__(self, in_features, out_features):
        super(GCNLayer, self).__init__()
        self.linear = nn.Linear(in_features, out_features)

    def forward(self, adjacency_matrix, node_features):
        # Tính toán tổng trọng số cạnh cho từng đỉnh
        degree_matrix = torch.sum(adjacency_matrix, dim=1)
        
        # Chuẩn hóa ma trận kề
        normalized_adjacency_matrix = torch.div(adjacency_matrix, degree_matrix.unsqueeze(1))
        
        # Tính toán convolution
        convolution = self.linear(normalized_adjacency_matrix @ node_features)
        
        # Áp dụng hàm kích hoạt
        activation = F.relu(convolution)
        
        return activation

class GCNModel(nn.Module):
    def __init__(self, num_features, hidden_dim, num_classes):
        super(GCNModel, self).__init__()
        self.layer1 = GCNLayer(num_features, hidden_dim)
        self.layer2 = GCNLayer(hidden_dim, num_classes)

    def forward(self, adjacency_matrix, node_features):
        output1 = self.layer1(adjacency_matrix, node_features)
        output2 = self.layer2(adjacency_matrix, output1)

        return output2

In [43]:
import pandas as pd
import networkx as nx
import torch
from torch import nn
import torch.nn.functional as F
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Đọc dữ liệu từ file CSV
customers_df = pd.read_csv('customer_segmentation.csv')

# Tiến hành xử lý dữ liệu: chuyển các biến phân loại sang dạng số và one-hot-encoding
customers_copy_df = customers_df.copy()

customers_copy_df['gender'] = customers_copy_df['gender'].replace({'M': 1, 'F': 0})

education_level_mapping = {
    'Uneducated': 0,
    'High School': 1,
    'College': 2,
    'Graduate': 3,
    'Post-Graduate': 4,
    'Doctorate': 5
}
customers_copy_df['education_level'] = customers_copy_df['education_level'].map(education_level_mapping)

marital_status_dummies = pd.get_dummies(customers_copy_df['marital_status'])
marital_status_dummies = marital_status_dummies.astype(int)
preprocessed_customers_df = pd.concat([customers_copy_df, marital_status_dummies], axis=1).drop('marital_status', axis=1)

# Chuẩn hóa dữ liệu
scaler = StandardScaler()
scaler.fit(preprocessed_customers_df)
scaled_customers_array = scaler.transform(preprocessed_customers_df)
scaled_customers_df = pd.DataFrame(scaled_customers_array, columns=preprocessed_customers_df.columns)
scaled_customers_dropid_df = scaled_customers_df.drop('customer_id', axis=1)

# Xác định số cụm dựa vào đường cong khuỷu tay (Elbow Method)
chosen_k = 6
kmeans_final = KMeans(n_clusters=chosen_k, random_state=42)
cluster_labels_final = kmeans_final.fit_predict(scaled_customers_dropid_df)

# Thêm cột 'CLUSTER' vào dữ liệu gốc để hiểu rõ hơn về đặc điểm của từng cụm
clustered_preprocessed_customers_df = preprocessed_customers_df.assign(CLUSTER=cluster_labels_final + 1)

# Tạo đồ thị và trích xuất đặc trưng từ dữ liệu
G = nx.Graph()
for _, row in clustered_preprocessed_customers_df.head(10).iterrows():
    customer_id = row['customer_id']
    G.add_node(customer_id)
    G.nodes[customer_id].update(row.to_dict())

for u, u_data in G.nodes(data=True):
    for v, v_data in G.nodes(data=True):
        if u != v and abs(u_data['credit_limit'] - v_data['credit_limit']) <= 5000:
            G.add_edge(u, v)

node_features = []
for u, u_data in G.nodes(data=True):
    node_features.append(list(u_data.values()))

labels = []
for u, u_data in G.nodes(data=True):
    labels.append(u_data['CLUSTER'])

# Xây dựng mô hình GCN
class GCNLayer(nn.Module):
    def __init__(self, in_features, out_features):
        super(GCNLayer, self).__init__()
        self.linear = nn.Linear(in_features, out_features)

    def forward(self, adjacency_matrix, node_features):
        degree_matrix = torch.sum(adjacency_matrix, dim=1)
        normalized_adjacency_matrix = torch.div(adjacency_matrix, degree_matrix.unsqueeze(1))
        convolution = self.linear(normalized_adjacency_matrix @ node_features)
        activation = F.relu(convolution)
        return activation

class GCNModel(nn.Module):
    def __init__(self, num_features, hidden_dim, num_classes):
        super(GCNModel, self).__init__()
        self.layer1 = GCNLayer(num_features, hidden_dim)
        self.layer2 = GCNLayer(hidden_dim, num_classes)

    def forward(self, adjacency_matrix, node_features):
        output1 = self.layer1(adjacency_matrix, node_features)
        output2 = self.layer2(adjacency_matrix, output1)
        return output2

# Khởi tạo mô hình GCN
num_features = len(node_features[0])
hidden_dim = 32
num_classes = chosen_k

gcn_model = GCNModel(num_features, hidden_dim, num_classes)

# Chuyển đổi dữ liệu sang dạng tensors của PyTorch
adjacency_matrix = nx.to_numpy_matrix(G)
node_features_tensor = torch.tensor(node_features, dtype=torch.float)
adjacency_matrix_tensor = torch.tensor(adjacency_matrix, dtype=torch.float)

# Huấn luyện mô hình GCN
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(gcn_model.parameters(), lr=0.01)

epochs = 100
for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = gcn_model(adjacency_matrix_tensor, node_features_tensor)
    loss = criterion(outputs, torch.tensor(labels, dtype=torch.long))
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")

# Đánh giá kết quả cuối cùng
with torch.no_grad():
    gcn_model.eval()
    outputs = gcn_model(adjacency_matrix_tensor, node_features_tensor)
    _, predicted_labels = torch.max(outputs, 1)
    accuracy = torch.sum(predicted_labels == torch.tensor(labels, dtype=torch.long)) / len(labels)
    print(f"Độ chính xác cuối cùng của mô hình: {accuracy.item()}")




AttributeError: module 'networkx' has no attribute 'to_numpy_matrix'

In [45]:
import pandas as pd
import networkx as nx
import torch
from torch import nn
import torch.nn.functional as F
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Đọc dữ liệu từ file CSV
customers_df = pd.read_csv('customer_segmentation.csv')

# Tiến hành xử lý dữ liệu: chuyển các biến phân loại sang dạng số và one-hot-encoding
customers_copy_df = customers_df.copy()

customers_copy_df['gender'] = customers_copy_df['gender'].replace({'M': 1, 'F': 0})

education_level_mapping = {
    'Uneducated': 0,
    'High School': 1,
    'College': 2,
    'Graduate': 3,
    'Post-Graduate': 4,
    'Doctorate': 5
}
customers_copy_df['education_level'] = customers_copy_df['education_level'].map(education_level_mapping)

marital_status_dummies = pd.get_dummies(customers_copy_df['marital_status'])
marital_status_dummies = marital_status_dummies.astype(int)
preprocessed_customers_df = pd.concat([customers_copy_df, marital_status_dummies], axis=1).drop('marital_status', axis=1)

# Chuẩn hóa dữ liệu
scaler = StandardScaler()
scaler.fit(preprocessed_customers_df)
scaled_customers_array = scaler.transform(preprocessed_customers_df)
scaled_customers_df = pd.DataFrame(scaled_customers_array, columns=preprocessed_customers_df.columns)
scaled_customers_dropid_df = scaled_customers_df.drop('customer_id', axis=1)

# Xác định số cụm dựa vào đường cong khuỷu tay (Elbow Method)
chosen_k = 6
kmeans_final = KMeans(n_clusters=chosen_k, random_state=42)
cluster_labels_final = kmeans_final.fit_predict(scaled_customers_dropid_df)

# Thêm cột 'CLUSTER' vào dữ liệu gốc để hiểu rõ hơn về đặc điểm của từng cụm
clustered_preprocessed_customers_df = preprocessed_customers_df.assign(CLUSTER=cluster_labels_final + 1)

# Tạo đồ thị và trích xuất đặc trưng từ dữ liệu
G = nx.Graph()
for _, row in clustered_preprocessed_customers_df.head(10).iterrows():
    customer_id = row['customer_id']
    G.add_node(customer_id)
    G.nodes[customer_id].update(row.to_dict())

for u, u_data in G.nodes(data=True):
    for v, v_data in G.nodes(data=True):
        if u != v and abs(u_data['credit_limit'] - v_data['credit_limit']) <= 5000:
            G.add_edge(u, v)

node_features = []
for u, u_data in G.nodes(data=True):
    node_features.append(list(u_data.values()))

labels = []
for u, u_data in G.nodes(data=True):
    labels.append(u_data['CLUSTER'])




In [52]:

# Xây dựng mô hình GCN
class GCNLayer(nn.Module):
    def __init__(self, in_features, out_features):
        super(GCNLayer, self).__init__()
        self.linear = nn.Linear(in_features, out_features)

    def forward(self, adjacency_matrix, node_features):
        degree_matrix = torch.sum(adjacency_matrix, dim=1)
        normalized_adjacency_matrix = torch.div(adjacency_matrix, degree_matrix.unsqueeze(1))
        convolution = self.linear(normalized_adjacency_matrix @ node_features)
        activation = F.relu(convolution)
        return activation

class GCNModel(nn.Module):
    def __init__(self, num_features, hidden_dim, num_classes):
        super(GCNModel, self).__init__()
        self.layer1 = GCNLayer(num_features, hidden_dim)
        self.layer2 = GCNLayer(hidden_dim, num_classes)

    def forward(self, adjacency_matrix, node_features):
        output1 = self.layer1(adjacency_matrix, node_features)
        output2 = self.layer2(adjacency_matrix, output1)
        return output2
# Chuyển đổi nhãn dữ liệu để nằm trong khoảng từ 0 đến 5
labels = [label - 1 for label in labels]

# Khởi tạo mô hình GCN
num_features = len(node_features[0])
hidden_dim = 32
num_classes = chosen_k

gcn_model = GCNModel(num_features, hidden_dim, num_classes)


In [53]:

# Chuyển đổi dữ liệu sang dạng tensors của PyTorch
adjacency_matrix = nx.linalg.graphmatrix.adjacency_matrix(G).toarray()
node_features_tensor = torch.tensor(node_features, dtype=torch.float)
adjacency_matrix_tensor = torch.tensor(adjacency_matrix, dtype=torch.float)


In [54]:

# Huấn luyện mô hình GCN
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(gcn_model.parameters(), lr=0.01)

epochs = 100
for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = gcn_model(adjacency_matrix_tensor, node_features_tensor)
    loss = criterion(outputs, torch.tensor(labels, dtype=torch.long))
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")

# Đánh giá kết quả cuối cùng
with torch.no_grad():
    gcn_model.eval()
    outputs = gcn_model(adjacency_matrix_tensor, node_features_tensor)
    _, predicted_labels = torch.max(outputs, 1)
    accuracy = torch.sum(predicted_labels == torch.tensor(labels, dtype=torch.long)) / len(labels)
    print(f"Độ chính xác cuối cùng của mô hình: {accuracy.item()}")


Epoch 1/100, Loss: nan
Epoch 11/100, Loss: nan
Epoch 21/100, Loss: nan
Epoch 31/100, Loss: nan
Epoch 41/100, Loss: nan
Epoch 51/100, Loss: nan
Epoch 61/100, Loss: nan
Epoch 71/100, Loss: nan
Epoch 81/100, Loss: nan
Epoch 91/100, Loss: nan
Độ chính xác cuối cùng của mô hình: 0.20000000298023224
