# 分类实例-信用卡欺诈

## 基础知识：逻辑回归与交叉熵

* 线性回归预测的是一个连续值
* 逻辑回归给出的`是`或`否`的回答

**Sigmoid函数：** 
* y = 1 / [1 + e ^ (-x)] （将 (-∞, +∞) 映射到 (0, 1) ）
* Sigmoid函数是一个概率分布函数，给定某个输入，将输出为一个概率值

**逻辑回归损失函数**
* `平方差`所惩罚的是与损失为同一数量级的情形
* 对于分类问题，最好使用`交叉熵`损失函数，交叉熵会输出一个更大的“损失”  

    交叉熵刻画的是实际输出（概率）与期望输出（概率）的距离，也就是交叉熵的值越小，两个概率分布就越接近。  
    假设`概率分布p`为期望输出，`概率分布q`为实际输出，`H(p,q)`为交叉熵，则：`H(p,q) = -Σ[p(x)logq(x)]`  

    在Pytorch中，使用`nn.BCELoss()`来计算二元交叉熵损失（在线性回归中，使用`nn.MSELoss()`计算均方差损失）


## 实例代码

In [23]:
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [24]:
# Load the dataset
data = pd.read_csv('./dataset/Credit.csv', header=None)
print(data.head())
print(data.info())

   0      1      2   3   4   5   6     7   8   9   10  11  12   13     14  15
0   0  30.83  0.000   0   0   9   0  1.25   0   0   1   1   0  202    0.0  -1
1   1  58.67  4.460   0   0   8   1  3.04   0   0   6   1   0   43  560.0  -1
2   1  24.50  0.500   0   0   8   1  1.50   0   1   0   1   0  280  824.0  -1
3   0  27.83  1.540   0   0   9   0  3.75   0   0   5   0   0  100    3.0  -1
4   0  20.17  5.625   0   0   9   0  1.71   0   1   0   1   2  120    0.0  -1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 653 entries, 0 to 652
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       653 non-null    int64  
 1   1       653 non-null    float64
 2   2       653 non-null    float64
 3   3       653 non-null    int64  
 4   4       653 non-null    int64  
 5   5       653 non-null    int64  
 6   6       653 non-null    int64  
 7   7       653 non-null    float64
 8   8       653 non-null    int64  
 9   9       653 non

In [25]:
# Split the dataset into X and y
X = data.iloc[:, :-1] # 所有的行，前15列为特征
print(X.head())
print(X.info())
Y = data.iloc[:, -1]  # 所有的行，最后一列为标签
print(Y.unique())
Y = data.iloc[:, -1].replace(-1, 0)  # 将-1转换为0，则标签全部转换为0和1
# 或者使用下面的方法建立映射
# Y = data.iloc[:, -1].map({-1: 0, 1: 1})
print(Y.unique())
print(Y.value_counts())

   0      1      2   3   4   5   6     7   8   9   10  11  12   13     14
0   0  30.83  0.000   0   0   9   0  1.25   0   0   1   1   0  202    0.0
1   1  58.67  4.460   0   0   8   1  3.04   0   0   6   1   0   43  560.0
2   1  24.50  0.500   0   0   8   1  1.50   0   1   0   1   0  280  824.0
3   0  27.83  1.540   0   0   9   0  3.75   0   0   5   0   0  100    3.0
4   0  20.17  5.625   0   0   9   0  1.71   0   1   0   1   2  120    0.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 653 entries, 0 to 652
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       653 non-null    int64  
 1   1       653 non-null    float64
 2   2       653 non-null    float64
 3   3       653 non-null    int64  
 4   4       653 non-null    int64  
 5   5       653 non-null    int64  
 6   6       653 non-null    int64  
 7   7       653 non-null    float64
 8   8       653 non-null    int64  
 9   9       653 non-null    int64  
 10  10

In [26]:
# Convert the data to PyTorch tensors
X = torch.from_numpy(X.values).float()  # 或者使用.type(torch.FloatTensor)或.type(torch.float32)
Y = torch.from_numpy(Y.values.reshape(-1, 1)).float()
print(X.dtype, Y.dtype)
print(X.shape, Y.shape)

torch.float32 torch.float32
torch.Size([653, 15]) torch.Size([653, 1])


In [13]:
from torch import nn

In [14]:
model = nn.Sequential(
                    nn.Linear(15, 1),
                    nn.Sigmoid()
)

In [None]:
model

In [16]:
loss_fn = nn.BCELoss()

In [17]:
opt = torch.optim.SGD(model.parameters(), lr=0.00001)

In [18]:
batches = 16
no_of_batches = 653//16

In [20]:
for epoch in range(1000):
    for batch in range(no_of_batches):
        start = batch*batches
        end = start + batches
        x = X[start: end]
        y = Y[start: end]
        y_pred = model(x)
        loss = loss_fn(y_pred, y)
        opt.zero_grad()
        loss.backward()
        opt.step()

In [None]:
model.state_dict()

### 计算正确率

In [None]:
((model(X).data.numpy() > 0.5).astype('int') == Y.numpy()).mean()