# 0.) Import the Credit Card Fraud Data From CCLE

上述代码展现了一个用于欺诈检测的机器学习流程，主要涉及数据预处理、特征工程、模型训练、处理数据不平衡问题、模型评估和结果分析等环节。以下是完整流程的思路和原理概述：

### 数据预处理和特征工程

1. **数据加载**：
   - 使用`pandas`的`read_csv`函数从"fraudTest.csv"文件中加载数据。

2. **选择相关特征**：
   - 从数据集中选择与任务相关的特征，包括交易时间、交易类别、金额、城市人口和欺诈标记（目标变量）。

3. **日期时间转换**：
   - 将交易时间从字符串格式转换为`datetime`格式，以便提取时间相关的特征。

4. **时间特征提取**：
   - 从交易时间中提取具体的时间特征（如秒），这可能与欺诈行为有相关性。

5. **独热编码**：
   - 对分类特征（如交易类别）进行独热编码，转换为模型可处理的数值型特征。

### 数据不平衡处理

6. **重采样策略**：
   - 采用过采样（增加少数类样本）、欠采样（减少多数类样本）和SMOTE（合成少数类过采样技术）等方法处理数据不平衡问题。

### 模型训练和评估

7. **训练逻辑回归模型**：
   - 分别使用经过不同重采样处理后的数据训练逻辑回归模型。

8. **模型性能评估**：
   - 使用测试集对每个模型进行评估，并通过计算准确率等指标衡量模型性能。

9. **混淆矩阵和敏感性计算**：
   - 对每个模型的预测结果计算混淆矩阵，并进一步计算敏感性（召回率），评估模型对少数类（欺诈行为）的识别能力。

### 性能指标计算和模型比较

10. **性能指标计算函数**：
    - 定义`calc_perf_metric`函数来计算多个性能指标，包括敏感性、特异性、精确度、召回率和F1分数，以全面评估模型性能。

11. **遍历模型训练和评估**：
    - 遍历不同的重采样方法和模型配置，训练模型并计算上述性能指标，将结果存储并汇总，便于比较不同策略的效果。

### 总结

整个流程展示了如何处理机器学习中的一个常见问题——数据不平衡，并探索了不同数据处理策略和模型配置对欺诈检测任务性能的影响。通过对每个模型的混淆矩阵和关键性能指标的详细分析，提供了一个全面的评估视角，有助于选择最适合解决特定问题的模型和策略。此外，通过系统地遍历不同配置和评估方法，这个流程为处理类似问题提供了一个结构化和可重复的框架。

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
df = pd.read_csv("fraudTest.csv")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0
3,3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,...,28.5697,-80.8191,54767,Set designer,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,1371816915,28.812398,-80.883061,0
4,4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,...,44.2529,-85.017,1126,Furniture designer,1955-07-06,57ff021bd3f328f8738bb535c302a31b,1371816917,44.959148,-85.884734,0


In [6]:
# 从原始DataFrame中选择与分析相关的列，并创建df_select DataFrame作为分析的基础。
df_select = df[["trans_date_trans_time", "category", "amt", "city_pop", "is_fraud"]].copy()

# 将"trans_date_trans_time"列中的字符串转换为datetime格式，便于后续的时间相关分析
df_select["trans_date_trans_time"] = pd.to_datetime(df_select["trans_date_trans_time"])

# 从"trans_date_trans_time"列提取秒作为新的特征"time_var"，可能用于分析交易时间和欺诈行为的关系。
df_select["time_var"] = df_select["trans_date_trans_time"].dt.second


In [None]:
# 特征工程

# 对"category"列进行独热编码，并丢弃原始的时间列和目标变量列，得到用于训练的特征集X
X = pd.get_dummies(df_select, columns=["category"]).drop(["trans_date_trans_time", "is_fraud"], axis=1)

# 从df_select中提取"is_fraud"列作为目标变量y
y = df_select["is_fraud"]

# 1.) Use scikit learn preprocessing to split the data into 70/30 in out of sample

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [8]:
# 将数据划分为训练集和测试集，测试集占总数据的30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3)

In [9]:
# 进一步将测试集划分为最终的测试集和保留集，各占原测试集的50%
X_test, X_holdout, y_test, y_holdout = train_test_split(X_test, y_test, test_size = .5)

In [10]:
# 使用StandardScaler对特征进行标准化，以确保模型不会因特征的量纲不同而受到影响
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_holdout = scaler.transform(X_holdout)

# 2.) Make three sets of training data (Oversample, Undersample and SMOTE)¶

In [12]:
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

In [13]:
# 处理数据不平衡

# RandomOverSampler通过随机复制少数类的样本来增加其数量，直到与多数类的样本数量相等，从而达到类别平衡的效果
# 这种方法简单直接，但可能导致过拟合，因为它只是简单地重复少数类样本。
ros = RandomOverSampler()
over_X, over_y = ros.fit_resample(X_train, y_train)

# RandomUnderSampler通过随机减少多数类的样本数量来匹配少数类的样本数量，从而实现类别平衡。
# 这种方法减少了数据集的大小，可能会导致信息丢失，但可以提高模型对少数类的关注。
rus = RandomUnderSampler()
under_X, under_y = rus.fit_resample(X_train, y_train)

# SMOTE（Synthetic Minority Over-sampling Technique）通过在少数类样本之间插值生成新的合成样本来增加少数类的样本数量。
# 与简单的过采样不同，SMOTE通过生成新样本增加了类别的多样性，有助于避免过拟合。
smote = SMOTE()
smote_X, smote_y = smote.fit_resample(X_train, y_train)

# 3.) Train three logistic regression models

In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
# 训练逻辑回归模型
# 分别使用经过RandomOverSampler、RandomUnderSampler和SMOTE处理后的数据集训练逻辑回归模型。
# 每种处理方法都可能对模型性能产生不同的影响。

over_log = LogisticRegression().fit(over_X, over_y)

under_log = LogisticRegression().fit(under_X, under_y)

smote_log = LogisticRegression().fit(smote_X, smote_y)

# 4.) Test the three models

In [None]:
# 每个处理过的数据集训练出的模型都在测试集上进行评估，计算准确率来衡量模型性能

In [16]:
over_log.score(X_test, y_test)

0.9186160896374673

In [17]:
under_log.score(X_test, y_test)

0.9240744739557091

In [18]:
smote_log.score(X_test, y_test)

0.9154850164351352

We see SMOTE performing with higher accuracy but is ACCURACY really the best measure?

# 5.) Which performed best in Out of Sample metrics?

在解释混淆矩阵和敏感性计算的代码之前先确保我们理解混淆矩阵的四个基本组成部分：

- **真正例（TP, True Positives）**：模型正确预测正类的数量。
- **假正例（FP, False Positives）**：模型错误地将负类预测为正类的数量。
- **真负例（TN, True Negatives）**：模型正确预测负类的数量。
- **假负例（FN, False Negatives）**：模型错误地将正类预测为负类的数量。

In [19]:
from sklearn.metrics import confusion_matrix

In [20]:
y_true = y_test

In [21]:
# 对于使用过采样处理的模型，代码中使用测试集数据进行预测并计算混淆矩阵

y_pred = over_log.predict(X_test) # over_log.predict(X_test)是对测试集X_test的预测结果
cm = confusion_matrix(y_true, y_pred) # confusion_matrix函数用于计算真实标签y_true（即y_test）和预测标签y_pred之间的混淆矩阵
cm

array([[76363,  6696],
       [   88,   211]])

In [22]:
# 敏感性（或召回率）计算

# 使用混淆矩阵的第二行（索引为1，对应于正类），即cm[1,1] / (cm[1,0] + cm[1,1])
# cm[1,1]是真正例（TP）的数量，cm[1,0]是假负例（FN）的数量
# 敏感性衡量了模型识别正类样本的能力
print("Over Sample Sensitivity : ", cm[1,1] /( cm[1,0] + cm[1,1]))

Over Sample Sensitivity :  0.705685618729097


In [23]:
y_pred = under_log.predict(X_test)
cm = confusion_matrix(y_true, y_pred)
cm

array([[76820,  6239],
       [   90,   209]])

In [24]:
print("Under Sample Sensitivity : ", cm[1,1] /( cm[1,0] + cm[1,1]))

Under Sample Sensitivity :  0.6989966555183946


In [25]:
y_pred = smote_log.predict(X_test)
cm = confusion_matrix(y_true, y_pred)
cm

array([[76102,  6957],
       [   88,   211]])

In [26]:
print("SMOTE Sample Sensitivity : ", cm[1,1] /( cm[1,0] + cm[1,1]))

SMOTE Sample Sensitivity :  0.705685618729097


# 7.) We want to compare oversampling, Undersampling and SMOTE across our 3 models (Logistic Regression, Logistic Regression Lasso and Decision Trees).

# Make a dataframe that has a dual index and 9 Rows.
# Calculate: Sensitivity, Specificity, Precision, Recall and F1 score. for out of sample data.
# Notice any patterns across perfomance for this model. Does one totally out perform the others IE. over/under/smote or does a model perform better DT, Lasso, LR?
# Choose what you think is the best model and why. test on Holdout

In [29]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
import pandas as pd
from imblearn.over_sampling import RandomOverSampler

In [None]:
# 扩展模型训练

In [30]:
# 对于重采样方法和模型配置，代码定义了两个字典来提供多样化的选择
# 这些字典分别包含了不同的重采样策略和多个模型配置，以便进行扩展训练和评估

resampling_methods = {
    "over": RandomOverSampler(),
    "under": RandomUnderSampler(),
    "smote": SMOTE()
}

model_configs = {
    "LOG":LogisticRegression(),
    "LASSO": LogisticRegression(penalty = "l1", C = 2., solver = "liblinear"),
    "DTREE": DecisionTreeClassifier()
}

In [47]:
# 性能指标的计算通过自定义函数calc_perf_metric实现
# 这个函数使用confusion_matrix和其他sklearn.metrics函数来计算敏感性、特异性、精确度、召回率和F1分数，提供了模型性能的全面视角。

def calc_perf_metric(y_true, y_pred):
    tn,fp,fn,tp = confusion_matrix(y_true, y_pred).ravel()
    
    sensitivity = tp / (tp + fn)
    specificity = tn / (tn + fp)
    precision = precision_score(y_true, y_pred),
    recall = recall_score(y_true, y_pred),
    f1 = f1_score(y_true, y_pred)
    
    
    return(sensitivity, specificity, precision, recall, f1)

In [32]:
trained_models = {}
results = []

In [48]:
# 遍历训练和评估

# 遍历了所有重采样方法和模型配置的组合，对每个组合进行了训练和性能评估，并将评估结果存储在results列表中。
# 最终，这些结果被汇总到result_df DataFrame中，便于后续分析和比较

for resample_key, resampler in resampling_methods.items():
    resample_X, resample_y = resampler.fit_resample(X_train, y_train)
    
    for model_key, model in model_configs.items():
        combined_key = f"{resample_key}_{model_key}"
        
        m = model.fit(resample_X, resample_y)
        
        trained_models[combined_key] = m
        
        y_pred = m.predict(X_test)
        
        sensitivity, specificity, precision, recall, f1 = calc_perf_metric(y_test,y_pred)
        
        results.append({"Model": combined_key,
                       "Sensitivity" : sensitivity,
                       "Specificity": specificity,
                       "Precision": precision,
                       "Recall": recall,
                       "F1" : f1})

In [49]:
result_df = pd.DataFrame(results)

In [50]:
result_df

Unnamed: 0,Model,Sensitivity,Specificity,Precision,Recall,F1
0,over_LOG,0.705686,0.915205,"(0.02908740005514199,)","(0.705685618729097,)",0.055872
1,over_LASSO,0.705686,0.915277,"(0.02911147902869757,)","(0.705685618729097,)",0.055916
2,over_DTREE,0.51505,0.998507,"(0.5539568345323741,)","(0.5150501672240803,)",0.533795
3,under_LOG,0.705686,0.926751,"(0.0335186656076251,)","(0.705685618729097,)",0.063998
4,under_LASSO,0.705686,0.926257,"(0.03330176767676768,)","(0.705685618729097,)",0.063602
5,under_DTREE,0.946488,0.949265,"(0.06293084278407828,)","(0.9464882943143813,)",0.118015
6,smote_LOG,0.705686,0.914579,"(0.028880372296742403,)","(0.705685618729097,)",0.05549
7,smote_LASSO,0.705686,0.914663,"(0.028908069598575146,)","(0.705685618729097,)",0.055541
8,smote_DTREE,0.692308,0.993198,"(0.26813471502590674,)","(0.6923076923076923,)",0.386555
