### 汽车销售行业偷漏税行为识别
#### 简介
>##### 汽车销售行业，通常是指销售汽车整车的行业。该行业在税收上存在少开发票金额、少计收入，上牌、按揭、保险等一条龙服务未入账，不及时确认保修索赔款等多种情况，导致政府损失大量税收。汽车销售企业的部分经营指标能在一定程度上评估企业的偷漏税倾向。  
>##### 在这个项目中，我们需要通过汽车销售企业的部分经营指标，建立偷漏税行为识别模型。

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
% matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号

df = pd.read_excel(u'汽车销售企业经营指标.xls',encoding = 'gbk')
df.head()

Unnamed: 0,纳税人编号,销售类型,销售模式,汽车销售平均毛利,维修毛利,企业维修收入占销售收入比重,增值税税负,存货周转率,成本费用利润率,整体理论税负,整体税负控制数,办牌率,单台办牌手续费收入,代办保险率,保费返还率,输出
0,1,国产轿车,4S店,0.0635,0.3241,0.0879,0.0084,8.5241,0.0018,0.0166,0.0147,0.4,0.02,0.7155,0.15,正常
1,2,国产轿车,4S店,0.052,0.2577,0.1394,0.0298,5.2782,-0.0013,0.0032,0.0137,0.3307,0.02,0.2697,0.1367,正常
2,3,国产轿车,4S店,0.0173,0.1965,0.1025,0.0067,19.8356,0.0014,0.008,0.0061,0.2256,0.02,0.2445,0.1301,正常
3,4,国产轿车,一级代理商,0.0501,0.0,0.0,0.0,1.0673,-0.3596,-0.1673,0.0,0.0,0.0,0.0,0.0,异常
4,5,进口轿车,4S店,0.0564,0.0034,0.0066,0.0017,12.847,-0.0014,0.0123,0.0095,0.0039,0.08,0.0117,0.1872,正常


>##### 可以看到我们的数据集有16列，其中'销售类型','销售模式','输出'是需要创建虚拟变量的。

In [8]:
print('数据集（行，列）：',df.shape)
print('空值数量：',df.isnull().sum().sum())
print('重复的数据行：',df[df.duplicated() == True])

数据集（行，列）： (124, 16)
空值数量： 0
重复的数据行： Empty DataFrame
Columns: [纳税人编号, 销售类型, 销售模式, 汽车销售平均毛利, 维修毛利, 企业维修收入占销售收入比重, 增值税税负, 存货周转率, 成本费用利润率, 整体理论税负, 整体税负控制数, 办牌率, 单台办牌手续费收入, 代办保险率, 保费返还率, 输出]
Index: []


>##### 可以看到数据集数据量不多，没有空值没有重复行，所以不做过多处理。

In [9]:
# 初步分析不同销售类型的异常数量
df_type1 = df.groupby([u'销售类型'])[u'异常'].sum()
df_type2 = df.groupby([u'销售类型'])[u'正常'].sum()
df_type12 = pd.concat([df_type1,df_type2],axis=1)
# 直方图按照异常出现的次数降序显示
df_type12.sort_values(u'异常',ascending=False).plot(kind='bar');
plt.title(u'不同销售类型下异常与正常的数量');
plt.xlabel('销售类型')
plt.ylabel('频率')
plt.show()

KeyError: 'Column not found: 异常'

>##### 从上图可以看出国产轿车和大客车的异常频率较大，其中大客车的异常比它本身的正常还有多很多，大客车需要重点关注。

In [None]:
# 计算每种销售类型异常和正常所占比列，并添加到新的列中。
df_type12[u'异常比列'] = df_type12[u'异常'] / df_type12[u'异常'].sum()
df_type12[u'正常比列'] = df_type12[u'正常'] / df_type12[u'正常'].sum()
# 用饼图进行可视化
fig = plt.figure(figsize=(12,10))
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
ax1.pie(df_type12[u'异常比列'] , labels = df_type12.index, autopct='%1.1f%%',shadow=True, startangle=90)
ax2.pie(df_type12[u'正常比列'] , labels = df_type12.index, autopct='%1.1f%%',shadow=True, startangle=90)
ax1.axis('equal')  # 圆形显示
ax2.axis('equal')
ax1.title(u'异常比列分布情况')
ax2.title(u'正常比列分布情况')
plt.show()