<div align="center"><h1> 第6章&nbsp;&nbsp;基本数据统计分析</h1></div>

# 一、说明

- 描述：本章源代码。有关勘误和本书常见问题，请点击顶部“常见问题”和“勘误信息”中的访问网站
- 作者：方伟（FangWei）
- 程序开发环境：Windows DEV Channel , Build 22533.1001 64位
- Python版本：64位 3.10.1

# 二、程序

In [1]:
# 导入库
import pandas as pd
import numpy as np

In [2]:
# 获取原始数据
raw_data = pd.read_excel('demo.xlsx')
print(raw_data.head(3))

    DATETIME  PROVINCE  CATE  AMOUNT  VISITS  IS_PRO
0 2019-04-29        28  南方大区   585.0    2485   False
1 2019-02-02        28  北方大区   936.0    4647    True
2 2019-09-23        28  北方大区   682.0    6402   False


## 6.1 描述性统计分析

In [3]:
# 将PROVINCE转换为字符串型
raw_data['PROVINCE'] = raw_data['PROVINCE'].astype(str)

In [4]:
# 获得描述性统计信息
desc_data = raw_data.describe(include='all').T
desc_data['polar_distance'] = desc_data['max']- desc_data['min']
desc_data['IQR'] = (desc_data['75%']-desc_data['25%'])/2
desc_data['days_int'] = desc_data['last']-desc_data['first']
desc_data['dtype'] = raw_data.dtypes
desc_data['all_count'] = raw_data.shape[0]
print(desc_data.columns)

Index(['count', 'unique', 'top', 'freq', 'first', 'last', 'mean', 'std', 'min',
       '25%', '50%', '75%', 'max', 'polar_distance', 'IQR', 'days_int',
       'dtype', 'all_count'],
      dtype='object')


In [5]:
desc_data.head()

Unnamed: 0,count,unique,top,freq,first,last,mean,std,min,25%,50%,75%,max,polar_distance,IQR,days_int,dtype,all_count
DATETIME,2136,302.0,2019-05-01 00:00:00,26.0,2019-01-02 00:00:00,2019-11-01 00:00:00,,,,,,,,,,303 days,datetime64[ns],2136
PROVINCE,2136,23.0,23,502.0,,,,,,,,,,,,NaT,object,2136
CATE,2136,5.0,北方大区,463.0,,,,,,,,,,,,NaT,object,2136
AMOUNT,2136,,,,,,1608.35,1239.32,357.0,800.35,1243.55,1993.42,14017.5,13660.5,596.538,NaT,float64,2136
VISITS,2136,,,,,,3881.16,3301.2,557.0,1717.75,2893.0,4963.5,35374.0,34817.0,1622.88,NaT,int64,2136


### 6.1.1. 通用信息

In [6]:
# 查看记录数量
print(desc_data[['all_count','count','dtype']])

          all_count count           dtype
DATETIME       2136  2136  datetime64[ns]
PROVINCE       2136  2136          object
CATE           2136  2136          object
AMOUNT         2136  2136         float64
VISITS         2136  2136           int64
IS_PRO         2136  2136            bool


### 6.1.2 集中性趋势

1. 数值型字段的数值型字段的均值、中位数和四分位数

In [7]:
print(desc_data.loc[['AMOUNT','VISITS'],['25%', '50%', '75%','mean']])

            25%      50%      75%     mean
AMOUNT   800.35  1243.55  1993.42  1608.35
VISITS  1717.75     2893   4963.5  3881.16


2. 非数值型字段的唯一值、众数和频数

In [8]:
print(desc_data.loc[['DATETIME','PROVINCE','CATE','IS_PRO'],['unique','top','freq']])

         unique                  top  freq
DATETIME    302  2019-05-01 00:00:00    26
PROVINCE     23                   23   502
CATE          5                 北方大区   463
IS_PRO        2                 True  1102


### 6.1.3 离散性趋势

1. 数值型字段的标准差、最小值、最大值、极差、四分位差

In [9]:
print(desc_data.loc[['AMOUNT','VISITS'],['std','min','max','polar_distance','IQR']])

            std  min      max polar_distance      IQR
AMOUNT  1239.32  357  14017.5        13660.5  596.538
VISITS   3301.2  557    35374          34817  1622.88


2. 日期型字段的开始日期、结束日期和日期间隔

In [10]:
print(desc_data.loc[['DATETIME'],['first','last','days_int']])

                        first                 last days_int
DATETIME  2019-01-02 00:00:00  2019-11-01 00:00:00 303 days


## 6.2 交叉对比和趋势分析

### 6.2.1 交叉对比分析

In [11]:
# 横向-不同元素的对比
raw_data.pivot_table(values=['AMOUNT','VISITS'],index=['CATE'],columns='IS_PRO',aggfunc=np.mean)

Unnamed: 0_level_0,AMOUNT,AMOUNT,VISITS,VISITS
IS_PRO,False,True,False,True
CATE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
中部大区,1605.869626,1557.756132,3968.88785,3633.867925
北方大区,1509.768837,1489.045968,3931.069767,3756.782258
南方大区,1526.57451,1651.133921,3590.279412,3977.69163
海外区,1676.751707,1831.289163,4175.390244,4126.527094
西部大区,1707.75102,1562.301415,3759.954082,3903.882075


### 6.2.2 交叉趋势分析

In [12]:
# 纵向-与自身历史的对比
raw_data['MONTH'] = raw_data['DATETIME'].map(lambda i: i.month)
overseas_north = raw_data[raw_data['CATE']=='海外区']
overseas_north.pivot_table(values=['AMOUNT','VISITS'],index=['MONTH'],columns='IS_PRO',aggfunc=np.mean)

Unnamed: 0_level_0,AMOUNT,AMOUNT,VISITS,VISITS
IS_PRO,False,True,False,True
MONTH,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,1246.211765,1156.542857,4541.823529,1864.857143
2,1398.080952,1396.2,3163.0,4777.5
3,1566.375,1553.296552,3727.458333,3422.965517
4,1453.741176,2210.053846,3189.823529,4219.384615
5,1519.332,2303.238462,5133.96,5357.923077
6,1854.215789,1333.627273,4311.578947,5564.0
7,1748.63,1732.244444,4420.55,3905.296296
8,2009.478947,1722.708333,3924.263158,4499.333333
9,2719.622222,2913.995,5552.666667,5043.85
10,1422.516,1757.809524,3817.84,3402.428571


## 6.3 结构与贡献累计分析

### 6.3.1  结构分析

In [13]:
com_data = raw_data.groupby(['PROVINCE'],as_index=False).sum()
com_sort = com_data.sort_values(['VISITS'],ascending=False)
amount_sum = com_sort['AMOUNT'].sum()
visits_sum = com_sort['VISITS'].sum()
com_sort['AMOUNT_PER'] = com_sort['AMOUNT']/amount_sum
com_sort['VISITS_PER'] = com_sort['VISITS']/visits_sum
print(com_sort.drop(['IS_PRO','MONTH'],axis=1).head())

   PROVINCE     AMOUNT   VISITS  AMOUNT_PER  VISITS_PER
10       23  1196926.5  1504144    0.348407    0.181437
18        5   361144.8   865083    0.105124    0.104351
5        14   314212.8   619030    0.091463    0.074671
13       26    70599.0   606770    0.020550    0.073192
11       24    78604.0   521749    0.022880    0.062936


### 6.3.2 二八法则分析

In [14]:
amount_data = com_sort.sort_values(['AMOUNT_PER'],ascending=False)
amount_data['CUM_AMOUNT_PER'] = amount_data['AMOUNT_PER'].cumsum()
print(amount_data[['PROVINCE','AMOUNT_PER','CUM_AMOUNT_PER']].round(2).head())

   PROVINCE  AMOUNT_PER  CUM_AMOUNT_PER
10       23        0.35            0.35
7        18        0.12            0.47
18        5        0.11            0.58
5        14        0.09            0.67
9        22        0.08            0.75


In [15]:
# 二八法则划分
#amount_data['20_80'] = ['top20%' if i <=0.8 else 'other80%' for i in amount_data['CUM_AMOUNT_PER']]
amount_data['20_80']=pd.cut(amount_data['CUM_AMOUNT_PER'],bins=[0,0.8,1],labels=['top20%','others80%'])
print(amount_data[['PROVINCE','AMOUNT_PER','CUM_AMOUNT_PER','20_80']].round(2).head(10))

   PROVINCE  AMOUNT_PER  CUM_AMOUNT_PER      20_80
10       23        0.35            0.35     top20%
7        18        0.12            0.47     top20%
18        5        0.11            0.58     top20%
5        14        0.09            0.67     top20%
9        22        0.08            0.75     top20%
3        12        0.02            0.78     top20%
16        3        0.02            0.80     top20%
11       24        0.02            0.82  others80%
21        8        0.02            0.84  others80%
13       26        0.02            0.86  others80%


### 6.3.3 ABC分析法

In [16]:
# 二八法则划分
amount_data['ABC'] = pd.cut(amount_data['CUM_AMOUNT_PER'],bins=[0,0.8,0.95,1],labels=list('ABC'))
print(amount_data[['PROVINCE','AMOUNT_PER','CUM_AMOUNT_PER','20_80','ABC']].round(2).head(15))

   PROVINCE  AMOUNT_PER  CUM_AMOUNT_PER      20_80 ABC
10       23        0.35            0.35     top20%   A
7        18        0.12            0.47     top20%   A
18        5        0.11            0.58     top20%   A
5        14        0.09            0.67     top20%   A
9        22        0.08            0.75     top20%   A
3        12        0.02            0.78     top20%   A
16        3        0.02            0.80     top20%   A
11       24        0.02            0.82  others80%   B
21        8        0.02            0.84  others80%   B
13       26        0.02            0.86  others80%   B
2        11        0.02            0.88  others80%   B
4        13        0.02            0.90  others80%   B
17        4        0.02            0.92  others80%   B
15       28        0.02            0.94  others80%   B
20        7        0.02            0.96  others80%   C


### 6.3.4 长尾分析

In [17]:
visits_data = com_sort.sort_values(['VISITS_PER'],ascending=False)
visits_data['CUM_VISITS_PER'] = visits_data['VISITS_PER'].cumsum()
print(visits_data[['PROVINCE','VISITS_PER','CUM_VISITS_PER']].round(2).head())

   PROVINCE  VISITS_PER  CUM_VISITS_PER
10       23        0.18            0.18
18        5        0.10            0.29
5        14        0.07            0.36
13       26        0.07            0.43
11       24        0.06            0.50


## 6.4 分组与聚合分析

### 6.4.1 使用分位数聚合分析

In [18]:
agg_data = raw_data.copy()
agg_data['QUAN_CUT'] = pd.cut(agg_data['VISITS'],bins=3,labels=list('ABC'))
print(agg_data[['VISITS','QUAN_CUT']].head())

   VISITS QUAN_CUT
0    2485        A
1    4647        A
2    6402        A
3   19765        B
4    2892        A


### 6.4.2 基于均值和标准差的聚合分析

In [19]:
visits_desc = agg_data['VISITS'].describe()
min_,mean_,std_,max_ = visits_desc['min'],visits_desc['mean'],visits_desc['std'],visits_desc['max']
bins = [min_-1,mean_-std_,mean_+std_,max_+1]
agg_data['CUST_CUT'] = pd.cut(agg_data['VISITS'],bins=bins,labels=list('ABC'))
print(agg_data[['VISITS','QUAN_CUT','CUST_CUT']].head())

   VISITS QUAN_CUT CUST_CUT
0    2485        A        B
1    4647        A        B
2    6402        A        B
3   19765        B        C
4    2892        A        B


## 6.5 相关性分析

### 6.5.1 Pearson相关性分析

In [20]:
cols = ['QUAN_CUT','CUST_CUT']
for i in cols:
    agg_data[i] = agg_data[i].astype('category')
    agg_data[i+'_IND'] = agg_data[i].cat.codes
print(agg_data[['AMOUNT','VISITS']].corr(method='pearson').round(2))

        AMOUNT  VISITS
AMOUNT    1.00    0.27
VISITS    0.27    1.00


### 6.5.2 Spearman相关性分析

In [21]:
print(agg_data[['QUAN_CUT_IND','CUST_CUT_IND']].corr(method='spearman').round(2))

              QUAN_CUT_IND  CUST_CUT_IND
QUAN_CUT_IND          1.00          0.47
CUST_CUT_IND          0.47          1.00


### 6.5.3 Kendall相关性分析

In [22]:
print(agg_data[['QUAN_CUT_IND','CUST_CUT_IND']].corr(method='kendall').round(2))

              QUAN_CUT_IND  CUST_CUT_IND
QUAN_CUT_IND          1.00          0.47
CUST_CUT_IND          0.47          1.00


## 6.6 主成分分析与因子分析

In [23]:
from sklearn.decomposition import PCA
from sklearn.decomposition import FactorAnalysis as FA
raw_data2 = pd.read_excel('demo.xlsx',sheet_name=1,index_col='USER_ID')
print(raw_data2.head(3))

         LEVEL  CLICKS  VISITS  ORDERS  CON_RATE
USER_ID                                         
1           70  876504   85018    7416  0.569385
2           65  425884   36821    3308  0.527024
3           23  537749   47354    4636  0.899514


### 6.6.1 主成分分析

In [24]:
pca = PCA(n_components=None)
pca_data = pca.fit_transform(raw_data2)

In [25]:
# 显示前3条主成分数据
print(pca_data[:3,:].round(2))

[[ 3.904447e+05 -5.135850e+03 -3.018500e+02 -1.991000e+01 -3.000000e-02]
 [-6.270874e+04  1.942810e+03 -4.802000e+01 -1.307000e+01 -1.000000e-02]
 [ 4.965771e+04  1.579790e+03  3.189600e+02  2.706000e+01 -3.400000e-01]]


In [26]:
# 显示每个主成分方差解释比例
pca.explained_variance_ratio_

array([9.99944522e-01, 5.49301430e-05, 5.31880710e-07, 1.57204434e-08,
       1.47192045e-12])

### 6.6.2 因子分析

In [27]:
fa = FA(n_components=None)
fa_data = fa.fit_transform(raw_data2)

In [28]:
# 显示前3条主成分数据
fa_data[:3,:].round(2)

array([[ 1.66, -2.94, -1.76, -0.67,  0.  ],
       [-0.27,  1.11, -0.28, -0.44,  0.  ],
       [ 0.21,  0.9 ,  1.86,  0.92,  0.  ]])