# 描述统计学（一）：表格法和图形法

## 2.1 汇总分类变量的数据

数据集介绍：DC与漫威漫画人物数据集

Variable | Definition
---|---------
page_id(页码id) | The unique identifier for that characters page within the wikia
name(名字) | The name of the character
urlslug(url目录) | The unique url within the wikia that takes you to the character
ID | The identity status of the character (Secret Identity, Public identity, [on marvel only: No Dual Identity])
ALIGN(任务善恶) | If the character is Good, Bad or Neutral
EYE(眼睛颜色) | Eye color of the character
HAIR(头发颜色) | Hair color of the character
SEX(性别) | Sex of the character (e.g. Male, Female, etc.)
GSM(性取向) | If the character is a gender or sexual minority (e.g. Homosexual characters, bisexual characters)
ALIVE(是否活着) | If the character is alive or deceased
APPEARANCES(出现次数) | The number of appareances of the character in comic books (as of Sep. 2, 2014. Number will become increasingly out of date as time goes on.)
FIRST APPEARANCE(第一次出现的年月) | The month and year of the character's first appearance in a comic book, if available
YEAR(第一次出现的年份) | The year of the character's first appearance in a comic book, if available

In [1]:
import pandas as pd
data_dc=pd.read_csv('Data\\fivethirtyeight-comic-characters-dataset\\dc-wikia-data.csv')
data_marvel=pd.read_csv('Data\\fivethirtyeight-comic-characters-dataset\\marvel-wikia-data.csv')
display(data_dc.head(5))

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0


接下来，我们对DC动漫人物的眼睛颜色做一个统计，各种颜色眼睛出现的次数被称为频数，如果将频数除以总数，得到的占比则称为频率。

In [2]:
freq=pd.DataFrame(index=data_dc['EYE'].value_counts().index)
freq['频数']=data_dc['EYE'].value_counts()
freq['频率']=(data_dc['EYE'].value_counts()/sum(data_dc["EYE"].value_counts())).round(3)
display(freq)

Unnamed: 0,频数,频率
Blue Eyes,1102,0.337
Brown Eyes,879,0.269
Black Eyes,412,0.126
Green Eyes,291,0.089
Red Eyes,208,0.064
White Eyes,116,0.035
Yellow Eyes,86,0.026
Photocellular Eyes,48,0.015
Grey Eyes,40,0.012
Hazel Eyes,23,0.007


通过条形统计图展示各自出现的次数以及相对大小。

In [3]:
from pyecharts.charts import Bar
import pyecharts.options as opts

bar=(
    Bar()
    .add_xaxis(list(freq.index.str.replace('Eyes','')))
    .add_yaxis('频数',list(freq['频数']))
    .set_global_opts(title_opts=opts.TitleOpts(title='DC漫画人物眼睛颜色频数分布直方图'),
                    xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=90)),
                    datazoom_opts=[opts.DataZoomOpts(type_='inside',range_start=0,range_end=100)],
                    )
)
display(bar.render_notebook())
from pyecharts import options as opts
from pyecharts.charts import Page, Pie


def pie_base(x,y) -> Pie:
    c = (
        Pie()
        .add("", [list(z) for z in zip(x,y)])
        .set_global_opts(title_opts=opts.TitleOpts(title="眼睛颜色扇形统计图",subtitle='为方便展示，仅显示占比前5名'))
        .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
    )
    return c
x=list(freq.index[0:5])
x.append('Other')
y=list(freq['频率'][0:5])
y.append(float('%.3f'%sum(freq['频率'].round(3)[5:])))
pie=pie_base(x,y)
display(pie.render_notebook())

## 2.2 汇总数量变量的数据

对于数量变量的汇总需要格外小心，因为我们可能会需要对数据进行分组，分组间隔需要格外注意。若分组间隔太大，则可能会丢失分布信息，分组间隔太小，则难以观察到较大尺度上的规律。

In [12]:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
mpl.rcParams.update({
    'font.family': 'sans-serif',
    'font.sans-serif': ['Times New Roman'],
    })  # 设置全局字体
#定义自定义字体，文件名从1.b查看系统中文字体中来  
from matplotlib.font_manager import *  
myfont = FontProperties(fname='~/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/simhei.ttf')  
plt.rcParams['font.sans-serif']=['SimHei']  # 用来正常显示中文标签  
plt.rcParams['axes.unicode_minus']=False  # 用来正常显示负号

mall_customer=pd.read_csv('Data/Mall_Customers.csv')
r=pd.cut(mall_customer['Age'],5)
print(r)

0      (17.948, 28.4]
1      (17.948, 28.4]
2      (17.948, 28.4]
3      (17.948, 28.4]
4        (28.4, 38.8]
5      (17.948, 28.4]
6        (28.4, 38.8]
7      (17.948, 28.4]
8        (59.6, 70.0]
9        (28.4, 38.8]
10       (59.6, 70.0]
11       (28.4, 38.8]
12       (49.2, 59.6]
13     (17.948, 28.4]
14       (28.4, 38.8]
15     (17.948, 28.4]
16       (28.4, 38.8]
17     (17.948, 28.4]
18       (49.2, 59.6]
19       (28.4, 38.8]
20       (28.4, 38.8]
21     (17.948, 28.4]
22       (38.8, 49.2]
23       (28.4, 38.8]
24       (49.2, 59.6]
25       (28.4, 38.8]
26       (38.8, 49.2]
27       (28.4, 38.8]
28       (38.8, 49.2]
29     (17.948, 28.4]
            ...      
170      (38.8, 49.2]
171    (17.948, 28.4]
172      (28.4, 38.8]
173      (28.4, 38.8]
174      (49.2, 59.6]
175      (28.4, 38.8]
176      (49.2, 59.6]
177    (17.948, 28.4]
178      (49.2, 59.6]
179      (28.4, 38.8]
180      (28.4, 38.8]
181      (28.4, 38.8]
182      (38.8, 49.2]
183      (28.4, 38.8]
184      (

## 2.3 用表格方法汇总两个变量的数据

## 2.4 用图形显示方法汇总两个变量的数据

## 2.5 数据可视化：创建有效图形显示的最佳实践