## 目录
- [1 大数据行业企业画像](#1)
    - [1.1 企业概况](#1.1)
    - [1.2 企业与薪资](#1.2)
        - [概况](#1.2.1)
        - [关联挖掘](#1.2.2)
    - [1.3 企业与职位需求量](#1.3)
        - [概况](#1.3.1)
        - [关联挖掘](#1.3.2)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyecharts.charts import Bar
from pyecharts.charts import Pie
from pyecharts.charts import Geo
from pyecharts.charts import HeatMap
from pyecharts.faker import Faker
from pyecharts import options as opts
from pyecharts.charts import Scatter3D
from pyecharts.charts import Line
import random
import math

In [2]:
data = pd.read_csv('../../../data/info_o.csv')

<h3 id="1">1 大数据行业企业画像</h3>
  
<h4 id="1.1">1.1 企业概况 - 有大数据需求的企业都是些什么背景？</h4>
首先来了解一下，招聘大数据行业人才的企业的基本情况，了解这些有大数据需求的企业在公司性质、公司规模、地理这些维度的分布情况。

In [3]:
firm = data[['c_name','c_nature','c_scale','w_place']].drop_duplicates(subset=['c_name'],keep='first')

In [4]:
print(firm.shape[0])
df = firm['c_nature'].value_counts()
x = list(df.index.values)
y = [int(x) for x in df.values]
bar = Bar()
bar.add_xaxis(x)
bar.add_yaxis("",y,color= "#2c85ff")
bar.set_global_opts(xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-45)))
bar.render_notebook()

8979


对于8979家对大数据有需求的企业，上图反应了其公司性质分布，我们可以从图中观察到，在大数据相关企业中，民营企业是最多的，这符合我国国情。另外对大数据有迫切需求的分别是股份制企业、上市公司、国企、合资公司等，每种企业数量均超过了300家，另外我们还可以观察到，下至社会团体，上至国家机关，对大数据都有需求。

In [5]:
df = firm['c_scale'].value_counts()
x = list(df.index.values)
y = [int(x) for x in df.values]
pie = Pie()
pie.add("", [list(z) for z in zip(x, y)])
pie.render_notebook()

在这8979家企业中，中型企业(100-499人)占了大多数，超过了4000家，小型企业(20-99人)和大型企业(1000-9999人)也占了不少的比重，将近达到3000家，另外可以观察到初创业(20人以下)和超大型企业(10000)也均对大数据有需求，公司数量均超过了200家。

In [9]:
df = firm.w_place.value_counts()
df = df.drop('其他')
x = list(df.index.values)
y = [int(x) for x in df.values]

In [39]:
m = Geo()
m.add_schema(maptype="china",
             itemstyle_opts=opts.ItemStyleOpts(color="#404a59", border_color="#fff11"),
            )
m.add_coordinate_json('city_location.json')
m.add("", [list(z) for z in zip(x, y)],symbol_size=8)
m.set_series_opts(label_opts=opts.LabelOpts(is_show=False))
m.set_global_opts(
    visualmap_opts=opts.VisualMapOpts()
)
m.render_notebook()

上图显示了这8979家企业的地理分布，可以观察到，在北京大数据公司数量最多，超过了2000家，上海等一些发达城市(成都、重庆)及沿海城市(深圳，杭州)的大数据公司也非常密集，均超过了80多家。整体来看，大数据公司的分布有着围绕着经济中心逐渐下降，从沿海到内地逐渐下降等趋势。

<h4 id="1.2">1.2 企业与薪资 - 什么样的企业给的薪酬最高？</h4>
<h5 id="1.2.1">概况</h5>

In [11]:
gp = data.pivot_table('s_average',index='c_nature',columns='c_scale',aggfunc='mean',margins=True,fill_value=0)
order = ['20人以下', '20-99人', '100-499人', '500-999人','1000-9999人', '10000人以上', 'All']
gp = gp[order]

In [30]:
def draw_heat(gp,v_max):
    value = [[i,j,round(gp.values[i][j])] for i in range(gp.index.shape[0])for j in range(gp.columns.shape[0])]
    c = HeatMap()
    c.add_xaxis(list(gp.index))
    c.add_yaxis(
            "",
            list(gp.columns),
            value,
            label_opts=opts.LabelOpts(is_show=True, position="inside"),
        )
    c.set_global_opts(
            title_opts=opts.TitleOpts(title=""),
            visualmap_opts=opts.VisualMapOpts(min_=0, max_=v_max),
            xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-45))
        )
    return c

In [31]:
c = draw_heat(gp,gp.values.max())
c.render_notebook()

上图展示了不同性质和不同规模的企业，对大数据岗位给出的平均薪资(单位k/月)，首先可以看见最高的竟然是一个20人以下的上市公司，这是因为20人以下的上市公司比较少，这家公司也比较特殊，并不具有非常明确的代表性，可以忽略这行数据。对于总体来看，其实可以发现不同性质对和不同规模的公司对大数据职位开出的薪资并没有相差太多，均在12k-15k左右，说明大数据需求是硬性需求，各种性质和规模的企业都不敢轻易忽视其地位。
  
对比不同性质的企业，港澳台公司给出的平均月薪资最高，为17k/月，其中最高的为港澳台的大型公司，给出平均月薪达到24k,最低的为社会团体，平均6k/月。

In [15]:
print("开出最高薪资(150k)的企业如下:")
data[data.s_max == data.s_max.max()].c_name

开出最高薪资(150k)的企业如下:


277       深圳市快乐人生人力资源有限公司 
654                    中电科
1072     北京信赢荣智企业管理咨询有限公司 
1868      浙江数秦科技有限公司北京分公司 
3239         甘肃思扬网络教育有限公司 
3354         山东纬横数据科技有限公司 
3694         海南易建科技股份有限公司 
3963         西安泽源信息科技有限公司 
4329    上海钧钰互联网金融信息服务有限公司 
4422        湖北聚一线网络开发有限公司 
4865       北京希嘉创智教育科技有限公司 
4959       武汉凯欣隆钢结构工程有限公司 
5544         河北有为文化传媒有限公司 
8490        扬州思亿欧网络科技有限公司 
8503     江苏南大电子信息技术股份有限公司 
Name: c_name, dtype: object

In [16]:
print("收入最高薪资(150k)的职业如下:")
data[data.s_max == data.s_max.max()].j_name

收入最高薪资(150k)的职业如下:


277           财务大数据总监
654          首席战略官CSO
1072           技术副总经理
1868    集团副总裁/执行总裁/VP
3239     分公司总经理/区域合作方
3354        java高级工程师
3694           资深外贸业务
3963       云平台高级开发工程师
4329       理财公司（副总经理）
4422         大数据研发工程师
4865          后端开发工程师
4959          后端开发工程师
5544         科技产品城市代理
8490          博士后科研人员
8503       人工智能产品营销经理
Name: j_name, dtype: object

<h5 id="1.2.2">关联挖掘</h5>

In [17]:
data_h = pd.read_csv('../../../data/info_h.csv')
c_s = data_h[[ 'c_nature', 'c_scale','s_average']]

In [18]:
from mlxtend.preprocessing import TransactionEncoder
def deal(data):
    return data.to_list()
def encode(df):
    df_arr = df.apply(deal,axis=1).tolist()
    te = TransactionEncoder()
    df_tf = te.fit_transform(df_arr)
    df = pd.DataFrame(df_tf,columns=te.columns_)
    return df

In [19]:
c_s = encode(c_s)

In [20]:
from mlxtend.frequent_patterns import apriori
frequent_items = apriori(c_s, min_support=0.05, use_colnames=True, max_len=4).sort_values(by='support', ascending=False)	
frequent_items.head(10)

Unnamed: 0,support,itemsets
11,0.676705,(民营)
0,0.563652,(100-499人)
16,0.454678,"(民营, 100-499人)"
8,0.357085,(a4)
7,0.293182,(a3)
23,0.284722,"(民营, a4)"
15,0.255413,"(a4, 100-499人)"
6,0.231796,(a2)
26,0.231141,"(民营, a4, 100-499人)"
22,0.181136,"(a3, 民营)"


In [22]:
# 评价方法
import math
def metrics(r,f):
    ans = []
    for i in range(r.shape[0]):
        item = r.iloc[i]
        ans.append(f(item))
    return ans
def allconf(item):
    return item.support/max(item['antecedent support'],item['consequent support'])
def cosine(item):
    return item.support/math.sqrt(item['antecedent support']*item['consequent support'])
def Jaccard(item):
    return item.support/(item['antecedent support']+item['consequent support']-item.support)
def maxconf(item):
    return max(item.support/item['antecedent support'],item.support/item['consequent support'])
def Kulczynski(item):
    return 0.5*(item.support/item['antecedent support']+item.support/item['consequent support'])

In [23]:
from mlxtend.frequent_patterns import association_rules
def get_rules(frequent_items):
    rules =  association_rules(frequent_items, metric='lift')
    rules = rules.sort_values(by=['lift'], ascending=False).reset_index(drop=True)
    rules = rules.drop(['leverage','conviction'],axis = 1)
    rules['cosine'] = metrics(rules,cosine)
    rules['Jaccard'] = metrics(rules,Jaccard)
    rules['Allconf'] = metrics(rules,allconf)
    rules['Maxconf'] = metrics(rules,maxconf)
    rules['Kulczynski'] = metrics(rules,Kulczynski)
    return rules

In [25]:
rules = get_rules(frequent_items)
rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,cosine,Jaccard,Allconf,Maxconf,Kulczynski
0,(100-499人),"(a4, 民营)",0.563652,0.284722,0.231141,0.410078,1.440277,0.576981,0.37448,0.410078,0.811815,0.610946
1,"(a4, 民营)",(100-499人),0.284722,0.563652,0.231141,0.811815,1.440277,0.576981,0.37448,0.410078,0.811815,0.610946
2,"(民营, 100-499人)",(a4),0.454678,0.357085,0.231141,0.508362,1.423643,0.57364,0.398092,0.508362,0.647299,0.577831
3,(a4),"(民营, 100-499人)",0.357085,0.454678,0.231141,0.647299,1.423643,0.57364,0.398092,0.508362,0.647299,0.577831
4,(20-99人),(a2),0.18033,0.231796,0.057609,0.319464,1.378213,0.281776,0.1625,0.248534,0.319464,0.283999
5,(a2),(20-99人),0.231796,0.18033,0.057609,0.248534,1.378213,0.281776,0.1625,0.248534,0.319464,0.283999
6,(民营),"(a4, 100-499人)",0.676705,0.255413,0.231141,0.341569,1.337317,0.555976,0.329741,0.341569,0.904968,0.623269
7,"(a4, 100-499人)",(民营),0.255413,0.676705,0.231141,0.904968,1.337317,0.555976,0.329741,0.341569,0.904968,0.623269
8,(a4),(100-499人),0.357085,0.563652,0.255413,0.715273,1.268997,0.569315,0.383893,0.45314,0.715273,0.584207
9,(100-499人),(a4),0.563652,0.357085,0.255413,0.45314,1.268997,0.569315,0.383893,0.45314,0.715273,0.584207


导出的关联规则说明:
  
- 规模为100-499人的中型民营企业一般对大数据行业岗位开出的薪资为17k左右。
- 规模为100-499人的中型企业对大数据行业岗位开出的平均薪资为17k左右。
- 规模为20-99人的小型企业对大数据行业岗位开出的平均薪资为7k-10k之间。

In [26]:
def draw_scatter(rules):
    # 配置 config
    config_xAxis3D = "support"
    config_yAxis3D = "confidence"
    config_zAxis3D = "lift"
    config_color = "support"
    config_symbolSize = "lift"
    # 构造数据
    data = [
        [
            rules.loc[i][config_xAxis3D],
            rules.loc[i][config_yAxis3D],
            rules.loc[i][config_zAxis3D],
            rules.loc[i][config_color],
            rules.loc[i][config_symbolSize],
            i,
        ]
        for i in range(rules.shape[0])
    ]

    s3= Scatter3D(init_opts=opts.InitOpts(width="700px", height="320px"))  # bg_color="black"
    s3.add(
        series_name="",
        data=data,
        xaxis3d_opts=opts.Axis3DOpts(
            name=config_xAxis3D,
            type_="value"
        ),
        yaxis3d_opts=opts.Axis3DOpts(
            name=config_yAxis3D,
            type_="value"
        ),
        zaxis3d_opts=opts.Axis3DOpts(
            name=config_zAxis3D,
            type_="value"
        ),
        grid3d_opts=opts.Grid3DOpts(width=100, height=100, depth=100),
    )
    s3.set_global_opts(
        visualmap_opts=[
            opts.VisualMapOpts(
                type_="color",
                is_calculable=True,
                dimension=3,
                pos_top="10",
                max_= 0.5,
                range_color=[
                        "#1710c0",
                        "#0b9df0",
                        "#00fea8",
                        "#00ff0d",
                        "#f5f811",
                        "#f09a09",
                        "#fe0300",
                ],
            )
        ]
    )
    return s3

**规则评价**
  
下图给出了导出关联规则的置信度、支持度以及Lift的散点图。

In [27]:
s = draw_scatter(rules)
s.render_notebook()

<h4 id="1.3">1.3 企业与职位需求量 - 什么样的企业对大数据人才的需求更高？</h4>
<h5 id="1.3.1">概况</h5>

In [35]:
data = data[data.vacancies < 100]
gp = data.pivot_table('vacancies',index='c_nature',columns='c_scale',aggfunc='sum',margins=True,fill_value=0)
gp = gp[order]

In [32]:
vh = draw_heat(gp,2000)
vh.render_notebook()

在19748条招聘信息中，由于我国民营企业占了大多数，所以自然的民营企业对大数据行业的需求是最紧缺的，一共有35180职位空缺；上市公司和股份制企业对大数据人才的需求也非常迫切，职位空缺均超过了4000人；合资企业和国企招收大数据职位空缺将近3000人。
  
对于不同规模的企业，职位空缺并不是完全集中在超大型公司(10000人以上)，而是更多的分布在中大型企业(100-9999人)中。

<h5 id="1.3.2">关联挖掘</h5>

In [36]:
c_v = data_h[[ 'c_nature', 'c_scale','vacancies']]
c_v = encode(c_v)

In [37]:
frequent_items = apriori(c_v, min_support=0.05, use_colnames=True, max_len=4).sort_values(by='support', ascending=False)	
rules = get_rules(frequent_items)
rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,cosine,Jaccard,Allconf,Maxconf,Kulczynski
0,(20-99人),"(v2, 民营)",0.18033,0.151878,0.059724,0.331192,2.180643,0.360883,0.219183,0.331192,0.393236,0.362214
1,"(v2, 民营)",(20-99人),0.151878,0.18033,0.059724,0.393236,2.180643,0.360883,0.219183,0.331192,0.393236,0.362214
2,(v2),"(20-99人, 民营)",0.272837,0.132037,0.059724,0.2189,1.657862,0.314665,0.173038,0.2189,0.452326,0.335613
3,"(20-99人, 民营)",(v2),0.132037,0.272837,0.059724,0.452326,1.657862,0.314665,0.173038,0.2189,0.452326,0.335613
4,(20-99人),(v2),0.18033,0.272837,0.080421,0.445965,1.634546,0.362563,0.215752,0.294758,0.445965,0.370362
5,(v2),(20-99人),0.272837,0.18033,0.080421,0.294758,1.634546,0.362563,0.215752,0.294758,0.445965,0.370362
6,(100-499人),"(民营, v1)",0.563652,0.467268,0.363884,0.645582,1.381611,0.709046,0.545523,0.645582,0.778748,0.712165
7,"(民营, v1)",(100-499人),0.467268,0.563652,0.363884,0.778748,1.381611,0.709046,0.545523,0.645582,0.778748,0.712165
8,"(v1, 100-499人)",(民营),0.416558,0.676705,0.363884,0.873549,1.290887,0.68537,0.498895,0.537729,0.873549,0.705639
9,(民营),"(v1, 100-499人)",0.676705,0.416558,0.363884,0.537729,1.290887,0.68537,0.498895,0.537729,0.873549,0.705639


导出的关联规则说明:
  
- 规模为20-99人的小型民营企业每个职位平均空缺为2-5人。
- 规模为100-499人的中型民营企业每个职位平均空缺为1人。
- 规模为20-99人的小型企业每个职位平均空缺为2-5人。

**规则评价**
  
下图给出了导出关联规则的置信度、支持度以及Lift的散点图。

In [38]:
sv = draw_scatter(rules)
sv.render_notebook()