## 目录
- [1 大数据行业个人画像](#1)
    - [1.1 个人能力概况](#1.1)
    - [1.2 个人能力与薪资](#1.2)
        - [概况](#1.2.1)
        - [关联挖掘](#1.2.2)


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyecharts.charts import Bar
from pyecharts.charts import Pie
from pyecharts.charts import Geo
from pyecharts.charts import HeatMap
from pyecharts.faker import Faker
from pyecharts import options as opts
from pyecharts.charts import Scatter3D
from pyecharts.charts import Line
import random
import math

In [2]:
data = pd.read_csv('../../../data/info_o.csv')

<h3 id="1">1 大数据行业个人画像</h3>
  
<h4 id="1.1">1.1 个人能力概况 - 企业对人才的要求是怎样的？ </h4>
首先从工作经验和教育背景两方面了解大数据行业所需要的个人能力基本情况。

In [3]:
tmp_data=data['w_experience'].str.strip()
data['w_experience']=tmp_data
capacity = data[['c_name','w_experience','education']].drop_duplicates(subset=['c_name'],keep='first')
capacity[:10]

Unnamed: 0,c_name,w_experience,education
0,纳沃克斯(北京)国际咨询有限公司 Networkers International,3-5年,不限
1,RIZE World Wide LTD,3-5年,本科
3,任子行网络技术股份有限公司,3-5年,本科
4,深圳市兴东企业管理有限公司,5-10年,本科
5,深圳杰然网络科技有限公司,3-5年,本科
6,北京渔阳信通信息技术有限公司,不限,大专
7,南京欧米伽网络科技有限公司,不限,大专
8,励牛课思(北京)信息技术有限公司济南分公司,不限,大专
9,北京百知教育科技有限公司,不限,大专
10,北京中关新才科技有限公司,不限,大专


In [4]:
tmp_data=capacity['w_experience'].str.strip()
capacity['w_experience']=tmp_data
print(capacity.shape[0])

df = capacity['w_experience'].value_counts()
x = list(df.index.values)
y = [int(x) for x in df.values]
bar = Bar()
bar.add_xaxis(x)
bar.add_yaxis("",y,color= "#2c85ff")
bar.set_global_opts(xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-45)))
bar.render_notebook()

8979


对于8979家对大数据有需求的企业，上图反应了这些公司对人才工作经验方面的要求，可以发现多数公司要求有1-5年的工作经验，占到了公司总数的一半，也有大约20%的公司对工作经验没有限制，比较灵活，这可能包含校招或转岗的部分。

In [5]:
df =capacity['education'].value_counts()
x = list(df.index.values)
y = [int(x) for x in df.values]
pie = Pie()
pie.add("", [list(z) for z in zip(x, y)])
pie.render_notebook()

从图中我们看到，一半以上的大数据企业要求应聘者至少有本科学历，大约20%的企业要求至少大专学历，这表明企业对学历的要求是比较宽松的，推测是由于大数据行业比较注重实践能力，大多数对高学历没有硬性要求。

<h4 id="1.2">1.2 个人能力与薪资 - 什么样的人才薪酬最高？</h4>
<h5 id="1.2.1">概况</h5>

In [6]:
gp = data.pivot_table('s_average',index='w_experience',columns='education',aggfunc='mean',margins=True,fill_value=0)
order = ['高中', '中技', '中专', '大专','本科', '硕士','博士','不限','All']
gp = gp[order]

In [7]:
def draw_heat(gp,v_max):
    value = [[i,j,round(gp.values[i][j])] for i in range(gp.index.shape[0])for j in range(gp.columns.shape[0])]
    c = HeatMap()
    c.add_xaxis(list(gp.index))
    c.add_yaxis(
            "",
            list(gp.columns),
            value,
            label_opts=opts.LabelOpts(is_show=True, position="inside"),
        ) 
    c.set_global_opts(
            title_opts=opts.TitleOpts(title=""),
            visualmap_opts=opts.VisualMapOpts(min_=0, max_=v_max),
            xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-45))
        )
    return c

In [11]:
c = draw_heat(gp,gp.values.max())
c.render_notebook()

上图展示了不同工作经验及学历所对应的平均薪资（k/月），很明显，博士工作经验3-5年的薪资水平最高，硕士工作十年以上的薪资紧随其后，这表明虽然大多数公司对硕士、博士学历没有硬性要求，但是在同等工作经验条件下，拥有更高学历的人才所获得薪资也会更高。同时在工作经验方面，我们也能看到，工作十年以上的人才对应的平均薪资最高，其次是5-10年，3-5年，推测在大数据方向工作时间越久，专业技术能力越强，累积的经验越多，对于新任务也更容易上手，所以公司为这类人才提供更高的薪资。

In [9]:
print("最高薪资(150k)对应的工作经验及学历如下:")
data[data.s_max == data.s_max.max()][['w_experience','education']]

最高薪资(150k)对应的工作经验及学历如下:


Unnamed: 0,w_experience,education
277,不限,不限
654,10年以上,硕士
1072,5-10年,硕士
1868,10年以上,硕士
3239,不限,不限
3354,3-5年,本科
3694,不限,本科
3963,不限,不限
4329,5-10年,大专
4422,不限,大专


<h5 id="1.2.2">关联挖掘</h5>

In [12]:
data_h = pd.read_csv('../../../data/info_h.csv')
c_s = data_h[[ 'w_experience', 'education','s_average']]

In [13]:
from mlxtend.preprocessing import TransactionEncoder
def deal(data):
    return data.to_list()
def encode(df):
    df_arr = df.apply(deal,axis=1).tolist()
    te = TransactionEncoder()
    df_tf = te.fit_transform(df_arr)
    df = pd.DataFrame(df_tf,columns=te.columns_)
    return df

In [14]:
c_s = encode(c_s)

In [15]:
from mlxtend.frequent_patterns import apriori
frequent_items = apriori(c_s, min_support=0.05, use_colnames=True, max_len=4).sort_values(by='support', ascending=False)	
frequent_items.head(10)

Unnamed: 0,support,itemsets
11,0.62257,(本科)
8,0.357085,(a4)
7,0.293182,(a3)
30,0.279736,"(a4, 本科)"
9,0.266895,(不限)
6,0.231796,(a2)
10,0.210444,(大专)
3,0.208178,(3-5年)
29,0.203495,"(本科, a3)"
2,0.158123,(1-3年)


In [17]:
# 评价方法
import math
def metrics(r,f):
    ans = []
    for i in range(r.shape[0]):
        item = r.iloc[i]
        ans.append(f(item))
    return ans
def allconf(item):
    return item.support/max(item['antecedent support'],item['consequent support'])
def cosine(item):
    return item.support/math.sqrt(item['antecedent support']*item['consequent support'])
def Jaccard(item):
    return item.support/(item['antecedent support']+item['consequent support']-item.support)
def maxconf(item):
    return max(item.support/item['antecedent support'],item.support/item['consequent support'])
def Kulczynski(item):
    return 0.5*(item.support/item['antecedent support']+item.support/item['consequent support'])

In [18]:
from mlxtend.frequent_patterns import association_rules
def get_rules(frequent_items):
    rules =  association_rules(frequent_items, metric='lift')
    rules = rules.sort_values(by=['lift'], ascending=False).reset_index(drop=True)
    rules = rules.drop(['leverage','conviction'],axis = 1)
    rules['cosine'] = metrics(rules,cosine)
    rules['Jaccard'] = metrics(rules,Jaccard)
    rules['Allconf'] = metrics(rules,allconf)
    rules['Maxconf'] = metrics(rules,maxconf)
    rules['Kulczynski'] = metrics(rules,Kulczynski)
    return rules

In [19]:
rules = get_rules(frequent_items)
rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,cosine,Jaccard,Allconf,Maxconf,Kulczynski
0,( 3-5年),"(a4, 本科)",0.142411,0.279736,0.097089,0.681754,2.437132,0.486436,0.298683,0.347075,0.681754,0.514414
1,"(a4, 本科)",( 3-5年),0.279736,0.142411,0.097089,0.347075,2.437132,0.486436,0.298683,0.347075,0.681754,0.514414
2,"(本科, 3-5年)",(a4),0.122872,0.357085,0.097089,0.790164,2.212816,0.463509,0.253584,0.271894,0.790164,0.531029
3,(a4),"(本科, 3-5年)",0.357085,0.122872,0.097089,0.271894,2.212816,0.463509,0.253584,0.271894,0.790164,0.531029
4,(a1),(不限),0.117937,0.266895,0.069443,0.588813,2.20616,0.391411,0.220182,0.260189,0.588813,0.424501
5,(不限),(a1),0.266895,0.117937,0.069443,0.260189,2.20616,0.391411,0.220182,0.260189,0.588813,0.424501
6,( 3-5年),(a4),0.142411,0.357085,0.110585,0.776521,2.174608,0.490387,0.284345,0.309688,0.776521,0.543104
7,(a4),( 3-5年),0.357085,0.142411,0.110585,0.309688,2.174608,0.490387,0.284345,0.309688,0.776521,0.543104
8,(大专),(a1),0.210444,0.117937,0.053077,0.252213,2.138537,0.336908,0.192793,0.252213,0.450043,0.351128
9,(a1),(大专),0.117937,0.210444,0.053077,0.450043,2.138537,0.336908,0.192793,0.252213,0.450043,0.351128


导出的关联规则说明:
  
- 本科学历有3-5年工作经验的人才平均薪资为17k左右。
- 不限学历及工作经验的岗位提供平均薪资为6k左右
- 要求大专学历的岗位平均薪资为6k左右

In [16]:
def draw_scatter(rules):
    # 配置 config
    config_xAxis3D = "support"
    config_yAxis3D = "confidence"
    config_zAxis3D = "lift"
    config_color = "support"
    config_symbolSize = "lift"
    # 构造数据
    data = [
        [
            rules.loc[i][config_xAxis3D],
            rules.loc[i][config_yAxis3D],
            rules.loc[i][config_zAxis3D],
            rules.loc[i][config_color],
            rules.loc[i][config_symbolSize],
            i,
        ]
        for i in range(rules.shape[0])
    ]

    s3= Scatter3D(init_opts=opts.InitOpts(width="700px", height="320px"))  # bg_color="black"
    s3.add(
        series_name="",
        data=data,
        xaxis3d_opts=opts.Axis3DOpts(
            name=config_xAxis3D,
            type_="value"
        ),
        yaxis3d_opts=opts.Axis3DOpts(
            name=config_yAxis3D,
            type_="value"
        ),
        zaxis3d_opts=opts.Axis3DOpts(
            name=config_zAxis3D,
            type_="value"
        ),
        grid3d_opts=opts.Grid3DOpts(width=100, height=100, depth=100),
    )
    s3.set_global_opts(
        visualmap_opts=[
            opts.VisualMapOpts(
                type_="color",
                is_calculable=True,
                dimension=3,
                pos_top="10",
                max_= 0.5,
                range_color=[
                        "#1710c0",
                        "#0b9df0",
                        "#00fea8",
                        "#00ff0d",
                        "#f5f811",
                        "#f09a09",
                        "#fe0300",
                ],
            )
        ]
    )
    return s3

**规则评价**
  
下图给出了导出关联规则的置信度、支持度以及Lift的散点图。

In [20]:
s = draw_scatter(rules)
s.render_notebook()