# Occupation

### Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [66]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

### Step 3. Assign it to a variable called users.

In [67]:
users = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|', index_col='user_id')
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


### Step 4. Discover what is the mean age per occupation

In [68]:
users.groupby('occupation')['age'].mean()

occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

*组内成员年龄与组内年龄平均值比较*

In [75]:
occupation_mean_age = users.groupby('occupation')['age'].transform('mean')
users['age_deviation'] = users['age'] - occupation_mean_age
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code,gender_num,age_deviation
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,24,M,technician,85711,1,-9.148148
2,53,F,other,94043,0,18.47619
3,23,M,writer,32067,1,-13.311111
4,24,M,technician,43537,1,-9.148148
5,33,F,other,15213,0,-1.52381


### Step 5. Discover the Male ratio per occupation and sort it from the most to the least

In [69]:
users['gender_num'] = users['gender'].apply(lambda x:1 if x=='M' else 0)
users.groupby('occupation')['gender_num'].mean().sort_values(ascending=False)

occupation
doctor           1.000000
engineer         0.970149
technician       0.962963
retired          0.928571
programmer       0.909091
executive        0.906250
scientist        0.903226
entertainment    0.888889
lawyer           0.833333
salesman         0.750000
educator         0.726316
student          0.693878
other            0.657143
marketing        0.615385
writer           0.577778
none             0.555556
administrator    0.544304
artist           0.535714
librarian        0.431373
healthcare       0.312500
homemaker        0.142857
Name: gender_num, dtype: float64

### Step 6. For each occupation, calculate the minimum and maximum ages

In [70]:
users.groupby('occupation')['age'].agg(['max', 'min'])

Unnamed: 0_level_0,max,min
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
administrator,70,21
artist,48,19
doctor,64,28
educator,63,23
engineer,70,22
entertainment,50,15
executive,69,22
healthcare,62,22
homemaker,50,20
lawyer,53,21


*对数据进行分箱操作,显示不同年龄段各个职业的数量*

In [87]:
bins = [0, 18, 50, 70]
users_labels = ['child', 'adult', 'senior']

users['labels'] = pd.cut(users['age'], bins, labels=users_labels)
users.groupby(['labels', 'occupation'], observed=True).size()

labels  occupation   
child   entertainment      2
        none               3
        other              3
        salesman           1
        student           43
        writer             2
adult   administrator     66
        artist            28
        doctor             5
        educator          75
        engineer          59
        entertainment     16
        executive         29
        healthcare        14
        homemaker          7
        lawyer            10
        librarian         39
        marketing         22
        none               5
        other             91
        programmer        62
        salesman           9
        scientist         30
        student          153
        technician        26
        writer            38
senior  administrator     13
        doctor             2
        educator          20
        engineer           8
        executive          3
        healthcare         2
        lawyer             2
        librarian    

*或者按照分位数划分*

In [90]:
users['age_quartile'] = pd.qcut(users['age'], 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

### Step 7. For each combination of occupation and gender, calculate the mean age

In [71]:
users.groupby(['occupation', 'gender'])['age'].mean()

occupation     gender
administrator  F         40.638889
               M         37.162791
artist         F         30.307692
               M         32.333333
doctor         M         43.571429
educator       F         39.115385
               M         43.101449
engineer       F         29.500000
               M         36.600000
entertainment  F         31.000000
               M         29.000000
executive      F         44.000000
               M         38.172414
healthcare     F         39.818182
               M         45.400000
homemaker      F         34.166667
               M         23.000000
lawyer         F         39.500000
               M         36.200000
librarian      F         40.000000
               M         40.000000
marketing      F         37.200000
               M         37.875000
none           F         36.500000
               M         18.600000
other          F         35.472222
               M         34.028986
programmer     F         32.16666

### Step 8.  For each occupation present the percentage of women and men

In [72]:
#users.groupby('occupation')['gender'].value_counts(normalize=True)
pd.crosstab(users.occupation, users.gender, normalize='index')

gender,F,M
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
administrator,0.455696,0.544304
artist,0.464286,0.535714
doctor,0.0,1.0
educator,0.273684,0.726316
engineer,0.029851,0.970149
entertainment,0.111111,0.888889
executive,0.09375,0.90625
healthcare,0.6875,0.3125
homemaker,0.857143,0.142857
lawyer,0.166667,0.833333


*利用pivot_table得到更多特征*

In [93]:
pd.pivot_table(users, index='occupation', columns='gender', values='age', aggfunc=['mean', 'max'])

Unnamed: 0_level_0,mean,mean,max,max
gender,F,M,F,M
occupation,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
administrator,40.638889,37.162791,62.0,70.0
artist,30.307692,32.333333,48.0,45.0
doctor,,43.571429,,64.0
educator,39.115385,43.101449,51.0,63.0
engineer,29.5,36.6,36.0,70.0
entertainment,31.0,29.0,38.0,50.0
executive,44.0,38.172414,49.0,69.0
healthcare,39.818182,45.4,53.0,62.0
homemaker,34.166667,23.0,50.0,23.0
lawyer,39.5,36.2,51.0,53.0


*关于crosstab和pivot_table<br>crosstab本质上是一个简化版、专门用于计算频数统计的 pivot_table。crosstab对计算频数和百分比进行了封装，主要用于两个分类变量之间的关系（normalize），pivot_table提供了更通用的聚合框架更利于计算具体的数值统计量。*