## 人口普查数据探索


---

本次挑战，运用Pandas探索数据，并回答有关
[<i class="fa fa-external-link-square" aria-hidden=True>Adult数据集</i>](https://archive.ics.uci.edu/ml/datasets/Adult/)
的几个问题。<br />
<!--
<a href="https://archive.ics.uci.edu/ml/datasets/Adult/">
    <i class="fa fa-external-link-square">Adult 数据集</i>
</a>
-->
Adult数据集是一个关于人口收入普查的数据集，其中包含多个特征，目标值为类别类型。

加载数据集，预览

In [2]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings("ignore")

In [3]:
data = pd.read_csv(
    'https://labfile.oss.aliyuncs.com/courses/1283/adult.data.csv')
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


DataFrame 前面的列均为特征，最后的 `salary` 为目标值。接下来，你需要自行补充必要的代码来回答相应的挑战问题。

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中有多少男性和女性？


In [4]:
data["sex"].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中女性的平均年龄是多少？

In [6]:
data[data["sex"]=="Female"]["age"].mean()

36.85823043357163

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中德国公民的比例是多少？


In [8]:
data["native-country"].value_counts(normalize=True)["Germany"]

0.004207487485028101

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>年收入超过 50K 和低于 50K 人群年龄的平均值和标准差是多少？

In [10]:
data.groupby(["salary"])["age"].agg([np.mean, np.std])

Unnamed: 0_level_0,mean,std
salary,Unnamed: 1_level_1,Unnamed: 2_level_1
<=50K,36.783738,14.020088
>50K,44.249841,10.519028


<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>年收入超过 50K 的人群是否都接受过高中以上教育？




In [12]:
# np.all(data[data["salary"]==">50K"]["education-num"] >=9)
data[data["salary"]==">50K"]["education"].unique()

array(['HS-grad', 'Masters', 'Bachelors', 'Some-college', 'Assoc-voc',
       'Doctorate', 'Prof-school', 'Assoc-acdm', '7th-8th', '12th',
       '10th', '11th', '9th', '5th-6th', '1st-4th'], dtype=object)

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>使用 `groupby` 和 `describe` 统计不同种族和性别人群的年龄分布数据。

In [14]:
data.groupby(["race", "sex"])["age"].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
race,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Amer-Indian-Eskimo,Female,119.0,37.117647,13.114991,17.0,27.0,36.0,46.0,80.0
Amer-Indian-Eskimo,Male,192.0,37.208333,12.049563,17.0,28.0,35.0,45.0,82.0
Asian-Pac-Islander,Female,346.0,35.089595,12.300845,17.0,25.0,33.0,43.75,75.0
Asian-Pac-Islander,Male,693.0,39.073593,12.883944,18.0,29.0,37.0,46.0,90.0
Black,Female,1555.0,37.854019,12.637197,17.0,28.0,37.0,46.0,90.0
Black,Male,1569.0,37.6826,12.882612,17.0,27.0,36.0,46.0,90.0
Other,Female,109.0,31.678899,11.631599,17.0,23.0,29.0,39.0,74.0
Other,Male,162.0,34.654321,11.355531,17.0,26.0,32.0,42.0,77.0
White,Female,8642.0,36.811618,14.329093,17.0,25.0,35.0,46.0,90.0
White,Male,19174.0,39.652498,13.436029,17.0,29.0,38.0,49.0,90.0


<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>统计男性高收入人群中已婚和未婚（包含离婚和分居）人群各自所占数量。

In [15]:
data[(data["salary"]==">50K")&(data["sex"]=="Male")]["marital-status"].value_counts()

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>统计数据集中最长周工作小时数及对应的人数，并计算该群体中收入超过 50K 的比例。

In [15]:
max_work = data["hours-per-week"].max()
df_max_work = data[data["hours-per-week"]==max_work]
print(max_work, df_max_work.shape[0])
df_max_work["salary"].value_counts(normalize=True)[">50K"]

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>计算各国超过和低于 50K 人群各自的平均周工作时长。


In [17]:
data.groupby(["native-country","salary",])["hours-per-week"].agg([np.mean])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
native-country,salary,Unnamed: 2_level_1
?,<=50K,40.164760
?,>50K,45.547945
Cambodia,<=50K,41.416667
Cambodia,>50K,40.000000
Canada,<=50K,37.914634
...,...,...
United-States,>50K,45.505369
Vietnam,<=50K,37.193548
Vietnam,>50K,39.200000
Yugoslavia,<=50K,41.600000


---

[<i class="fa fa-file-code-o"></i>参考](https://nbviewer.jupyter.org/github/shiyanlou/mlcourse-answers/tree/master/)