# 人口普查数据集`Adult`
> @杨冠林

- 数据集: http://archive.ics.uci.edu/ml/datasets/Adult `.names`是数据集 `.test`是测试集
- 目标: 根据UCI人口普查数据集`预测收入是否会超过$50k`
- 适合: 决策树,逻辑回归等一系列分类算法
- 属性信息:
    > 属性列表：>50K，<=50K

    - `age`年龄:连续值
    - `workclass`工作类：私人，自雇非公司，自雇公司，联邦政府，地方政府，州政府，无薪，从未工作过。
    - `fnlwgt`：连续。
    - `education`教育：学士，一些学院，11年级，HS毕业生，教授学院，Assoc-acdm，Assoc-voc，9th，7th-8th，12th，硕士，1st-4th，10th，博士学位，5th-6th，学前班。
    - `education-num`：持续。
    - `marital-status`婚姻状况：已婚公民配偶、离婚、未婚、分居、丧偶、已婚配偶缺席、已婚配偶。
    - `occupation`职业：技术支持、工艺维修、其他服务、销售、执行管理、教授专业、搬运工清洁工、机器操作、高级文书、农业渔业、运输搬运、私人房屋服务、保护服务、武装部队。
    - `relationship`关系：妻子，自己的孩子，丈夫，非家庭，其他亲戚，未婚。
    - `race`种族：白人，亚洲裔太平洋岛民，美洲印第安-爱斯基摩人，其他，黑人。
    - `sex`性别：女，男。
    - `capital-gain`资本收益：连续。
    - `capital-loss`资本损失：持续。
    - `hours-per-week`每周小时数：连续。
    - `native-country`母国：美国、柬埔寨、英国、波多黎各、加拿大、德国、离美（关岛-USVI-等）、印度、日本、希腊、南部、中国、古巴、伊朗、洪都拉斯、菲律宾、意大利、波兰、牙买加、越南、墨西哥、葡萄牙、爱尔兰、法国、多米尼加共和国、老挝、厄瓜多尔、台湾、海地、哥伦比亚、匈牙利、危地马拉、尼加拉瓜、苏格兰、泰国、南斯拉夫、萨尔瓦多、特里纳达和多巴哥、秘鲁、洪、荷兰-荷兰。


----
## 问题分析
- 主目标为`预测是否收入会大于50k`,分析属性,选取特征,过程使用`markdown`单元
> VSCode扩展推荐:`hediet.vscode-drawio`,`mushan.vscode-paste-image`
1. `draw.io`的使用: 新建 `XXX.dio`扩展名文档,可画流程图,`E-R`图等
2. `PasteImage`的使用:

![](pics/2022-12-29-11-05-55.png)

   - 通过任意方式截图
   - 使用`Ctrl+Alt+V`,即可将剪贴板中的见截图保存到  `项目主文件夹/pics` 路径

----
## 数据准备
- 数据的导入,检视,清洗,操纵,预处理等等,自行处理
> 要求:处理前的数据在`original`文件夹,处理后的(或者生成的新文件)文件另存至`post_cleaning`文件夹,保持`项目文件树`整洁美观

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
import sklearn.model_selection as sk_ms
import sklearn.metrics as sk_met
import sklearn.preprocessing as sk_pre
from pyecharts.charts import Line,Bar,Scatter,Pie,MapGlobe
import pyecharts.options as opts
import seaborn as sns

- 原数据:
    - `adult.data`是训练用,`adult.test`是,测试用,`adult.name`为数据集信息<br>
    ![adult.data](pics/2023-01-03-15-59-40.png)<br>
    是普通文本文件,个值之间是`, `分隔,可当csv读取,但需要指定列名和分隔符<br>
    adult.test同,但首行需要舍弃<br>
    ![](pics/2023-01-03-16-00-52.png)<br>
    参数详解: [pandas.read_csv参数超级详解，好多栗子！](https://blog.csdn.net/sinat_35562946/article/details/81058221)


In [2]:
train = pd.read_csv('./original/adult.data',sep=', ',names=[
    'age','work_t','fnlwgt','edu','edu_n','marital','job','rel','race','sex','gain','loss','hr_week','nc','y'])
train.head()

  train = pd.read_csv('./original/adult.data',sep=', ',names=[


Unnamed: 0,age,work_t,fnlwgt,edu,edu_n,marital,job,rel,race,sex,gain,loss,hr_week,nc,y
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
test = pd.read_csv('./original/adult.test',skipinitialspace=True,skiprows=1,names=[
    'age','work_t','fnlwgt','edu','edu_n','marital','job','rel','race','sex','gain','loss','hr_week','nc','y'])
test.head()

Unnamed: 0,age,work_t,fnlwgt,edu,edu_n,marital,job,rel,race,sex,gain,loss,hr_week,nc,y
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.


In [4]:
test['y'] = test['y'].apply(lambda x : x.rstrip('.'))

- 似乎无空值,未知项是`?`而非`nan`

In [5]:
datasheet = pd.concat([train,test]).replace('?',np.nan)
datasheet

Unnamed: 0,age,work_t,fnlwgt,edu,edu_n,marital,job,rel,race,sex,gain,loss,hr_week,nc,y
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
16277,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K
16278,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
16279,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


- 先看categorical的数据,标签化; 再看连续性numeric数据,切分

In [6]:
def value_proc(df=pd.DataFrame(),n_v_method='quantile',n_bin=5):
    # 针对分类取标签
    cat_val = df.select_dtypes('object')
    cat_enc = sk_pre.LabelEncoder()
    # 针对连续型数字划分区段
    num_val = df.select_dtypes(exclude='object')
    cont_enc = sk_pre.KBinsDiscretizer(n_bins=n_bin,encode='ordinal',strategy=n_v_method)
    
    if n_v_method in ['uniform','quantile','kmeans']:
        for col in cat_val.columns:
            cat_val[col] = cat_enc.fit_transform(cat_val[col])
        for col in num_val.columns:
            num_val[col] = cont_enc.fit_transform(num_val[col].to_numpy().reshape(-1,1))
    else:raise Exception("Keyword nvmethod only accept 'uniform','quantile','kmeans' !")
    return (pd.concat([num_val,cat_val],axis=1))

In [7]:
vect = value_proc(datasheet.dropna(),n_v_method='kmeans',n_bin=4)
vect

Unnamed: 0,age,fnlwgt,edu_n,gain,loss,hr_week,work_t,edu,marital,job,rel,race,sex,nc,y
0,1.0,0.0,3.0,0.0,0.0,1.0,5,9,4,0,1,4,1,38,0
1,2.0,0.0,3.0,0.0,0.0,0.0,4,9,2,3,0,4,1,38,0
2,1.0,1.0,2.0,0.0,0.0,1.0,2,11,0,5,1,4,1,38,0
3,2.0,1.0,1.0,0.0,0.0,1.0,2,1,2,5,0,2,1,38,0
4,0.0,2.0,3.0,0.0,0.0,1.0,2,9,2,9,5,2,0,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16275,1.0,1.0,3.0,0.0,0.0,1.0,2,9,4,9,3,4,1,38,0
16276,1.0,1.0,3.0,0.0,0.0,1.0,2,9,0,9,1,4,0,38,0
16278,1.0,2.0,3.0,0.0,0.0,1.0,2,9,2,9,0,4,1,38,0
16279,2.0,0.0,3.0,1.0,0.0,1.0,2,9,0,0,3,1,1,38,0


----
## 数据建模
- 查看样本的目标占比

In [8]:
a = datasheet['y'].value_counts(normalize=True)
graph = (
    Pie(init_opts=opts.InitOpts(bg_color='white'))
    .add('',[list(z) for z in zip(a.index.tolist(),a.values.tolist())])
    .set_global_opts(title_opts=opts.TitleOpts(title="目标Y占比"))
    .set_series_opts(label_opts=opts.LabelOpts(formatter="[{b}] = {c}"))
)
graph.render_notebook()

- 分离出测试集

In [9]:
train_X,test_X,train_y,test_y = sk_ms.train_test_split(vect.iloc[:,:-1],vect.iloc[:,-1],test_size=0.2,random_state=20)

- 创建模型并寻找最佳参数
    > 见:[sklearn中的GridSearchCV方法详解](https://www.cnblogs.com/dalege/p/14175192.html)

In [10]:
dtc = DecisionTreeClassifier(class_weight='balanced')
gs = sk_ms.GridSearchCV(dtc,cv=10,n_jobs=-1,param_grid={
    'splitter':('best','random'),
    'criterion':("gini","entropy"),
    'max_depth':[*range(1,10)],
    'min_samples_leaf':[*range(1,50,5)],
    'min_impurity_decrease':[*np.linspace(0,0.5,20)]
})
gs.fit(train_X,train_y)

In [23]:
print(f'GridSearchCV结果:\n\
    最佳参数:\n{gs.best_params_}\n\
    最佳分数:{gs.best_score_*100}%\n\
    测试集正确率:{pd.Series(np.equal(gs.predict(test_X),test_y)).value_counts(normalize=True)[True]*100}%')

GridSearchCV结果:
    最佳参数:
{'criterion': 'gini', 'max_depth': 9, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'splitter': 'best'}
    最佳分数:79.2768593367754%
    测试集正确率:79.38087341072416%


In [22]:
dc = DecisionTreeClassifier(
    class_weight='balanced',
    splitter='best',
    criterion='gini',
    max_depth=9,
    min_samples_leaf=1,
    min_impurity_decrease=0.0
)
dc.fit(train_X,train_y)

In [24]:
# 新数据使用predict

- 逻辑回归

In [None]:
lr = LogisticRegression()

----
## 数据可视化
- 包含`模型预测结果的可视化`,以及其他的`无关模型预测的单纯EDA可视化`

> 优秀分析案例参考:<br>[Explain your ML model: no more black boxes 🎁](https://www.kaggle.com/code/amidala/explain-your-ml-model-no-more-black-boxes)<br>[Who can earn more than 50K per year?](https://www.kaggle.com/code/jiashenliu/who-can-earn-more-than-50k-per-year)

> 不可照搬