### Обучение

1. Загрузим данные  https://archive.ics.uci.edu/ml/datasets/Adult
2. Обучим модель и сохраним на диск полученный результат.

**Обзор данных:**

Целевая переменная:
*   target (заработок): >50K, <=50K.

Признаки:
1.   age (возраст)
2.   workclass (рабочий статус): Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
3. fnlwgt (примерная оценка количества людей с такими же характеристиками)
4. education (уровень образования): Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
5. educational-num (длительность обучения)
6. marital-status (семейное положение): Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
7. occupation (поле деятельности): Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
8. relationship (положение в семье): Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
9. race (раса): White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
10. sex (пол): Female, Male.
11. capital-gain (прирост капитала).
12. capital-loss (потеря капитала).
13. hours-per-week (количество рабочих часов в неделю).
14. native-country (страна рождения): United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.


In [42]:
!pip install catboost



In [43]:
import numpy as np
import pandas as pd
import dill
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve
from sklearn.base import BaseEstimator, TransformerMixin
from catboost import Pool, CatBoostClassifier
from sklearn.linear_model import LogisticRegression

Загрузим данные

In [44]:
df = pd.read_csv('adult.csv', header=None).fillna(0)
df.columns = ['age', 'workclass', 'fnlwg', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'target']
df.head(3)

Unnamed: 0,age,workclass,fnlwg,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


In [45]:
df.shape

(32561, 15)

In [46]:
cat_columns = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
for i in cat_columns:
  df[i] = df[i].str.replace('?', "0")


In [47]:
# преобразуем целевую переменную
df['target'] = df['target'].str.replace('<=50K','0')
df['target'] = df['target'].str.replace('>50K','1')
df.target.astype(int).tail()

32556    0
32557    1
32558    0
32559    0
32560    1
Name: target, dtype: int64

In [48]:
data = df.drop('target', 1)
labels = df.target
cat_features = [1, 3, 5, 6, 7, 8, 9, 13]

Разделим данные на train/test

In [49]:
X_train, X_test, y_train, y_test = train_test_split(data, 
                                                    labels, test_size=0.3, random_state=42)

Обучаем модель

In [50]:
train_dataset = Pool(data=X_train,
                     label=y_train,
                     cat_features=cat_features)

eval_dataset = Pool(data=X_test,
                    label=y_test,
                    cat_features=cat_features)

model = CatBoostClassifier(iterations=500).fit(train_dataset, use_best_model=True, eval_set=eval_dataset)

Learning rate set to 0.092732
0:	learn: 0.6053361	test: 0.6053339	best: 0.6053339 (0)	total: 49.7ms	remaining: 24.8s
1:	learn: 0.5338783	test: 0.5336567	best: 0.5336567 (1)	total: 99.8ms	remaining: 24.9s
2:	learn: 0.4851729	test: 0.4837332	best: 0.4837332 (2)	total: 141ms	remaining: 23.4s
3:	learn: 0.4512573	test: 0.4502356	best: 0.4502356 (3)	total: 179ms	remaining: 22.1s
4:	learn: 0.4197485	test: 0.4187354	best: 0.4187354 (4)	total: 222ms	remaining: 22s
5:	learn: 0.3983656	test: 0.3977895	best: 0.3977895 (5)	total: 265ms	remaining: 21.8s
6:	learn: 0.3813135	test: 0.3807273	best: 0.3807273 (6)	total: 305ms	remaining: 21.5s
7:	learn: 0.3682415	test: 0.3676547	best: 0.3676547 (7)	total: 342ms	remaining: 21s
8:	learn: 0.3591630	test: 0.3584117	best: 0.3584117 (8)	total: 376ms	remaining: 20.5s
9:	learn: 0.3515598	test: 0.3508998	best: 0.3508998 (9)	total: 421ms	remaining: 20.6s
10:	learn: 0.3453847	test: 0.3446092	best: 0.3446092 (10)	total: 464ms	remaining: 20.6s
11:	learn: 0.3393453	tes

In [51]:
model.feature_names_

['age',
 'workclass',
 'fnlwg',
 'education',
 'education-num',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'native-country']

In [52]:
model.feature_importances_

array([13.21410903,  2.98982021,  3.57996757,  3.61390269,  4.77827767,
        8.30890386,  6.60552476, 17.77766904,  1.20524866,  1.72428116,
       18.98021384,  6.9051418 ,  8.60765433,  1.70928538])

Далее будем рассматривать наиболее важные признаки (для простоты не категориальные):
1.   capital-gain
2.   age
3. hours-per-week



In [53]:
data = data.rename(columns={'capital-gain': 'capital_gain', 'hours-per-week': 'hours_per_week'})

Обучим на них данные, используя логистическую регрессию

In [54]:
X_train, X_test, y_train, y_test = train_test_split(data[['capital_gain', 'age', 'hours_per_week']], 
                                                    labels, test_size=0.3, random_state=42)

#save test
X_test.to_csv("X_test.csv", index=None)
y_test.to_csv("y_test.csv", index=None)
#save train
X_train.to_csv("X_train.csv", index=None)
y_train.to_csv("y_train.csv", index=None)

logreg = LogisticRegression().fit(X_train, y_train)


Сохраним модель

In [55]:
with open("logreg.dill", "wb") as f:
    dill.dump(logreg, f)

In [56]:
pip freeze > requirements.txt