# Анализ данных по доходу населения UCI Adult

**В задании предлагается с помощью Pandas ответить на несколько вопросов по данным репозитория UCI [Adult](https://archive.ics.uci.edu/ml/datasets/Adult)**

Уникальные значения признаков (больше информации по ссылке выше):
- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.   
- salary: >50K,<=50K

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("adult.data")
data.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


**1. Сколько мужчин и женщин (признак *sex*) представлено в этом наборе данных?**

In [3]:
print(data['sex'].value_counts())

 Male      21790
 Female    10771
Name: sex, dtype: int64


**2. Каков средний возраст (признак *age*) женщин?**

In [4]:
round(data[data['sex'] == ' Female']['age'].mean(), 2)

36.86

**3. Какова доля граждан Германии (признак *native-country*)?**

In [5]:
len(data[data['native-country'] == ' Germany']) / len(data)

0.004207487485028101

**4-5. Каковы средние значения и среднеквадратичные отклонения возраста тех, кто получает более 50K в год (признак *salary*) и тех, кто получает менее 50K в год? **

In [6]:
data.groupby(['salary'])['age'].agg([np.mean, np.std])

Unnamed: 0_level_0,mean,std
salary,Unnamed: 1_level_1,Unnamed: 2_level_1
<=50K,36.783738,14.020088
>50K,44.249841,10.519028


**6. Правда ли, что люди, которые получают больше 50k, имеют как минимум высшее образование? (признак *education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters* или *Doctorate*)**

Исходя из вывода следующей конструкции, люди, зарабатывающие более 50к, не всегда имеют высшее образование 

In [22]:
data[data['salary'] == ' >50K']['education'].unique()

array([' HS-grad', ' Masters', ' Bachelors', ' Some-college',
       ' Assoc-voc', ' Doctorate', ' Prof-school', ' Assoc-acdm',
       ' 7th-8th', ' 12th', ' 10th', ' 11th', ' 9th', ' 5th-6th',
       ' 1st-4th'], dtype=object)

**7. Выведите статистику возраста для каждой расы (признак *race*) и каждого пола. Используйте *groupby* и *describe*. Найдите таким образом максимальный возраст мужчин расы *Amer-Indian-Eskimo*.**


Максимальный возраст - 82 года 

In [14]:
data.groupby(['sex', 'race'])['age'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
sex,race,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Female,Amer-Indian-Eskimo,119.0,37.117647,13.114991,17.0,27.0,36.0,46.0,80.0
Female,Asian-Pac-Islander,346.0,35.089595,12.300845,17.0,25.0,33.0,43.75,75.0
Female,Black,1555.0,37.854019,12.637197,17.0,28.0,37.0,46.0,90.0
Female,Other,109.0,31.678899,11.631599,17.0,23.0,29.0,39.0,74.0
Female,White,8642.0,36.811618,14.329093,17.0,25.0,35.0,46.0,90.0
Male,Amer-Indian-Eskimo,192.0,37.208333,12.049563,17.0,28.0,35.0,45.0,82.0
Male,Asian-Pac-Islander,693.0,39.073593,12.883944,18.0,29.0,37.0,46.0,90.0
Male,Black,1569.0,37.6826,12.882612,17.0,27.0,36.0,46.0,90.0
Male,Other,162.0,34.654321,11.355531,17.0,26.0,32.0,42.0,77.0
Male,White,19174.0,39.652498,13.436029,17.0,29.0,38.0,49.0,90.0


**8. Среди кого больше доля зарабатывающих много (>50K): среди женатых или холостых мужчин (признак *marital-status*)? Женатыми считаем тех, у кого *marital-status* начинается с *Married* (Married-civ-spouse, Married-spouse-absent или Married-AF-spouse), остальных считаем холостыми.**

Судя по выводу, намного больше женатых мужчин, зарабатывающих много 

In [39]:
married = data['marital-status'].isin([' Married-civ-spouse', ' Married-spouse-absent', ' Married-AF-spouse'])
#data.insert(0, 'Married', married)
pd.crosstab(data[data['sex'] == ' Male']['Married'], data[data['sex'] == ' Male']['salary'])

salary,<=50K,>50K
Married,Unnamed: 1_level_1,Unnamed: 2_level_1
False,7552,697
True,7576,5965


**9. Какое максимальное число часов человек работает в неделю (признак *hours-per-week*)? Сколько людей работают такое количество часов и каков среди них процент зарабатывающих много?**

Максимальное количество часов - 99. 85 человек работает столько часов в неделю. всего 29 процентов таких людей зарабатывают много 

In [58]:
max_hours_per_week = data['hours-per-week'].max()

working_a_lot_people = data[data['hours-per-week'] == max_hours_per_week]
number_of_working_a_lot_people = len(working_a_lot_people)

number_of_working_a_lot_and_earning_a_lot = len(working_a_lot_people[working_a_lot_people['salary'] == ' >50K'])

print(number_of_working_a_lot_and_earning_a_lot / number_of_working_a_lot_people * 100)

29.411764705882355


**10. Посчитайте среднее время работы (*hours-per-week*) зарабатывающих мало и много (*salary*) для каждой страны (*native-country*).**

In [60]:
data.groupby(['native-country', 'salary'])['hours-per-week'].agg([np.mean])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
native-country,salary,Unnamed: 2_level_1
?,<=50K,40.164760
?,>50K,45.547945
Cambodia,<=50K,41.416667
Cambodia,>50K,40.000000
Canada,<=50K,37.914634
...,...,...
United-States,>50K,45.505369
Vietnam,<=50K,37.193548
Vietnam,>50K,39.200000
Yugoslavia,<=50K,41.600000
