<center>
<img src="../../img/ods_stickers.jpg">
    
## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course

Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/). Translated and edited by [Sergey Isaev](https://www.linkedin.com/in/isvforall/), [Artem Trunov](https://www.linkedin.com/in/datamove/), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/). All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.

Unique values of all features (for more information, please see the links above):
- `age`: continuous.
- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- `fnlwgt`: continuous.
- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- `education-num`: continuous.
- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- `sex`: Female, Male.
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.   
- `salary`: >50K,<=50K

In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 100)
# to draw pictures in jupyter notebook
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('../input/uci-adult/adult.data.csv')
data.head(3)

**1. How many men and women (*sex* feature) are represented in this dataset?** 

In [None]:
# men = (data["sex"] == "Male").sum()
# Which way are better?
num_each_sex = data.groupby("sex").sex.count()
num_each_sex

**2. What is the average age (*age* feature) of women?**

In [None]:
# average = data.groupby("sex").age.mean()
average_women = round(data[data.sex == "Female"].age.mean() , 2)
average_women

**3. What is the percentage of German citizens (*native-country* feature)?**

In [None]:
100 * (((data["native-country"] == "Germany").sum())/ data["native-country"].count())

**4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (*salary* feature) and those who earn less than 50K per year? **

In [None]:
mean_std_age = data.groupby("salary").age.agg(["mean","std"])
mean_std_age

**6. Is it true that people who earn more than 50K have at least high school education? (*education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters* or *Doctorate* feature)**

In [None]:
plt.figure(figsize=(20,6))
cou = data[data["salary"] == ">50K"].education.value_counts()

distribution_bachelor = cou.loc[["Bachelors", "Masters", "Prof-school","Assoc-voc", "Doctorate", "Assoc-acdm"]].sum()
distribution_HS = cou.loc[["HS-grad", "Some-college", "10th", "11th", "7th-8th", "12th", "9th", "5th-6th", "1st-4th"]].sum()

salary_education = pd.Series([distribution_bachelor, distribution_HS], index=["More than HS", "At must HS"])

salary_education.plot.pie(figsize=(5, 5))

**The majority of people who earns more than 50k has studies more than high school.**

**7. Display age statistics for each race (*race* feature) and each gender (*sex* feature). Use *groupby()* and *describe()*. Find the maximum age of men of *Amer-Indian-Eskimo* race.**

**Maximun age of Amer-Indian-Eskimo men is 82.**

In [None]:
data.groupby(["race", "sex"]).age.describe()

**8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (*marital-status* feature)? Consider as married those who have a *marital-status* starting with *Married* (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.**

In [None]:
marital_status_more_50k = data[data["salary"] == ">50K"]["marital-status"].value_counts()
married_more_50k = marital_status_more_50k.loc[["Married-civ-spouse","Married-spouse-absent","Married-AF-spouse"]].sum()

# marital_status_more_50k.sum() , married_more_50k

marital_status_50k_distribution = pd.Series([married_more_50k, (marital_status_more_50k.sum()-married_more_50k)], index=["Married" , "No married"])

marital_status_50k_distribution.plot.pie(figsize=(5,5))

**9. What is the maximum number of hours a person works per week (*hours-per-week* feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?**

In [None]:
max_hours = data["hours-per-week"].max()
many_people_max = (data["hours-per-week"] == max_hours).sum()
percent_50k = round(100 * (((data["hours-per-week"] == max_hours) & (data["salary"] == ">50K")).sum())/many_people_max , 2)
print("Max number of hours: ", max_hours , "\nPeople who works the max number of hours: ", many_people_max, 
      "\nPercent of people who works the max number of hours and earns a lot: ", percent_50k)

**10. Count the average time of work (*hours-per-week*) for those who earn a little and a lot (*salary*) for each country (*native-country*). What will these be for Japan?**

Answer for Japan showed below.

In [None]:
mean_all_countries = data.groupby(["native-country" , "salary"])["hours-per-week"].mean()
round(mean_all_countries,2)

In [None]:
round(mean_all_countries["Japan"],2)