# Author: Yury Kashnitsky. Translated and edited by Sergey Isaev, Artem Trunov, Anastasia Manokhina, and Yuanyuan Pao. All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license. https://mlcourse.ai/

In [10]:
import numpy as np
import pandas as pd

pd.set_option("display.max.columns", 100)
# to draw pictures in jupyter notebook
%matplotlib inline
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings

import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings("ignore")

In [11]:
DATA_URL = "https://raw.githubusercontent.com/Yorko/mlcourse.ai/main/data/"
df = pd.read_csv(DATA_URL + "adult.data.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


# 1.How many men and women (sex feature) are represented in this dataset?

In [12]:
sum_data = df['sex'].value_counts().sum()
sum_data

32561

In [13]:
repartition = df['sex'].value_counts(normalize=True)
repartition

sex
Male      0.669205
Female    0.330795
Name: proportion, dtype: float64

In [14]:
print(f"There is a total of {sum_data} people in the dataset, with a distribution of {repartition[0]} men and {repartition[1]} women.")


There is a total of 32561 people in the dataset, with a distribution of 0.6692054912318418 men and 0.33079450876815825 women.


# 2. What is the average age (age feature) of women?

In [15]:
female_df = df[df['sex'] == 'Female']

In [16]:
female_df['age'].mean()

36.85823043357163

# 3. What is the percentage of German citizens (native-country feature)?

In [17]:
df['native-country'].unique()

array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',
       'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany',
       'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',
       'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',
       'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',
       'China', 'Japan', 'Yugoslavia', 'Peru',
       'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',
       'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',
       'Holand-Netherlands'], dtype=object)

In [18]:
proportion_germany = (df['native-country'].value_counts(normalize=True)['Germany'])*100
proportion_germany

0.42074874850281013

# 4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year?

In [34]:
df['salary'].unique()

array(['<=50K', '>50K'], dtype=object)

In [29]:
# Data for people who earn more than 50K
more = df[df['salary']=='>50K']
more_std = more['age'].std()
more_mean = more['age'].mean()
print(f"The average age of people earning more than 50K is {more_mean:.2f} years with a standard deviation of {more_std:.2f}.")

The average age of people earning more than 50K is 44.25 years with a standard deviation of 10.52.


In [35]:
# Data for people who earn less than 50K
less = df[df['salary'] == '<=50K']
less_std = less['age'].std()
less_mean = less['age'].mean()
print(f"The average age of people earning less than 50K is {less_mean:.2f} years with a standard deviation of {less_std:.2f}.")

The average age of people earning less than 50K is 36.78 years with a standard deviation of 14.02.


# 6. Is it true that people who earn more than 50K have at least high school education?

In [36]:
df['education'].unique()

array(['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
       'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
       '5th-6th', '10th', '1st-4th', 'Preschool', '12th'], dtype=object)

In [39]:
more['education'].value_counts()

education
Bachelors       2221
HS-grad         1675
Some-college    1387
Masters          959
Prof-school      423
Assoc-voc        361
Doctorate        306
Assoc-acdm       265
10th              62
11th              60
7th-8th           40
12th              33
9th               27
5th-6th           16
1st-4th            6
Name: count, dtype: int64

# 7. Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo race.

# 8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.

# 9. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?

# 10. Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?