
# Adult dataset

Unique values of all features (for more information, please see the links above):
- `age`: continuous.
- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- `fnlwgt`: continuous.
- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- `education-num`: continuous.
- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- `sex`: Female, Male.
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.   
- `salary`: >50K,<=50K

### Step 1. Import the necessary libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

# set this so the graphs open internally
%matplotlib inline

### Step 2. Import the dataset from this [address](https://github.com/thieu1995/csv-files/blob/main/data/pandas/adult.data).

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "salary"
]
df = pd.read_csv(url, names=columns, na_values=" ?", skipinitialspace=True)

**2. What is the average age (*age* feature) of women?**

In [3]:
avg_age_women = df[df["sex"] == "Female"]["age"].mean()
print(f"Average age of women: {avg_age_women}")

Average age of women: 36.85823043357163


**3. What is the percentage of German citizens (*native-country* feature)?**

In [4]:
german_count = df[df["native-country"] == "Germany"].shape[0]
total_count = df.shape[0]
german_percentage = (german_count / total_count) * 100
print(f"Percentage of German citizens: {german_percentage:.2f}%")

Percentage of German citizens: 0.42%


**4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (*salary* feature) and those who earn less than 50K per year? **

In [5]:
salary_groups = df.groupby("salary")["age"].agg(["mean", "std"])
print(salary_groups)

             mean        std
salary                      
<=50K   36.783738  14.020088
>50K    44.249841  10.519028


**6. Is it true that people who earn more than 50K have at least high school education? (*education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters* or *Doctorate* feature)**

In [6]:
high_edu = ["Bachelors", "Prof-school", "Assoc-acdm", "Assoc-voc", "Masters", "Doctorate"]
high_edu_check = df[df["salary"] == ">50K"]["education"].isin(high_edu).all()
print(f"Do all >50K earners have at least high school education? {high_edu_check}")

Do all >50K earners have at least high school education? False


**7. Display age statistics for each race (*race* feature) and each gender (*sex* feature). Use *groupby()* and *describe()*. Find the maximum age of men of *Amer-Indian-Eskimo* race.**

In [7]:
age_stats = df.groupby(["race", "sex"])["age"].describe()
max_age_amerind_male = df[(df["race"] == "Amer-Indian-Eskimo") & (df["sex"] == "Male")]["age"].max()
print(age_stats)
print(f"Maximum age of Amer-Indian-Eskimo men: {max_age_amerind_male}")

                             count       mean        std   min   25%   50%  \
race               sex                                                       
Amer-Indian-Eskimo Female    119.0  37.117647  13.114991  17.0  27.0  36.0   
                   Male      192.0  37.208333  12.049563  17.0  28.0  35.0   
Asian-Pac-Islander Female    346.0  35.089595  12.300845  17.0  25.0  33.0   
                   Male      693.0  39.073593  12.883944  18.0  29.0  37.0   
Black              Female   1555.0  37.854019  12.637197  17.0  28.0  37.0   
                   Male     1569.0  37.682600  12.882612  17.0  27.0  36.0   
Other              Female    109.0  31.678899  11.631599  17.0  23.0  29.0   
                   Male      162.0  34.654321  11.355531  17.0  26.0  32.0   
White              Female   8642.0  36.811618  14.329093  17.0  25.0  35.0   
                   Male    19174.0  39.652498  13.436029  17.0  29.0  38.0   

                             75%   max  
race               sex

**8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (*marital-status* feature)? Consider as married those who have a *marital-status* starting with *Married* (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.**

In [8]:
df["marital-status"] = df["marital-status"].apply(lambda x: "Married" if x.startswith("Married") else "Single")
married_vs_single = df[df["sex"] == "Male"].groupby("marital-status")["salary"].value_counts(normalize=True).unstack()
print(married_vs_single)

salary             <=50K      >50K
marital-status                    
Married         0.559486  0.440514
Single          0.915505  0.084495


**9. What is the maximum number of hours a person works per week (*hours-per-week* feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?**

In [9]:
max_hours = df["hours-per-week"].max()
num_max_hours = df[df["hours-per-week"] == max_hours].shape[0]
high_earners_max_hours = df[(df["hours-per-week"] == max_hours) & (df["salary"] == ">50K")].shape[0]
percentage_high_earners = (high_earners_max_hours / num_max_hours) * 100
print(f"Max work hours per week: {max_hours}, Number of people: {num_max_hours}, Percentage earning >50K: {percentage_high_earners:.2f}%")


Max work hours per week: 99, Number of people: 85, Percentage earning >50K: 29.41%


**10. Count the average time of work (*hours-per-week*) for those who earn a little and a lot (*salary*) for each country (*native-country*). What will these be for Japan?**

In [10]:
avg_hours_by_country = df.groupby(["native-country", "salary"])["hours-per-week"].mean().unstack()
print(avg_hours_by_country.loc["Japan"])

salary
<=50K    41.000000
>50K     47.958333
Name: Japan, dtype: float64
