Data source-UCI adult dataset, also called as Census Income

Link-UCI Machine Learning Repository: Adult Data Set

Question-Predict whether income exceeds 550K/yr based on census



1. Shape of database


In [164]:
import pandas as pd
data=pd.read_csv("adult.csv")
print(data)

       age workclass  fnlwgt     education  education.num      marital.status  \
0       90         ?   77053       HS-grad              9             Widowed   
1       82   Private  132870       HS-grad              9             Widowed   
2       66         ?  186061  Some-college             10             Widowed   
3       54   Private  140359       7th-8th              4            Divorced   
4       41   Private  264663  Some-college             10           Separated   
...    ...       ...     ...           ...            ...                 ...   
32556   22   Private  310152  Some-college             10       Never-married   
32557   27   Private  257302    Assoc-acdm             12  Married-civ-spouse   
32558   40   Private  154374       HS-grad              9  Married-civ-spouse   
32559   58   Private  151910       HS-grad              9             Widowed   
32560   22   Private  201490       HS-grad              9       Never-married   

              occupation   

In [165]:
shape=data.shape
print(shape)

(32561, 15)


2. Print column names

In [166]:
columns = list(data.columns)
print(columns)

['age', 'workclass', 'fnlwgt', 'education', 'education.num', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week', 'native.country', 'income']


3. How many men and women (sex feature) are represented in this dataset.


In [167]:
gender_counts = data["sex"].value_counts()
print("The number of men and women represented in the dataset are: ",gender_counts)


The number of men and women represented in the dataset are:  sex
Male      21790
Female    10771
Name: count, dtype: int64


4. What is the average age (age feature) of women?


In [168]:
average_age_women = data.loc[data["sex"] == "Female", "age"].mean()
print("The average age of women:",average_age_women)

The average age of women: 36.85823043357163


5. What is the percentage of German citizens (native-country feature)?


In [169]:
german_citizens = data["native.country"].value_counts().get("Germany", 0)
german_percentage = (german_citizens / len(data)) * 100
print("The percentage of German Citizens: ",german_citizens)

The percentage of German Citizens:  137


6. What are the mean and standard deviation of age for those who earn more than 50% per year (salary feature) and those who earn less than 50K per year?


In [170]:
stats_earning_more_50k = data.loc[data["income"] == ">50K", "age"].agg(["mean", "std"])
stats_earning_less_50k = data.loc[data["income"] == "<=50K", "age"].agg(["mean", "std"])

print("Those earning more than 50k: \n",stats_earning_more_50k)
print("\nThose earning more than 50k: \n",stats_earning_less_50k)


Those earning more than 50k: 
 mean    44.249841
std     10.519028
Name: age, dtype: float64

Those earning more than 50k: 
 mean    36.783738
std     14.020088
Name: age, dtype: float64


7. Is it true that people who earn more than 50K have at least high school education? (education-Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)


In [171]:
high_earning_education = data.loc[data["income"] == ">50K", "education"]
all_have_high_school = high_earning_education.isin(
    ["Bachelors", "Prof-school", "Assoc-acdm", "Assoc-voc", "Masters", "Doctorate"]
).all()

print("All high earners have at least high school education: ", all_have_high_school)

All high earners have at least high school education:  False


8. Display age statistics for each race (race feature) and each gender (sex feature) Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo race.


In [172]:
age_stats = data.groupby(["race", "sex"])["age"].describe()
max_age_amer_indian_eskimo_men = data.loc[
    (data["race"] == "Amer-Indian-Eskimo") & (data["sex"] == "Male"), "age"
].max()

print("Age stats by race and gender: \n", age_stats)
print("\n Max age of Amer-Indian-Eskimo men: ", max_age_amer_indian_eskimo_men)


Age stats by race and gender: 
                              count       mean        std   min   25%   50%  \
race               sex                                                       
Amer-Indian-Eskimo Female    119.0  37.117647  13.114991  17.0  27.0  36.0   
                   Male      192.0  37.208333  12.049563  17.0  28.0  35.0   
Asian-Pac-Islander Female    346.0  35.089595  12.300845  17.0  25.0  33.0   
                   Male      693.0  39.073593  12.883944  18.0  29.0  37.0   
Black              Female   1555.0  37.854019  12.637197  17.0  28.0  37.0   
                   Male     1569.0  37.682600  12.882612  17.0  27.0  36.0   
Other              Female    109.0  31.678899  11.631599  17.0  23.0  29.0   
                   Male      162.0  34.654321  11.355531  17.0  26.0  32.0   
White              Female   8642.0  36.811618  14.329093  17.0  25.0  35.0   
                   Male    19174.0  39.652498  13.436029  17.0  29.0  38.0   

                             75

9. Among whom is the proportion of those who earn a lot (>50K) greater, married or single men (marital-status feature)?Consider as married those who have a (marital-status feature) ? Consider as married those who have a marital-status status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.


In [173]:
# Group people as 'Married' or 'Single' based on marital status
data["marital_status_group"] = data["marital.status"].apply(
    lambda x: "Married" if x.startswith("Married") else "Single"
)

# Find all married and single men
married_men = data[(data["sex"] == "Male") & (data["marital_status_group"] == "Married")]
single_men = data[(data["sex"] == "Male") & (data["marital_status_group"] == "Single")]

# Calculate proportion of high earners (>50K) for married men
if married_men.shape[0] > 0:  # Check if there are any married men
    married_high_earners = married_men[married_men["income"] == ">50K"].shape[0] / married_men.shape[0]
else:
    married_high_earners = 0

# Calculate proportion of high earners (>50K) for single men
if single_men.shape[0] > 0:  # Check if there are any single men
    single_high_earners = single_men[single_men["income"] == ">50K"].shape[0] / single_men.shape[0]
else:
    single_high_earners = 0

# Print the results
print("Proportion of high earners (married men):", married_high_earners)
print("Proportion of high earners (single men):", single_high_earners)

Proportion of high earners (married men): 0.4405139945351156
Proportion of high earners (single men): 0.08449509031397745


10. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?

In [174]:
max_hours = data["hours.per.week"].max()
people_working_max_hours = data[data["hours.per.week"] == max_hours]
num_people_max_hours = len(people_working_max_hours)
high_earners_max_hours = (len(people_working_max_hours[people_working_max_hours["income"] == ">50K"]) / num_people_max_hours) * 100

print("Max hours per week: ", max_hours)
print("Number of people working max hours: ", num_people_max_hours)
print("Percentage of high earners among them: ",high_earners_max_hours)


Max hours per week:  99
Number of people working max hours:  85
Percentage of high earners among them:  29.411764705882355
