In this challenge you must analyze demographic data using Pandas. You are given a dataset of demographic data that was extracted from the 1994 Census database.
You must use Pandas to answer the following questions:

How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)
What is the average age of men?
What is the percentage of people who have a Bachelor's degree?
What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
What percentage of people without advanced education make more than 50K?
What is the minimum number of hours a person works per week?
What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
What country has the highest percentage of people that earn >50K and what is that percentage?
Identify the most popular occupation for those who earn >50K in India.

Begin by setting up environment:

In [2]:
import numpy as np
import pandas as pd

In [3]:
data = pd.read_csv('adult.data.csv')

An overview of the data:

In [23]:
data.shape

(32561, 15)

In [22]:
data.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Make sure data is cleaned and ready for analysis
First, check for missing values and duplicates, in which data has none.

In [33]:
pd.isnull(data).sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
salary            0
dtype: int64

In [47]:
data.duplicated().sum()
data.drop_duplicates(keep='first')

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


Let's look at the data:

In [10]:
data.info()
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)

In [71]:
race=data['race']
race.value_counts()

race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: count, dtype: int64

What is the average age of men? Answer: 39 years old.

In [97]:
data.groupby('sex')['age'].mean()

sex
Female    36.858230
Male      39.433547
Name: age, dtype: float64

What is the percentage of people who have a Bachelor's degree? Answer: 16.44%

In [99]:
edu=data['education']
edu.value_counts()/sum(edu.value_counts())*100

education
HS-grad         32.250238
Some-college    22.391818
Bachelors       16.446055
Masters          5.291607
Assoc-voc        4.244341
11th             3.608612
Assoc-acdm       3.276926
10th             2.865391
7th-8th          1.983969
Prof-school      1.768987
9th              1.578576
12th             1.329812
Doctorate        1.268389
5th-6th          1.022696
1st-4th          0.515955
Preschool        0.156629
Name: count, dtype: float64

What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K? Answer: 44.46%

In [100]:
above_50=data[data['salary']=='>50K']
ad_edu=above_50[above_50['education'].isin(['Bachelors','Masters','Doctorate'])]
(ad_edu['education'].value_counts().sum()) / (above_50['education'].value_counts().sum()) * 100

44.45861497258003

What percentage of people without advanced education make more than 50K? Answer: 55.54%

In [101]:
100 - ((ad_edu['education'].value_counts().sum()) / (above_50['education'].value_counts().sum()) * 100)

55.54138502741997

What is the minimum number of hours a person works per week? Answer: 1 hour

In [102]:
min(data['hours-per-week'])

1

What percentage of the people who work the minimum number of hours per week have a salary of more than 50K? Answer: 0.0255%

In [103]:
min_work=above_50[above_50['hours-per-week']==1]
(min_work['salary'].value_counts().sum()) / (above_50['education'].value_counts().sum()) * 100

0.025506950644050504

What country has the highest percentage of people that earn >50K and what is that percentage? Answer: United States at 91.45%

In [104]:
by_ctr=above_50['native-country'].value_counts()
by_ctr/(by_ctr.sum())*100

native-country
United-States         91.455172
?                      1.862007
Philippines            0.777962
Germany                0.561153
India                  0.510139
Canada                 0.497386
Mexico                 0.420865
England                0.382604
Italy                  0.318837
Cuba                   0.318837
Japan                  0.306083
Taiwan                 0.255070
China                  0.255070
Iran                   0.229563
South                  0.204056
Puerto-Rico            0.153042
Poland                 0.153042
France                 0.153042
Jamaica                0.127535
El-Salvador            0.114781
Greece                 0.102028
Cambodia               0.089274
Hong                   0.076521
Yugoslavia             0.076521
Ireland                0.063767
Vietnam                0.063767
Portugal               0.051014
Haiti                  0.051014
Ecuador                0.051014
Thailand               0.038260
Hungary                0.

Identify the most popular occupation for those who earn >50K in India. Answer: Prof-specialty

In [106]:
above_50_india=above_50[above_50['native-country']=='India']
above_50_india['occupation'].value_counts()

occupation
Prof-specialty      25
Exec-managerial      8
Other-service        2
Tech-support         2
Transport-moving     1
Sales                1
Adm-clerical         1
Name: count, dtype: int64