# Demographic Data Analyzer

### In this challenge we will analyze demographic data using Pandas. We are given a dataset of demographic data that was extracted from the 1994 Census database.

In [114]:
import numpy as np
import pandas as pd

In [115]:
df = pd.read_csv("adult.data.csv")

In [116]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [117]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [118]:
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)

In [119]:
race_cnt = df['race'].value_counts()
race_cnt

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

What is the average age of men?

In [120]:
sex_cnt = df['sex'].value_counts()
sex_cnt

Male      21790
Female    10771
Name: sex, dtype: int64

In [166]:
#average of age of men
men = round(df.loc[df['sex'] == 'Male', 'age'].mean(),1)
print("average age of men: ", men)

average age of men:  39.4


What is the percentage of people who have a Bachelor's degree?

In [122]:
bachelor = df['education'] == 'Bachelors'
bachelor


0         True
1         True
2        False
3        False
4         True
         ...  
32556    False
32557    False
32558    False
32559    False
32560    False
Name: education, Length: 32561, dtype: bool

In [123]:
total_bachelor = df.loc[bachelor].value_counts().sum()
total_bachelor

5355

In [124]:
#Total educated
educated = df['education'].value_counts().sum()
educated

32561

In [125]:
#percentage of people who have a Bachelor's degree
bachelors_percentage = round(total_bachelor*100/educated,1)
bachelors_percentage


16.4

What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?

In [126]:
master = df['education'] == 'Masters'
doctorate = df['education'] == 'Doctorate'
advanced = bachelor | master | doctorate

In [127]:
advanced_rich = df.loc[advanced & (df['salary'] == '>50K')].value_counts().sum()
advanced_rich


3486

In [128]:
advanced_total = df.loc[bachelor | master | doctorate].value_counts().sum()
advanced_total

7491

In [129]:
#percentage of people with advanced education and make more than 50K
advanced_rich_percentage = round(advanced_rich*100/advanced_total, 1)
advanced_rich_percentage

46.5

What percentage of people without advanced education make more than 50K?

In [130]:
lower = (df['education'] != 'Bachelors') & (df['education'] != 'Masters') & (df['education'] != 'Doctorate')

In [131]:
lower_rich = df.loc[lower & (df['salary'] == '>50K')].value_counts().sum()
lower_rich

4355

In [132]:
lower_total = df.loc[lower].value_counts().sum()
lower_total

25070

In [133]:
#percentage of people with low education and make more than 50K
lower_rich_percentage = round(lower_rich*100/lower_total, 1)
lower_rich_percentage

17.4

What is the minimum number of hours a person works per week?

In [139]:
min_work_hr = df['hours-per-week'].value_counts().min()
min_work_hr

1

What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?

In [141]:
num_min_workers = df.loc[df['hours-per-week'] == 1 & (df['salary'] == '>50K')].value_counts().sum()
num_min_workers

2

In [143]:
#percentage of the people who work the minimum number of hours per week have a salary of more than 50K
percentage_min_workers = num_min_workers *100 / df.loc[df['hours-per-week']==1].value_counts().sum()
percentage_min_workers

10.0

What country has the highest percentage of people that earn >50K and what is that percentage?

In [151]:
rich_by_coutry = df.loc[df['salary'] == '>50K', 'native-country'].value_counts()
rich_by_coutry

United-States         7171
?                      146
Philippines             61
Germany                 44
India                   40
Canada                  39
Mexico                  33
England                 30
Cuba                    25
Italy                   25
Japan                   24
China                   20
Taiwan                  20
Iran                    18
South                   16
France                  12
Poland                  12
Puerto-Rico             12
Jamaica                 10
El-Salvador              9
Greece                   8
Cambodia                 7
Hong                     6
Yugoslavia               6
Ireland                  5
Vietnam                  5
Haiti                    4
Ecuador                  4
Portugal                 4
Hungary                  3
Thailand                 3
Scotland                 3
Guatemala                3
Laos                     2
Columbia                 2
Nicaragua                2
Dominican-Republic       2
T

In [150]:
pop_by_coutry = df['native-country'].value_counts()
pop_by_coutry 

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
Greece                      

In [153]:
percentage_rich_bycountry = round(rich_by_coutry*100/pop_by_coutry,1) 
percentage_rich_bycountry

?                             25.0
Cambodia                      36.8
Canada                        32.2
China                         26.7
Columbia                       3.4
Cuba                          26.3
Dominican-Republic             2.9
Ecuador                       14.3
El-Salvador                    8.5
England                       33.3
France                        41.4
Germany                       32.1
Greece                        27.6
Guatemala                      4.7
Haiti                          9.1
Holand-Netherlands             NaN
Honduras                       7.7
Hong                          30.0
Hungary                       23.1
India                         40.0
Iran                          41.9
Ireland                       20.8
Italy                         34.2
Jamaica                       12.3
Japan                         38.7
Laos                          11.1
Mexico                         5.1
Nicaragua                      5.9
Outlying-US(Guam-USV

In [155]:
highest_earning_country = percentage_rich_bycountry.idxmax()
highest_earning_country

'Iran'

In [156]:
highest_earning_country_percentage = percentage_rich_bycountry.max()
highest_earning_country_percentage

41.9

Identify the most popular occupation for those who earn >50K in India

In [158]:
india = df['native-country'] == 'India'
india

0        False
1        False
2        False
3        False
4        False
         ...  
32556    False
32557    False
32558    False
32559    False
32560    False
Name: native-country, Length: 32561, dtype: bool

In [161]:
india_rich = df.loc[india & (df['salary'] == '>50K'), 'occupation'].value_counts()
india_rich

Prof-specialty      25
Exec-managerial      8
Other-service        2
Tech-support         2
Adm-clerical         1
Transport-moving     1
Sales                1
Name: occupation, dtype: int64

In [164]:
pop_occupation = india_rich.idxmax()
pop_occupation

'Prof-specialty'