Perry Fox 5/29/23

# Demographic Data Analyzer
"In this challenge you must analyze demographic data using Pandas. You are given a dataset of demographic data that was extracted from the 1994 Census database."

This notebook was created first and then adapted to the replit python file for unit testing and submission:
https://replit.com/@Pyrus277/boilerplate-demographic-data-analyzer


---

## 0. Imports and first-look at the dataset

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('adult_data.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [4]:
## Questions

### 1. How many people of each race are represented in this dataset?
Return a Pandas series with race names as the index labels

In [182]:
df.groupby(['race'])['race'].count().sort_values(ascending=False)

race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

### 2. What is the average age of men?

In [186]:
df.loc[
    df.sex == 'Male',
    ['age']
].mean().round(decimals=1)[0]

39.4

### 3. What is the percentage of people who have a Bachelor's degree?

In [7]:
percentage_bachelors = round(
    (df.education.value_counts().Bachelors / df.education.count() * 100)
, 1)

In [8]:
percentage_bachelors

16.4

### 4. What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50k?

In [22]:
advanced_edu = df.loc[
    df.education.isin(['Bachelors', 'Masters', 'Doctorate'])
]
advanced_edu.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K


In [95]:
adv_edu_over50k = advanced_edu.loc[advanced_edu.salary == '>50K'].shape[0]

higher_education_rich = round(
    (adv_edu_over50k / advanced_edu.shape[0] * 100)
, 1)

In [96]:
higher_education_rich

46.5

### 5. What percentage of people without advanced education make more than 50K?

In [21]:
low_edu = df.loc[
    ~df.education.isin(['Bachelors', 'Masters', 'Doctorate'])
]
low_edu.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
10,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K


In [97]:
low_edu_over50k = low_edu.loc[low_edu.salary == '>50K'].shape[0]

low_edu_rich = round(
    (low_edu_over50k / low_edu.shape[0] * 100)
, 1)

In [98]:
low_edu_rich

17.4

### 6. What is the minimum number of hours a person works per week?

In [41]:
min_hrs = df['hours-per-week'].min()
min_hrs

1

answer: 1

### 7. What percentage of the people who work the minimum number of hours per week have a salary of more than 50k?

In [86]:
df_min_work = df.loc[df['hours-per-week'] == min_hrs]

df_min_work50k = df_min_work.loc[df_min_work.salary == '>50K']

prct_min_work_sal50 = round(
    (df_min_work50k.shape[0] / df_min_work.shape[0]) * 100
, 0)

In [88]:
prct_min_work_sal50

10.0

answer: 10

### 8. What country has the highest percentage of people that earn >50k...

In [116]:
# High earner count by country
rich = df.loc[df.salary == '>50K']
high_earner_ct = rich.groupby(['native-country']).salary.count() # this works

In [118]:
# Total count by country  
total_ct = df.groupby(['native-country']).salary.count() 

In [129]:
# series with percentagaes
percentages = high_earner_ct / total_ct
# Sort descending and grab the first index--the info we're looking for:
percentages.sort_values(ascending=False).index[0]

'Iran'

answer: 'Iran'

#### ...And what is that percentage?

In [133]:
round(percentages.sort_values(ascending=False)[0] * 100, 1)

41.9

answer: 41.9

### 9. Identify the most popular occupation for those who earn >50k in India.

In [200]:
high_earn_india = df.loc[
    (df['native-country']== 'India') & (df.salary== '>50K')
    ,['occupation']
]
high_earn_india = \
    high_earn_india.groupby('occupation').occupation.count().sort_values(ascending=False)
# did the groupby+count combo because value counts were returning the index strings in an annoying format

In [201]:
high_earn_india.index[0]

'Prof-specialty'

answer: 'Prof-speciality'