In [1]:
import pandas as pd
import numpy as np

# Project 2 - Demographic Data Analyzer 
---
## Given
- A dataset of demographic data that was extracted from the 1994 Census database.

## Objective
Answer the following questions:
1. How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)
2. What is the average age of men?
3. What is the percentage of people who have a Bachelor's degree?
4. What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
5. What percentage of people without advanced education make more than 50K?
6. What is the minimum number of hours a person works per week?
7. What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
8. What country has the highest percentage of people that earn >50K and what is that percentage?
9. Identify the most popular occupation for those who earn >50K in India.  

## My Solution
> A visual of the process done to extract the anwers will be on this notebook. The source code to get the answer will be the same as in demographic_data_analyzer.py for testing wiht test cases.

### Loading the data
First load the data from the csv on to a DataFrame 'df' using pd.read_csv().

In [2]:
df = pd.read_csv('adult.data.csv')

In [3]:
# Quick visualize of the data we are dealing with
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
# get an understanding of the number of rows and columns:
df.shape

(32561, 15)

In [5]:
# initial info about the data stored, types, no. of null values etc:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


### Overview of data:
- 32561 rows and 15 columns.  
- no entries are null.  

### Question 1:
> How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)  

the answer is stored in race_count

In [6]:
race_count = df.race.value_counts()
# print(race_count)
# print(race_count.values.tolist())

# type(race_count)
race_count

race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: count, dtype: int64

### Question 2:
> What is the average age of men?  

The anwer is stored in average_age_men

In [7]:
average_age_men = df.loc[df.sex == "Male"].age.mean()
average_age_men 

39.43354749885268

### Question 3:
> What is the percentage of people who have a Bachelor's degree?  

The answer is stored in percentage_bachelors

In [8]:
# df.head()
total_people = df.shape[0]
num_people_with_bachelors = df.loc[df.education=="Bachelors"].shape[0]
percentage_bachelors = num_people_with_bachelors/total_people *100
# df.education.value_counts()
percentage_bachelors

16.44605509658794

### Question 4:
> What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?  

The answer is stored in higher_education

In [9]:
# df.head()
higher_education = df.loc[(df.education == "Masters") |(df.education == "Bachelors") | (df.education == "Doctorate")]

total_people_higher_education = higher_education.shape[0]

# print(df_higher_education.salary.value_counts())
num_people_with_higher_educationGT50K =  higher_education.loc[higher_education.salary == ">50K"].shape[0]
# num_people_with_higher_educationGT50K

higher_education = num_people_with_higher_educationGT50K/total_people_higher_education * 100
higher_education

46.535843011613935

### Question 5:
> What percentage of people without advanced education make more than 50K?  

The answer is stored in lower_education

In [10]:
# df.education.value_counts()
lower_education = df.loc[(df.education != "Masters") & (df.education != "Bachelors") & (df.education != "Doctorate")]
total_people_lower_education = lower_education.shape[0]
num_people_with_lower_educationGT50K = lower_education.loc[lower_education.salary == ">50K"].shape[0]

lower_education = num_people_with_lower_educationGT50K / total_people_lower_education * 100
round(lower_education,1)

17.4

### Question 6:
> What is the minimum number of hours a person works per week?  

The answer is stored in min_work_hours

In [11]:
min_work_hours = df["hours-per-week"].values.min()
min_work_hours

1

### Question 7:
> What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?

The number of people who work the minimum hours per week is stored in num_min_workers  
The answer is stored in rich_percentage

In [12]:
df.head()
min_hour_workers = df.loc[df["hours-per-week"] == min_work_hours]
num_min_workers = min_hour_workers.shape[0]
num_min_workersGT50K = min_hour_workers.loc[min_hour_workers["salary"] == ">50K"].shape[0]

rich_percentage = num_min_workersGT50K/num_min_workers * 100

num_min_workers,rich_percentage

(20, 10.0)

### Question 8:
> What country has the highest percentage of people that earn >50K and what is that percentage?  

The county that has the highest earning people is highest_earning_country  
The percentage highest_earning_country_percentage

In [13]:
df.head()
# country, total people, people who earn more than 50k
people_from_countryGT50K = df.loc[df.salary == ">50K"]["native-country"].value_counts()
people_from_country = df["native-country"].value_counts()
req = people_from_countryGT50K / people_from_country
highest_earning_country = req.idxmax()
highest_earning_country_percentage = round(req.max()*100,1)

highest_earning_country,highest_earning_country_percentage

('Iran', 41.9)

### Question 9:
> Identify the most popular occupation for those who earn >50K in India.

The occupation is stored in top_IN_occupation

In [14]:
df.head()
indian_data = df.loc[df["native-country"] == "India"].loc[df.salary == ">50K"]
top_IN_occupation = indian_data.occupation.value_counts().idxmax()
top_IN_occupation

'Prof-specialty'