# Demographic Data Analyzer

Source:<br>
https://www.freecodecamp.org/learn/data-analysis-with-python/data-analysis-with-python-projects/demographic-data-analyzer

### Assignment

# Demographic Data Analyzer

In this challenge you must analyze demographic data using Pandas. You are given a dataset of demographic data that was extracted from the 1994 Census database. Here is a sample of what the data looks like:

|    |   age | workclass        |   fnlwgt | education   |   education-num | marital-status     | occupation        | relationship   | race   | sex    |   capital-gain |   capital-loss |   hours-per-week | native-country   | salary   |
|---:|------:|:-----------------|---------:|:------------|----------------:|:-------------------|:------------------|:---------------|:-------|:-------|---------------:|---------------:|-----------------:|:-----------------|:---------|
|  0 |    39 | State-gov        |    77516 | Bachelors   |              13 | Never-married      | Adm-clerical      | Not-in-family  | White  | Male   |           2174 |              0 |               40 | United-States    | <=50K    |
|  1 |    50 | Self-emp-not-inc |    83311 | Bachelors   |              13 | Married-civ-spouse | Exec-managerial   | Husband        | White  | Male   |              0 |              0 |               13 | United-States    | <=50K    |
|  2 |    38 | Private          |   215646 | HS-grad     |               9 | Divorced           | Handlers-cleaners | Not-in-family  | White  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
|  3 |    53 | Private          |   234721 | 11th        |               7 | Married-civ-spouse | Handlers-cleaners | Husband        | Black  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
|  4 |    28 | Private          |   338409 | Bachelors   |              13 | Married-civ-spouse | Prof-specialty    | Wife           | Black  | Female |              0 |              0 |               40 | Cuba             | <=50K    |


You must use Pandas to answer the following questions:
* How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (`race` column)
* What is the average age of men?
* What is the percentage of people who have a Bachelor's degree?
* What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
* What percentage of people without advanced education make more than 50K?
* What is the minimum number of hours a person works per week?
* What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
* What country has the highest percentage of people that earn >50K and what is that percentage?
* Identify the most popular occupation for those who earn >50K in India. 

Use the starter code in the file `demographic_data_anaylizer`. Update the code so all variables set to "None" are set to the appropriate calculation or code. Round all decimals to the nearest tenth.

Unit tests are written for you under `test_module.py`.

### Dataset Source

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

In [1]:
import pandas as pd

In [2]:
# Data Loading
df = pd.read_csv('data/adult.data.csv')
df.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


<h4>Below solution for each question one by one</h4><br>


In [3]:
# How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
# Expected Values {"White":27816, "Black":3124, "Asian-Pac-Islander":1039, "Amer-Indian-Eskimo":311, "Other":271}  
race_count = df['race'].value_counts()
print(race_count)

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64


In [4]:
# What is the average age of men?
# Expected Values: 39.4
filter1 = df['sex'] == 'Male'
df_male_age = df['age'].where(filter1)
average_age_men = df_male_age.mean()
print(average_age_men)


39.43354749885268


In [5]:
# What is the percentage of people who have a Bachelor's degree?
# Expected Values: 16.4
filter2 = df['education'] == 'Bachelors'
suma = df['education'].where(filter2).dropna().count()
percentage_bachelors = (suma / df['education'].count()) * 100
print(percentage_bachelors)

16.44605509658794


In [6]:
# What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
# Expected Values: 46.5
filter3 = df['salary'] == '>50K' 
filter3a = (df['education'] == 'Bachelors') | (df['education'] == 'Masters') | (df['education'] == 'Doctorate')

higher_education = df['education'].where(filter3a).dropna().count()
higher_education_rich = df['education'].where(filter3 & filter3a).dropna().count() / higher_education * 100

print(higher_education_rich)

46.535843011613935


Put all the above into a function that can be used to run unit tests.

In [7]:
# What percentage of people without advanced education make more than 50K?
# Expected Values: 17.4
filter3 = df['salary'] == '>50K' 
filter3b = ~(df['education'] == 'Bachelors') & ~(df['education'] == 'Masters') & ~(df['education'] == 'Doctorate')

lower_education = df['education'].where(filter3b).dropna().count() 
lower_education_rich = df['education'].where(filter3 & filter3b).dropna().count() / lower_education * 100

print(lower_education_rich)

17.3713601914639


In [8]:
# What is the minimum number of hours a person works per week?
# Expected Values: 1
min_work_hours = df['hours-per-week'].min()
print(min_work_hours)

1


Below a function which combines all the above, so it can be used to run unit tests

In [12]:
def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('data/adult.data.csv')
    df.head()

    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df['race'].value_counts()

    # What is the average age of men?
    filter1 = df['sex'] == 'Male'
    df_male_age = df['age'].where(filter1)
    average_age_men = round(float(df_male_age.mean()),1)

    # What is the percentage of people who have a Bachelor's degree?
    filter2 = df['education'] == 'Bachelors'
    suma2 = df['education'].where(filter2).dropna().count()
    percentage_bachelors = round((suma2 / df['education'].count()) * 100,1)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    filter3 = df['salary'] == '>50K' 
    
    filter3a = (df['education'] == 'Bachelors') | (df['education'] == 'Masters') | (df['education'] == 'Doctorate')
    
    filter3b = ~(df['education'] == 'Bachelors') & ~(df['education'] == 'Masters') & ~(df['education'] == 'Doctorate')

    higher_education = df['education'].where(filter3a).dropna().count()
    lower_education = df['education'].where(filter3b).dropna().count() 

    # percentage with salary >50K
    higher_education_rich = round(df['education'].where(filter3 & filter3a).dropna().count() / higher_education * 100,1)
    lower_education_rich = round(df['education'].where(filter3 & filter3b).dropna().count() / lower_education * 100,1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?

    min_work_h_salary = df[(df['hours-per-week'] == min_work_hours) & filter3]

    haves = len(min_work_h_salary.index)
    have_nots = len(df[(df['hours-per-week'] == min_work_hours)].index)

    rich_percentage = haves / have_nots * 100

    # What country has the highest percentage of people that earn >50K?
    country_list = pd.DataFrame(data=df['native-country'].unique(), columns = ["native-country"])
    country_list['High_Earners'] = 0
    country_list['High_Earners_Ratio'] = 0

    country_list.set_index('native-country')

    filter3 = df['salary'] == '>50K' 

    for i, row in country_list.iterrows():
        filter6 = df['native-country'] == row['native-country']
        country_list.iloc[i, 1] = df['native-country'].where(filter3 & filter6).dropna().count()
        country_list.iloc[i, 2] = country_list.iloc[i, 1] / df['native-country'].where(filter6).dropna().count() * 100

    country_list.loc[country_list['High_Earners'].idxmax()]
    country_list.loc[country_list['High_Earners_Ratio'].idxmax()]


    highest_earning_country = country_list.loc[country_list['High_Earners_Ratio'].idxmax(), 'native-country']
    highest_earning_country_percentage = round(country_list.loc[country_list['High_Earners_Ratio'].idxmax(), 'High_Earners_Ratio'],1)

    # Identify the most popular occupation for those who earn >50K in India.
    df_filtered = df[(df['native-country'] == 'India') & (df['salary'] == '>50K')]
    top_IN_occupation = df_filtered['occupation'].value_counts().idxmax()

    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage': highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }

In [13]:
a = calculate_demographic_data()

Number of each race:
 White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64
Average age of men: 39.4
Percentage with Bachelors degrees: 16.4%
Percentage with higher education that earn >50K: 46.5%
Percentage without higher education that earn >50K: 17.4%
Min work time: 1 hours/week
Percentage of rich among those who work fewest hours: 10.0%
Country with highest percentage of rich: Iran
Highest percentage of rich people in country: 41.9%
Top occupations in India: Prof-specialty
