# Demographic Data Analysis

The point of this notebook is the analyze census data that was provided by Free Code Camp. The goal here is to create a function that answers the following questions using Pandas:

How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)
- What is the average age of men?
- What is the percentage of people who have a Bachelor's degree?
- What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
- What percentage of people without advanced education make more than 50K?
- What is the minimum number of hours a person works per week?
- What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
- What country has the highest percentage of people that earn >50K and what is that percentage?
- Identify the most popular occupation for those who earn >50K in India.

<font color='blue'>Here you will find thelink to the assignment and csv file: </font> https://repl.it/@freeCodeCamp/fcc-demographic-data-analyzer#README.md

In [1]:
# Importing needed libraries
import pandas as pd
import numpy as np

In [2]:
# Importing the data
df = pd.read_csv('adult.data.txt') 

In [3]:
# Previewing the data 
df.head(8) 

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K


In [4]:
# Checking how large the data is via rows and columns 
df.shape 

(32561, 15)

In [5]:
# Checking the stats of the data
df.describe() 

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [6]:
# Checking for any missing data 
df.isna().sum() 

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
salary            0
dtype: int64

In [7]:
# Checking the data types of each column
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


## Answering The Questions

### How many people of each race are represented in this dataset? 

In [8]:
race_count = df["race"].value_counts()

### What is the average age of men?

In [9]:
average_age_men = df[df["sex"] == "Male"]["age"]
average_age_men = round(np.mean(average_age_men) , 1 )

### What is the percentage of people who have a Bachelor's degree?

In [10]:
percentage_bachelors = len(df[df["education"] == "Bachelors"])
total_education = len(df["education"])
percentage_bachelors = round(100* (percentage_bachelors/total_education) , 1)

### What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?

In [11]:
higher_education = df.loc[(df["education"].str.startswith("B")) | (df["education"].str.startswith("M")) | (df["education"].str.startswith("D"))][df["salary"] == ">50K"]


  higher_education = df.loc[(df["education"].str.startswith("B")) | (df["education"].str.startswith("M")) | (df["education"].str.startswith("D"))][df["salary"] == ">50K"]


### What percentage of people without advanced education make more than 50K? with and without `Bachelors`, `Masters`, or `Doctorate`

In [12]:
lower_education = df.loc[df["salary"] == ">50K"]
lower_education = len(lower_education) - len(higher_education)

### percentage with salary >50K

In [13]:
higher_education2 = df.loc[(df["education"].str.startswith("B")) | (df["education"].str.startswith("M")) | (df["education"].str.startswith("D"))]    
higher_education_rich = round(100 * (len(higher_education) / len(higher_education2)),1 )
lower_education_rich = round(100 * (lower_education / (total_education - len(higher_education2)) ) ,1 )

### What is the minimum number of hours a person works per week (hours-per-week feature)?

In [14]:
min_work_hours = df["hours-per-week"].min()

### What percentage of the people who work the minimum number of hours per week have a salary of >50K?

In [15]:
min_work = df.loc[(df["hours-per-week"] == 1)]
num_min_workers = df.loc[(df["hours-per-week"] == 1) & (df["salary"] == ">50K")]

In [16]:
rich_percentage = round(100 * len(num_min_workers) / len(min_work) , 1)

### What country has the highest percentage of people that earn >50K?

In [17]:
new_df = df.loc[(df["salary"] == ">50K")]["native-country"].value_counts()
new_df2 = df["native-country"].value_counts()

In [18]:
richest = (new_df / new_df2).max()

In [19]:
highest_earning_country = (new_df/new_df2).sort_values(ascending = False).index[0]
highest_earning_country_percentage = round(100 * richest, 1 )

### Identify the most popular occupation for those who earn >50K in India.

In [20]:
india = df.loc[(df["salary"] == ">50K") & ( df["native-country"] == "India")]["occupation"]
top_IN_occupation = pd.Series.mode(india)[0]

## Print the results

In [23]:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

Number of each race:
 White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64
Average age of men: 39.4
Percentage with Bachelors degrees: 16.4%
Percentage with higher education that earn >50K: 46.5%
Percentage without higher education that earn >50K: 17.4%
Min work time: 1 hours/week
Percentage of rich among those who work fewest hours: 10.0%
Country with highest percentage of rich: Iran
Highest percentage of rich people in country: 41.9%
Top occupations in India: Prof-specialty
