
# Demographic Data Analyzer: Adult Income and Demographic Data

**Author:** Hamed Ahmadinia  
**Date:** 2.1.2024

This notebook is designed to perform data analysis on a demographic dataset. The dataset contains information on adults such as their age, work class, education, marital status, and other features related to income and demographic characteristics.

The goal of this analysis is to derive insights from the data, such as the distribution of various demographic attributes, average values for certain groups, and overall statistics for better understanding of the population.


In [4]:
import pandas as pd

In [6]:
### Step 1: Data Loading
# We start by loading the dataset and assigning appropriate column names. This dataset contains demographic information of adults in the United States. 
# The column names include age, work class, education, occupation, race, sex, and income information.

# Load the dataset with appropriate column names
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'salary']

df = pd.read_csv('adult.data.csv', names=column_names, skiprows=1)

In [8]:
# How many people of each race are represented in this dataset?
race_count = df['race'].value_counts()
print("Race count:\n", race_count)

Race count:
 race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: count, dtype: int64


In [10]:
### Step 2: Race Distribution Analysis
# In this step, we calculate how many people of each race are represented in the dataset.

# How many people of each race are represented in this dataset?
race_count = df['race'].value_counts()
print("Race count:\n", race_count)

Race count:
 race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: count, dtype: int64


In [12]:
### Step 3: Average Age of Men
# Next, we calculate the average age of men in the dataset. This can help us understand the age distribution within male respondents.

average_age_men = df[df['sex'] == 'Male']['age'].mean()
print("\nAverage age of men:", average_age_men)


Average age of men: 39.43354749885268


In [14]:
### Step 4: Education and Salary Analysis
# In this step, we analyze the relationship between education and salary. Specifically, we will calculate the percentage of individuals with a Bachelor's degree who earn more than 50K.

higher_education = df[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
lower_education = df[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]

higher_education_rich = higher_education[higher_education['salary'] == '>50K'].shape[0] / higher_education.shape[0]
lower_education_rich = lower_education[lower_education['salary'] == '>50K'].shape[0] / lower_education.shape[0]

print(f"Percentage with higher education that earn >50K: {higher_education_rich:.2%}")
print(f"Percentage without higher education that earn >50K: {lower_education_rich:.2%}")

Percentage with higher education that earn >50K: 46.54%
Percentage without higher education that earn >50K: 17.37%


In [16]:
### Step 5: Working Hours and Salary
# Here we calculate the percentage of people who work a minimum of 40 hours per week and earn more than 50K. 

min_work_hours = df['hours-per-week'].min()
rich_percentage = df[(df['hours-per-week'] > min_work_hours) & (df['salary'] == '>50K')].shape[0] / df[df['hours-per-week'] > min_work_hours].shape[0]
print(f"Percentage of people working more than {min_work_hours} hours per week and earning >50K: {rich_percentage:.2%}")

Percentage of people working more than 1 hours per week and earning >50K: 24.09%


In [18]:
### Step 6: Country-wise Income Analysis
# In this step, we analyze which country has the highest percentage of individuals earning more than 50K. 

country_salary = df.groupby('native-country')['salary'].apply(lambda x: (x == '>50K').mean()).sort_values(ascending=False)
highest_earning_country = country_salary.idxmax()
highest_earning_country_percentage = country_salary.max()

print(f"Country with highest percentage of people earning >50K: {highest_earning_country}")
print(f"Highest percentage of people earning >50K: {highest_earning_country_percentage:.2%}")

Country with highest percentage of people earning >50K: Iran
Highest percentage of people earning >50K: 41.86%


In [20]:
### Step 7: Most Popular Occupation Among High Earners
# we will identify the most popular occupation among individuals who earn more than 50K in the United States.

top_US_occupation = df[(df['salary'] == '>50K') & (df['native-country'] == 'United-States')]['occupation'].value_counts().idxmax()
print(f"Most popular occupation for those who earn >50K in the United States: {top_US_occupation}")

Most popular occupation for those who earn >50K in the United States: Exec-managerial


In [22]:
### Step 8: Country-wise Income Analysis (Highest Earning Country)
# In this step, we will find which country has the highest percentage of people earning more than 50K and calculate the exact percentage.

country_salary = df.groupby('native-country')['salary'].apply(lambda x: (x == '>50K').mean()).sort_values(ascending=False)
highest_earning_country = country_salary.idxmax()
highest_earning_country_percentage = country_salary.max()

print(f"Country with highest percentage of people earning >50K: {highest_earning_country}")
print(f"Highest percentage of people earning >50K: {highest_earning_country_percentage:.2%}")

Country with highest percentage of people earning >50K: Iran
Highest percentage of people earning >50K: 41.86%


In [24]:
### Step 9: Popular Occupation Among High Earners in India
# Here we will analyze the most popular occupation for individuals earning more than 50K in India.

top_india_occupation = df[(df['salary'] == '>50K') & (df['native-country'] == 'India')]['occupation'].value_counts().idxmax()
print(f"Most popular occupation for those who earn >50K in India: {top_india_occupation}")

Most popular occupation for those who earn >50K in India: Prof-specialty


In [26]:
### Step 10: Average Working Hours Comparison by Gender
# We will now calculate the average hours worked per week for individuals earning more than 50K, and compare the results between men and women.

average_hours_per_week = df[df['salary'] == '>50K'].groupby('sex')['hours-per-week'].mean()
print("Average hours-per-week for people earning >50K:")
print(average_hours_per_week)

Average hours-per-week for people earning >50K:
sex
Female    40.426633
Male      46.366106
Name: hours-per-week, dtype: float64


In [28]:
### Step 11: Common Marital Status Among High Earners
# We will analyze the most common marital status among people who earn more than 50K.
    
common_marital_status = df[df['salary'] == '>50K']['marital-status'].value_counts().idxmax()
print(f"Most common marital status for people earning >50K: {common_marital_status}")

Most common marital status for people earning >50K: Married-civ-spouse


In [30]:
### Step 12: Occupation with Highest Average Capital Gain
# This step will help us identify which occupation has the highest average capital gain for people earning more than 50K.

highest_avg_capital_gain_occupation = df[df['salary'] == '>50K'].groupby('occupation')['capital-gain'].mean().idxmax()
print(f"Occupation with highest average capital gain: {highest_avg_capital_gain_occupation}")

Occupation with highest average capital gain: Priv-house-serv


In [32]:
### Step 13: Average Age by Race
# Next, we will calculate the average age for individuals of each race in the dataset.

average_age_by_race = df.groupby('race')['age'].mean()
print("Average age by race:")
print(average_age_by_race)

Average age by race:
race
Amer-Indian-Eskimo    37.173633
Asian-Pac-Islander    37.746872
Black                 37.767926
Other                 33.457565
White                 38.769881
Name: age, dtype: float64


In [34]:
### Step 14: Percentage of High Earners by Race
# Lastly, we will find out what percentage of individuals from each race earn more than 50K.

percentage_earning_by_race = df.groupby('race')['salary'].apply(lambda x: (x == '>50K').mean())
print("Percentage of each race earning more than 50K:")
print(percentage_earning_by_race)

Percentage of each race earning more than 50K:
race
Amer-Indian-Eskimo    0.115756
Asian-Pac-Islander    0.265640
Black                 0.123880
Other                 0.092251
White                 0.255860
Name: salary, dtype: float64
