1. Introduction

The ability to extract meaningful insights from raw data is more crucial now than ever.

Among the many datasets available for analysis, the “Adult Census Income” data can be used to understand the socio-economic factors that influence income levels.

This dataset, collected from the 1994 U.S. Census, includes a variety of demographic information, such as age, education, occupation, and more.

The primary goal of analysing this dataset is to determine whether an individual earns more than $50K per year—a task that has wide-ranging implications for economic policy, business strategy, and social research. 

2. Data Description

age: The age of the individual.

workclass: The type of employer or self-employment status.

fnlwgt: Final weight, representing the number of people the observation represents.

education: The highest level of education attained.

education-num: The number corresponding to the education level.

marital-status: Marital status of the individual.

occupation: The type of job held by the individual.

relationship: The relationship of the individual to other members of the household.

race: The race of the individual.

sex: The gender of the individual.

capital-gain: Income from investment sources, apart from wages/salary.

capital-loss: Losses from investments.

hours-per-week: The number of hours the individual works per week.

native-country: The country of origin of the individual.

income: The income level, which is the target variable, indicating whether the income exceeds $50K or not.


In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, auc, accuracy_score, roc_auc_score, roc_curve

In [5]:
# Load the dataset 
df = pd.read_csv(r"adult_Income_dataset.csv")
df.head()
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


(48842, 15)

In [None]:
# Code the 'Income' column so that it becomes a binary categorical variable
from sklearn.preprocessing import LabelEncoder, StandardScaler
le = LabelEncoder()
df.replace('<=50K.', '<=50K', inplace=True)
df.replace('>50K.', '>50K', inplace=True)
df['income_encoded'] = le.fit_transform(df['income']) # <=50k for 0, then >50k for 1
df.tail()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,income_encoded
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K,0
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K,0
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K,0
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K,0
48841,35,Self-emp-inc,182148,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,60,United-States,>50K,1


In [None]:
# Specify the predictor variables and target variable
X = df.drop('income_encoded', axis=1)
y = df['income_encoded']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)


In [None]:
# 44% of income goes to tax
# Salary - 300,0000
Amount_before_tax = 300000
tax = 0.44
Amount_after_tax = 0.56 * Amount_before_tax
print(Amount_after_tax)
print(f"The amount after tax is: {Amount_after_tax}")

168000.00000000003
The amount after tax is: 168000.00000000003


In [2]:
names = {"A", "B", "C", "D", "E"}
for name in names:
    print(name)

D
B
C
A
E


In [2]:
numbers = range(19,51)
count = 19
while count <= 25:
    print(count)
    count += 1

19
20
21
22
23
24
25
