# US Adult Income

## Purpose of the project

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">
...
</div>

## Imports

In [1]:
import pandas as pd

## Table of Contents

- [1. Dataset](#dataset) 
    - [a. Description](#description) 
    - [b. Data cleaning](#data-cleaning) 

## Dataset

### Description

<div style="background-color: rgba(255, 255, 0, 0.15); padding: 8px;">

The US Adult Income dataset, sourced from [Kaggle](https://www.kaggle.com/datasets/johnolafenwa/us-census-data), was originally extracted by Barry Becker from the 1994 US Census Database. It contains anonymous data about various social and economic factors, including occupation, age, native country, race, capital gain, capital loss, education, work class, and more.

Each entry in the dataset is labeled based on income, categorizing individuals as earning either ">50K" or "<=50K" annually. This classification allows for the analysis of how different social factors correlate with income levels.

The dataset is divided into two CSV files:
* `adult-training.txt`: Contains data used for training models.
* `adult-test.txt`: Contains data used for testing models.

This dataset is commonly utilized for machine learning tasks focused on income prediction and social factor analysis.
</div>

### Data cleaning

##### Read data

In [2]:
# Define path to data files
train_file = "data/adult-training.csv"
test_file = "data/adult-test.csv"

In [3]:
# Define columns
COLUMNS = ["age", "workclass", "fnlwgt", "education","education_num",
           "marital_status", "occupation", "relationship", "race", "gender",
           "capital_gain", "capital_loss", "hours_per_week", "native_country", "income_bracket"]

In [4]:
# Read data into df
df_train = pd.read_csv(train_file, names = COLUMNS, skipinitialspace = True, engine= "python")
df_test = pd.read_csv(test_file, names = COLUMNS, skipinitialspace = True, skiprows=1, engine = "python")

##### Investigate the data

In [5]:
# Set size
print("Train set :", df_train.shape)
print("Test set :", df_test.shape)

Train set : (32561, 15)
Test set : (16281, 15)


In [None]:
# Check if there are any missing (NaN) values in the DataFrame
print("Any missing (NaN) values in the train set ? :", df_train.isnull().values.any())
print("Any missing (NaN) values in the test set ? :", df_test.isnull().values.any())

Any missing (NaN) values in the train set ? : False
Any missing (NaN) values in the test set ? : False


In [None]:
df_train.isnull().values.any()

In [11]:
# Train head
df_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [15]:
# Distribution of income values in train set
df_train.income_bracket.value_counts(normalize=True)

income_bracket
<=50K    0.75919
>50K     0.24081
Name: proportion, dtype: float64

In [16]:
# Distribution of income values in test set
df_test.income_bracket.value_counts(normalize=True)

income_bracket
<=50K.    0.763774
>50K.     0.236226
Name: proportion, dtype: float64

In [19]:
# Statistic description
df_train.describe(include='all')

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket,Label
count,32561.0,32561,32561.0,32561,32561.0,32561,32561,32561,32561,32561,32561.0,32561.0,32561.0,32561,32561,32561.0
unique,,9,,16,,7,15,6,5,2,,,,42,2,
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K,
freq,,22696,,10501,,14976,4140,13193,27816,21790,,,,29170,24720,
mean,38.581647,,189778.4,,10.080679,,,,,,1077.648844,87.30383,40.437456,,,0.24081
std,13.640433,,105550.0,,2.57272,,,,,,7385.292085,402.960219,12.347429,,,0.427581
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,,0.0
25%,28.0,,117827.0,,9.0,,,,,,0.0,0.0,40.0,,,0.0
50%,37.0,,178356.0,,10.0,,,,,,0.0,0.0,40.0,,,0.0
75%,48.0,,237051.0,,12.0,,,,,,0.0,0.0,45.0,,,0.0


##### Clean data

In [17]:
# Set label column
df_train['Label'] = (df_train["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)
df_test['Label'] = (df_test["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)

In [20]:
#Remove NaN
df_train.dropna(how="any",axis = 0)
df_test.dropna(how="any", axis = 0)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket,Label
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.,1
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.,0
16277,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K.,0
16278,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.,0
16279,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.,0


In [21]:
# Set size
print("Train set :", df_train.shape)
print("Test set :", df_test.shape)

Train set : (32561, 16)
Test set : (16281, 16)
