# UCI Data Set: Person Income Prediction

## Introduction
  This data set contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau.  The data contains 41 demographic and employment related variables.

  The instance weight indicates the number of people in the population that each record represents due to stratified sampling. To do real analysis and derive conclusions, this field must be used. This attribute should *not* be used in the classifiers.

  One instance per line with comma delimited fields. There are 199523 instances in the data file and 99762 in the test file.

  The data was split into train/test in approximately 2/3, 1/3 proportions using MineSet's MIndUtil mineset-to-mlc.

## Prediction Task
  Prediction task is to determine the income level for the person represented by the record.  Incomes have been binned at the $50K level to present a binary classification problem, much like the original UCI/ADULT database.  The goal field of this data, however, was drawn from the "total person income" field rather than the "adjusted gross income" and may, therefore, behave differently than the orginal ADULT goal field.
  
## Feature Description
- age(continuous)
- class_of_worker(nominal)
- detailed_industry_recode(nominal)
    - Numerical representation of major industry code. This column will be ignored.
- detailed_occupation_recode(nominal)
    - Numerical representation of major occupation code. This column will be ignored.    
- education_level(nominal)
- wage_per_hour(continues)
- enrolled_in_edu_inst_last_wk(nominal)
    - Not in universe, high school, college or university
- marital_status(nominal)
- major_industry_code(nominal)
    - different kinds of job categories
- major_occupation_code(nominal)
- race(nominal)
- hispanic_origin(nominal)
- sex(nominal)
- member_of_a_laber_union(nominal)
- reason_for_unemployment(nominal)
- full_or_part_time_employment_stat(nominal)
- capital_gains(continues)
- capital_losses(continues)
- divdends_from_stocks(continues)
- tax_filer_status(nominal)
- region_of_previous_residence(nominal)
- state_of_previous_residence(nominal)
- detailed_household_and_family_stat(nominal)
    - detailed information of child and grandchild in the family
- detailed household summary in household_household(nominal)
- instance_weight 
    - *The instance weight indicates the number of people in the population that each record represents due to stratified sampling. To do real analysis and derive conclusions, this field must be used. This attribute should **not** be used in the classifiers.*
- migration_code_change_in_msa(nominal)
    - Migration Skills Assessment
- migration_code_change_in_reg(nominal)
- migration_code_move_within_reg(nominal)
- live_in_this_house_1_year_ago(nominal)
- migration_prev_res_in_sunbelt(nominal)
- num_persons_worked_for_employer(continuous)
- family_members_under_18(nominal)
- country_of_birth_father(nominal)
- country_of_birth_mother(nominal)
- country_of_birth_self(nominal)
- citizenship(nominal)
- own_business_or_self_employed(nominal)
- fill_inc_questionnaire_for_veterans_admin(nominal)
- veterans_benefits(nominal)
- weeks_worked_in_year(continues)
- year(nominal)

## Label
- total_person_income(nominal)

In [1]:
import pandas as pd
import plotly.express as px

In [2]:
path = "census-income.data"
column_names = [
    "age",
    "class_of_worker",
    "detailed_industry_recode",
    "detailed_occupation_recode",
    "education_level",
    "wage_per_hour",
    "enrolled_in_edu_inst_last_wk",
    "marital_status",
    "major_industry_code",
    "major_occupation_code",
    "race",
    "hispanic_origin",
    "sex",
    "member_of_a_laber_union",
    "reason_for_unemployment",
    "full_or_part_time_employment_stat",
    "capital_gains",
    "capital_losses",
    "divdends_from_stocks",
    "tax_filer_status",
    "region_of_previous_residence",
    "state_of_previous_residence",
    "detailed_household_and_family_stat",
    "detailed_household_summary_in_household",
    "instance_weight",
    "migration_code_change_in_msa",
    "migration_code_change_in_reg",
    "migration_code_move_within_reg",
    "live_in_this_house_1_year_ago",
    "migration_prev_res_in_sunbelt",
    "num _persons_worked_for_employer",
    "family_members_under_18",
    "country_of_birth_father",
    "country_of_birth_mother",
    "country_of_birth_self",
    "citizenship",
    "own_business_or_self_employed",
    "fill_inc_questionnaire_for_veterans_admin",
    "veterans_benefits",
    "weeks_worked_in_year",
    "year",
    "total_person_income"
]

train_data = pd.read_csv(path, names=column_names, index_col=False)
train_data = train_data.drop(columns=['instance_weight', 'detailed_industry_recode', 'detailed_occupation_recode'])
test_data = None

In [3]:
train_data.head()

Unnamed: 0,age,class_of_worker,education_level,wage_per_hour,enrolled_in_edu_inst_last_wk,marital_status,major_industry_code,major_occupation_code,race,hispanic_origin,...,country_of_birth_father,country_of_birth_mother,country_of_birth_self,citizenship,own_business_or_self_employed,fill_inc_questionnaire_for_veterans_admin,veterans_benefits,weeks_worked_in_year,year,total_person_income
0,73,Not in universe,High school graduate,0,Not in universe,Widowed,Not in universe or children,Not in universe,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
1,58,Self-employed-not incorporated,Some college but no degree,0,Not in universe,Divorced,Construction,Precision production craft & repair,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
2,18,Not in universe,10th grade,0,High school,Never married,Not in universe or children,Not in universe,Asian or Pacific Islander,All other,...,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.
3,9,Not in universe,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
4,10,Not in universe,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.


In [4]:
train_data.describe()

Unnamed: 0,age,wage_per_hour,capital_gains,capital_losses,divdends_from_stocks,num _persons_worked_for_employer,own_business_or_self_employed,veterans_benefits,weeks_worked_in_year,year
count,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0
mean,34.494199,55.426908,434.71899,37.313788,197.529533,1.95618,0.175438,1.514833,23.174897,94.499672
std,22.310895,274.896454,4697.53128,271.896428,1984.163658,2.365126,0.553694,0.851473,24.411488,0.500001
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.0
25%,15.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,94.0
50%,33.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,8.0,94.0
75%,50.0,0.0,0.0,0.0,0.0,4.0,0.0,2.0,52.0,95.0
max,90.0,9999.0,99999.0,4608.0,99999.0,6.0,2.0,2.0,52.0,95.0


In [8]:
features = list(set(train_data.columns.values.tolist())-set(['total_person_income']))
numeric_features = ['age', 
                    'wage_per_hour', 
                    'capital_gains', 
                    'capital_losses', 
                    'divdends_from_stocks', 
                    'num_persons_worked_for_employer',
                    'weeks_worked_in_year'
                   ]
categorical_features = list(set(features) - set(numeric_features))
print(features)
print(numeric_features)
print(categorical_features)

['sex', 'migration_code_change_in_msa', 'full_or_part_time_employment_stat', 'race', 'country_of_birth_mother', 'country_of_birth_father', 'hispanic_origin', 'detailed_household_summary_in_household', 'year', 'num _persons_worked_for_employer', 'capital_gains', 'capital_losses', 'migration_code_move_within_reg', 'major_industry_code', 'enrolled_in_edu_inst_last_wk', 'age', 'veterans_benefits', 'tax_filer_status', 'major_occupation_code', 'weeks_worked_in_year', 'own_business_or_self_employed', 'marital_status', 'region_of_previous_residence', 'state_of_previous_residence', 'detailed_household_and_family_stat', 'class_of_worker', 'migration_prev_res_in_sunbelt', 'family_members_under_18', 'migration_code_change_in_reg', 'wage_per_hour', 'reason_for_unemployment', 'live_in_this_house_1_year_ago', 'member_of_a_laber_union', 'citizenship', 'divdends_from_stocks', 'fill_inc_questionnaire_for_veterans_admin', 'education_level', 'country_of_birth_self']
['age', 'wage_per_hour', 'capital_gains

In [9]:
sub_data = train_data['total_person_income'].value_counts().reset_index()
sub_data_columns = ['income_type', 'count']
sub_data.columns = sub_data_columns
fig = px.bar(
    sub_data,
    x = 'income_type',
    y = 'count',
    title='Grouped by Income Types'
)
fig.show()

KeyError: 'federal_income_tax_liability'