# UCI Data Set: Person Income Prediction

## Introduction
  This data set contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau.  The data contains 41 demographic and employment related variables.

  The instance weight indicates the number of people in the population that each record represents due to stratified sampling. To do real analysis and derive conclusions, this field must be used. This attribute should *not* be used in the classifiers.

  One instance per line with comma delimited fields. There are 199523 instances in the data file and 99762 in the test file.

  The data was split into train/test in approximately 2/3, 1/3 proportions using MineSet's MIndUtil mineset-to-mlc.

## Prediction Task
  Prediction task is to determine the income level for the person represented by the record.  Incomes have been binned at the $50K level to present a binary classification problem, much like the original UCI/ADULT database.  The goal field of this data, however, was drawn from the "total person income" field rather than the "adjusted gross income" and may, therefore, behave differently than the orginal ADULT goal field.
  
## Data Description
| Total Count   | Feature Count |
| ------------- | ------------- |
|     199523    |       38      |
  
## Feature Description
- age(continuous)
- class_of_worker(nominal)
- detailed_industry_recode(nominal)
    - Numerical representation of major industry code. This column will be ignored.
- detailed_occupation_recode(nominal)
    - Numerical representation of major occupation code. This column will be ignored.    
- education_level(nominal)
- wage_per_hour(continues)
- enrolled_in_edu_inst_last_wk(nominal)
    - Not in universe, high school, college or university
- marital_status(nominal)
- major_industry_code(nominal)
    - different kinds of job categories
- major_occupation_code(nominal)
- race(nominal)
- hispanic_origin(nominal)
- sex(nominal)
- member_of_a_laber_union(nominal)
- reason_for_unemployment(nominal)
- full_or_part_time_employment_stat(nominal)
- capital_gains(continues)
- capital_losses(continues)
- divdends_from_stocks(continues)
- tax_filer_status(nominal)
- region_of_previous_residence(nominal)
- state_of_previous_residence(nominal)
- detailed_household_and_family_stat(nominal)
    - detailed information of child and grandchild in the family
- detailed household summary in household_household(nominal)
- instance_weight 
    - *The instance weight indicates the number of people in the population that each record represents due to stratified sampling. To do real analysis and derive conclusions, this field must be used. This attribute should **not** be used in the classifiers.*
- migration_code_change_in_msa(nominal)
    - Migration Skills Assessment
- migration_code_change_in_reg(nominal)
- migration_code_move_within_reg(nominal)
- live_in_this_house_1_year_ago(nominal)
- migration_prev_res_in_sunbelt(nominal)
- num_persons_worked_for_employer(continuous)
- family_members_under_18(nominal)
- country_of_birth_father(nominal)
- country_of_birth_mother(nominal)
- country_of_birth_self(nominal)
- citizenship(nominal)
- own_business_or_self_employed(nominal)
- fill_inc_questionnaire_for_veterans_admin(nominal)
- veterans_benefits(nominal)
- weeks_worked_in_year(continues)
- year(nominal)

## Label
- total_person_income(nominal): 50000+ or -50000

In [65]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split


In [52]:
path = "data/census-income.data"
column_names = [
    "age",
    "class_of_worker",
    "detailed_industry_recode",
    "detailed_occupation_recode",
    "education_level",
    "wage_per_hour",
    "enrolled_in_edu_inst_last_wk",
    "marital_status",
    "major_industry_code",
    "major_occupation_code",
    "race",
    "hispanic_origin",
    "sex",
    "member_of_a_laber_union",
    "reason_for_unemployment",
    "full_or_part_time_employment_stat",
    "capital_gains",
    "capital_losses",
    "divdends_from_stocks",
    "tax_filer_status",
    "region_of_previous_residence",
    "state_of_previous_residence",
    "detailed_household_and_family_stat",
    "detailed_household_summary_in_household",
    "instance_weight",
    "migration_code_change_in_msa",
    "migration_code_change_in_reg",
    "migration_code_move_within_reg",
    "live_in_this_house_1_year_ago",
    "migration_prev_res_in_sunbelt",
    "num _persons_worked_for_employer",
    "family_members_under_18",
    "country_of_birth_father",
    "country_of_birth_mother",
    "country_of_birth_self",
    "citizenship",
    "own_business_or_self_employed",
    "fill_inc_questionnaire_for_veterans_admin",
    "veterans_benefits",
    "weeks_worked_in_year",
    "year",
    "total_person_income"
]

data = pd.read_csv(
    path, 
    names=column_names, 
    index_col=False,
    na_values=' ?')
data = data.drop(columns=['instance_weight', 'detailed_industry_recode', 'detailed_occupation_recode'])
train_data, val_data = train_test_split(data, train_size = 0.7, test_size = 0.3, shuffle = True)
test_data = None

In [82]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 139666 entries, 139573 to 146896
Data columns (total 39 columns):
 #   Column                                     Non-Null Count   Dtype 
---  ------                                     --------------   ----- 
 0   age                                        139666 non-null  int64 
 1   class_of_worker                            139666 non-null  object
 2   education_level                            139666 non-null  object
 3   wage_per_hour                              139666 non-null  int64 
 4   enrolled_in_edu_inst_last_wk               139666 non-null  object
 5   marital_status                             139666 non-null  object
 6   major_industry_code                        139666 non-null  object
 7   major_occupation_code                      139666 non-null  object
 8   race                                       139666 non-null  object
 9   hispanic_origin                            139666 non-null  object
 10  sex            

In [53]:
train_data.head(10)

Unnamed: 0,age,class_of_worker,education_level,wage_per_hour,enrolled_in_edu_inst_last_wk,marital_status,major_industry_code,major_occupation_code,race,hispanic_origin,...,country_of_birth_father,country_of_birth_mother,country_of_birth_self,citizenship,own_business_or_self_employed,fill_inc_questionnaire_for_veterans_admin,veterans_benefits,weeks_worked_in_year,year,total_person_income
139573,3,Not in universe,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
185289,7,Not in universe,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,95,- 50000.
7830,67,Not in universe,Bachelors degree(BA AB BS),0,Not in universe,Divorced,Not in universe or children,Not in universe,White,All other,...,Italy,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94,- 50000.
37903,80,Not in universe,High school graduate,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94,- 50000.
180135,51,Private,Associates degree-academic program,0,Not in universe,Married-civilian spouse present,Other professional services,Adm support including clerical,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,2,Not in universe,2,20,95,- 50000.
154085,19,Private,High school graduate,0,College or university,Never married,Retail trade,Other service,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,12,95,- 50000.
144020,3,Not in universe,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,Other,Mexican-American,...,United-States,Germany,United-States,Native- Born in the United States,0,Not in universe,0,0,95,- 50000.
146938,53,Not in universe,Some college but no degree,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
40289,22,Private,Bachelors degree(BA AB BS),465,College or university,Never married,Hospital services,Other service,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,95,- 50000.
143445,45,Private,High school graduate,0,Not in universe,Divorced,Hospital services,Adm support including clerical,White,All other,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.


In [56]:
print('Part of missing values for every column')
print(train_data.isnull().sum()/len(train_data))

Part of missing values for every column
age                                          0.000000
class_of_worker                              0.000000
education_level                              0.000000
wage_per_hour                                0.000000
enrolled_in_edu_inst_last_wk                 0.000000
marital_status                               0.000000
major_industry_code                          0.000000
major_occupation_code                        0.000000
race                                         0.000000
hispanic_origin                              0.000000
sex                                          0.000000
member_of_a_laber_union                      0.000000
reason_for_unemployment                      0.000000
full_or_part_time_employment_stat            0.000000
capital_gains                                0.000000
capital_losses                               0.000000
divdends_from_stocks                         0.000000
tax_filer_status                          

In [58]:
features = train_data.columns.values.tolist()
del features[38]
numeric_features = ['age', 
                    'wage_per_hour', 
                    'capital_gains', 
                    'capital_losses', 
                    'divdends_from_stocks', 
                    'num_persons_worked_for_employer',
                    'weeks_worked_in_year'
                   ]
categorical_features = list(set(features) - set(numeric_features))

In [60]:
ds = train_data['total_person_income'].value_counts().reset_index()
ds.columns = [
    'income_type', 
    'percent']

ds['percent'] /= len(train_data)
fig = px.pie(
    ds,
    names='income_type',
    values='percent',
    title='Percent of income types'
)
fig.show()

In [81]:
fig = go.Figure()
ds = train_data['age'].value_counts().reset_index()
ds.columns = [
    'age', 
    'count'
]

fig.add_trace(
    go.Bar(
        name='all',
        x=ds['age'], 
        y=ds['count'], 
    ))
ds = train_data[train_data['total_person_income'] == ' 50000+.']['age'].value_counts().reset_index()
ds.columns = [
    'age', 
    'count'
]

fig.add_trace(
    go.Bar(
        name='50000+',
        x=ds['age'], 
        y=ds['count'], 
    ))
fig.update_layout(title='Age distribution')
fig.show()

In [85]:
age_count = train_data['age'].value_counts().reset_index()
age_count.columns = [
    'age', 
    'count'
]

age_positive_percent = train_data[train_data['total_person_income'] == ' 50000+.']['age'].value_counts().reset_index()
age_positive_percent.columns = [
    'age', 
    'percent'
]

age_positive_percent['percent'] /= age_count['count']
fig = px.bar(
    age_positive_percent, 
    x='age', 
    y='percent',  
    title='50000+ percent sorted by age', 
)

fig.show()