# Projek 2: Census Income
* Objective

Your task is to build machine learning models to predict the income level (target variable) of the related collaborators in the evaluation set, being 0 a collaborator who has an income less than 50,000 USD annually, and 1 a collaborator who has an income equal to or greater than 50,000 USD annually.

* Evaluation Criteria

Submissions are evaluated using F1 Score. How do we do it? 
Once you generate and submit the target variable predictions on evaluation dataset, your submissions will be compared with the true values of the target variable. The True or Actual values of the target variable are hidden on the DPhi platform so that we can evaluate your model's performance on unseen data. Finally, an F1 score for your model will be generated and displayed.

* About the dataset

This database contains 41 attributes. The target variable refers to the income level, being 0 a collaborator who has an income less than 50,000 USD annually, and 1 a collaborator who has an income equal to or greater than 50,000 USD annually.

* age: continuous.

  * workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  * fnlwgt: continuous.
  * education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  * education-num: continuous.
  * marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  * occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  * relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  * race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  * sex: Female, Male.
  * capital-gain: continuous.
  * capital-loss: continuous.
  * hours-per-week: continuous.
  * native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.


## Import Packages

In [None]:
import numpy as np
import pandas as pd
import io
import requests
import seaborn as sns
from matplotlib import pyplot as plt
import pickle
import os
from pandas.api.types import CategoricalDtype

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import cross_val_score

## Import Dataset

In [None]:
train  = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/Census_Income/Training_set_census.csv" )
test = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/Census_Income/Testing_set_census.csv')

# Eksplorasi Data

Akan dilihat informasi dari dua data tersebut

## Training Data

In [None]:
train.head()

Unnamed: 0,age,class_of_worker,industry_code,occupation_code,education,wage_per_hour,enrolled_in_edu_inst_lastwk,marital_status,major_industry_code,major_occupation_code,race,hispanic_origin,sex,member_of_labor_union,reason_for_unemployment,full_parttime_employment_stat,capital_gains,capital_losses,dividend_from_Stocks,tax_filer_status,region_of_previous_residence,state_of_previous_residence,d_household_family_stat,d_household_summary,migration_msa,migration_reg,migration_within_reg,live_1_year_ago,migration_sunbelt,num_person_Worked_employer,family_members_under_18,country_father,country_mother,country_self,citizenship,business_or_self_employed,fill_questionnaire_veteran_admin,veterans_benefits,weeks_worked_in_year,year,income_level
0,23,Private,43,22,Some college but no degree,0,College or university,Never married,Education,Adm support including clerical,White,All other,Male,Not in universe,Not in universe,Full-time schedules,0,0,0,Single,Not in universe,Not in universe,Child 18+ never marr Not in a subfamily,Child 18 or older,,,,Not in universe under 1 year old,,4,Not in universe,Peru,Peru,United-States,Native- Born in the United States,0,Not in universe,2,30,95,0
1,24,Private,34,2,Bachelors degree(BA AB BS),0,Not in universe,Never married,Finance insurance and real estate,Executive admin and managerial,White,All other,Male,No,Not in universe,Children or Armed Forces,0,0,0,Single,West,California,Nonfamily householder,Householder,MSA to MSA,Different county same state,Different county same state,No,No,4,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,26,94,0
2,38,Private,34,2,Masters degree(MA MS MEng MEd MSW MBA),0,Not in universe,Married-civilian spouse present,Finance insurance and real estate,Executive admin and managerial,White,All other,Male,Not in universe,Not in universe,Children or Armed Forces,0,0,250,Joint both under 65,Not in universe,Not in universe,Householder,Householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,4,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,1
3,33,Not in universe,0,0,High school graduate,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,White,All other,Female,Not in universe,Not in universe,Not in labor force,0,0,0,Joint both under 65,Not in universe,Not in universe,Child 18+ ever marr RP of subfamily,Child 18 or older,,,,Not in universe under 1 year old,,0,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,0
4,13,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,White,All other,Male,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Nonfiler,Not in universe,Not in universe,Child <18 never marr not in subfamily,Child under 18 never married,,,,Not in universe under 1 year old,,0,Both parents present,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,95,0


In [None]:
train.tail()

Unnamed: 0,age,class_of_worker,industry_code,occupation_code,education,wage_per_hour,enrolled_in_edu_inst_lastwk,marital_status,major_industry_code,major_occupation_code,race,hispanic_origin,sex,member_of_labor_union,reason_for_unemployment,full_parttime_employment_stat,capital_gains,capital_losses,dividend_from_Stocks,tax_filer_status,region_of_previous_residence,state_of_previous_residence,d_household_family_stat,d_household_summary,migration_msa,migration_reg,migration_within_reg,live_1_year_ago,migration_sunbelt,num_person_Worked_employer,family_members_under_18,country_father,country_mother,country_self,citizenship,business_or_self_employed,fill_questionnaire_veteran_admin,veterans_benefits,weeks_worked_in_year,year,income_level
199995,2,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,White,Mexican-American,Male,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Nonfiler,Not in universe,Not in universe,Child <18 never marr not in subfamily,Child under 18 never married,,,,Not in universe under 1 year old,,0,Both parents present,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,95,0
199996,32,Private,39,19,Associates degree-occup /vocational,0,Not in universe,Married-civilian spouse present,Personal services except private HH,Sales,White,All other,Male,No,Not in universe,Children or Armed Forces,5178,0,0,Joint both under 65,Not in universe,Not in universe,Householder,Householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,4,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,1
199997,18,Not in universe,0,0,11th grade,0,High school,Never married,Not in universe or children,Not in universe,White,All other,Male,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Nonfiler,Not in universe,Not in universe,Child 18+ never marr Not in a subfamily,Child 18 or older,Nonmover,Nonmover,Nonmover,Yes,Not in universe,0,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94,0
199998,45,State government,43,33,Bachelors degree(BA AB BS),0,Not in universe,Married-civilian spouse present,Education,Precision production craft & repair,White,All other,Male,Not in universe,Not in universe,Full-time schedules,0,0,200,Joint both under 65,Not in universe,Not in universe,Householder,Householder,,,,Not in universe under 1 year old,,6,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,95,0
199999,9,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,White,Mexican-American,Male,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Nonfiler,Not in universe,Not in universe,Child <18 never marr not in subfamily,Child under 18 never married,,,,Not in universe under 1 year old,,0,Both parents present,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,95,0


In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 41 columns):
 #   Column                            Non-Null Count   Dtype 
---  ------                            --------------   ----- 
 0   age                               200000 non-null  int64 
 1   class_of_worker                   200000 non-null  object
 2   industry_code                     200000 non-null  int64 
 3   occupation_code                   200000 non-null  int64 
 4   education                         200000 non-null  object
 5   wage_per_hour                     200000 non-null  int64 
 6   enrolled_in_edu_inst_lastwk       200000 non-null  object
 7   marital_status                    200000 non-null  object
 8   major_industry_code               200000 non-null  object
 9   major_occupation_code             200000 non-null  object
 10  race                              200000 non-null  object
 11  hispanic_origin                   199408 non-null  object
 12  se

In [None]:
train.describe()

Unnamed: 0,age,industry_code,occupation_code,wage_per_hour,capital_gains,capital_losses,dividend_from_Stocks,num_person_Worked_employer,business_or_self_employed,veterans_benefits,weeks_worked_in_year,year,income_level
count,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0
mean,34.662495,15.56483,11.326325,54.8357,493.56158,38.921275,212.97763,1.98378,0.177995,1.52286,23.54182,94.4998,0.07436
std,22.225765,18.104961,14.424809,272.034681,5109.900136,277.867944,2062.591247,2.372892,0.557014,0.846346,24.447497,0.500001,0.262357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.0,0.0
25%,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,94.0,0.0
50%,33.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,10.0,94.0,0.0
75%,50.0,33.0,26.0,0.0,0.0,0.0,0.0,4.0,0.0,2.0,52.0,95.0,0.0
max,90.0,51.0,46.0,9999.0,99999.0,4608.0,99999.0,6.0,2.0,2.0,52.0,95.0,1.0


In [None]:
train.isna().sum()

age                                     0
class_of_worker                         0
industry_code                           0
occupation_code                         0
education                               0
wage_per_hour                           0
enrolled_in_edu_inst_lastwk             0
marital_status                          0
major_industry_code                     0
major_occupation_code                   0
race                                    0
hispanic_origin                       592
sex                                     0
member_of_labor_union                   0
reason_for_unemployment                 0
full_parttime_employment_stat           0
capital_gains                           0
capital_losses                          0
dividend_from_Stocks                    0
tax_filer_status                        0
region_of_previous_residence            0
state_of_previous_residence           477
d_household_family_stat                 0
d_household_summary               

## Testing Data

In [None]:
test.head()

Unnamed: 0,age,class_of_worker,industry_code,occupation_code,education,wage_per_hour,enrolled_in_edu_inst_lastwk,marital_status,major_industry_code,major_occupation_code,race,hispanic_origin,sex,member_of_labor_union,reason_for_unemployment,full_parttime_employment_stat,capital_gains,capital_losses,dividend_from_Stocks,tax_filer_status,region_of_previous_residence,state_of_previous_residence,d_household_family_stat,d_household_summary,migration_msa,migration_reg,migration_within_reg,live_1_year_ago,migration_sunbelt,num_person_Worked_employer,family_members_under_18,country_father,country_mother,country_self,citizenship,business_or_self_employed,fill_questionnaire_veteran_admin,veterans_benefits,weeks_worked_in_year,year
0,65,Not in universe,0,0,12th grade no diploma,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,White,All other,Male,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Joint one under 65 & one 65+,Not in universe,Not in universe,Householder,Householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,0,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94
1,75,Not in universe,0,0,Bachelors degree(BA AB BS),0,Not in universe,Divorced,Not in universe or children,Not in universe,White,All other,Female,Not in universe,Not in universe,Not in labor force,0,0,0,Nonfiler,Not in universe,Not in universe,Nonfamily householder,Householder,,,,Not in universe under 1 year old,,0,Not in universe,France,,United-States,Native- Born in the United States,0,Not in universe,2,0,95
2,26,Not in universe,0,0,High school graduate,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,White,Puerto Rican,Female,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Joint both under 65,Not in universe,Not in universe,Householder,Householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,0,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94
3,42,Self-employed-incorporated,2,43,Bachelors degree(BA AB BS),0,Not in universe,Married-civilian spouse present,Agriculture,Farming forestry and fishing,White,All other,Female,Not in universe,Not in universe,PT for non-econ reasons usually FT,0,0,115,Joint both under 65,Not in universe,Not in universe,Spouse of householder,Spouse of householder,,,,Not in universe under 1 year old,,1,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,95
4,35,Private,33,26,Some college but no degree,0,Not in universe,Married-civilian spouse present,Retail trade,Adm support including clerical,White,All other,Female,Not in universe,Not in universe,Children or Armed Forces,0,0,500,Joint both under 65,South,Louisiana,Spouse of householder,Spouse of householder,MSA to MSA,Same county,Same county,No,No,3,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94


In [None]:
test.tail()

Unnamed: 0,age,class_of_worker,industry_code,occupation_code,education,wage_per_hour,enrolled_in_edu_inst_lastwk,marital_status,major_industry_code,major_occupation_code,race,hispanic_origin,sex,member_of_labor_union,reason_for_unemployment,full_parttime_employment_stat,capital_gains,capital_losses,dividend_from_Stocks,tax_filer_status,region_of_previous_residence,state_of_previous_residence,d_household_family_stat,d_household_summary,migration_msa,migration_reg,migration_within_reg,live_1_year_ago,migration_sunbelt,num_person_Worked_employer,family_members_under_18,country_father,country_mother,country_self,citizenship,business_or_self_employed,fill_questionnaire_veteran_admin,veterans_benefits,weeks_worked_in_year,year
49995,2,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,White,All other,Female,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Nonfiler,Not in universe,Not in universe,Child <18 never marr not in subfamily,Child under 18 never married,Nonmover,Nonmover,Nonmover,Yes,Not in universe,0,Both parents present,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94
49996,42,Private,26,38,High school graduate,0,Not in universe,Married-civilian spouse present,Manufacturing-nondurable goods,Transportation and material moving,White,All other,Male,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Joint both under 65,Not in universe,Not in universe,Householder,Householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,6,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94
49997,41,Private,4,34,7th and 8th grade,0,Not in universe,Married-civilian spouse present,Construction,Precision production craft & repair,White,Central or South American,Male,Not in universe,Not in universe,Full-time schedules,0,0,0,Joint both under 65,Not in universe,Not in universe,Spouse of householder,Spouse of householder,?,?,?,Not in universe under 1 year old,?,2,Not in universe,Guatemala,Guatemala,Guatemala,Foreign born- Not a citizen of U S,0,Not in universe,2,26,95
49998,77,Not in universe,0,0,High school graduate,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,White,All other,Female,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Nonfiler,Not in universe,Not in universe,Spouse of householder,Spouse of householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,0,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94
49999,7,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,Other,All other,Female,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Nonfiler,Not in universe,Not in universe,Other Rel <18 never marr child of subfamily RP,Other relative of householder,?,?,?,Not in universe under 1 year old,?,0,Both parents present,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,95


In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 40 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   age                               50000 non-null  int64 
 1   class_of_worker                   50000 non-null  object
 2   industry_code                     50000 non-null  int64 
 3   occupation_code                   50000 non-null  int64 
 4   education                         50000 non-null  object
 5   wage_per_hour                     50000 non-null  int64 
 6   enrolled_in_edu_inst_lastwk       50000 non-null  object
 7   marital_status                    50000 non-null  object
 8   major_industry_code               50000 non-null  object
 9   major_occupation_code             50000 non-null  object
 10  race                              50000 non-null  object
 11  hispanic_origin                   49854 non-null  object
 12  sex               

In [None]:
test.describe()

Unnamed: 0,age,industry_code,occupation_code,wage_per_hour,capital_gains,capital_losses,dividend_from_Stocks,num_person_Worked_employer,business_or_self_employed,veterans_benefits,weeks_worked_in_year,year
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,34.86952,15.53144,11.38076,57.67614,481.0942,37.867,217.04162,1.99692,0.17974,1.52806,23.53932,94.5044
std,22.261519,18.055435,14.454668,288.661988,4859.057532,272.695815,2143.428371,2.377575,0.559857,0.843319,24.450718,0.499986
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.0
25%,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,94.0
50%,34.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,10.5,95.0
75%,50.0,33.0,26.0,0.0,0.0,0.0,0.0,4.0,0.0,2.0,52.0,95.0
max,90.0,51.0,46.0,9900.0,99999.0,4356.0,99999.0,6.0,2.0,2.0,52.0,95.0


In [None]:
test.isna().sum()

age                                     0
class_of_worker                         0
industry_code                           0
occupation_code                         0
education                               0
wage_per_hour                           0
enrolled_in_edu_inst_lastwk             0
marital_status                          0
major_industry_code                     0
major_occupation_code                   0
race                                    0
hispanic_origin                       146
sex                                     0
member_of_labor_union                   0
reason_for_unemployment                 0
full_parttime_employment_stat           0
capital_gains                           0
capital_losses                          0
dividend_from_Stocks                    0
tax_filer_status                        0
region_of_previous_residence            0
state_of_previous_residence           117
d_household_family_stat                 0
d_household_summary               

Kedua data ada missing data yang perlu diperbaiki terlebih dahulu untuk dilanjutkan ke analisis 

# Missing Values Imputer


## Column Selector 
Untuk mengisi missing values, perlu kita bagi terlebih dahulu data kedalam kategori-kategori berdasarkan dtypenya. Dalam hal ini, kita kumpulkan fitur numerik dan fitur objek dengan menggunakan custom transformer yang di-inherit dari class BaseEstimator dan TransformerMixin

In [None]:
class ColumnsSelector(BaseEstimator, TransformerMixin):
  def __init__(self, type):
    self.type = type
  
  def fit(self, X, y=None):
    return self
  
  def transform(self,X):
    return X.select_dtypes(include=[self.type])

Selanjutnya akan dilihat NA values untuk kedua kategori

In [None]:
numeric = ColumnsSelector('int64')
object = ColumnsSelector('object')

In [None]:
# Data train
numeric_train = numeric.transform(train)
numeric_test = numeric.transform(test)

object_train = object.transform(train)
object_test = object.transform(test)

In [None]:
print("Data Train: ")
print("")
print("numeric: ")
print(numeric_train.isna().sum())
print("")
print("object: ")
print(object_train.isna().sum())

Data Train: 

numeric: 
age                           0
industry_code                 0
occupation_code               0
wage_per_hour                 0
capital_gains                 0
capital_losses                0
dividend_from_Stocks          0
num_person_Worked_employer    0
business_or_self_employed     0
veterans_benefits             0
weeks_worked_in_year          0
year                          0
income_level                  0
dtype: int64

object: 
class_of_worker                         0
education                               0
enrolled_in_edu_inst_lastwk             0
marital_status                          0
major_industry_code                     0
major_occupation_code                   0
race                                    0
hispanic_origin                       592
sex                                     0
member_of_labor_union                   0
reason_for_unemployment                 0
full_parttime_employment_stat           0
tax_filer_status                 

In [None]:
print("Data Test: ")
print("")
print("numeric: ")
print(numeric_test.isna().sum())
print("")
print("object: ")
print(object_test.isna().sum())

Data Test: 

numeric: 
age                           0
industry_code                 0
occupation_code               0
wage_per_hour                 0
capital_gains                 0
capital_losses                0
dividend_from_Stocks          0
num_person_Worked_employer    0
business_or_self_employed     0
veterans_benefits             0
weeks_worked_in_year          0
year                          0
dtype: int64

object: 
class_of_worker                         0
education                               0
enrolled_in_edu_inst_lastwk             0
marital_status                          0
major_industry_code                     0
major_occupation_code                   0
race                                    0
hispanic_origin                       146
sex                                     0
member_of_labor_union                   0
reason_for_unemployment                 0
full_parttime_employment_stat           0
tax_filer_status                        0
region_of_previous_resid

Jadi yang memiliki missing values ada pada fitur bertipe object, kita akan eksplor lebih lanjut beberapa fitur objek dari data tsb

## Eksplorasi Data Bertipe Objek

In [None]:
object_train['hispanic_origin'].unique()

array(['All other', ' Mexican-American', ' All other', ' Other Spanish',
       'Mexican-American', 'Mexican (Mexicano)', ' NA',
       ' Central or South American', 'Central or South American',
       ' Mexican (Mexicano)', ' Puerto Rican', 'Other Spanish',
       'Puerto Rican', 'Cuban', ' Cuban', nan, 'Do not know', 'Chicano',
       ' Chicano', ' Do not know'], dtype=object)

In [None]:
object_test['migration_msa'].unique()

array(['Nonmover', nan, ' MSA to MSA', 'MSA to nonMSA', ' ?', ' Nonmover',
       'NonMSA to nonMSA', 'MSA to MSA', 'Abroad to MSA',
       ' Abroad to nonMSA', 'Abroad to nonMSA', ' Not identifiable',
       'Not identifiable', ' Not in universe', 'Not in universe',
       'NonMSA to MSA', ' NonMSA to nonMSA', ' MSA to nonMSA',
       ' NonMSA to MSA', ' Abroad to MSA'], dtype=object)

Jika dilihat terdapat value yang sama namun karena ada space di awal, pandas membedakan 2 fitur tersebut, jadi perlu kita samakan values-values yang ada space di awal dengan kalimat yang tidak ada spacenya

Lalu, terdapat values seperti "?", "Do not know", "NA" yang mana berarti sama dengan missing values na, jadi perlu kita ganti valuesnya menjadi "Not identifieable" (tidak teridentifikasi). Ada banyak cara untuk mengganti na values salah satunya dengan mengganti na values tersebut dengan values yang menjadi modus di datanya

Untuk mengubah missing values ini dengan mudah maka kita dapat membuat custom transformer untuk mengubah na values di data tersebut berdasarkan strategi yang diinput

## Imputer

In [None]:
class MissingValuesImputer(BaseEstimator, TransformerMixin):
  def __init__(self, columns, strategy = 'same_meaning'):
    self.columns = columns
    self.strategy = strategy

  def fit(self, X, y = None):
    if self.columns is None:
      self.columns = X.columns

    if self.strategy == 'same_meaning':
      self.fill = {column : 'Not identifiable' for column in self.columns}
    
    elif self.strategy == 'most_frequent':
      self.fill = {column : X[column].value_counts().index[0] for column in self.columns}
    
    else :
      self.fill = {column : '0' for column in self.columns}

    return self
  
  def transform(self, X):
    X_copy = X.copy()
    for column in self.columns:
      X[column] = X[column].str.strip()
      X[column] = X[column].replace(['NA','Do not know','?'], 'Not identifiable')
      X_copy[column] = X_copy[column].fillna(self.fill[column])
    return X_copy

Dapat dilihat bahwa missing valuesnya sudah tidak ada berkat custom transformer tsb, langkah selanjutnya adalah membuat custom transformer untuk mengubah fitur bertipe string menjadi kategori dan diolah lebih lanjut menjadi data numerik

# Encode Categories

## Custom Transformer
Akan digunakan One Hot Encoder untuk mengolah fitur kategorikal menjadi numerik pada data train dan test yang telah digabung menjadi satu

In [None]:
class CategoricalEncoder(BaseEstimator, TransformerMixin): 
  def __init__(self, dropFirst=True):
    self.categories=dict()
    self.dropFirst=dropFirst
    
  def fit(self, X, y=None):
    join_df = pd.concat([train, test])
    join_df = join_df.select_dtypes(include=['object'])
    for column in join_df.columns:
      self.categories[column] = join_df[column].value_counts().index.tolist()
    return self
    
  def transform(self, X):
    X_copy = X.copy()
    X_copy = X_copy.select_dtypes(include=['object'])
    for column in X_copy.columns:
      X_copy[column] = X_copy[column].astype({column:
                CategoricalDtype(self.categories[column])})
    return pd.get_dummies(X_copy, drop_first=self.dropFirst)

# Pipeline ML
Dengan membuat custom transformernya kita bisa membuat pipeline untuk data bertipe numerik dan kategorikal lalu menggabunggkannya menjadi satu pipeline untuk dilakukan ML

In [None]:
num_pipeline = Pipeline(steps = [
                                 ("num_dtype_selector",ColumnSelector(type = 'int')),
                                 ("scaler", StandardScaler())
                                 ])

cat_pipeline = Pipeline(steps = [
                                 ("obj_dtype_selector",ColumnSelector(type = 'object')),
                                  ("missing_values_imputer", MissingValuesImputer(columns = object_train.columns, strategy = 'most_frequent')),
                                 ("encoder", CategoricalEncoder(dropFirst = True))
])

full_pipeline = FeatureUnion([("num_pipe", num_pipeline), 
                ("cat_pipeline", cat_pipeline)])



## Model ML

### Training Model

In [None]:
# Buat copian data train
train_data = train.copy()

# Definisikan X_train dan y_train
X_train = train_data.drop('income_level', axis = 1)
y_train = train_data['income_level']

# Fit transform X_train dengan pipeline
X_train_processed = full_pipeline.fit_transform(X_train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
model = LogisticRegression(random_state = 50)
model.fit(X_train_processed, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=50, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Testing Model

In [None]:
test.head()

Unnamed: 0,age,class_of_worker,industry_code,occupation_code,education,wage_per_hour,enrolled_in_edu_inst_lastwk,marital_status,major_industry_code,major_occupation_code,race,hispanic_origin,sex,member_of_labor_union,reason_for_unemployment,full_parttime_employment_stat,capital_gains,capital_losses,dividend_from_Stocks,tax_filer_status,region_of_previous_residence,state_of_previous_residence,d_household_family_stat,d_household_summary,migration_msa,migration_reg,migration_within_reg,live_1_year_ago,migration_sunbelt,num_person_Worked_employer,family_members_under_18,country_father,country_mother,country_self,citizenship,business_or_self_employed,fill_questionnaire_veteran_admin,veterans_benefits,weeks_worked_in_year,year
0,65,Not in universe,0,0,12th grade no diploma,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,White,All other,Male,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Joint one under 65 & one 65+,Not in universe,Not in universe,Householder,Householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,0,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94
1,75,Not in universe,0,0,Bachelors degree(BA AB BS),0,Not in universe,Divorced,Not in universe or children,Not in universe,White,All other,Female,Not in universe,Not in universe,Not in labor force,0,0,0,Nonfiler,Not in universe,Not in universe,Nonfamily householder,Householder,,,,Not in universe under 1 year old,,0,Not in universe,France,,United-States,Native- Born in the United States,0,Not in universe,2,0,95
2,26,Not in universe,0,0,High school graduate,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,White,Puerto Rican,Female,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Joint both under 65,Not in universe,Not in universe,Householder,Householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,0,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94
3,42,Self-employed-incorporated,2,43,Bachelors degree(BA AB BS),0,Not in universe,Married-civilian spouse present,Agriculture,Farming forestry and fishing,White,All other,Female,Not in universe,Not in universe,PT for non-econ reasons usually FT,0,0,115,Joint both under 65,Not in universe,Not in universe,Spouse of householder,Spouse of householder,,,,Not in universe under 1 year old,,1,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,95
4,35,Private,33,26,Some college but no degree,0,Not in universe,Married-civilian spouse present,Retail trade,Adm support including clerical,White,All other,Female,Not in universe,Not in universe,Children or Armed Forces,0,0,500,Joint both under 65,South,Louisiana,Spouse of householder,Spouse of householder,MSA to MSA,Same county,Same county,No,No,3,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94


In [None]:
test_copy = test.copy()

X_test = test_copy
X_test_processed = full_pipeline.fit_transform(X_test)

test['prediction'] = model.predict(X_test_processed)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
test

Unnamed: 0,age,class_of_worker,industry_code,occupation_code,education,wage_per_hour,enrolled_in_edu_inst_lastwk,marital_status,major_industry_code,major_occupation_code,race,hispanic_origin,sex,member_of_labor_union,reason_for_unemployment,full_parttime_employment_stat,capital_gains,capital_losses,dividend_from_Stocks,tax_filer_status,region_of_previous_residence,state_of_previous_residence,d_household_family_stat,d_household_summary,migration_msa,migration_reg,migration_within_reg,live_1_year_ago,migration_sunbelt,num_person_Worked_employer,family_members_under_18,country_father,country_mother,country_self,citizenship,business_or_self_employed,fill_questionnaire_veteran_admin,veterans_benefits,weeks_worked_in_year,year,prediction
0,65,Not in universe,0,0,12th grade no diploma,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,White,All other,Male,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Joint one under 65 & one 65+,Not in universe,Not in universe,Householder,Householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,0,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94,0
1,75,Not in universe,0,0,Bachelors degree(BA AB BS),0,Not in universe,Divorced,Not in universe or children,Not in universe,White,All other,Female,Not in universe,Not in universe,Not in labor force,0,0,0,Nonfiler,Not in universe,Not in universe,Nonfamily householder,Householder,,,,Not in universe under 1 year old,,0,Not in universe,France,,United-States,Native- Born in the United States,0,Not in universe,2,0,95,0
2,26,Not in universe,0,0,High school graduate,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,White,Puerto Rican,Female,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Joint both under 65,Not in universe,Not in universe,Householder,Householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,0,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94,0
3,42,Self-employed-incorporated,2,43,Bachelors degree(BA AB BS),0,Not in universe,Married-civilian spouse present,Agriculture,Farming forestry and fishing,White,All other,Female,Not in universe,Not in universe,PT for non-econ reasons usually FT,0,0,115,Joint both under 65,Not in universe,Not in universe,Spouse of householder,Spouse of householder,,,,Not in universe under 1 year old,,1,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,95,0
4,35,Private,33,26,Some college but no degree,0,Not in universe,Married-civilian spouse present,Retail trade,Adm support including clerical,White,All other,Female,Not in universe,Not in universe,Children or Armed Forces,0,0,500,Joint both under 65,South,Louisiana,Spouse of householder,Spouse of householder,MSA to MSA,Same county,Same county,No,No,3,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,White,All other,Female,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Nonfiler,Not in universe,Not in universe,Child <18 never marr not in subfamily,Child under 18 never married,Nonmover,Nonmover,Nonmover,Yes,Not in universe,0,Both parents present,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,0
49996,42,Private,26,38,High school graduate,0,Not in universe,Married-civilian spouse present,Manufacturing-nondurable goods,Transportation and material moving,White,All other,Male,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Joint both under 65,Not in universe,Not in universe,Householder,Householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,6,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,0
49997,41,Private,4,34,7th and 8th grade,0,Not in universe,Married-civilian spouse present,Construction,Precision production craft & repair,White,Central or South American,Male,Not in universe,Not in universe,Full-time schedules,0,0,0,Joint both under 65,Not in universe,Not in universe,Spouse of householder,Spouse of householder,?,?,?,Not in universe under 1 year old,?,2,Not in universe,Guatemala,Guatemala,Guatemala,Foreign born- Not a citizen of U S,0,Not in universe,2,26,95,0
49998,77,Not in universe,0,0,High school graduate,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,White,All other,Female,Not in universe,Not in universe,Children or Armed Forces,0,0,0,Nonfiler,Not in universe,Not in universe,Spouse of householder,Spouse of householder,Nonmover,Nonmover,Nonmover,Yes,Not in universe,0,Not in universe,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94,0


In [None]:
test['prediction'].unique()

array([0, 1])

In [None]:
test[['prediction']]

Unnamed: 0,prediction
0,0
1,0
2,0
3,0
4,0
...,...
49995,0
49996,0
49997,0
49998,0


In [None]:
from google.colab import  drive
drive.mount('/drive')

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


In [None]:
test[['prediction']].to_csv('/drive/My Drive/predictions/prediction_2.csv')