<a href="https://colab.research.google.com/github/SrikanthGuggila/INeuron/blob/main/Boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Problem Statement

In this assignment students need to predict whether a person makes over
50K per year or not from classic adult dataset using XGBoost. The
description of the dataset is as follows:

### Dataset Information

* Extraction was done by Barry Becker from the 1994 Census
* database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

#### Attribute Information:

* Listing of attributes: >50K, <=50K.
* age: continuous.
* workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov,
* Local-gov, State-gov, Without-pay, Never-worked.
* fnlwgt: continuous.
* education: Bachelors, Some-college, 11th, HS-grad, Prof-school,
* Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th,
* Doctorate, 5th-6th, Preschool.
* education-num: continuous.
* marital-status: Married-civ-spouse, Divorced, Never-married,Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
* occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-
* managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct,
* Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv,
* Protective-serv, Armed-Forces.
* relationship: Wife, Own-child, Husband, Not-in-family, Other-relative,Unmarried.
* race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
* sex: Female, Male.
* capital-gain: continuous.
* capital-loss: continuous.
* hours-per-week: continuous.
* native-country: United-States, Cambodia, England, Puerto-Rico,Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan,Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy,Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France,Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia,Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

### Loading the data

#### Import Necessary Libraries

In [None]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-1.3.1-py3-none-win_amd64.whl (95.2 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.3.1
Note: you may need to restart the kernel to use updated packages.


In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

#### Train data

In [None]:
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None)
test_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test' , skiprows = 1, header = None)

In [None]:
train_set.shape

(32561, 15)

In [None]:
test_set.shape

(16281, 15)

In [None]:
col_labels = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
'marital_status', 'occupation','relationship', 'race', 'sex', 'capital_gain',
'capital_loss', 'hours_per_week', 'native_country', 'wage_class']

In [None]:
train_set.columns = col_labels
test_set.columns = col_labels

In [None]:
train_set.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [None]:
train_set.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
wage_class        0
dtype: int64

In [None]:
train_set_x = train_set.drop('wage_class', axis=1)
train_set_y = train_set.wage_class

In [None]:
train_set_x.shape

(32561, 14)

In [None]:
train_set_y.shape

(32561,)

In [None]:
test_set_x = test_set.drop('wage_class', axis=1)
test_set_y = test_set.wage_class

In [None]:
test_set_x.shape

(16281, 14)

In [None]:
test_set_y.shape

(16281,)

In [None]:
for column in train_set_x.columns:
    print(train_set_x[column].value_counts())
    print('----------------------------------------')

36    898
31    888
34    886
23    877
35    876
     ... 
83      6
85      3
88      3
87      1
86      1
Name: age, Length: 73, dtype: int64
----------------------------------------
 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64
----------------------------------------
164190    13
203488    13
123011    13
113364    12
121124    12
          ..
284211     1
312881     1
177711     1
179758     1
229376     1
Name: fnlwgt, Length: 21648, dtype: int64
----------------------------------------
 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              4

In [None]:
train_set_x.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


#### Converting Categorical data into Numarical data

In [None]:
sex_dummies_train = pd.get_dummies(train_set_x['sex'], prefix='sex').iloc[:,1:]
workclass_dummies_train = pd.get_dummies(train_set_x['workclass'], prefix='workclass').iloc[:,1:]
education_dummies_train = pd.get_dummies(train_set_x['education'], prefix='education').iloc[:,1:]
marital_status_dummies_train = pd.get_dummies(train_set_x['marital_status'], prefix='marital_status').iloc[:,1:]
relationship_dummies_train = pd.get_dummies(train_set_x['relationship'], prefix='relationship').iloc[:,1:]
race_dummies_train = pd.get_dummies(train_set_x['race'], prefix='race').iloc[:,1:]
occupation_dummies_train = pd.get_dummies(train_set_x['occupation'], prefix='occupation').iloc[:,1:]
native_country_dummies_train = pd.get_dummies(train_set_x['native_country'], prefix='native_country').iloc[:,1:]

In [None]:
print(sex_dummies_train.shape)
print(workclass_dummies_train.shape)
print(education_dummies_train.shape)
print(marital_status_dummies_train.shape)
print(relationship_dummies_train.shape)
print(race_dummies_train.shape)
print(occupation_dummies_train.shape)
print(native_country_dummies_train.shape)
print(race_dummies_train.shape)


(32561, 1)
(32561, 8)
(32561, 15)
(32561, 6)
(32561, 5)
(32561, 4)
(32561, 14)
(32561, 41)
(32561, 4)


In [None]:
set_train_X = train_set_x[['age','fnlwgt','capital_gain','capital_loss','hours_per_week']]
set_train_X = pd.concat([set_train_X, sex_dummies_train], axis=1)
set_train_X = pd.concat([set_train_X, workclass_dummies_train], axis=1)
set_train_X = pd.concat([set_train_X, education_dummies_train], axis=1)
set_train_X = pd.concat([set_train_X, marital_status_dummies_train], axis=1)
set_train_X = pd.concat([set_train_X, relationship_dummies_train], axis=1)
set_train_X = pd.concat([set_train_X, race_dummies_train], axis=1)
set_train_X = pd.concat([set_train_X, occupation_dummies_train], axis=1)

In [None]:
set_train_X.shape

(32561, 58)

In [None]:
set_train_X.shape

(32561, 58)

In [None]:
test_set_x.shape

(16281, 14)

In [None]:
sex_dummies = pd.get_dummies(test_set_x['sex'], prefix='sex').iloc[:,1:]
workclass_dummies = pd.get_dummies(test_set_x['workclass'], prefix='workclass').iloc[:,1:]
education_dummies = pd.get_dummies(test_set_x['education'], prefix='education').iloc[:,1:]
marital_status_dummies = pd.get_dummies(test_set_x['marital_status'], prefix='marital_status').iloc[:,1:]
relationship_dummies = pd.get_dummies(test_set_x['relationship'], prefix='relationship').iloc[:,1:]
race_dummies = pd.get_dummies(test_set_x['race'], prefix='race').iloc[:,1:]
occupation_dummies = pd.get_dummies(test_set_x['occupation'], prefix='occupation').iloc[:,1:]
native_country_dummies = pd.get_dummies(test_set_x['native_country'], prefix='native_country').iloc[:,1:]

In [None]:
print(sex_dummies.shape)
print(workclass_dummies.shape)
print(education_dummies.shape)
print(marital_status_dummies.shape)
print(relationship_dummies.shape)
print(race_dummies.shape)
print(occupation_dummies.shape)
print(native_country_dummies.shape)
print(race_dummies.shape)

(16281, 1)
(16281, 8)
(16281, 15)
(16281, 6)
(16281, 5)
(16281, 4)
(16281, 14)
(16281, 4)


In [None]:
set_test_X = test_set_x[['age','fnlwgt','capital_gain','capital_loss','hours_per_week']]
set_test_X = pd.concat([set_test_X, workclass_dummies], axis=1)
set_test_X = pd.concat([set_test_X, education_dummies], axis=1)
set_test_X = pd.concat([set_test_X, marital_status_dummies], axis=1)
set_test_X = pd.concat([set_test_X, relationship_dummies], axis=1)
set_test_X = pd.concat([set_test_X, race_dummies], axis=1)
set_test_X = pd.concat([set_test_X, occupation_dummies], axis=1)
set_test_X = pd.concat([set_test_X, sex_dummies], axis=1)

In [None]:
set_test_X.shape

(16281, 58)

In [None]:
train_set_y = pd.get_dummies(train_set_y,prefix='wage_class').iloc[:,1:]

In [None]:
train_set_y

Unnamed: 0,wage_class_ >50K
0,0
1,0
2,0
3,0
4,0
...,...
32556,0
32557,1
32558,0
32559,0


In [None]:
test_set_y = pd.get_dummies(text_set_y,prefix='wage_class').iloc[:,1:]
test_set_y

Unnamed: 0,wage_class_ >50K.
0,0
1,0
2,1
3,1
4,0
...,...
16276,0
16277,0
16278,0
16279,0


### AdaBoost Classifier Algorithm

In [None]:
ada = AdaBoostClassifier()
ada.fit(set_train_X,train_set_y)

  return f(**kwargs)


AdaBoostClassifier()

In [None]:
y_predict = ada.predict(set_train_X)

In [None]:
y_predict

array([0, 1, 0, ..., 0, 0, 1], dtype=uint8)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
score = accuracy_score(train_set_y,y_predict)

#### Trainging accuracy score

In [None]:
score

0.860876508706735

In [None]:
train_set_y

Unnamed: 0,wage_class_ >50K
0,0
1,0
2,0
3,0
4,0
...,...
32556,0
32557,1
32558,0
32559,0


In [None]:
y_test_predict = ada.predict(set_test_X)

#### Testing accuracy score

In [None]:
score2 = accuracy_score(test_set_y,y_test_predict)

In [None]:
score2

0.7791904674160064

### Gradient Boosting Classifier Algorithm

In [None]:
gb = GradientBoostingClassifier()
gb.fit(set_train_X,train_set_y)

  return f(**kwargs)


GradientBoostingClassifier()

In [None]:
y_pred_gb_train = gb.predict(set_train_X) 

#### Training accuracy score (Gradient boosting classifier)

In [None]:
score_gb_train = accuracy_score(train_set_y,y_pred_gb_train)
score_gb_train

0.8668959798531987

In [None]:
y_pred_gb_test = gb.predict(set_test_X)

#### Testing accuracy score (Gradient boosting classifier)

In [None]:
score_gb_test = accuracy_score(test_set_y,y_pred_gb_test)
score_gb_test

0.7635280388182544

### XG Boost classifier Algorithm

In [None]:
xgboost = XGBClassifier(learning_rate= 1, max_depth= 5, n_estimators= 50)
xgboost.fit(set_train_X,train_set_y)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=1, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=4, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
y_pred_xgb_train = xgboost.predict(set_train_X)

#### Training accuracy score (XG boost classifier)

In [None]:
score_xgb_train = accuracy_score(train_set_y,y_pred_xgb_train)
score_xgb_train

0.9013236694204724

In [None]:
y_pred_xgb_test = xgboost.predict(set_test_X)

In [None]:
score_xgb_test = accuracy_score(test_set_y,y_pred_xgb_test)

#### Testing accuracy score (XG boost classifier)

In [None]:
score_xgb_test

0.746760027025367

#### Parameter tuning

In [None]:
xgboost = XGBClassifier(learning_rate= 0.5, max_depth= 3, n_estimators= 150,n_jobs=5)
xgboost.fit(set_train_X,train_set_y)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.5, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=150, n_jobs=5, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
y_pred_xgb_train = xgboost.predict(set_train_X)

#### Training accuracy after parameter tuning (XG boost classifier)

In [None]:
score_xgb_train = accuracy_score(train_set_y,y_pred_xgb_train)
score_xgb_train

0.8882098215656767

In [None]:
y_pred_xgb_test = xgboost.predict(set_test_X)

#### Testing accuracy after parameter tuning (XG boost classifier)

In [None]:
score_xgb_test = accuracy_score(test_set_y,y_pred_xgb_test)

In [None]:
score_xgb_test

0.8114980652294085

In [None]:
import pickle

filename = 'xgboost_model.pickle'
pickle.dump(xgb, open(filename, 'wb'))

loaded_model = pickle.load(open(filename, 'rb'))