### Part III. Hyper- Parameter Tuning with Gradient Boosting
In Part III, you will conduct a hyper-parameter tuning experiment with Gradient Boosting. 

In [66]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

0. For this experiment, utilize the same adult dataset (adult.data) as in Part II

1. Data Preparation

1. (a) Load the designated dataset. 

In [67]:
data = pd.read_csv("adult.data") #1 a.
data.columns = ["age", "workclass", "fnlwgt", "education", "education-num", "martial-status", "occupation", "relationship", "race", "sex", "capital-gain","capital-loss", "hours-per-week","native-country","income"]

1. (b) Exhibit the first few rows of the dataset and show the count of instances and descriptive features
in the original data.

In [68]:
data.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,martial-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [69]:
data.describe() #this shows count of instances and descriptive features

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32560.0,32560.0,32560.0,32560.0,32560.0,32560.0
mean,38.581634,189781.8,10.08059,1077.615172,87.306511,40.437469
std,13.640642,105549.8,2.572709,7385.402999,402.966116,12.347618
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117831.5,9.0,0.0,0.0,40.0
50%,37.0,178363.0,10.0,0.0,0.0,40.0
75%,48.0,237054.5,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [70]:
num_instances, num_features = data.shape
print("Total instances: "+str(num_instances))
print("Total features: "+str(num_features))

Total instances: 32560
Total features: 15


1. (c) Eliminate instances containing missing values. 

In [71]:
data = data.dropna()

1. (d) The class feature, INCOME, has two categorical values: ‘<=50K’ and ‘>50K’. Alter the target feature to binary 0/1, although it’s generally not a requisite for the Gradient Boosting algorithm. 

In [72]:

def get_label(label):
    if label == " >50K":
        return 1
    else:
        return 0
    
data["income"] = data["income"].apply(get_label)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32560 non-null  int64 
 1   workclass       32560 non-null  object
 2   fnlwgt          32560 non-null  int64 
 3   education       32560 non-null  object
 4   education-num   32560 non-null  int64 
 5   martial-status  32560 non-null  object
 6   occupation      32560 non-null  object
 7   relationship    32560 non-null  object
 8   race            32560 non-null  object
 9   sex             32560 non-null  object
 10  capital-gain    32560 non-null  int64 
 11  capital-loss    32560 non-null  int64 
 12  hours-per-week  32560 non-null  int64 
 13  native-country  32560 non-null  object
 14  income          32560 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 3.7+ MB


1. (e) Execute Label Encoding for categorical variables. 
    - I shall use one hot encoding like in part II

In [73]:
encoder = OneHotEncoder(drop='first', sparse=False)
categorical_cols = ['workclass', 'education', 'martial-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
encoded_values = encoder.fit_transform(data[categorical_cols])
encoded_df = pd.DataFrame(encoded_values, columns=encoder.get_feature_names_out(categorical_cols))
data = data.drop(categorical_cols, axis=1)
data = pd.concat([data, encoded_df], axis=1)




1. (f) Illustrate the first few rows of the modified data. How many descriptive features does the data
contain? Explain the difference from the prior one-hot encoding


In [74]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Columns: 101 entries, age to native-country_ Yugoslavia
dtypes: float64(94), int64(7)
memory usage: 25.1 MB


- Answer:
    - There is no change in number of instances.
    - All the categorical fields are now converted into numerical binding using one hot encoding.
    - The dataset only has numerical features now.
    - The data does not have object string data now.

1. (g) Split the data for model training and testing, allocating 30% for testing and the remaining 70%
for training.

In [75]:
y = data["income"]
X = data.drop(columns=["income"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

2. Hyper-parameter Tuning:

- In this experiment, we are primarily altering two hyper-parameters: the number of base learners and the learning rate, for the Gradient Boosting classifier. 

- For the number of individual decision trees for the base learners, employ 5, 10, and 50, and for the learning rate, select 0.01, 0.05, and 0.1. 

- Therefore, a total of 9 combinations will be considered to identify the optimum hyper-parameters.


2. (h) Execute a grid search to find the most considerable hyper-parameter values among the provided combinations of values. 

During the search, utilize the prepared training data and a 3-fold crossvalidation schema for training and validation. For testing, employee the prepared test data, and use accuracy as the scoring metric

In [76]:

param_grid = {
    'n_estimators': [5, 10, 50],
    'learning_rate': [0.01, 0.05, 0.1]
}

gb = GradientBoostingClassifier(random_state=42)
grid_search = GridSearchCV(gb, param_grid, cv=3, scoring='accuracy', return_train_score=True)
grid_search.fit(X_train, y_train)

best_gb = grid_search.best_estimator_
y_pred = best_gb.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the best model on the test set: "+ str(test_accuracy*100))


Accuracy of the best model on the test set: 86.33292383292384


2. (i) For every combination of the stated parameter values, present the average test score, standard
deviation of test scores, and rank test score (1, 2, 3..)

In [83]:
results = grid_search.cv_results_

print("Results for each hyper-parameter combination:\n")
print("{:<60} | {:<15} | {:<10} | {:<5}".format('Parameters', 'Mean Test Score', 'Std', 'Rank'))
print('-'*100)
for mean_score, std_score, params, rank in zip(results['mean_test_score'], results['std_test_score'], results['params'], results['rank_test_score']): print("{:<60} | {:<15.4f} | {:<10.4f} | {:<5}".format(str(params), mean_score, std_score, rank))


Results for each hyper-parameter combination:

Parameters                                                   | Mean Test Score | Std        | Rank 
----------------------------------------------------------------------------------------------------
{'learning_rate': 0.01, 'n_estimators': 5}                   | 0.7594          | 0.0001     | 7    
{'learning_rate': 0.01, 'n_estimators': 10}                  | 0.7594          | 0.0001     | 7    
{'learning_rate': 0.01, 'n_estimators': 50}                  | 0.8047          | 0.0011     | 4    
{'learning_rate': 0.05, 'n_estimators': 5}                   | 0.7594          | 0.0001     | 7    
{'learning_rate': 0.05, 'n_estimators': 10}                  | 0.8047          | 0.0011     | 4    
{'learning_rate': 0.05, 'n_estimators': 50}                  | 0.8526          | 0.0050     | 2    
{'learning_rate': 0.1, 'n_estimators': 5}                    | 0.8047          | 0.0011     | 4    
{'learning_rate': 0.1, 'n_estimators': 10}          

2. (j) Present the performance report of the model with the superior parameter setting, incorporating
metrics such as accuracy, precision, recall, F1-score, etc.

In [80]:
best_gb = grid_search.best_estimator_
y_pred = best_gb.predict(X_test)
report = classification_report(y_test, y_pred)
print("\nPerformance report of the model with the best hyperparameters:\n")
print(report)


Performance report of the model with the best hyperparameters:

              precision    recall  f1-score   support

           0       0.88      0.96      0.91      7410
           1       0.80      0.57      0.67      2358

    accuracy                           0.86      9768
   macro avg       0.84      0.76      0.79      9768
weighted avg       0.86      0.86      0.85      9768

