# Modelling

We will load the data, and use the following tree-based methods:
- Classification trees
- Random forests
- Gradient Boosted trees

The data is loaded using the code below.

In [22]:
import pandas as pd

# Load the saved datasets from the CSV files
X_train = pd.read_csv("../data/X_train.csv", index_col=0)  # Use the first column as index
y_train = pd.read_csv("../data/y_train.csv", index_col=0)  # Use the first column as index
X_test = pd.read_csv("../data/X_test.csv", index_col=0)    # Use the first column as index
y_test = pd.read_csv("../data/y_test.csv", index_col=0)    # Use the first column as index

The main challenges we have to contend with for this data are the missing data and the imbalanced data. We stored the missing values as `NaN`.

In [23]:
# Check the unique values in the columns with missing data
print("Unique values in 'workclass':\n", X_train['workclass'].unique())
print("\nUnique values in 'occupation':\n", X_train['occupation'].unique())
print("\nUnique values in 'native-country':\n", X_train['native-country'].unique())

# Count the number of NaN values
nan_count = X_train['workclass'].isna().sum()

# Output the results
print(f"Number of missing values: {nan_count}")

Unique values in 'workclass':
 ['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' nan
 'Self-emp-inc' 'Without-pay' 'Never-worked']

Unique values in 'occupation':
 ['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Tech-support' nan 'Protective-serv'
 'Machine-op-inspct' 'Priv-house-serv' 'Armed-Forces']

Unique values in 'native-country':
 ['United-States' 'Cuba' 'Jamaica' 'India' nan 'Mexico' 'South'
 'Puerto-Rico' 'England' 'Canada' 'Germany' 'Iran' 'Philippines' 'Italy'
 'Columbia' 'Cambodia' 'Thailand' 'Ecuador' 'Laos' 'Taiwan' 'Haiti'
 'Portugal' 'El-Salvador' 'Poland' 'France' 'Dominican-Republic'
 'Honduras' 'Guatemala' 'China' 'Japan' 'Yugoslavia' 'Peru'
 'Outlying-US(Guam-USVI-etc)' 'Trinadad&Tobago' 'Nicaragua' 'Greece'
 'Hong' 'Vietnam' 'Ireland' 'Scotland' 'Hungary' 'Holand-Netherlands']
Number of missing values: 2799


To contend with the missing data, we will use the following strategies:
- Treating missingness as its own category
- Surrogate splits
- Imputation.

To contend with the imbalanced data, we will use SMOTE.

We will explain each of these strategies as we use them in the modelling. 

# Deciding on a Performance Measure

