# Modelling

We will load the data, and use the following tree-based methods:
- Classification trees
- Random forests
- Gradient Boosted trees

The data is loaded using the code below.

In [1]:
import pandas as pd

# Load the saved datasets from the CSV files
X_train = pd.read_csv("../data/X_train.csv", index_col=0)  # Use the first column as index
y_train = pd.read_csv("../data/y_train.csv", index_col=0)  # Use the first column as index
X_test = pd.read_csv("../data/X_test.csv", index_col=0)    # Use the first column as index
y_test = pd.read_csv("../data/y_test.csv", index_col=0)    # Use the first column as index

The main challenges we have to contend with for this data are the missing data and the imbalanced data. We stored the missing values as `NaN`.

In [2]:
# Check the unique values in the columns with missing data
print("Unique values in 'workclass':\n", X_train['workclass'].unique())
print("\nUnique values in 'occupation':\n", X_train['occupation'].unique())
print("\nUnique values in 'native-country':\n", X_train['native-country'].unique())

# Count the number of NaN values
nan_count = X_train['workclass'].isna().sum()

# Output the results
print(f"Number of missing values: {nan_count}")

Unique values in 'workclass':
 ['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' nan
 'Self-emp-inc' 'Without-pay' 'Never-worked']

Unique values in 'occupation':
 ['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Tech-support' nan 'Protective-serv'
 'Machine-op-inspct' 'Priv-house-serv' 'Armed-Forces']

Unique values in 'native-country':
 ['United-States' 'Cuba' 'Jamaica' 'India' nan 'Mexico' 'South'
 'Puerto-Rico' 'England' 'Canada' 'Germany' 'Iran' 'Philippines' 'Italy'
 'Columbia' 'Cambodia' 'Thailand' 'Ecuador' 'Laos' 'Taiwan' 'Haiti'
 'Portugal' 'El-Salvador' 'Poland' 'France' 'Dominican-Republic'
 'Honduras' 'Guatemala' 'China' 'Japan' 'Yugoslavia' 'Peru'
 'Outlying-US(Guam-USVI-etc)' 'Trinadad&Tobago' 'Nicaragua' 'Greece'
 'Hong' 'Vietnam' 'Ireland' 'Scotland' 'Hungary' 'Holand-Netherlands']
Number of missing values: 2799


# About Missing Data

Little and Rubin give the following three types of missingness in [Chapter 3.1, 1].

- Missing completely at random (MCAR): Missingness is independent of any observed or unobserved variables. One example is when we cannot collect data on a patient in a hospital because a measuring instrument is accidentally broken. Some data would then be missing, and this would be independent of the observed or unobserved characteristics of the patient.  

- Missing at random (MAR): Missingness depends on the observed data but not on the missing data.  Vateekul (see \cite{vateekul2009tree}) gives an example in that `respondents with lower education may be less likely to complete the entire survey'. The cause of missingness is not due to the missing variable, but due to some other variables.

- Non-ignorable missingness: This is missing data that is neither MAR nor MCAR. In other words, the value of a missing variable is related to the reason of its missingness. For example, some people may wish to not report their income in a survey due to their income itself.
\end{enumerate}

From our EDA, we concluded that the missingness was not MCAR, and is likely related to the income. To contend with the missing data, we will use the following strategies:
- Surrogate splits
- Treating missingness as its own category
- Imputation.

# About Imbalanced Data

When data is imbalanced, accuracy can become a misleading measure of performance, since predicting everything as the majority class can give good accuracy. We will look at metrics of performance that can be used with imbalanced data and look at a method called SMOTE to try to tackle imbalance.

We will explain each of these strategies as we use them in the modelling. 

In the next section, we look at classification trees, and explore these methods of dealing with missingness and imbalance.