### Setup Jupyter notebook

In [6]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install --upgrade pip

Collecting pip
  Obtaining dependency information for pip from https://files.pythonhosted.org/packages/ef/7d/500c9ad20238fcfcb4cb9243eede163594d7020ce87bd9610c9e02771876/pip-24.3.1-py3-none-any.whl.metadata
  Downloading pip-24.3.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-24.3.1-py3-none-any.whl (1.8 MB)
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
    --------------------------------------- 0.0/1.8 MB 262.6 kB/s eta 0:00:07
   - -------------------------------------- 0.1/1.8 MB 409.6 kB/s eta 0:00:05
   ---- ----------------------------------- 0.2/1.8 MB 1.0 MB/s eta 0:00:02
   ------ --------------------------------- 0.3/1.8 MB 1.3 MB/s eta 0:00:02
   --------- ------------------------------ 0.4/1.8 MB 1.6 MB/s eta 0:00:01
   ----------- ---------------------------- 0.5/1.8 MB 1.7 MB/s eta 0:00:01
   --------------

In [7]:
pip install pandas


Collecting pandas
  Using cached pandas-2.2.3-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Using cached pandas-2.2.3-cp312-cp312-win_amd64.whl (11.5 MB)
Using cached pytz-2024.2-py2.py3-none-any.whl (508 kB)
Installing collected packages: pytz, pandas
Successfully installed pandas-2.2.3 pytz-2024.2
Note: you may need to restart the kernel to use updated packages.


In [8]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [9]:
pip install joblib

Note: you may need to restart the kernel to use updated packages.


#### Train ML algorithms

In [10]:
import json # will be needed for saving preprocessing details
import numpy as np # for data manipulation
import pandas as pd # for data manipulation
from sklearn.model_selection import train_test_split # will be used for data split
from sklearn.preprocessing import LabelEncoder # for preprocessing
from sklearn.ensemble import RandomForestClassifier # for training the algorithm
from sklearn.ensemble import ExtraTreesClassifier # for training the algorithm
import joblib # for saving algorithm and preprocessing objects

### Loading data

In [11]:
# load dataset
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv', skipinitialspace=True)
x_cols = [c for c in df.columns if c != 'income']
# set input matrix and target column
X = df[x_cols]
y = df['income']
# show first rows of data
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [12]:
df.shape

(32561, 15)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


The X matrix has 32,561 rows and 14 columns. This is input data for our algorithm, each row describes one person. The y vector has 32,561 values indicating whether income exceeds 50K per year.

Before starting data preprocessing we will split our data into training, and testing subsets. We will use 30% of the data for testing.

In [14]:
# data split train / test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1234)

### Pré-processamento de dados
Em nosso conjunto de dados, há valores ausentes e colunas categóricas. Para o treinamento do algoritmo ML, usarei o Random Forest algoritmo do sklearn pacote. Na implementação atual, ele não pode manipular valores ausentes e colunas categóricas, é por isso que precisamos aplicar algoritmos de pré-processamento.

Para preencher os valores ausentes, usaremos o valor mais frequente em cada coluna (há muitos outros métodos de preenchimento, o que eu seleciono é apenas para fins de exemplo).

In [15]:
# fill missing values
train_mode = dict(X_train.mode().iloc[0])
X_train = X_train.fillna(train_mode)
print(train_mode)

{'age': np.float64(31.0), 'workclass': 'Private', 'fnlwgt': np.int64(121124), 'education': 'HS-grad', 'education-num': np.float64(9.0), 'marital-status': 'Married-civ-spouse', 'occupation': 'Prof-specialty', 'relationship': 'Husband', 'race': 'White', 'sex': 'Male', 'capital-gain': np.float64(0.0), 'capital-loss': np.float64(0.0), 'hours-per-week': np.float64(40.0), 'native-country': 'United-States'}


In [16]:
train_mode

{'age': np.float64(31.0),
 'workclass': 'Private',
 'fnlwgt': np.int64(121124),
 'education': 'HS-grad',
 'education-num': np.float64(9.0),
 'marital-status': 'Married-civ-spouse',
 'occupation': 'Prof-specialty',
 'relationship': 'Husband',
 'race': 'White',
 'sex': 'Male',
 'capital-gain': np.float64(0.0),
 'capital-loss': np.float64(0.0),
 'hours-per-week': np.float64(40.0),
 'native-country': 'United-States'}

Como train_modevocê pode ver, por exemplo na agecoluna o valor mais frequente é 31.0.

Vamos converter categóricos em números. Usarei o pacote LabelEncoderfrom sklearn:

In [17]:
# convert categoricals
encoders = {}
for column in ['workclass', 'education', 'marital-status',
                'occupation', 'relationship', 'race',
                'sex','native-country']:
    categorical_convert = LabelEncoder()
    X_train[column] = categorical_convert.fit_transform(X_train[column])
    encoders[column] = categorical_convert

### Algorithms training
Data is ready, so we can train our Random Forest algorithm.

In [18]:
# train the Random Forest algorithm
rf = RandomForestClassifier(n_estimators = 100)
rf = rf.fit(X_train, y_train)

In [19]:
# train the Extra Trees algorithm
et = ExtraTreesClassifier(n_estimators = 100)
et = et.fit(X_train, y_train)

Como você vê, treinar o algoritmo é fácil, apenas 2 linhas de código - muito menos do que a leitura de dados e o pré-processamento. Agora, vamos salvar o algoritmo que criamos. O importante a notar é que o algoritmo ML não é apenas a variável rfand et(com pesos de modelo), mas também precisamos salvar as variáveis ​​de pré-processamento train_modeand encoders. Para salvar, usarei joblibo pacote.

In [20]:
# save preprocessing objects and RF algorithm
joblib.dump(train_mode, "./train_mode.joblib", compress=True)
joblib.dump(encoders, "./encoders.joblib", compress=True)
joblib.dump(rf, "./random_forest.joblib", compress=True)
joblib.dump(et, "./extra_trees.joblib", compress=True)

['./extra_trees.joblib']