Name		Name	Last commit message	Last commit date
parent directory ..
Linear models		Linear models
data		data
img		img
2.1 EDA.ipynb		2.1 EDA.ipynb
2.2 Decission tree.ipynb		2.2 Decission tree.ipynb
2.3 Random Forest.ipynb		2.3 Random Forest.ipynb
2.4 Gradient Boosting.ipynb		2.4 Gradient Boosting.ipynb
2.5 Logistic Regression.ipynb		2.5 Logistic Regression.ipynb
2.6 Neural Network.ipynb		2.6 Neural Network.ipynb
Global Interpretability.ipynb		Global Interpretability.ipynb
Individual interpretability.ipynb		Individual interpretability.ipynb
readme.md		readme.md

readme.md

Tabular data for classification

Dataset: Titanic

Index

Feature preprocessing

Categorical

Numerical

Ordinal

Text: Ver sesión de NLP

Date or Time: Ver sesión de series temporales

Feature generation

Models

Validation

Classification metrics

Conclussion

Categorical features

Ordinal Encoding (aka Label Encoding)

👉 Use this encoding for tree based models (Random Forest, Gradient Boosting...)

One-hot Encoding

👉 Use this encoding for not tree based models (Linear models, Neural Networks, Support Vector Machines...)

Numerical features

Números enteros o decimales: Ej: Edad, medidas, precios, ...

Ordinal features

Categorias con orden. No podemos asegurar que los intervalos son iguales. Ej: Carnet de conducir, nivel de educación, tipo de ticket

Feature generation: CREATIVITY + DOMOAIN KNOWLEDGE

Si tienes el precio de la casa y los metros cuadrados, puedes añadir el precio del metro cuadrado.
Si tines la distancia en el eje x e y, puedes añadir la distancia directa por pitagoras.
Si tines precios, puedes añanir la parte fraccionaria pq es muy subjetiva en la gente.

Models

Model	Comment	Library	More info
Decission Tree	Simple and explicable.	Sklearn
Linear models	Simple and explicable.	Sklearn
Random Forest	Good starting point (tree enesemble)	Sklearn
Gradient Boosting	Usually the best (tree enesemble)	XGBoost, LighGBM, Catboost
Neural Network	Good if lot of data.	Fast.ai v2	blog

Jeremy Howard on twitter: Our advice for tabular modeling

We have two approaches to tabular modelling: decision tree ensembles, and neural networks. And we have mentioned two different decision tree ensembles: random forests, and gradient boosting. Each is very effective, but each also has compromises:

Decission Tree

Random Forest

Are the easiest to train, because they are extremely resilient to hyperparameter choices, and require very little preprocessing. They are very fast to train, and should not overfit, if you have enough trees. But, they can be a little less accurate, especially if extrapolation is required, such as predicting future time periods

Gradient Boosting

In theory are just as fast to train as random forests, but in practice you will have to try lots of different hyperparameters. They can overfit. At inference time they will be less fast, because they cannot operate in parallel. But they are often a little bit more accurate than random forests.

	sklearn Random Forest	XGBoost Gradient Boosting	LightGBM Gradient Boosting	Try
🔷 Number of trees	N_estimators	num_round 💡	num_iterations 💡	100
🔷 Max depth of the tree	max_depth	max_depth	max_depth	7
🔶 Min cases per final tree leaf	min_samples_leaf	min_child_weight	min_data_in_leaf
🔷 % of rows used to build the tree	max_samples	subsample	bagging_fraction	0.8
🔷 % of feats used to build the tree	max_features	colsample_bytree	feature_fraction
🔷 Speed of training	NOT IN FOREST	eta	learning_rate
🔶 L1 regularization	NOT IN FOREST	lambda	lambda_l1
🔶 L2 regularization	NOT IN FOREST	alpha	lambda_l2
Random seed	random_state	seed	_seed

🔷: Increase parameter for overfit, decrease for underfit.

🔶: Increase parameter for underfit, decrease for overfit. (regularization)

💡: For Gradient Boosting maybe is better to do early stopping rather than set a fixed number of trees.

Neural Network

Take the longest time to train, and require extra preprocessing such as normalisation; this normalisation needs to be used at inference time as well. They can provide great results, and extrapolate well, but only if you are careful with your hyperparameters, and are careful to avoid overfitting.

Validation

Cross Validation

Conclussion

We suggest starting your analysis with a random forest. This will give you a strong baseline, and you can be confident that it's a reasonable starting point. You can then use that model for feature selection and partial dependence analysis, to get a better understanding of your data.

From that foundation, you can try Gradient Boosting and Neural Nets, and if they give you significantly better results on your validation set in a reasonable amount of time, you can use them.

	Tree based models Decission tree Random Forest Extra trees Adaboost Gradient Boosting XGBoost LightGBM CatBoost	No-tree based models Linear Models Neural Networks K-Nearest Neighbors Suport Vector Machines
Categorical Ordinal	Label encoding Frequency encoding	One hot encoding Embedding
Numerical	Nothing	MinMaxScaler StandarScaler Skewed? np.log(1+x) np.sqrt(x+2/3) Box-Cox transform

Map data to a normal distribution: Box-Cox

A Box Cox transformation is a generic way to transform non-normal variables into a normal shape.

Lambda value (λ)	Transformed data
-3	Y⁻³ = 1/Y³
-2	Y⁻² = 1/Y²
-1	Y⁻¹ = 1/Y¹
-0.5	Y⁻⁰·⁵ = 1/√Y
0	log(Y)
0.5	Y⁰·⁵ = √Y
1	Y¹
2	Y²
3	Y³

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2. Tabular data (clas)

2. Tabular data (clas)

readme.md

Tabular data for classification

Dataset: Titanic

Index

Categorical features

Ordinal Encoding (aka Label Encoding)

One-hot Encoding

Numerical features

Ordinal features

Feature generation: CREATIVITY + DOMOAIN KNOWLEDGE

Models

Jeremy Howard on twitter: Our advice for tabular modeling

Decission Tree

Random Forest

Gradient Boosting

Neural Network

Validation

Cross Validation

Conclussion

Tree based models

No-tree based models

Map data to a normal distribution: Box-Cox

Files

2. Tabular data (clas)

Directory actions

More options

Directory actions

More options

Latest commit

History

2. Tabular data (clas)

Folders and files

parent directory

readme.md

Tabular data for classification

Dataset: Titanic

Index

Categorical features

Ordinal Encoding (aka Label Encoding)

One-hot Encoding

Numerical features

Ordinal features

Feature generation: CREATIVITY + DOMOAIN KNOWLEDGE

Models

Jeremy Howard on twitter: Our advice for tabular modeling

Decission Tree

Random Forest

Gradient Boosting

Neural Network

Validation

Cross Validation

Conclussion

Tree based models

No-tree based models

Map data to a normal distribution: Box-Cox