# Exploración de la importancia de los features en un modelo de Random Forest

Con esto queremos determinar dos cosas:
1. Si los términos de interacción añaden poder predictivo al modelo.
2. Si la variable categórica `Type` a la cual se aplicó un one-hot encoding, es relevante para el modelo.

Para ello, se entrena un `RandomForestClassifier` con los mismos parámetros pero solo variando los features usados. Se comparan los resultados de los modelos y se analiza la importancia de los features en cada uno de ellos.

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# Carga de datos

In [2]:
X_train = pd.read_csv('data/X_train_extended.csv')
X_test = pd.read_csv('data/X_test_extended.csv')
y_train = pd.read_csv('data/y_train.csv')
y_test = pd.read_csv('data/y_test.csv')

y_train = y_train.drop(columns=["Machine failure"])
y_test = y_test.drop(columns=["Machine failure"])

In [3]:
X_train.head()

Unnamed: 0,Type_L,Type_M,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Temperature difference [K],Rotational power
0,0,1,-0.354642,-1.014674,-0.239039,-0.059097,-0.440162,-0.792467,-0.100567
1,1,0,-1.65121,-1.486322,0.5936,-0.539312,2.045097,1.102016,-0.295095
2,0,1,-1.002926,-1.284187,-1.110534,1.161451,1.400188,0.104919,0.891225
3,1,0,-0.753586,0.063378,1.498401,-1.109568,-1.383932,1.600564,-0.751564
4,1,0,0.094169,0.400269,-0.599849,0.231034,0.078911,0.404048,0.05469


In [4]:
y_train.head()

Unnamed: 0,Failure type
0,0
1,0
2,0
3,0
4,0


# Definición del modelo

In [5]:
random_forest = RandomForestClassifier(
    n_estimators=200,
    random_state=95,
    max_depth=10,
    class_weight='balanced',
    min_samples_leaf=10,
    min_samples_split=10,
    max_features=5
)

# Solo los features originales

In [6]:
X_train_original = X_train.drop(columns=["Temperature difference [K]", "Rotational power"])
X_test_original = X_test.drop(columns=["Temperature difference [K]", "Rotational power"])

In [7]:
random_forest.fit(X_train_original, y_train)

  return fit_method(estimator, *args, **kwargs)


In [8]:
y_pred_original = random_forest.predict(X_test_original)

In [9]:
class_report = classification_report(y_test, y_pred_original)
conf_matrix = confusion_matrix(y_test, y_pred_original)

In [10]:
print(class_report)

              precision    recall  f1-score   support

           0       1.00      0.94      0.97      1930
           1       0.09      0.67      0.15         9
           2       0.56      0.90      0.69        21
           3       0.40      1.00      0.57        16
           4       0.46      0.69      0.55        16
           5       0.00      0.00      0.00         3
           6       0.33      0.60      0.43         5

    accuracy                           0.94      2000
   macro avg       0.40      0.69      0.48      2000
weighted avg       0.98      0.94      0.95      2000



In [11]:
print(conf_matrix)

[[1815   64   15   22   12    1    1]
 [   3    6    0    0    0    0    0]
 [   0    0   19    1    1    0    0]
 [   0    0    0   16    0    0    0]
 [   0    0    0    0   11    0    5]
 [   3    0    0    0    0    0    0]
 [   1    0    0    1    0    0    3]]


## Importancia de los features

In [12]:
feat_importance_dict = {
    feature_name: importance
    for feature_name, importance in zip(X_train_original.columns, random_forest.feature_importances_)
}

### Importancia neta de los features

In [13]:
# Print the feature importance ordered by importance in percentage with 2 decimal places
for feature, importance in sorted(feat_importance_dict.items(), key=lambda x: x[1], reverse=True):
    print(f"{feature}: {importance*100:.2f}%")

Torque [Nm]: 35.67%
Tool wear [min]: 25.95%
Rotational speed [rpm]: 17.53%
Air temperature [K]: 10.77%
Process temperature [K]: 8.12%
Type_L: 1.18%
Type_M: 0.79%


### Importancia acumulada de los features

In [14]:
# Print cumulative feature importance in percentage with 2 decimal places
cumulative_importance = 0
for feature, importance in sorted(feat_importance_dict.items(), key=lambda x: x[1], reverse=True):
    cumulative_importance += importance
    print(f"{feature}: {cumulative_importance*100:.2f}%")

Torque [Nm]: 35.67%
Tool wear [min]: 61.61%
Rotational speed [rpm]: 79.14%
Air temperature [K]: 89.91%
Process temperature [K]: 98.03%
Type_L: 99.21%
Type_M: 100.00%


# Features originales y extendidos

In [15]:
random_forest.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


In [16]:
y_pred = random_forest.predict(X_test)

In [17]:
class_report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [18]:
print(class_report)

              precision    recall  f1-score   support

           0       1.00      0.96      0.98      1930
           1       0.10      0.78      0.17         9
           2       0.83      0.95      0.89        21
           3       0.94      1.00      0.97        16
           4       0.52      0.75      0.62        16
           5       0.00      0.00      0.00         3
           6       0.44      0.80      0.57         5

    accuracy                           0.95      2000
   macro avg       0.55      0.75      0.60      2000
weighted avg       0.98      0.95      0.97      2000



In [19]:
print(conf_matrix)

[[1849   65    3    1   11    0    1]
 [   2    7    0    0    0    0    0]
 [   0    0   20    0    0    0    1]
 [   0    0    0   16    0    0    0]
 [   1    0    0    0   12    0    3]
 [   3    0    0    0    0    0    0]
 [   0    0    1    0    0    0    4]]


## Importancia de los features

In [20]:
feat_importance_dict = {
    feature_name: importance
    for feature_name, importance in zip(X_train.columns, random_forest.feature_importances_)
}

### Importancia neta de los features

In [21]:
# Print the feature importance ordered by importance in percentage with 2 decimal places
for feature, importance in sorted(feat_importance_dict.items(), key=lambda x: x[1], reverse=True):
    print(f"{feature}: {importance*100:.2f}%")

Tool wear [min]: 24.15%
Rotational power: 20.52%
Torque [Nm]: 18.03%
Temperature difference [K]: 16.62%
Rotational speed [rpm]: 10.10%
Air temperature [K]: 4.17%
Process temperature [K]: 3.96%
Type_M: 1.27%
Type_L: 1.17%


### Importancia acumulada de los features

In [22]:
# Print cumulative feature importance in percentage with 2 decimal places
cumulative_importance = 0
for feature, importance in sorted(feat_importance_dict.items(), key=lambda x: x[1], reverse=True):
    cumulative_importance += importance
    print(f"{feature}: {cumulative_importance*100:.2f}%")

Tool wear [min]: 24.15%
Rotational power: 44.67%
Torque [Nm]: 62.70%
Temperature difference [K]: 79.32%
Rotational speed [rpm]: 89.42%
Air temperature [K]: 93.59%
Process temperature [K]: 97.56%
Type_M: 98.83%
Type_L: 100.00%


# Features originales y extendidos sin la variable categórica

In [23]:
X_train_no_cat = X_train.drop(columns=["Type_L", "Type_M"])
X_test_no_cat = X_test.drop(columns=["Type_L", "Type_M"])

In [24]:
random_forest.fit(X_train_no_cat, y_train)

  return fit_method(estimator, *args, **kwargs)


In [25]:
y_pred_no_cat = random_forest.predict(X_test_no_cat)

In [26]:
class_report = classification_report(y_test, y_pred_no_cat)
conf_matrix = confusion_matrix(y_test, y_pred_no_cat)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [27]:
print(class_report)

              precision    recall  f1-score   support

           0       1.00      0.95      0.98      1930
           1       0.09      0.78      0.17         9
           2       0.87      0.95      0.91        21
           3       0.94      1.00      0.97        16
           4       0.54      0.94      0.68        16
           5       0.00      0.00      0.00         3
           6       0.44      0.80      0.57         5

    accuracy                           0.95      2000
   macro avg       0.55      0.77      0.61      2000
weighted avg       0.98      0.95      0.97      2000



In [28]:
print(conf_matrix)

[[1843   68    2    1   13    0    3]
 [   2    7    0    0    0    0    0]
 [   0    0   20    0    0    0    1]
 [   0    0    0   16    0    0    0]
 [   0    0    0    0   15    0    1]
 [   3    0    0    0    0    0    0]
 [   0    0    1    0    0    0    4]]


## Importancia de los features

In [29]:
feat_importance_dict = {
    feature_name: importance
    for feature_name, importance in zip(X_train_no_cat.columns, random_forest.feature_importances_)
}

### Importancia neta de los features

In [30]:
# Print the feature importance ordered by importance in percentage with 2 decimal places
for feature, importance in sorted(feat_importance_dict.items(), key=lambda x: x[1], reverse=True):
    print(f"{feature}: {importance*100:.2f}%")

Rotational power: 24.49%
Tool wear [min]: 23.59%
Torque [Nm]: 17.62%
Temperature difference [K]: 17.36%
Rotational speed [rpm]: 8.57%
Air temperature [K]: 4.31%
Process temperature [K]: 4.06%


### Importancia acumulada de los features

In [31]:
# Print cumulative feature importance in percentage with 2 decimal places
cumulative_importance = 0
for feature, importance in sorted(feat_importance_dict.items(), key=lambda x: x[1], reverse=True):
    cumulative_importance += importance
    print(f"{feature}: {cumulative_importance*100:.2f}%")

Rotational power: 24.49%
Tool wear [min]: 48.08%
Torque [Nm]: 65.70%
Temperature difference [K]: 83.06%
Rotational speed [rpm]: 91.63%
Air temperature [K]: 95.94%
Process temperature [K]: 100.00%


# Conclusiones del análisis de la importancia de los features
1. Vemos que para las variables `Type_L` y `Type_M` la importancia neta es muy baja, lo que indica que no aportan poder predictivo al modelo.
    1. Para el caso donde solo se usan los features originales, la importancia de estas features suma un `1.97%`.
    2. Para el caso donde se usan los features originales y extendidos, la importancia de estas features suma un `2.44%`.
2. De acuerdo a el macro average de recall que es nuestra métrica más importante, tenemos que el performance del modelo se da de esta forma:
    1. Extendido sin variables categóricas: `0.77`.
    2. Extendido: `0.75`.
    3. Original: `0.69`.
3. De acuerdo al macro average del F1-score, tenemos que el performance del modelo se da de esta forma:
    1. Extendido sin variables categóricas: `0.61`.
    2. Extendido: `0.60`.
    3. Original: `0.48`.

De acuerdo a esto, podemos concluir que **el mejor modelo es el que usa los features originales y extendidos sin la variable categórica `Type`**.