# Hepatitis

La idea de esta notebook es mostrar y comparar resultados con los arboles que creamos para y sus propios resultados.

En el paper se trata de buscar las configuraciones de RF (Random Forest) y BR (Bootstrap Rate) que de mayor precisión en clasificar.

Como valores base para los hiperparámetros, usamos lo que están por default en scikit-learn (v1.1.3) y los denotamos como RF(base) (RF = Random Forest):
- Number of trees: `nt = 100`
- Maximum tree depth: `md = None` (no depth limit)
- Function measuring the quality of a split: `qs = "gini"` (Gini impurity)
- Min. number of instances required to split an internal node: `mn = 2`
- Min. count of obs. neccesary to constitute a leaf node: `ml = 1`
- Number of attributes to consider when looking for the best split: `nf = "sqrt"` (square root of the number of features)

También se prueban 17 modificaciones al RF(base) para buscar la mejor precisión:

- RF(nt 200), RF(nt 500): number of trees equals 200 or 500, respectively
- RF(md 10), RF(md 15), RF(md 20), RF(md 25): maximum depth of a tree equals 10, 15, 20, or 25, respectively
- RF(qs ent): split quality is measured using Shannon entropy (information gain)
- RF(mn 3), RF(mn 4), RF(mn 6), RF(mn 8): minimum number of observations required to split an internal node is equal to 3, 4, 6, or 8, respectively
- RF(ml 2), RF(ml 3), RF(ml 4), RF(ml 5): minimum number of instances per leaf is 2, 3, 4, or 5, respectively
- RF(nf log), RF(nf all): number of features considered in a node split equals the logarithm with base 2 of the number of attributes or all features are taken into account, respectively

Se testea bajo los siguientes BR (Bootstrap Rate): 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 2.0, 3.0, 4.0 y 5.0.

In [6]:
# Imports y variables

from ucimlrepo import fetch_ucirepo

params = {
  'nt': [100, 200, 500],
  'md': [None, 10, 15, 20, 25],
  'qs': ["gini", "ent"],
  'mn': [2, 3, 4, 6, 8],
  'ml': [1, 2, 3, 4, 5],
  'nf': ["sqrt", "log", "all"]
}

In [7]:
# Obtengo datasets de UC Irvine Machine Learning Repository

hepatitis = fetch_ucirepo(id=46)

In [17]:
X = hepatitis.data.features
y = hepatitis.data.targets

In [27]:
X.head(10)

Unnamed: 0,Age,Sex,Steroid,Antivirals,Fatigue,Malaise,Anorexia,Liver Big,Liver Firm,Spleen Palpable,Spiders,Ascites,Varices,Bilirubin,Alk Phosphate,Sgot,Albumin,Protime,Histology
0,30,2,1.0,2,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1
1,50,1,1.0,2,1.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1
2,78,1,2.0,2,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1
3,31,1,,1,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1
4,34,1,2.0,2,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1
5,34,1,2.0,2,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.9,95.0,28.0,4.0,75.0,1
6,51,1,1.0,2,1.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,,,,,,1
7,23,1,2.0,2,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,,,,1
8,39,1,2.0,2,1.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,0.7,,48.0,4.4,,1
9,30,1,2.0,2,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,120.0,3.9,,1


In [26]:
y.head(10)

Unnamed: 0,Class
0,2
1,2
2,2
3,2
4,2
5,2
6,1
7,2
8,2
9,2


In [5]:
hepatitis.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,Class,Target,Categorical,,,,no
1,Age,Feature,Integer,,,,no
2,Sex,Feature,Categorical,,,,no
3,Steroid,Feature,Categorical,,,,yes
4,Antivirals,Feature,Categorical,,,,no
5,Fatigue,Feature,Categorical,,,,yes
6,Malaise,Feature,Categorical,,,,yes
7,Anorexia,Feature,Categorical,,,,yes
8,Liver Big,Feature,Categorical,,,,yes
9,Liver Firm,Feature,Categorical,,,,yes


In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [21]:
X_train.shape, X_test.shape

((103, 19), (52, 19))

In [72]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rfc = RandomForestClassifier(n_estimators=500)

rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)

print(f'Model accuracy: {accuracy_score(y_test, y_pred):0.4f}')

  return fit_method(estimator, *args, **kwargs)


Model accuracy: 0.8077
