# Classification with RIPPER
In this notebook we aim to build a classifier using RIPPER. RIPPER is one of the most popular rule based classifier. 

In [1]:
import pandas as pd
from os import path
import numpy as np
from preprocessing import get_train_test_data

X_train, y_train, X_test, y_test, columns_to_keep = get_train_test_data()

To begin, we create a target variable named `top_20` based on the column `position`.

Now we need to discretize the data to make it suitable for RIPPER. We will use the `discretize_data` function to discretize the data. 

Finally, we will train the RIPPER classifier using the discretized data.

In [2]:
#function to discretize the variables
#input: the dataset and the list of variables' names to discretize
def discretize_data(dataset, variables):
    for variable in variables:
        #get the unique variable's values
        var = sorted(dataset[variable].unique())
        
        #generate a mapping from the variable's values to the number representation  
        mapping = dict(zip(var, range(0, len(var) + 1)))

        #add a new colum with the number representation of the variable
        dataset[variable+'_num'] = dataset[variable].map(mapping).astype(int)
    return dataset

In [3]:
X_train = discretize_data(X_train, [col for col in columns_to_keep if col != "top_20"])

X_test = discretize_data(X_test, [col for col in columns_to_keep if col != "top_20"])

X_train.head()

Unnamed: 0,bmi,career_points,career_duration(days),debut_year,difficulty_score,competitive_age,is_tarmac,points,climbing_efficiency,startlist_quality,bmi_num,career_points_num,career_duration(days)_num,debut_year_num,difficulty_score_num,competitive_age_num,is_tarmac_num,points_num,climbing_efficiency_num,startlist_quality_num
0,23.765432,68034.221635,6233.0,1977.0,0.635375,22,True,100.0,0.006796,1241,510,2827,2580,7,483,4,1,8,394,563
1,20.897959,29429.221635,5212.0,1974.0,0.635375,27,True,100.0,0.006796,1241,254,2348,2433,4,483,9,1,8,394,563
2,22.790329,15880.0,2972.0,1977.0,0.635375,24,True,100.0,0.006796,1241,437,1720,1674,7,483,6,1,8,394,563
3,21.46915,6600.0,3606.0,1970.0,0.635375,30,True,100.0,0.006796,1241,309,924,1929,0,483,12,1,8,394,563
4,21.295295,17245.0,2192.0,1977.0,0.635375,27,True,100.0,0.006796,1241,293,1796,1268,7,483,9,1,8,394,563


Now we search for the best hyperparemeters for the RIPPER classifier using the `GridSearchCV` function. 

Finally, we evaluate the model using the `classification_report` function.

As we can see from the classification report, the RIPPER classifier has an accuracy of 0.85. Unfortunately, while the precision and recall are high for the negative class, they are low for the positive class. This is likely due to the class imbalance in the dataset. We try to address this issue by using the `class_weight` parameter in the RIPPER classifier.

In [4]:
import wittgenstein as lw
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score



model = lw.RIPPER()

param_grid = {
    'k': [1, 2],  
    'max_rules': [10, 20], 
    'prune_size': [0.33],
    'class_weights': [{0: 1, 1: 25}]  
}

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='accuracy',
    cv=3,  
    verbose=1
)

grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)


report = classification_report(y_test, y_pred)
best_params = grid_search.best_params_
accuracy = accuracy_score(y_test, y_pred)


print("best parameters: " + best_params)
print("accuracy: " + accuracy)
print(report)

Fitting 3 folds for each of 4 candidates, totalling 12 fits


KeyboardInterrupt: 

Now we print the rules generated by the model. The rules provide insight into the decision making process of the model.

In [7]:
# Visualizzazione strutturata
for rule in best_model.ruleset_:
    print(rule)


[delta_num=<69.0^profile_num=2^startlist_quality_num=<231.0]
[delta_num=<69.0^profile_num=2^is_tarmac_num=0]
[delta_num=<69.0^profile_num=2^length_num=249.0-485.0^points_num=<3.0]
[delta_num=<69.0^profile_num=2^startlist_quality_num=231.0-311.0^length_num=249.0-485.0]


In [None]:
print(report)

              precision    recall  f1-score   support

           0       0.86      0.98      0.92     30219
           1       0.41      0.10      0.16      5187

    accuracy                           0.85     35406
   macro avg       0.64      0.54      0.54     35406
weighted avg       0.80      0.85      0.81     35406



Again, with the new class weights, we evaluate the model using the `classification_report` function. The accuracy of the model is still 0.85, but the scores are the same as before. This suggests that the the model is to simple to capture the complexity of the data.