# CHAID Algorithm

### CHAID algorithm is a decision tree algorithm. The algorithm is based on CHi-squared statistics.

The algorithm performs CHi-squared analysis for each variable in the dataset, and from highest value to lowest value, splits the data into subsets.

After that, it performs the same process on the subset, and split it into subsets as well (while ignoring the previous featues that were chosen previously on the tree), until all festues have been chosen at some point, or until the subset predictors are the same.

- Note - chefboost module does not have visualization feature right now so I coudln't draw the tree since it was too complicated with other tools.


In [1]:
from chefboost import Chefboost as cb
import pandas as pd
import numpy as np
import os

base_data_path = "datasets\\"
titanic_train_path = os.path.join(base_data_path, "titanic\\train.csv")
titanic_test_path = os.path.join(base_data_path, "titanic\\test.csv")
titanic_test_results_path = os.path.join(
    base_data_path, "titanic\\gender_submission.csv")
print(titanic_train_path)
print(titanic_test_path)
print(titanic_test_results_path)


datasets\titanic\train.csv
datasets\titanic\test.csv
datasets\titanic\gender_submission.csv


# The DataSet

### The dataset contains data on the passangers from the titanic. The model goal is to predict who survived and who didn't their features.

- Some featues are irrelevant and were removed. If you remove more featues, the model accuracy will be better but might result in overfitting.
- You can play with the feautes to remove and see different results
- The dataset is split to train, test and test results (which called gender_submission)


In [2]:
train_data = pd.read_csv(titanic_train_path)
# irrelavant_featues = ['PassengerId','Name','Ticket','Cabin','Pclass','Age','SibSp','Parch','Fare','Embarked']
irrelavant_featues = ['PassengerId', 'Name', 'Ticket', 'Cabin']
train_data = train_data.drop(irrelavant_featues, axis=1)
# Some casting was required for the chefboost module
train_data['Survived'] = train_data['Survived'].astype(object)
train_data.head(n=10)


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S
5,0,3,male,,0,0,8.4583,Q
6,0,1,male,54.0,0,0,51.8625,S
7,0,3,male,2.0,3,1,21.075,S
8,1,3,female,27.0,0,2,11.1333,S
9,1,2,female,14.0,1,0,30.0708,C


# The Model

### The model is a basic CHAID model, that tries to predicted the people who survived on the titanic.

- Note - The chefboost module is supposed to present the accuracy of the model, but for some reason it shows 0 (probably because of typing). For that reason, I created an evaluation process below


In [3]:
config = {'algorithm': "CHAID"}

tree = cb.fit(train_data, config=config, target_label="Survived")


[INFO]:  2 CPU cores will be allocated in parallel running
CHAID  tree is going to be built...
-------------------------
finished in  17.869871854782104  seconds
-------------------------
Evaluate  train set
-------------------------
Accuracy:  0.0 % on  891  instances
Labels:  [0 1]
Confusion matrix:  [[0, 0], [0, 0]]
Precision:  0.0 %, Recall:  0.0 %, F1:  0.0 %


# Evaluation

### The evaluation takes the test data, gets prediction from the model and compares it to the real data


In [4]:
test_data = pd.read_csv(titanic_test_path)
test_data = test_data.drop(irrelavant_featues, axis=1)
test_real_values = pd.read_csv(titanic_test_results_path)
count = 0
for index, row in test_data.iterrows():
    prediction = cb.predict(tree, row)
    real_value = test_real_values.iloc[index]['Survived']
    if int(prediction) == int(test_real_values.iloc[index]['Survived']):
        count += 1
print(f"accuracy = {(count/len(test_data))*100}%")


accuracy = 89.95215311004785%
