In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

In [2]:
df = pd.read_csv('diabetes.csv') 

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Selecting first 590 strings:

In [4]:
task_data = df.head(590)

Printing the number of strings in the obtained sample, which belong to the class 1 (a patient has diabetes).

In [5]:
len(task_data[task_data['Outcome'] == 1])

204

Splitting the data into training and test sets. First 70% of the strings are the training set, and the remaining are the test set.

In [6]:
train = task_data.head(int(len(task_data)*0.7))
test = task_data.tail(int(len(task_data)*0.3))

Select the predictors (first 8 columns) and the response (Outcome):

In [7]:
features = list(train.columns[:8])
x = train[features]
y = train['Outcome']

Connecting DecisionTreeClassifier:

In [8]:
from sklearn.tree import DecisionTreeClassifier

Setting the decision tree parameters and training the model:

In [9]:
tree = DecisionTreeClassifier(criterion='entropy', #splitting criterion
                              min_samples_leaf=20, #minimum number of samples per leaf
                              max_leaf_nodes=30, #maximum number of leaves
                              random_state=2020)
clf=tree.fit(x, y)

Import the library to visualize the tree. Saving to the file and displaying.

In [10]:
from sklearn.tree import export_graphviz
import graphviz
columns = list(x.columns)
export_graphviz(clf, out_file='tree.dot', 
                feature_names=columns,
                class_names=['0', '1'],
                rounded = True, proportion = False, 
                precision = 2, filled = True, label='all')

with open('tree.dot') as f:
    dot_graph = f.read()

graphviz.Source(dot_graph)

<graphviz.files.Source at 0x7f1339589da0>

Printing the tree depth:

In [11]:
clf.tree_.max_depth

7

Making predictions for objects from the test sample:

In [12]:
features = list(test.columns[:8])
x = test[features]
y_true = test['Outcome']
y_pred = clf.predict(x)

The rate of correct answers of the classifier:

In [13]:
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)

0.7909604519774012

The mean of the metrics $F_1$ (Macro-F1):

In [14]:
from sklearn.metrics import f1_score
f1_score(y_true, y_pred, average='macro')

0.7301940427635645

Making a prediction for a certain object (with the index 708) of the input data:

In [15]:
df.loc[708, features]

Pregnancies                   9.000
Glucose                     164.000
BloodPressure                78.000
SkinThickness                 0.000
Insulin                       0.000
BMI                          32.800
DiabetesPedigreeFunction      0.148
Age                          45.000
Name: 708, dtype: float64

Assigned class:

In [16]:
clf.predict([df.loc[708, features].tolist()])[0]

1