##  What are decision trees?

Decision trees are a flowchart-like structure in which each branch is the result of a test on a previous attribute. 
They allow you to ask multiple linear questions in the form of conditional logic statements. Decision tree algorithms can also be referred to as Classification and Regression Trees (CART) indicating that they can be used to solve both classification and regression problems.

### Decision tree terminology
**Root node** is the first node which gets split into several other nodes. It typically contains the entire sample or dataset being used.

**Splitting** is the process of breaking down a node into two or more other nodes.

**Decision node** is a node on which other nodes can be split from.

**Leaf node** also known as terminal node, the leaf node is one on which there are no other splits.

**Pruning** is the opposite of splitting. It involves removing subnodes to reduce the size of a tree. 

### Entropy
Entropy controls how a decision trees decides where to split the data. It is the measure of impurity. The whole point is to find split points that are as pure as possible.
The mathematical formula for entropy is:

$\sum_i (p_i)log_2 (p_i) $

*where $ (p_i) $ - is the fraction of examples given in a given class*

*and* $\sum_i $ - *is the sum of all classes*

There are two extreme situations for entrophy viz:

All the classes are the same therefore entropy is **0**.

Even split between two classes therefore entropy is **1**.

You can calculate entropy in Python using the math function.



### Information Gain
The decision tree algorithm maximises for information gain.

Information gain = $ entropy (parent) - [weighted average]  entropy(children) $

In [1]:
import pandas as pd

In [2]:
kidney_dataset = pd.read_csv("C:/Users/Studio14/Downloads/kidney disease.csv")

This dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease) at the UCI Machine Learning Repository. It can be used to predict chronic kidney disease denoted as "ckd". It was collected over a two -month period. The dataset has the following features:
1.Age(numerical)- **age** in years

2.Blood Pressure(numerical)- **bp** in mm/Hg

3.Specific Gravity(nominal) - **sg** - (1.005,1.010,1.015,1.020,1.025)

4.Albumin(nominal)- **al** - (0,1,2,3,4,5)

5.Sugar(nominal) - **su** - (0,1,2,3,4,5)

6.Red Blood Cells(nominal) - **rbc** - (normal,abnormal)

7.Pus Cell (nominal) - **pc** - (normal,abnormal)

8.Pus Cell clumps(nominal) - **pcc** - (present,notpresent)

9.Bacteria(nominal) - **ba** - (present,notpresent)

10.Blood Glucose Random(numerical) - **bgr** in mgs/dl

11.Blood Urea(numerical) - **bu** in mgs/dl

12.Serum Creatinine(numerical) - **sc** in mgs/dl

13.Sodium(numerical) - **sod** in mEq/L

14.Potassium(numerical) - **pot** in mEq/L

15.Hemoglobin(numerical) - **hemo** in gms

16.Packed Cell Volume(numerical) - **pcv**

17.White Blood Cell Count(numerical) - **wc** in cells/cumm

18.Red Blood Cell Count(numerical) - **rc** in millions/cmm

19.Hypertension(nominal) - **htn** - (yes,no)

20.Diabetes Mellitus(nominal) - **dm** - (yes,no)

21.Coronary Artery Disease(nominal) - **cad** - (yes,no)

22.Appetite(nominal) - **appet** - (good,poor)

23.Pedal Edema(nominal) - **pe** - (yes,no)

24.Anemia(nominal) - **ane** - (yes,no)

25.Class (nominal) - **classification** - (ckd,notckd) 
    
Let us take a quick look at the dataset using the head, info, and describe methods

In [3]:
kidney_dataset.head()

Unnamed: 0,id,age,bp,sg,al,su,rbc,pc,pcc,ba,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,...,38,6000,,no,no,no,good,no,no,ckd
2,2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,...,31,7500,,no,yes,no,poor,no,yes,ckd
3,3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,...,35,7300,4.6,no,no,no,good,no,no,ckd


In [4]:
kidney_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 26 columns):
id                400 non-null int64
age               391 non-null float64
bp                388 non-null float64
sg                353 non-null float64
al                354 non-null float64
su                351 non-null float64
rbc               248 non-null object
pc                335 non-null object
pcc               396 non-null object
ba                396 non-null object
bgr               356 non-null float64
bu                381 non-null float64
sc                383 non-null float64
sod               313 non-null float64
pot               312 non-null float64
hemo              348 non-null float64
pcv               330 non-null object
wc                295 non-null object
rc                270 non-null object
htn               398 non-null object
dm                398 non-null object
cad               398 non-null object
appet             399 non-null object
pe         

In [5]:
kidney_dataset.describe()

Unnamed: 0,id,age,bp,sg,al,su,bgr,bu,sc,sod,pot,hemo
count,400.0,391.0,388.0,353.0,354.0,351.0,356.0,381.0,383.0,313.0,312.0,348.0
mean,199.5,51.483376,76.469072,1.017408,1.016949,0.450142,148.036517,57.425722,3.072454,137.528754,4.627244,12.526437
std,115.614301,17.169714,13.683637,0.005717,1.352679,1.099191,79.281714,50.503006,5.741126,10.408752,3.193904,2.912587
min,0.0,2.0,50.0,1.005,0.0,0.0,22.0,1.5,0.4,4.5,2.5,3.1
25%,99.75,42.0,70.0,1.01,0.0,0.0,99.0,27.0,0.9,135.0,3.8,10.3
50%,199.5,55.0,80.0,1.02,0.0,0.0,121.0,42.0,1.3,138.0,4.4,12.65
75%,299.25,64.5,80.0,1.02,2.0,0.0,163.0,66.0,2.8,142.0,4.9,15.0
max,399.0,90.0,180.0,1.025,5.0,5.0,490.0,391.0,76.0,163.0,47.0,17.8


Let's see what happens when we run the try to run predictions on the dataset without doing any work on it. 

In [6]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()

In [7]:
from sklearn.model_selection import train_test_split
X= kidney_dataset.drop(['classification'], axis = 1)
y= kidney_dataset.classification
X_train, y_train, x_test, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [8]:
clf = clf.fit (X_train, y_train)

ValueError: could not convert string to float: 'abnormal'

We immediately run into errors here because classification models cannot handle non-numerical data. Therefore, we will creat dummy data to replace the columns with categorical data. Running the pd.info method on the dataset will easily point to the columns with string data as they have an object type.

In [12]:
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['rbc'], prefix = 'rbc', drop_first = True)], axis = 1)
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['pc'], prefix = 'pc', drop_first = True)], axis = 1) 
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['pcc'], prefix = 'pcc', drop_first = True)], axis = 1)
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['ba'], prefix = 'ba', drop_first = True)], axis = 1)
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['pcv'], prefix = 'pcv', drop_first = True)], axis = 1)
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['wc'], prefix = 'wc', drop_first = True)], axis = 1)
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['rc'], prefix = 'rc', drop_first = True)], axis = 1)
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['htn'], prefix = 'htn', drop_first = True)], axis = 1)
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['dm'], prefix = 'dm', drop_first = True)], axis = 1)
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['cad'], prefix = 'cad', drop_first = True)], axis = 1)
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['appet'], prefix = 'appet', drop_first = True)], axis = 1)
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['pe'], prefix = 'pe', drop_first = True)], axis = 1)
kidney_dataset = pd.concat([kidney_dataset,pd.get_dummies(kidney_dataset['ane'], prefix = 'ane', drop_first = True)], axis = 1)

In [13]:
kidney_dataset.drop(['rbc', 'pc', 'pcc', 'ba','pcv','wc','rc', 'htn','dm', 'cad', 'appet','pe', 'ane'], axis = 1, inplace = True)

In [None]:
kidney_dataset.isnull().sum()

In [14]:
kidney_dataset['hemo'].fillna(kidney_dataset.hemo.mean(), inplace = True)

In [15]:
kidney_dataset.drop(kidney_dataset[kidney_dataset.pot == 39].index, inplace = True)

After dropping the outliers in the pot column, the max value becomes 7. We then go ahead and fill the null values with the mean.

In [16]:
kidney_dataset['pot'].fillna(kidney_dataset.pot.mean(), inplace = True)

In [None]:
kidney_dataset.isnull().sum()

In [None]:
import matplotlib.pyplot as plt
plt.hist(kidney_dataset['sod'])
plt.show

In [None]:
a = kidney_dataset[kidney_dataset['sod'] < 20]
print(a.index)

In [17]:
kidney_dataset.drop(kidney_dataset[kidney_dataset.sod < 20].index, inplace = True)

In [18]:
kidney_dataset['sod'].fillna(kidney_dataset.sod.mean(), inplace = True)

In [None]:
kidney_dataset.isnull().sum()

In [19]:
import matplotlib.pyplot as plt
plt.hist(kidney_dataset['su'])
plt.show

  keep = (tmp_a >= first_edge)
  keep &= (tmp_a <= last_edge)


<function matplotlib.pyplot.show(*args, **kw)>

The 'sg', 'al' and 'su' columns contain nominal data. Values in this column seem to fall overwhelmingly into one category.Therefore, we will fill the null values with the mode of each column.

In [20]:
kidney_dataset['sg'].fillna(kidney_dataset.sg.mode(), inplace = True)
kidney_dataset['al'].fillna(kidney_dataset.al.mode(), inplace = True)
kidney_dataset['su'].fillna(kidney_dataset.su.mode(), inplace = True)
kidney_dataset['sc'].fillna(kidney_dataset.sc.mode(), inplace = True)
kidney_dataset['sc'].fillna(kidney_dataset.bp.mode(), inplace = True)

In [None]:
import matplotlib.pyplot as plt
plt.plot(kidney_dataset['bu'])
plt.show

In [None]:
kidney_dataset.isnull().sum()

The 'bgr' and 'bu' columns do not show any pattern that supports assigning a particular value to the null columns. Therefore, the null columns will be dropped.

In [21]:
kidney_dataset.dropna(inplace = True)

In [22]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()

In [32]:
from sklearn.model_selection import train_test_split
X= kidney_dataset.drop(['classification'], axis = 1)
y= kidney_dataset.classification
X_train, X_test,y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [33]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [40]:
from sklearn.metrics import recall_score, precision_score
recall = recall_score(y_test, y_pred, pos_label = 'ckd')
precision = precision_score(y_test, y_pred, pos_label = 'ckd')
print('Recall score is: ', recall, 'while Precision score is: ', precision)


Recall score is:  1.0 while Precision score is:  0.9629629629629629
