#  DecisionTree Classifier :

<pre>
Definition : A Decision Tree is a simple representation for classifying examples. It is a Supervised Machine Learning where the data is continuously split according to a certain parameter.

Decision Tree consists of :

Nodes : Test for the value of a certain attribute.

Edges/ Branch : Correspond to the outcome of a test and connect to the next node or leaf.

Leaf nodes : Terminal nodes that predict the outcome (represent class labels or class distribution).


To understand the concept of Decision Tree consider the below example. Let’s say you want to predict whether a person is fit or unfit, given their information like age, eating habits, physical activity, etc. The decision nodes are the questions like ‘What’s the age?’, ‘Does he exercise?’, ‘Does he eat a lot of pizzas’? And the leaves represent outcomes like either ‘fit’, or ‘unfit’.
</pre>

<img src="https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1545934190/1_r5ikdb.png" >

In [2]:
#Loading the Datasets.........Requires  for Decision Tree Classifier...

In [17]:
from sklearn.datasets import load_breast_cancer

In [18]:
data = load_breast_cancer()


In [19]:
data.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [21]:
#Creating the Dataframe...Importing the Required methods...
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt


In [22]:
df = pd.DataFrame(data['data'],columns=data['feature_names'])

In [23]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [24]:
#Here all the Features are there ,....But there is no Target Column...

df['target'] = data['target']

In [25]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [26]:
#Checking the Null Values in the Datasets.....

df.isna().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

In [27]:
#Information about the dataset....

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [35]:
#Assigning the Columns to the X,y Variables...

X = df.drop('target',axis=1)
y = df[['target']]

In [28]:
#From Here Importing the Algorithm and Constructing the Model ...

In [29]:
from sklearn.tree import DecisionTreeClassifier

In [30]:
#Making the Object ...

dec = DecisionTreeClassifier()

In [36]:
#Here data is splitting into Training and Testing.....

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

<IPython.core.display.Javascript object>

In [38]:
#Training the Model.....

dec.fit(X_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [42]:
#Checking the Accuracy of the Model during the Training....

dec.score(X_train,y_train)*100

100.0

In [43]:
#Checking the Accuracy of the Model on the TEst data....

dec.score(X_test,y_test)*100

92.98245614035088

In [44]:
#Confusion Matrix  to the model...

from sklearn.metrics import confusion_matrix

In [45]:
confusion_matrix(y_test,dec.predict(X_test))

array([[39,  4],
       [ 4, 67]], dtype=int64)

<pre>
Top  4 --->Type 1 error ---->He is not a patient But model says he is a cancer patient..
Down 4 --->Type 2 error ---->He is a cancer patient ..But model says that He is not a patient.. </pre>

In [47]:
from sklearn import tree

In [50]:
%matplotlib notebook

Visualizing the Data....

In [51]:
plt.figure(figsize=(15,8))
tree.plot_tree(dec,feature_names =X.columns)
plt.show()

<IPython.core.display.Javascript object>

Example-2 on the DTC-->DecisionTreeClassifier....

In [2]:
#IMporting the Dataset...

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
#Importing the Dataset from the Datasets folder in my PC.----->House Dataset.csv

In [44]:
df = pd.read_csv("Datasets/House Dataset.csv")

In [6]:
df.head()

Unnamed: 0,price,resid_area,air_qual,room_num,age,dist1,dist2,dist3,dist4,teachers,poor_prop,airport,n_hos_beds,n_hot_rooms,waterbody,rainfall,bus_ter,parks,Sold
0,24.0,32.31,0.538,6.575,65.2,4.35,3.81,4.18,4.01,24.7,4.98,YES,5.48,11.192,River,23,YES,0.049347,0
1,21.6,37.07,0.469,6.421,78.9,4.99,4.7,5.12,5.06,22.2,9.14,NO,7.332,12.1728,Lake,42,YES,0.046146,1
2,34.7,37.07,0.469,7.185,61.1,5.03,4.86,5.01,4.97,22.2,4.03,NO,7.394,101.12,,38,YES,0.045764,0
3,33.4,32.18,0.458,6.998,45.8,6.21,5.93,6.16,5.96,21.3,2.94,YES,9.268,11.2672,Lake,45,YES,0.047151,0
4,36.2,32.18,0.458,7.147,54.2,6.16,5.86,6.37,5.86,21.3,5.33,NO,8.824,11.2896,Lake,55,YES,0.039474,0


In [7]:
#Here Except the sold all the items are the Features....
#Here sold is the Target column.......

In [8]:
#CHecking  the correelation ....Coreelation is nothing but  the columns effecting the target column...

df.corr()

Unnamed: 0,price,resid_area,air_qual,room_num,age,dist1,dist2,dist3,dist4,teachers,poor_prop,n_hos_beds,n_hot_rooms,rainfall,parks,Sold
price,1.0,-0.484754,-0.4293,0.696304,-0.377999,0.251355,0.249459,0.24665,0.2482,0.505655,-0.740836,0.109646,0.023122,-0.047426,-0.391574,-0.154698
resid_area,-0.484754,1.0,0.763651,-0.391676,0.644779,-0.706481,-0.707956,-0.707566,-0.705819,-0.383248,0.6038,0.005827,-0.000839,0.05581,0.707635,0.024404
air_qual,-0.4293,0.763651,1.0,-0.302188,0.73147,-0.768589,-0.769724,-0.769157,-0.764873,-0.188933,0.590879,-0.049954,-0.004882,0.092104,0.915544,-0.004017
room_num,0.696304,-0.391676,-0.302188,1.0,-0.240265,0.208464,0.203981,0.201907,0.205397,0.355501,-0.613808,0.032207,0.030674,-0.064694,-0.282817,0.027148
age,-0.377999,0.644779,0.73147,-0.240265,1.0,-0.746904,-0.746493,-0.747021,-0.746707,-0.261515,0.602339,-0.021102,0.00938,0.075198,0.67385,-0.016291
dist1,0.251355,-0.706481,-0.768589,0.208464,-0.746904,1.0,0.997905,0.997735,0.994073,0.232834,-0.498823,-0.03055,-0.014463,-0.036794,-0.706319,-0.035309
dist2,0.249459,-0.707956,-0.769724,0.203981,-0.746493,0.997905,1.0,0.998097,0.994003,0.233707,-0.495693,-0.031248,-0.010239,-0.038005,-0.708237,-0.040356
dist3,0.24665,-0.707566,-0.769157,0.201907,-0.747021,0.997735,0.998097,1.0,0.994126,0.233588,-0.49429,-0.028471,-0.010077,-0.04147,-0.709346,-0.035768
dist4,0.2482,-0.705819,-0.764873,0.205397,-0.746707,0.994073,0.994003,0.994126,1.0,0.228256,-0.496084,-0.021648,-0.00585,-0.032542,-0.703508,-0.043612
teachers,0.505655,-0.383248,-0.188933,0.355501,-0.261515,0.232834,0.233707,0.233588,0.228256,1.0,-0.374044,-0.00813,-0.023343,-0.045836,-0.187004,0.042525


<pre>
Here the Correletion is No.of rooms,resid_area,n_house beds are postively correleated....

</pre>

In [40]:
#Checking the NUll values...in the datasets... In df['n_hos_beds'] have 8 null values...

df.isna().sum()

price          0
resid_area     0
air_qual       0
room_num       0
age            0
dist1          0
dist2          0
dist3          0
dist4          0
teachers       0
poor_prop      0
airport        0
n_hos_beds     0
n_hot_rooms    0
waterbody      0
rainfall       0
bus_ter        0
parks          0
Sold           0
dtype: int64

In [41]:
#Checking the information of the dataframe...

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   price        506 non-null    float64
 1   resid_area   506 non-null    float64
 2   air_qual     506 non-null    float64
 3   room_num     506 non-null    float64
 4   age          506 non-null    float64
 5   dist1        506 non-null    float64
 6   dist2        506 non-null    float64
 7   dist3        506 non-null    float64
 8   dist4        506 non-null    float64
 9   teachers     506 non-null    float64
 10  poor_prop    506 non-null    float64
 11  airport      506 non-null    int32  
 12  n_hos_beds   506 non-null    float64
 13  n_hot_rooms  506 non-null    float64
 14  waterbody    506 non-null    uint8  
 15  rainfall     506 non-null    int64  
 16  bus_ter      506 non-null    uint8  
 17  parks        506 non-null    float64
 18  Sold         506 non-null    int64  
dtypes: float

In [46]:
df['n_hos_beds'].isna().sum()

#Here we have 8 null values...

8

In [47]:
#Here object is the string ....We should not pass the string dataa..sooooo,,,,,

In [48]:
from sklearn.preprocessing import LabelEncoder

In [49]:
encode = LabelEncoder()

In [50]:
lab=encode.fit(df['airport'])

In [51]:
df['airport'] = lab.transform(df['airport'])

In [52]:
df.head()

Unnamed: 0,price,resid_area,air_qual,room_num,age,dist1,dist2,dist3,dist4,teachers,poor_prop,airport,n_hos_beds,n_hot_rooms,waterbody,rainfall,bus_ter,parks,Sold
0,24.0,32.31,0.538,6.575,65.2,4.35,3.81,4.18,4.01,24.7,4.98,1,5.48,11.192,River,23,YES,0.049347,0
1,21.6,37.07,0.469,6.421,78.9,4.99,4.7,5.12,5.06,22.2,9.14,0,7.332,12.1728,Lake,42,YES,0.046146,1
2,34.7,37.07,0.469,7.185,61.1,5.03,4.86,5.01,4.97,22.2,4.03,0,7.394,101.12,,38,YES,0.045764,0
3,33.4,32.18,0.458,6.998,45.8,6.21,5.93,6.16,5.96,21.3,2.94,1,9.268,11.2672,Lake,45,YES,0.047151,0
4,36.2,32.18,0.458,7.147,54.2,6.16,5.86,6.37,5.86,21.3,5.33,0,8.824,11.2896,Lake,55,YES,0.039474,0


Here using anotheer method for encoding is Dummies....-->Waterbody,bus_ter columns are made into numeric format...

In [53]:
df['bus_ter'] = pd.get_dummies(df['bus_ter'])
df['waterbody'] = pd.get_dummies(df['waterbody'])

In [54]:
df.head()

Unnamed: 0,price,resid_area,air_qual,room_num,age,dist1,dist2,dist3,dist4,teachers,poor_prop,airport,n_hos_beds,n_hot_rooms,waterbody,rainfall,bus_ter,parks,Sold
0,24.0,32.31,0.538,6.575,65.2,4.35,3.81,4.18,4.01,24.7,4.98,1,5.48,11.192,0,23,1,0.049347,0
1,21.6,37.07,0.469,6.421,78.9,4.99,4.7,5.12,5.06,22.2,9.14,0,7.332,12.1728,1,42,1,0.046146,1
2,34.7,37.07,0.469,7.185,61.1,5.03,4.86,5.01,4.97,22.2,4.03,0,7.394,101.12,0,38,1,0.045764,0
3,33.4,32.18,0.458,6.998,45.8,6.21,5.93,6.16,5.96,21.3,2.94,1,9.268,11.2672,1,45,1,0.047151,0
4,36.2,32.18,0.458,7.147,54.2,6.16,5.86,6.37,5.86,21.3,5.33,0,8.824,11.2896,1,55,1,0.039474,0


In [55]:
df['n_hos_beds']=df['n_hos_beds'].fillna(df['n_hos_beds'].mean())

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   price        506 non-null    float64
 1   resid_area   506 non-null    float64
 2   air_qual     506 non-null    float64
 3   room_num     506 non-null    float64
 4   age          506 non-null    float64
 5   dist1        506 non-null    float64
 6   dist2        506 non-null    float64
 7   dist3        506 non-null    float64
 8   dist4        506 non-null    float64
 9   teachers     506 non-null    float64
 10  poor_prop    506 non-null    float64
 11  airport      506 non-null    int32  
 12  n_hos_beds   506 non-null    float64
 13  n_hot_rooms  506 non-null    float64
 14  waterbody    506 non-null    uint8  
 15  rainfall     506 non-null    int64  
 16  bus_ter      506 non-null    uint8  
 17  parks        506 non-null    float64
 18  Sold         506 non-null    int64  
dtypes: float

Importing the Model DecisionTreeClassifier().....

In [58]:
from sklearn.tree import DecisionTreeClassifier

In [59]:
des = DecisionTreeClassifier()

In [60]:
#Here the data is splitted into Train ing and for Testing ....

from sklearn.model_selection import train_test_split

In [63]:
X = df.drop('Sold',axis=1)
y = df[['Sold']]

In [64]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [66]:
#Training the Model using the X_train dataa......

des.fit(X_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [68]:
#Checking the model accuracy...

des.score(X_train,y_train)*100

100.0

In [70]:
#Checking the accuracy of the model on the test_data...

des.score(X_test,y_test)

0.6274509803921569

In [71]:
#Checking the Model accuracy using accurcy _score//.....

In [72]:
from sklearn.metrics import accuracy_score

In [75]:
y_pred = des.predict(X_test)

In [76]:
accuracy_score(y_test,y_pred)

0.6274509803921569

Constructing the confusion_matrix....

In [85]:
from sklearn.metrics import confusion_matrix

In [86]:
confusion_matrix(y_test,y_pred)

array([[36, 24],
       [14, 28]], dtype=int64)

In [87]:
#Visualizing the treee....

In [88]:
from sklearn import tree

In [89]:
%matplotlib notebook

In [92]:
plt.figure(figsize=[15,8])
tree.plot_tree(des,feature_names=X.columns)
plt.show()

<IPython.core.display.Javascript object>