# Decision Tree
A Decision Tree is a tree-like structure where:

- Internal nodes represent a feature (or attribute) of the dataset.

- Branches represent a decision rule (based on that feature).

- Leaf nodes represent the outcome (label or value).

It recursively splits the data based on certain conditions to reach a final decision.


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("salaries.csv")
df.head()

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0


In [3]:
inputs = df.drop('salary_more_then_100k', axis = 'columns')
target = df['salary_more_then_100k']

In [4]:
inputs.head()

Unnamed: 0,company,job,degree
0,google,sales executive,bachelors
1,google,sales executive,masters
2,google,business manager,bachelors
3,google,business manager,masters
4,google,computer programmer,bachelors


In [5]:
target

0     0
1     0
2     1
3     1
4     0
5     1
6     0
7     0
8     0
9     1
10    1
11    1
12    1
13    1
14    1
15    1
Name: salary_more_then_100k, dtype: int64

In [6]:
from sklearn.preprocessing import LabelEncoder

Call the LabelEncoder() method

In [7]:
le_company = LabelEncoder()
le_job = LabelEncoder()
le_degree = LabelEncoder()

Create new columns to store the encoded values for each of the existing column and apply the fit_transform method

In [8]:
inputs['company_n'] = le_company.fit_transform(inputs['company'])
inputs['job_n'] = le_job.fit_transform(inputs['job'])
inputs['degree_n'] = le_degree.fit_transform(inputs['degree'])
inputs.head()

Unnamed: 0,company,job,degree,company_n,job_n,degree_n
0,google,sales executive,bachelors,2,2,0
1,google,sales executive,masters,2,2,1
2,google,business manager,bachelors,2,0,0
3,google,business manager,masters,2,0,1
4,google,computer programmer,bachelors,2,1,0


Drop all the label columns 

In [9]:
inputs_n = inputs.drop(['company', 'job', 'degree'], axis = 'columns')
inputs_n

Unnamed: 0,company_n,job_n,degree_n
0,2,2,0
1,2,2,1
2,2,0,0
3,2,0,1
4,2,1,0
5,2,1,1
6,0,2,1
7,0,1,0
8,0,0,0
9,0,0,1


import Tree from sklearn and DecisionTreeClassifier

In [10]:
from sklearn.tree import DecisionTreeClassifier

In [11]:
model = DecisionTreeClassifier()

In [12]:
model.fit(inputs_n, target)

In [13]:
model.score(inputs_n, target)

1.0

In [14]:
model.predict(pd.DataFrame([[2,2,1]], columns = ['company_n', 'job_n', 'degree_n']))

array([0])

## Exercise

Exercise: Build decision tree model to predict survival based on certain parameters


CSV file is available to download at https://github.com/codebasics/py/blob/master/ML/9_decision_tree/Exercise/titanic.csv

In this file using following columns build a model to predict if person would survive or not,
- Pclass
- Sex
- Age
- Fare

Calculate score of your model

In [15]:
df = pd.read_csv('titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [16]:
inputs2 = df.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin','Survived', 'Embarked'], axis = 'columns')
target = df['Survived']

In [17]:
inputs2.head(10)

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,male,22.0,7.25
1,1,female,38.0,71.2833
2,3,female,26.0,7.925
3,1,female,35.0,53.1
4,3,male,35.0,8.05
5,3,male,,8.4583
6,1,male,54.0,51.8625
7,3,male,2.0,21.075
8,3,female,27.0,11.1333
9,2,female,14.0,30.0708


In [18]:
inputs2.Age = inputs2.Age.fillna(inputs2.Age.mean())

In [19]:
le_sex = LabelEncoder()

In [20]:
inputs2['Sex_n'] = le_sex.fit_transform(inputs2['Sex'])

In [21]:
inputs2.head(10)

Unnamed: 0,Pclass,Sex,Age,Fare,Sex_n
0,3,male,22.0,7.25,1
1,1,female,38.0,71.2833,0
2,3,female,26.0,7.925,0
3,1,female,35.0,53.1,0
4,3,male,35.0,8.05,1
5,3,male,29.699118,8.4583,1
6,1,male,54.0,51.8625,1
7,3,male,2.0,21.075,1
8,3,female,27.0,11.1333,0
9,2,female,14.0,30.0708,0


In [22]:
inputs2_n = inputs2.drop(['Sex'], axis = 'columns')

In [23]:
inputs2_n.head()

Unnamed: 0,Pclass,Age,Fare,Sex_n
0,3,22.0,7.25,1
1,1,38.0,71.2833,0
2,3,26.0,7.925,0
3,1,35.0,53.1,0
4,3,35.0,8.05,1


In [24]:
from sklearn.model_selection import train_test_split

In [25]:
x_train, x_test, y_train, y_test = train_test_split(inputs2_n, target, test_size = 0.2)

In [26]:
model2 = DecisionTreeClassifier()

In [27]:
model2.fit(x_train, y_train)

In [28]:
model2.score(x_test, y_test)

0.7932960893854749

In [29]:
model2.predict(pd.DataFrame([[1,22,7.25,1]], columns =['Pclass', 'Age', 'Fare', 'Sex_n'] ))

array([0])