<a href="https://colab.research.google.com/github/MFaiqKhan/classical_MachineLearning/blob/main/DecisionTree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A Medium Article Explaining Decision Tree: https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052



The rationale behind decision trees is to create a model that predicts the value of a target variable based on several input variables. It works by breaking down the data set into smaller and smaller subsets, while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

*In the context of machine learning, an associated decision tree refers to the model that is being created or developed using the decision tree algorithm. As the algorithm progresses, the decision tree is built incrementally by recursively partitioning the data set into smaller and smaller subsets based on the values of the input features. Each node in the decision tree represents a decision point based on the value of a particular input feature, and the branches spreading from each node represent the possible outcomes or values of the target variable. Ultimately, the goal is to build a decision tree that accurately predicts the target variable for new, unseen data.*



*In a decision tree, nodes represent features or attributes of the data and edges represent the decision rules or conditions based on the values of those features. There are two main types of nodes in a decision tree:*

- *Decision nodes (or internal nodes): These nodes represent a decision based on a feature or attribute. They have one incoming edge and two or more outgoing edges that correspond to the possible values of the feature. Each decision node splits the data into two or more subsets based on the values of the selected feature.*

- *Leaf nodes (or terminal nodes): These nodes represent a final decision or outcome. They have one incoming edge and no outgoing edges. Each leaf node represents a class label or a decision that can be made based on the values of the features in the corresponding subset of data.*

*In other words, decision nodes represent the decision-making process that occurs in the tree, while leaf nodes represent the final outcomes or predictions based on that process.*


Each decision node in the tree corresponds to an input variable, and the leaf nodes represent the output or the decision about the target variable. The tree can be "learned" by splitting the training data into subsets based on an attribute value test, where the subsets of the data are chosen based on the value of a given attribute.

The split criterion used to determine the attribute to split on is typically based on the measure of how well the split separates the data of different classes. This is often measured using the Gini impurity or entropy measure, which determines the degree of randomness or disorder in a set of values.

The rationale behind decision trees is to create a simple model that is easy to understand and interpret, and can be used to predict the target variable value based on the input variables. The resulting tree can also be used for classification, where the target variable is a categorical variable, or for regression, where the target variable is a continuous variable.

Decision trees have the advantage of being easy to understand and interpret, as the resulting tree can be visualized and analyzed. However, they can be prone to overfitting and can be unstable when there are small changes in the data. To address these issues, ensemble methods such as random forests and boosting are often used, which combine multiple decision trees to improve the accuracy and stability of the model.

Gini importance is a metric used in decision trees to evaluate the importance of each feature in the classification task. It measures the total reduction of the Gini impurity achieved by a feature in all the nodes of the tree where it appears.

Gini impurity is a measure of how often a randomly chosen element from a set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. A feature with higher Gini importance is more important for the classification task, as it contributes more to the reduction of the Gini impurity.

The Gini importance of a feature is calculated as follows :

- Calculate the Gini importance of each feature at each split in the decision tree.

- Multiply the Gini importance of each feature by the number of samples that were classified at each split.

- Sum the importance of each feature over all the splits in the decision tree.

In [76]:
import pandas as pd
df = pd.read_csv("salaries.csv")
df.head()

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0


In [77]:
## processing target and input variables

inputs = df.drop('salary_more_then_100k', axis='columns')
target = df['salary_more_then_100k']
target

0     0
1     0
2     1
3     1
4     0
5     1
6     0
7     0
8     0
9     1
10    1
11    1
12    1
13    1
14    1
15    1
Name: salary_more_then_100k, dtype: int64

In [78]:
inputs 

Unnamed: 0,company,job,degree
0,google,sales executive,bachelors
1,google,sales executive,masters
2,google,business manager,bachelors
3,google,business manager,masters
4,google,computer programmer,bachelors
5,google,computer programmer,masters
6,abc pharma,sales executive,masters
7,abc pharma,computer programmer,bachelors
8,abc pharma,business manager,bachelors
9,abc pharma,business manager,masters


In [79]:
# using label encoder to turn the categorical into numerical form

from sklearn.preprocessing import LabelEncoder
le_job = LabelEncoder()
le_company = LabelEncoder()
le_degree = LabelEncoder()


In [80]:
inputs['company_n'] = le_company.fit_transform(inputs['company'])
inputs['job_n'] = le_company.fit_transform(inputs['job'])
inputs['degree_n'] = le_company.fit_transform(inputs['degree'])
inputs

Unnamed: 0,company,job,degree,company_n,job_n,degree_n
0,google,sales executive,bachelors,2,2,0
1,google,sales executive,masters,2,2,1
2,google,business manager,bachelors,2,0,0
3,google,business manager,masters,2,0,1
4,google,computer programmer,bachelors,2,1,0
5,google,computer programmer,masters,2,1,1
6,abc pharma,sales executive,masters,0,2,1
7,abc pharma,computer programmer,bachelors,0,1,0
8,abc pharma,business manager,bachelors,0,0,0
9,abc pharma,business manager,masters,0,0,1


n this code snippet, the `LabelEncoder` class from scikit-learn is used to convert categorical variables in the `inputs` DataFrame to numerical labels.

`le_company`, `le_job`, and `le_degree` are instances of the `LabelEncoder` class that are used to fit and transform the `company`, `job`, and `degree` columns of the `inputs` DataFrame, respectively.

The transformed numerical labels are then stored in new columns of the `inputs` DataFrame, with `_n` added to the original column names (`company_n`, `job_n`, and `degree_n`).

Note that the `fit_transform` method is called on each `LabelEncoder` instance to both fit the encoder to the data and transform the data into numerical labels in a single step. This is a common pattern in scikit-learn when working with transformers that need to be fit on the training data and then applied to both the training and test data.

We need to fit the data on the encoder to create a mapping between the categorical values and numerical values. The mapping is learned during the fitting process, which involves scanning the data to identify all unique categorical values and assigning a unique numerical value to each one. This mapping is then stored in the encoder object.

During the transform step, the encoder applies the mapping to the input data, replacing the categorical values with their corresponding numerical values. This allows us to use categorical data in machine learning algorithms that only work with numerical data.

In the code snippet provided, the LabelEncoder fit_transform() method is used to fit and transform three categorical features (company, job, and degree) into numerical features (company_n, job_n, and degree_n). Once the transformation is complete, the original categorical features are no longer needed and can be dropped from the input data.

In [81]:
# dropping all the label columns and only keep the numerical label column we just got
inputs_n = inputs.drop(['company','job','degree'], axis="columns")
inputs_n

Unnamed: 0,company_n,job_n,degree_n
0,2,2,0
1,2,2,1
2,2,0,0
3,2,0,1
4,2,1,0
5,2,1,1
6,0,2,1
7,0,1,0
8,0,0,0
9,0,0,1


In [82]:
X = inputs_n
y = target

In [83]:
from sklearn import tree
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.8)

model = tree.DecisionTreeClassifier()
model.fit(X_train,y_train)


In [84]:
X_train

Unnamed: 0,company_n,job_n,degree_n
5,2,1,1
10,1,2,0
9,0,0,1
3,2,0,1
12,1,0,0
11,1,2,1
13,1,0,1
0,2,2,0
7,0,1,0
2,2,0,0


In [85]:
X_test

Unnamed: 0,company_n,job_n,degree_n
1,2,2,1
6,0,2,1
15,1,1,1
4,2,1,0


In [86]:
y_test

1     0
6     0
15    1
4     0
Name: salary_more_then_100k, dtype: int64

In [87]:
model.score(X_test,y_test)

0.5

In [88]:
model.predict(X_test)

array([0, 1, 1, 1])

# Exercise:

In [89]:
import pandas as pd
df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [90]:
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [91]:
inputs = df.drop(['Survived'], axis = 'columns')
target = df.Survived
target.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [92]:
inputs.head()

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,male,22.0,7.25
1,1,female,38.0,71.2833
2,3,female,26.0,7.925
3,1,female,35.0,53.1
4,3,male,35.0,8.05


In [93]:
 inputs.Sex = inputs.Sex.map({'male': 1, 'female': 2})

this is not the same as label encoding. Label encoding assigns numerical labels to categorical variables, whereas this code line is directly mapping the values of a specific column to new numerical values. In this case, the code is mapping the values 'male' and 'female' to 1 and 2, respectively. This is a specific mapping rather than a general encoding of all possible values in the column.

The map() function is being used here to replace the values of male and female with numerical values 1 and 2 respectively. However, the same thing can be achieved using LabelEncoder from the sklearn library.

Using LabelEncoder would be a better approach as it provides a standardized way of encoding categorical variables and avoids errors that can occur when manually mapping categories to numerical values.



```
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
inputs['Sex'] = le.fit_transform(inputs['Sex'])

```

This will encode the Sex column with 0 for female and 1 for male.



In [94]:
inputs.head()

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,1,22.0,7.25
1,1,2,38.0,71.2833
2,3,2,26.0,7.925
3,1,2,35.0,53.1
4,3,1,35.0,8.05


In [95]:
# will return the first 10 values in the "Age" column of the dataset.

inputs.Age[:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

Note that some of the values in the "Age" column are missing, as indicated by the "NaN" value.

In [96]:
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs[:10]

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,1,22.0,7.25
1,1,2,38.0,71.2833
2,3,2,26.0,7.925
3,1,2,35.0,53.1
4,3,1,35.0,8.05
5,3,1,29.699118,8.4583
6,1,1,54.0,51.8625
7,3,1,2.0,21.075
8,3,2,27.0,11.1333
9,2,2,14.0,30.0708


In [98]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.2)

from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.776536312849162