# Decision Tree Tutorial
### What is an algorithm?
An algorithm takes inputs and generate outputs.
### What is a Machine Learning Model?
In some algorithms, the programmer manually creates the logic which converts inputs to ouputs. In these traditional models, the algorithm is created by the programmer, tested and verified by the programmer, and then used.

A ML model is a type of algorithm. In ML models, the algorithm is created via training and testing, and then used via inference. These steps look differet for different ML models. They will make sense through examples.
### What is a decision tree?
A decision tree is a type of ML model. In a decision tree, inputs are "roots". The roots pass data along a "tree" of decision nodes. Each decision node will do some math on the inputs (add, subtract, multiply, or divide in relation to another input and/or in relation to a constant). Each decision node will then pass on its output to the next decision node, until it reaches a node with no children, which is called a terminal node. In this way, a decision tree is a directed acyclic graph (DAG) where each root has input, each internal node has logic, and each leaf has output.

![Visual representation of a decision tree](assets/decision-tree.png)

When the decision tree is initialized, there are a set of parameters defined by the programmer. These include the number of inputs, number of internal nodes, number of leaves, and sometimes the specific connections between those nodes. The logic in the internal nodes is also defined as a parameter, but it is often completely random.
### Training
To train the tree, the model needs training data. This is a set of pairs of inputs and ouputs that are verified to be correct. For example, if the decision tree were trying to predict college GPA from high school GPA, the training data would consist of pairs of high school GPA and college GPA. The model is fed inputs, and the ouputs are checked against the correct, expected outputs. Based on how far off the output is from the expected output, the decision tree can gague how accurate it is. If the model is inaccurate, the logic in internal nodes will be randomized a lot, and the model will train on the next piece of data. If the model is not very inaccurate, the logic in internal nodes will be randomized a little, and the model will train on the next piece of data. Each time, the model gets closer and closer to being correct. This process will continue until the training data is exhausted.

### Testing
To test the tree, the programmer will get a fresh set of inputs and ouputs. These must be distinct from the training set. The programmer will then measure the percent of times the model gets the correct answer. That is the accuracy of the model. Because this algorithm is probibalistic, it will almost never be 100% correct; it will just be close enough.

### Inference
Once the model is trained and tested, it can be fed new inputs with unknown ouputs. It will then give a prediction. The programmer can judge how useful that prediction is based on the accuracy of the model as measured in testing.

### Data types
Decision trees can be categorical or regressive. Categorical trees try to fit inputs into categorical ouputs. For example, take in an image and predict if it is a cat or a dog. Regressive trees try to take a number and output a different number; for example, high school and college GPA as explained above.

# Decision Tree Exercise
### SciKit Learn
SciKit learn is a useful set of ML tools provided in python. It has premade models, datasets, and functions to train and evaluate ML models.

To get started, lets' import scikit, load an example dataset, train a decision tree classifier, run a test set, and then evaluate the accuracy of the model.

If you try to import scikit and python errors with "No module named 'sklearn', install sklearn with 'pip install scikit-learn' in the terminal.

In [None]:
# Our dataset which contains data about petals and sepals for different types of flowers
from sklearn.datasets import load_iris
# A function which splits data into training data and testing data
from sklearn.model_selection import train_test_split
# The decision tree
from sklearn.tree import DecisionTreeClassifier
# A function for evaluating the accuracy of a ML model
from sklearn.metrics import accuracy_score

In [None]:
# Load data
iris = load_iris()
X = iris.data
y = iris.target

# This will exceed the size limit for Jupyter to print, but gives a sense of the data
print(X)
print(y)

In [None]:
# Split the data into train and test sets, with 20% going to the test set and 80% going to the training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Initialize the decision tree
# max_depth is the maximum number of internal nodes before a leaf must be generated. 
# max_depth helps limit the computation required to train the model.
# Higher max_depth generally leads to a better model.
# random_state defines the initial random state of the model.
# Try changing random_state and see how it affects accuracy later in the demonstration.
model = DecisionTreeClassifier(max_depth=1, random_state=4)

In [None]:
# Train the model
model.fit(X_train, y_train)

In [None]:
# Test the model
y_prediction = model.predict(X_test)
# Compare the test ouput to our expected output
accuracy = accuracy_score(y_test, y_prediction)
print(f"The model has an accuracy score of {accuracy}")

# Exploration questions
### What is the model's accuracy at a max_depth of 1? 2? 3? 4? 5? 99?


### How does changing random_state change (or not change) the model's accuracy? Why do you think that may be?


### What happens when you increase test_size from 0.2 to 0.5? 0.8?

# Generating your own model

### Outline
You will train a decision tree model on scikit-learn's diabetes dataset. This is a regression problem, unlike iris (above), which was a classification (categorial) problem.

Remember, the basic steps are:
- Import required libraries
- Prep data (split into training and testing)
- Train model
- Test model
- Compare test output to expected output

In [None]:
# Have fun!
from sklearn.datasets import load_diabetes
