<a href="https://colab.research.google.com/github/GabrielleRab/SRMPmachine/blob/main/Decision_Trees_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Decision trees with a dataset of your choice** 

Recall that we use decision trees when we have labeled data and our research question calls for categorization of our data into different groups based on their features.

### **Step 1:** Identify your question

This has already been done for you! Look at the research question for your dataset and consider whether or not it's a good fit for the decision tree approach.

### **Step 2:** Select your data

Let's import our data. First we need to load in the necessary Python libraries. Run the code below:

In [None]:
# import the necessary Python library
import pandas as pd

Next, we create a dataframe called with a pre-cleaned version of your data. 

**Important:** *Only run the cell for your chosen dataset. Ignore the other two cells.*

In [None]:
# Run this cell ONLY if you are using the stellar rotation dataset
df = pd.DataFrame(pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/Stellar_rotation_clean.csv"))

In [None]:
# Run this cell ONLY if you are using the dragonfly wing dataset
df = pd.DataFrame(pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/wing_measurements_clean.csv"))

In [None]:
# Run this cell ONLY if you are using the North Carolina crime dataset
df = pd.DataFrame(pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/crime_data_clean.csv"))

Let's take a look at the first 5 rows of the dataset. Make sure this is the dataset you meant to import! If it's wrong, just go back and run the correct cell above. That will over-write the dataframe.

In [None]:
df.head()

Run the code below to find out how many rows are in our dataset:

In [None]:
# return the number of rows in the dataset
len(df)

### **Step 3:** Choose your method

Review your dataset and your research question one more time to make sure that you're ready to use the decision tree model. Remember that it is a good fit for categorizing labeled data.

Run the code below to import the necessary Python libraries for creating Decision Trees.

In [None]:
#import necessary Python libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import graphviz
from numpy.ma.extras import unique

### **Step 4:** Prepare your data

This step has been taken care of for you! All three datasets only contain the needed columns and have all relevant data in numerical form.

Let's check our dataset for balance! 
**Replace aaaa and bbbb with the name of your first label and your second label.**

**Then replace xxxx in all three lines of code with the column that contains your labels.**

In [None]:
#Replace aaaa and bbbb with the name of your first label and your second label:
label1 = "aaaa"
label2 = "bbbb"

# Replace xxxx with the column that contains your labels in ALL THREE ROWS below:
colname = "xxxx"
print(label1+":",len(df[df.xxxx == label1]))
print(label2+":",len(df[df.xxxx == label2]))

**Bias Alert:** If there are more values with one label than the other, this introduces a possible source of statistical bias into our analysis. Our decision tree may end up better prepared to identify values with one label than the other. 

You can still use Decision Trees even if you have unbalanced data, but if your tree is not as effective as you hoped this could be a possible explanation as to why.

Now it's time to split our data into a training and a testing dataset using a random split:

In [None]:
#Get the features and labels from the data 
x = df.drop([colname], axis=1)
y = df[colname]

#Specify a 50% split
training_percentage = 50 

#Create the training and testing datasets
X_train, X_test, Y_train, Y_test = train_test_split(x, y, train_size=training_percentage/100)

### **Step 5:** Train the model

Now it's time to make our decision tree. We will also need to set the hyperparameters (values that control how the model learns and makes decisions). In this case we will specify the maximum depth and the criterion the model will use to evaluate each feature. 

The "Gini index" is a measure of how pure the split is for each node in the Decision Tree. A lower Gini index indicates a more pure split.

Run the code below to create your model:

In [None]:
#Create a decision tree classifier called "clf"
clf = DecisionTreeClassifier(criterion='gini', max_depth=2, random_state=0)

Now that we have created a decision tree it's time to train it using our training dataset. We will also evaluate the model's accuracy (what percent of exoplanets did it correctly classify based on discovery method).

Run the code below to train and evaluate our model:

In [None]:
#Train the model
clf.fit(X_train, Y_train)

#Print the training accuracy 
print('\nTraining Accuracy (%): ',(100*(clf.score(X_train,Y_train))))

Now that we have trained our model, we can see which features it is using to make predictions. 

Run the code below to visualize the tree based off of the original 50/50 split:

In [None]:
#Create a visualization for the decision tree
dot_data = tree.export_graphviz(clf, out_file=None,
                               feature_names=X_train.columns,
                               class_names=unique(Y_train.values, ''),                                
                               filled=True, rounded=True,
                               special_characters=True)
graph = graphviz.Source(dot_data)
graph

#did not use inclination -> is transit

### **Step 6:** Test the model

Run this code block to run the other half of our data (the testing dataset) through the model we just trained to find out how accurate it is with new data:

In [None]:
#Make the prediction using the model
Y_pred = clf.predict(X_test)

print('Percentage accuracy: ', 100*accuracy_score(Y_test, Y_pred))

### **Step 7:** Evaluate the model

Did this model help you answer your research question?

What are some forms of bias that you need to be aware of in this analysis?

What questions do you still have?