<a href="https://colab.research.google.com/github/GabrielleRab/SRMPmachine/blob/main/Decision_Trees_Dragonflies_location.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploring Dragonfly wings with Decision Trees**

### **Step 1:** Identify your question

In this colab, you will build a machine learning model to figure out which characteristics of a dragonfly's wing are best correlated with its location (in North America or outside of North America).

### **Step 2:** Select your data
This dataset contains wing measurements from a series of dragonfly specimens collected around the world.

![Dragonfly](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQEBt2Q-q8f4RA4x0vo4bRPw6w3_mSshyaUiHXO8cR52CfDW419Vjy_0-CfiRohgVqiWWE&usqp=CAU)

The measurements include:
*   Area of each hindwing (HW) and forewing (FW) in mm^2
*   Inner and outer length and width of each wing (in mm)
*   Widest ratio: Ratio of distance (in pixels) from the base of the wing to its thickest portion to wing length (in pixels)
*   Slope of thickness: Slope of line that best fits a plot of wing thickness (in pixels), measured along the length of the wing

Let's load the data into the colab and take a look. Run the cell below to create a dataframe (table) of our data and preview the first five rows:

In [None]:
# import the necessary Python libraries
import pandas as pd

# create a dataframe called "df" with the dataset
df = pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/wing_measurements_location.csv")
df['North_America'] = df['North_America'].map({True: 'True', False: 'False'})

# preview the first five rows
df.head()

Let's find out how many rows are in our dataset. Each row represents a dragonfly specimen that has been collected.

Type len(df) below and run the cell:

In [2]:
# return the number of rows in the dataset. Type len(df) below and run the cell


### **Step 3:** Choose your method

We will be using the Decision Tree method today, as it is a good fit for categorizing labeled data.

Run the code below to import the necessary Python libraries for creating Decision Trees.

In [3]:
#import necessary Python libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import graphviz
from numpy.ma.extras import unique

import warnings
warnings.filterwarnings('ignore')

### **Step 4:** Prepare your data

Decision Trees work best when the features being analyzed are numerical. Aside from our labels ("North_America": True or False), there are several other columns with non-numerical data or unnecessary information (Collection ID). We will remove those now.

In [None]:
#We will remove columns we don't need for this investigation
df = df.drop('Collection Unique ID', 1)
df = df.drop('Suborder', 1)
df = df.drop('Species name', 1)

#Return the first 5 rows of the dataframe
df.head()

Try it yourself! We only want to look at data formatted like numbers, so we will remove the Family column. Type df = df.drop('Family', 1) in the cell below and press play to run your code.

Check to make sure that the Family column is gone.

In [None]:
#Type df = df.drop('Family', 1) below to remove the 'Family' column.


#Return the first 5 rows of the dataframe
df.head()

When people collect data by hand, there are likely to be some missing values. Let's remove all of the rows with missing data. Run the code below to do so:

In [6]:
#Remove any rows with missing values
df = df.dropna()

Let's see how many rows are left. Type len(df) below and run the code:

In [None]:
#Return the remaining number of rows. Type len(df) below and run code:



Now that our dataset is ready, let's check it for balance. Are there the same number of dragonflies in North America and outside of North America? Run the code below to find out:

In [None]:
#Return the number of rows with either Radial Velocity or Transit for discovery method
print("In North America:",len(df[df.North_America == 'True']))
print("Not in North America:",len(df[df.North_America == 'False']))

As it turns out, there are significantly more dragonflies in North America than not in North America.

**Bias Alert:** This introduces a possible source of statistical bias into our analysis. Our decision tree may end up better prepared to identify North American dragonflies because it has more information about them.

One way to address this source of bias is to ensure that our training and testing data both contain enough specimens from each location. Another is to make sure that we have a large enough training dataset to include sufficient non-North American specimens.

Now it's time to split our data into a training and a testing dataset.

**Bias Alert:** It's important to create a random split to eliminate any clustering or sorting of the data. Run the code below to do so:

Split the data 50/50 into a training and testing set by typing the number 50 after "training_percentage =" below. Then run the code cell.

In [8]:
#Get the features and labels from the data
x = df.drop(['North_America'], axis=1)
y = df['North_America']

#Specify a 50% split: TYPE 50 AFTER THE EQUAL SIGN BELOW
training_percentage =

#Create the training and testing datasets
X_train, X_test, Y_train, Y_test = train_test_split(x, y, train_size=training_percentage/100)

### **Step 5:** Train the model

Now it's time to make our decision tree. We will also need to set the hyperparameters (values that control how the model learns and makes decisions). In this case we will specify the maximum depth and the criterion the model will use to evaluate each feature.

The "Gini index" is a measure of how pure the split is for each node in the Decision Tree. A lower Gini index indicates a more pure split.

Type the number 2 after "max_depth= " below and then run the code cell to create your model:

In [9]:
#Create a decision tree classifier called "clf"
# TYPE 2 AFTER "MAX_DEPTH= " BELOW
clf = DecisionTreeClassifier(criterion='gini', max_depth= , random_state=0)

Now that we have created a decision tree it's time to train it using our training dataset. We will also evaluate the model's accuracy (what percent of dragonflies did it correctly classify based on where they were collected).

Run the code below to train and evaluate our model:

In [None]:
#Train the model
clf.fit(X_train, Y_train)

#Print the training accuracy
print('\nTraining Accuracy (%): ',(100*(clf.score(X_train,Y_train))))

**Bias alert:** Because our training accuracy is so high, we run the risk of overfitting our model. This means it might be better at predicting the training data than the testing data.

We can try to address this by training the model with a few different split percentages (instead of a 50/50 train/test split). Run the code below to see how they compare:

In [None]:
#Train and test the model with a 40% training percentage
X_train_b, X_test_b, Y_train_b, Y_test_b = train_test_split(x, y, train_size=40/100)
clf_b = DecisionTreeClassifier(criterion='gini', max_depth=2)
clf_b.fit(X_train, Y_train)
print('Training Accuracy (%) for 40/60 split: ',(100*(clf_b.score(X_train_b,Y_train_b))))

#Create the training and testing datasets with a 30% training percentage
X_train_c, X_test_c, Y_train_c, Y_test_c = train_test_split(x, y, train_size=30/100)
clf_c = DecisionTreeClassifier(criterion='gini', max_depth=2)
clf_c.fit(X_train_c, Y_train_c)
print('Training Accuracy (%) for 30/70 split: ',(100*(clf_c.score(X_train,Y_train))))

Rerun the code above a few times to see how the results change.

Our Decision Tree's training accuracy is always around 90%, regardless of the split percentage we use.

Now that we have trained our model, we can see which features it is using to make predictions.

Run the code below to visualize the tree based off of the original 50/50 split:

In [None]:
#Create a visualization for the decision tree
dot_data = tree.export_graphviz(clf, out_file=None,
                               feature_names=X_train.columns,
                               class_names=unique(Y_train.values, ''),
                               filled=True, rounded=True,
                               special_characters=True)
graph = graphviz.Source(dot_data)
graph

Note: "True" means found in North America and "False" means not found in North America

### **Step 6:** Test the model

Run this code block to run the other half of our data (the testing dataset) through the model we just trained to find out how accurate it is with new data:

In [None]:
#Make the prediction using the model
Y_pred = clf.predict(X_test)

print('Percentage accuracy: ', 100*accuracy_score(Y_test, Y_pred))

**Bias alert:** It's important to check if your decision tree is more or less accurate in classifying the training data than the testing data. If it is more accurate in the training vs testing data your model may be overfit.

Compare the accuracy in the training vs testing step. Is our model overfit?

### **Step 7:** Evaluate the model

Let's apply the model we just made to some newly identified dragonflies!

Look at the data for four dragonflies and see if our model would classify them correctly:
[New dragonfly records](https://docs.google.com/spreadsheets/d/1qt-X0ylXl8HSzL_rYrMPLANudCJAmSIa-uZKZQLITfU/edit?usp=sharing)

Follow the decision tree above to find out! For example, if the tree says "Slope of thickness FW ≤ 0.069" and the dragonfly has a "Slope of thickness FW" of 0.05, you should follow the "True" arrow because it is true that this dragonfly's forewing Slope of thickness is less than 0.069.

Did the model get all four dragonflies right? Did anything about this result surprise you? Why or why not?