#Simple tutorial on Decision Trees. Full tutorial article: https://exploringaiblog.wordpress.com/2019/02/28/an-intro-to-decision-trees-branching-out-in-machine-learning/

#Get dependencies

In [None]:
import pandas as pd
import pydotplus #pip install pydotplus
from sklearn.tree import export_graphviz
from sklearn.tree import DecisionTreeClassifier
import numpy as np

#Create mock data

We will create a mock dataset again. This time we donate male as the value 1, and female 0 in our “Gender” column, and our features column will be “P_Movies” (Number of movies as protagonist). This time we will increase the range of our feature column and randomize the dataset a bit more:

Note: So model will predict the gender of the actor based on the number of movies he/she played the role protagonist.

In [None]:
data = pd.DataFrame({'P_Movies': [17,64,18,20,38,49,55,25,29,31,33],
             'Gender': [1,0,1,0,1,0,0,1,1,0,1]})
data =data.sort_values('P_Movies')
data

Unnamed: 0,P_Movies,Gender
0,17,1
2,18,1
3,20,0
7,25,1
8,29,1
9,31,0
10,33,1
4,38,1
5,49,0
6,55,0


#Helper code to visualize tree

In [None]:
def tree_graph_to_png(tree, feature_names, png_file_to_save):
    tree_str = export_graphviz(tree, feature_names=feature_names,
                                     filled=True, out_file=None)
    graph = pydotplus.graph_from_dot_data(tree_str)
    graph.write_png(png_file_to_save)

#Train tree and get predictions

In [None]:
#define Decision Tree
dt = DecisionTreeClassifier(criterion = 'entropy')
#Define input vectors
#X is the features in this dataset
X = data['P_Movies'].values.reshape(-1, 1)
#Y is the vector with our Target Variables
Y = data['Gender'].values
#start fitting process
dt.fit(X, Y)

tree_graph_to_png(dt, feature_names=['P_Movies'],
                 png_file_to_save='dt.png')

**In** the below code, we define our input array and enter the information regarding to “P_Movies” about 4 actors. The DT takes in the array and makes prediction about their gender. 1 corresponds to male, 0 corresponds to female. Again, as our data is highly randomized, these predictions will seem random too.

In [None]:
d  = np.array([7, 15, 43, 45])
d=d.reshape(-1, 1)

dt.predict(d)

array([1, 1, 1, 0])