# "Assisted" Decision Tree

The data set from the intial source included SAD and GAD probabilities, which we decided to drop from the dataset, for fear that they give the model too much information about the outcomes, essentially giving the model the answer.

With the cleaned Data we had a success rate around 68% with regular decision trees and 72% with random forest. The paper which was published with the data set achieved a 96% success rate using alternating decision trees.  This 'assisted' decision tree is the same model we used with the regular decision trees and random forest but with the SAD and GAD probablities included. As expected this made the model remarkably more accurate. SAD and GAD probabilities were the most important features by a wide margin.

This proves that excluding the SAD and GAD probabilities as well as the sample wright was the cirrect decision for our analysis.

In [1]:
# Dependencies
from sklearn import tree
import pandas as pd
import os 
from sklearn.model_selection import train_test_split

In [2]:
# Load in data
df = pd.read_csv('../cleaned_data/cleaned_anxiety_testing_data.csv')
df.head()

Unnamed: 0,Subject,Sex,Race,Age,Number of Bio. Parents,Number of Siblings,Poverty Status,SAD,GAD,Social Phobia,...,Frequency Temper Tantrums,Frequency Irritable Mood,Number of Sleep Disturbances,Number of Physical Symptoms,Number of Sensory Sensitivities,Family History - Substance Abuse,Family History - Psychiatric Diagnosis,GAD Probabiliy - Gamma,SAD Probability - Gamma,Sample Weight
0,Test101,F,0,2,1,1,0,0,1,1,...,27,396,4,3,2,0,0,1.596957,0.067867,0.572227
1,Test102,M,1,2,0,3,1,0,0,0,...,282,1179,6,1,0,1,1,0.03265,0.068811,0.396157
2,Test103,F,1,3,2,1,0,1,0,0,...,3,0,9,4,1,0,0,0.005782,0.270583,0.686673
3,Test104,F,0,4,2,0,0,0,0,0,...,30,45,6,0,0,0,1,1.596957,0.068811,0.400559
4,Test105,F,0,5,2,3,0,1,1,1,...,637,1170,10,5,2,0,0,1.596957,47.355899,0.314725


In [3]:
#Cleaning data
df = df.drop(['Subject'], axis = 1)
df['disorder'] = 'place holder'


In [4]:
# create disorder function
def disorder_condenser(x):
    if x.SAD == 0 and x.GAD == 0:
        return 'None'
    elif x.SAD == 1 and x.GAD == 0:
        return 'SAD'
    elif x.SAD == 0 and x.GAD == 1:
        return 'GAD'
    return 'Both'

In [5]:
# applying function
df['disorder'] = df.apply(disorder_condenser, axis=1)

In [6]:
#Cleaning data
df = df.drop(['GAD','SAD'], axis=1)
df.head()

Unnamed: 0,Sex,Race,Age,Number of Bio. Parents,Number of Siblings,Poverty Status,Social Phobia,ADHD,CD,Depression,...,Frequency Irritable Mood,Number of Sleep Disturbances,Number of Physical Symptoms,Number of Sensory Sensitivities,Family History - Substance Abuse,Family History - Psychiatric Diagnosis,GAD Probabiliy - Gamma,SAD Probability - Gamma,Sample Weight,disorder
0,F,0,2,1,1,0,1,0,0,1,...,396,4,3,2,0,0,1.596957,0.067867,0.572227,GAD
1,M,1,2,0,3,1,0,1,0,0,...,1179,6,1,0,1,1,0.03265,0.068811,0.396157,
2,F,1,3,2,1,0,0,0,0,0,...,0,9,4,1,0,0,0.005782,0.270583,0.686673,SAD
3,F,0,4,2,0,0,0,0,0,0,...,45,6,0,0,0,1,1.596957,0.068811,0.400559,
4,F,0,5,2,3,0,1,0,1,1,...,1170,10,5,2,0,0,1.596957,47.355899,0.314725,Both


In [7]:
# convert gender into binary int
df.Sex[df.Sex == 'M'] = 0
df.Sex[df.Sex == 'F'] = 1
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Sex,Race,Age,Number of Bio. Parents,Number of Siblings,Poverty Status,Social Phobia,ADHD,CD,Depression,...,Frequency Irritable Mood,Number of Sleep Disturbances,Number of Physical Symptoms,Number of Sensory Sensitivities,Family History - Substance Abuse,Family History - Psychiatric Diagnosis,GAD Probabiliy - Gamma,SAD Probability - Gamma,Sample Weight,disorder
0,1,0,2,1,1,0,1,0,0,1,...,396,4,3,2,0,0,1.596957,0.067867,0.572227,GAD
1,0,1,2,0,3,1,0,1,0,0,...,1179,6,1,0,1,1,0.03265,0.068811,0.396157,
2,1,1,3,2,1,0,0,0,0,0,...,0,9,4,1,0,0,0.005782,0.270583,0.686673,SAD
3,1,0,4,2,0,0,0,0,0,0,...,45,6,0,0,0,1,1.596957,0.068811,0.400559,
4,1,0,5,2,3,0,1,0,1,1,...,1170,10,5,2,0,0,1.596957,47.355899,0.314725,Both


In [9]:
# splitting data in input and output
data = df.values
inputs = data[:, 0:24]
outputs = data[:, 24]

array(['GAD', 'None', 'SAD', 'None', 'Both', 'GAD', 'None', 'Both',
       'None', 'None', 'None', 'None', 'None', 'SAD', 'None', 'None',
       'None', 'Both', 'None', 'None', 'GAD', 'None', 'Both', 'None',
       'None', 'SAD', 'None', 'None', 'None', 'None', 'None', 'None',
       'None', 'None', 'None', 'None', 'None', 'GAD', 'None', 'Both',
       'None', 'None', 'None', 'None', 'SAD', 'None', 'None', 'Both',
       'None', 'None', 'None', 'None', 'SAD', 'None', 'None', 'None',
       'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None',
       'None', 'GAD', 'None', 'GAD', 'None', 'None', 'None', 'None',
       'None', 'None', 'None', 'None', 'None', 'GAD', 'None', 'None',
       'None', 'Both', 'None', 'None', 'GAD', 'None', 'None', 'None',
       'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None',
       'None', 'SAD', 'Both', 'None', 'None', 'None', 'None', 'None',
       'None', 'None', 'Both', 'None', 'GAD', 'GAD', 'Both', 'None',
       'None', 'None',

In [11]:
# split data
outcomes = df['disorder']
outcomes.head()

input_factors = df.drop(['disorder'], axis = 1)
input_factors.head()
feature_names = input_factors.columns

In [12]:
# Train test groups
input_train, input_test, output_train, output_test = train_test_split(input_factors, outcomes, random_state=42)

In [13]:
# decision tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(input_train, output_train)
clf.score(input_test, output_test)

0.8888888888888888

In [14]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
randomforest = RandomForestClassifier(n_estimators=200)
randomforest = randomforest.fit(input_train, output_train)
randomforest.score(input_test, output_test)

0.875

In [15]:
# importance of inputs
sorted(zip(randomforest.feature_importances_, feature_names), reverse=True)

[(0.25944501714288476, 'SAD Probability - Gamma'),
 (0.20706905729518024, 'GAD Probabiliy - Gamma'),
 (0.06442671020690946, 'Number of Sleep Disturbances'),
 (0.052616544327673535, 'Number of Physical Symptoms'),
 (0.04987321374167448, 'Sample Weight'),
 (0.044610743573122454, 'Frequency Temper Tantrums'),
 (0.04282796933050219, 'Frequency Irritable Mood'),
 (0.041649281929327236, 'Number of Impairments'),
 (0.03414024268679358, 'Number of Type B Stressors'),
 (0.029743413088175904, 'Social Phobia'),
 (0.02727667797599973, 'Age'),
 (0.023454521565203442, 'Number of Type A Stressors'),
 (0.01826857336858881, 'Number of Siblings'),
 (0.016446734873755663, 'Sex'),
 (0.015134481641821582, 'Number of Sensory Sensitivities'),
 (0.011744261325292365, 'CD'),
 (0.010400215217335468, 'Race'),
 (0.010041626554683667, 'Number of Bio. Parents'),
 (0.00855521890103777, 'Poverty Status'),
 (0.007953328802695213, 'ODD'),
 (0.00749604643329259, 'Family History - Psychiatric Diagnosis'),
 (0.00737165817