### Decision Trees

### DT are versatile ML algorithms that can perform regression and classification tsks

#### Advantages
* Easy to Understand
* Useful in Data exploration
* Less data cleaning required
* Data type is not a constraint

#### Disadvantages
* Over fitting

#### Objective: 
##### Build a model to predict "Drug Like" properties of a single compound.


#### Data: 
##### ADME descriptors for 3 libraries. Libraries: AFRODB Biofacquim FDA

#### Endpoint:
##### Drug Like (Binary)
* 1 -> Drug Like
*    0 -> No Drug Like

#### Descriptors:
#####    ADME descriptors:
* Aromatic heavy atoms
* H-bond acceptors
* H-bond donors
* Heavy atoms
* Rotatable bonds
* Ali Log S
* Ali Solubility (mg/ml)
#### Method: 
##### Decision Trees

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
Data = pd.read_csv("data/Data_SVM.csv", sep = ",")

In [None]:
Data.head()

In [None]:
## Exploratory Data Analysis

In [None]:
#Identify Libraries
Data.Library.unique()

In [None]:
#Identify Target
Data["Drug Like"].unique()

In [None]:
#Plot descriptors
sns.boxplot(x = "Library", y = Data["MW"], data=Data)

In [None]:
#Select numerical variables
numerical_data  = Data.select_dtypes(np.number)

In [None]:
numerical_data.head(6)

In [None]:
#statistical values
numerical_data.describe()

In [None]:
#Correlation
corr = numerical_data.corr()
corr.head()

In [None]:
sns.heatmap(corr)

In [None]:
#save heatmap
plt.savefig("correlacion_inicial.png")

In [None]:
#identify high correlated variables
corr_var =   ['XLOGP3', 'iLOGP', 'log Kp (cm/s)', 'Silicos-IT LogSw',
             'Ali Solubility (mol/l)','Ali Solubility (mg/ml)' , 
                'Consensus Log P', 'ESOL Solubility (mg/ml)', 'Unnamed: 54']

In [None]:
#Drop correlated variables
numerical_data = numerical_data.drop(columns=corr_var)
#Drop Target variable
numerical_data = numerical_data.drop("Drug Like", axis =1)

In [None]:
numerical_data.head()

In [None]:
#Save target column in a new DF
df_target = pd.DataFrame(Data['Drug Like'],columns=['Drug Like'])
df_target['Drug Like'].unique()

In [None]:
df_target

## Machine Learning Model

### Decision Tree

In [None]:
#Import
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz

In [None]:
#Create a test set
from sklearn.model_selection import train_test_split

In [None]:
#split data
X_train, X_test, y_train, y_test = train_test_split(numerical_data, np.ravel(df_target)              , test_size=0.30, random_state=1992)

In [None]:
X_train.head()

In [None]:
y_train

In [None]:
#Assign Model
dtree = DecisionTreeClassifier(criterion='gini', 
                               max_depth=4,
                               #max_leaf_nodes=8, 
                               random_state = 1992)


In [None]:
#train model
dtree.fit(X_train, y_train)

### Tree visualization

In [None]:
from IPython.display import Image  
from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydot

In [None]:
feature_names = numerical_data.columns
#feature_names
X_train.columns

In [None]:
dot_data = StringIO()  
export_graphviz(dtree, out_file=dot_data,feature_names=feature_names,
                        filled=True,rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())  
image_name = "DT.png"
graph[0].write_png(image_name)
display(Image(graph[0].create_png()))

Gini Index



### Predictions

In [None]:
#write a function to select an specific compound
def test_compound(Data, Library, Name):
    Data = Data[Data["Library"]== Library]
    test = Data[Data["Name"]== Name]
    test = test[numerical_data.columns]
    #print(test.head())
    return test

In [None]:
test = test_compound(Data, "FDA", "Acetaminophen")

In [None]:
#Descriptors
test

#### single prediction

In [None]:
dtree.predict(test)

### Evaluate the model

In [None]:
#make predictions
predictions = dtree.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
#accuracy
accuracy_score(y_test,predictions)

In [None]:
print(confusion_matrix(y_test,predictions))

### Random Forests¶
#### Now let's compare the decision tree model to a random forest.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators=100)


In [None]:
rfc.fit(X_train, y_train)

In [None]:
test = test_compound(Data, "FDA", "Acetaminophen")
rfc.predict(test)

In [None]:
rfc_pred = rfc.predict(X_test)

In [None]:
rfc_pred[:10]


In [None]:
print(confusion_matrix(y_test,rfc_pred))