<a href="https://colab.research.google.com/github/Freemanlabs/giz-rwanda-ai-training/blob/master/intro-to-ai/04_decision_trees/decision_tree_classifier.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Decision Tree Classifier in Scikit-learn

## Loading Data

Let's first load the required Pima Indian Diabetes dataset using pandas' `.read_csv()` function. 

In [None]:
# load dataset
import pandas as pd

# diabetes_df = pd.read_csv('data/diabetes_clean.csv')
diabetes_df = pd.read_csv('https://raw.githubusercontent.com/Freemanlabs/giz-rwanda-ai-training/master/intro-to-ai/04_decision_trees/data/diabetes_clean.csv')
diabetes_df.head()

## Feature Engineering

### Feature Selection

Here, you need to divide given columns into two types of variables dependent (or target variable) and independent variable(or feature variables).

In [None]:
# split dataset in features and target variable
X = diabetes_df.drop(["diabetes"], axis=1).values # Features
y = diabetes_df["diabetes"].values # Target variable
print(X.shape, y.shape)

### Splitting Data

To understand model performance, dividing the dataset into a training set and a test set is a good strategy.

Let's split the dataset by using the function train_test_split(). You need to pass three parameters features; target, and test_set size.

In [None]:
# Split dataset into training set and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) # 80% training and 20% test

### Scaling Data

Standardize features by removing the mean and scaling to unit variance.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Modelling

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import roc_auc_score

### Building Logistic Regression Model

Let's create a logistic regression model using Scikit-learn.

In [None]:
# Create Logistic Regression classifer object
lr_clf = LogisticRegression()

# Train Logistic Regression Classifer
lr_clf = lr_clf.fit(X_train, y_train)

#Predict the response for test dataset
lr_pred = lr_clf.predict(X_test)

#### Evaluating the Model

Let's estimate how accurately the classifier or model can predict the type of cultivars.

In [None]:
print("Logistic Regression AUC: ", roc_auc_score(y_test, lr_pred))

### Building Decision Tree Model

Let's create a decision tree model using Scikit-learn.

In [None]:
# Create Decision Tree classifer object
tree_clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
tree_clf = tree_clf.fit(X_train, y_train)

#Predict the response for test dataset
tree_pred = tree_clf.predict(X_test)

#### Evaluating the Model

Let's estimate how accurately the classifier or model can predict the type of cultivars.

In [None]:
print("Decision Tree AUC: ", roc_auc_score(y_test, tree_pred))

### Building Random Forest Model

Let's create a random forest model using Scikit-learn.

In [None]:
# Create Random Forest classifer object
rf_clf = RandomForestClassifier()

# Train Logistic Regression Classifer
rf_clf = rf_clf.fit(X_train, y_train)

#Predict the response for test dataset
rf_pred = rf_clf.predict(X_test)

#### Evaluating the Model

Let's estimate how accurately the classifier or model can predict the type of cultivars.

In [None]:
print("Random Forest AUC: ", roc_auc_score(y_test, rf_pred))