#What is Decision Tree?
![Decision_Tree](https://raw.githubusercontent.com/PolinaRus/My_Projects/master/Decision%20Tree.png)

#Decision Tree
Decision trees are a very straightforward machine learning model, and visualizing generic tree structures is a fairly well established technique.

#Binary Classification
**Binary Classification** is the task of predicting a binary label. E.g., is an email spam or not spam? Should I show this ad to this user or not? Will it rain tomorrowor not? This section demonstrates algorithms for making these types of predictions.

# Data

The Country Life Expectancy dataset we are going to use consists of information about 152 countries and their average Life Expectancy
We will use this information to predict if an average age of a country is <65 or >=65. The dataset is rather clean, and consists of both numeric and categorical variables.

**Attribute Information:**

* **Country:** Country name
* **Code:** Country code
* **Continent:** Continent that the country is located
* **Religion:** Main religion practiced in the country
* **Status:** Whether the country is developed, developing or poor
* **Population:** 2012 population of the country
* **Labor Forece:** 2012 labor force of the country
* **GDP:** 2012 GDP per capita of the country
* **Urbanization:** 2012 urbanization % of the country
* **Literacy:** 2012 literacy % of the country
* **Population Growth Rate:** 2012 population growth rate % of the country
* **Below Poverty Line:** 2012 population below poverty line % of the country
* **Median Age:** 2012 median age of the country
* **Life Expectancy:** 2012 average life expectancy of the country
* **Age:** Whether the country's average life expectancy is below or above 65

**Target/Label:** < 65, >= 65

# Load Data
In this example, we will read in the Life Expectancy dataset in SQL using the CSV data source for Spark and rename the columns appropriately.

In [5]:
%sh 
mkdir -p life_pred
curl 'https://raw.githubusercontent.com/PolinaRus/CIS_8795/master/Life_Expectancy.csv' > life_pred/life_expectancy.csv
ls /databricks/driver/life_pred

In [6]:
%fs 
ls file:/databricks/driver/life_pred

In [7]:
%sql DROP TABLE IF EXISTS life_expectancy

In [8]:
%sql

CREATE TABLE life_expectancy (
  Country STRING,
  Code STRING,
  Continent STRING,
  Religion STRING,
  Status STRING,
  Population DOUBLE,
  Labor_Force DOUBLE,
  GDP DOUBLE,
  Urbanization DOUBLE,
  Literacy DOUBLE,
  Population_Growth_Rate DOUBLE,
  Below_Poverty_Line DOUBLE,
  Median_Age DOUBLE,
  Life_Expectancy DOUBLE,
  Age STRING)
USING com.databricks.spark.csv
OPTIONS (path "file:/databricks/driver/life_pred/life_expectancy.csv", header "true")

In [9]:
my_data = spark.table("life_expectancy")
cols = my_data.columns

In [10]:
display(my_data)

# Process Data
For Decision Tree analysis we need to convert the categorical variables in the dataset into numeric variables. There are 2 ways we can do this.

* **Category Indexing.**
This is basically assigning a numeric value to each category from {0, 1, 2, ...numCategories-1}. This introduces an implicit ordering among your categories, and is more suitable for ordinal variables (eg: Poor: 0, Average: 1, Good: 2)

* **One-Hot Encoding.**
This converts categories into binary vectors with at most one nonzero value (eg: (Blue: [1, 0]), (Green: [0, 1]), (Red: [0, 0]))

Here, we will use a combination of StringIndexer and OneHotEncoder to convert the categorical variables. The OneHotEncoder will return a SparseVector.

Since we will have more than 1 stages of feature transformations, we use a Pipeline to tie the stages together. This simplifies our code.

In [12]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = ["Country", "Code", "Continent", "Religion", "Status"]
stages = [] 
for categoricalCol in categoricalColumns:

  stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")

  encoder = OneHotEncoder(inputCol=categoricalCol+"Index", outputCol=categoricalCol+"classVec")

  stages += [stringIndexer, encoder]

The above code basically indexes each categorical column using the StringIndexer, and then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row.

We use the StringIndexer again here to encode our labels to label indices.

In [14]:
label_stringIdx = StringIndexer(inputCol = "Age", outputCol = "label")
stages += [label_stringIdx]

Next, we will use the VectorAssembler to combine all the feature columns into a single vector column. This will include both the numeric columns and the one-hot encoded binary vector columns in our dataset.

In [16]:
numericCols = ["Population", "Labor_Force", "GDP", "Urbanization", "Literacy", "Population_Growth_Rate","Below_Poverty_Line","Median_Age", "Life_Expectancy"]
assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

We finally run our stages as a Pipeline. This puts the data through all of the feature transformations we described in a single call.

In [18]:
pipeline = Pipeline(stages=stages)

my_Model = pipeline.fit(my_data)
df = my_Model.transform(my_data)

selectedcols = ["label", "features"] + cols
df = df.select(selectedcols)
display(df)

Randomly split data into training and test sets. set seed for reproducibility

In [20]:
(trainingData, testData) = df.randomSplit([0.7, 0.3], seed = 100)
print trainingData.count()
print testData.count()

#Fit and Evaluate Models

We are now ready to try out some of the Binary Classification algorithms available in the Pipelines API.

Out of these algorithms, the below are also capable of supporting multiclass classification with the Python API: - Decision Tree Classifier - Random Forest Classifier

These are the general steps we will take to build our models: - Create initial model using the training set - Tune parameters with a ParamGrid and 5-fold Cross Validation - Evaluate the best model obtained from the Cross Validation using the test set

We will be using the BinaryClassificationEvaluator to evaluate our models. The default metric used here is areaUnderROC.

# Decision Trees
The Decision Trees algorithm is popular because it handles categorical data and works out of the box with multiclass classification tasks.

In [23]:
from pyspark.ml.classification import DecisionTreeClassifier

# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)

# Train model with Training Data
dtModel = dt.fit(trainingData)

We can extract the number of nodes in our decision tree as well as the tree depth of our model.

In [25]:
print "numNodes = ", dtModel.numNodes
print "depth = ", dtModel.depth

Make predictions on test data using the Transformer.transform() method.

In [27]:
predictions = dtModel.transform(testData)

Print the schema of the prediction table

In [29]:
predictions.printSchema()

View model's predictions and probabilities of each prediction class

In [31]:
selected = predictions.select("label", "prediction", "probability", "Life_Expectancy", "Status")
display(selected)

We will evaluate our Decision Tree model with BinaryClassificationEvaluator.

In [33]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(predictions)

Create ParamGrid for Cross Validation

In [35]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1,2,6,10])
             .addGrid(dt.maxBins, [20,40,80])
             .build())

Create 5-fold CrossValidator and run cross validation

In [37]:
cv = CrossValidator(estimator=dt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
cvModel = cv.fit(trainingData)

In [38]:
print "numNodes = ", cvModel.bestModel.numNodes
print "depth = ", cvModel.bestModel.depth

Use test set here so we can measure the accuracy of our model on new data. cvModel uses the best model found from the Cross Validation

In [40]:
predictions = cvModel.transform(testData)

Evaluate the best model

In [42]:
evaluator.evaluate(predictions)

View Best model's predictions and probabilities of each prediction class

In [44]:
selected = predictions.select("label", "prediction", "probability", "Age", "Status")
display(selected)

Evaluate the accuracy of the Decision Tree model

In [46]:
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Accuracy= %g" % (accuracy))

As shown above, our Decision Tree model is 100% accurate