# Applying Decision Trees

___

## The Problem : Advertisement Banner Blocker
Here we will discuss classification and regression with models called decision trees. We
use an ensemble of decision trees to construct a banner advertisement blocker.

We will use decision trees to create software that can block banner ads on web pages.
This program will predict whether each of the images on a web page is an
advertisement or article content. Images that are classified as being advertisements
could then be hidden using Cascading Style Sheets.

## The Data Set: 
The dataset for this example can be downloaded from http://archive.ics.uci.edu/ml/machine-learning-databases/internet_ads/ad-dataset.zip
We will train a decision tree
classifier using this Internet Advertisements Data Set, which contains data for 3,279 images.
The proportions of the classes are skewed; 459 of the images are advertisements and
2,820 are content. Decision tree learning algorithms can produce biased trees from data
with unbalanced class proportions; we will evaluate a model on the unaltered data set
before deciding if it is worth balancing the training data by over- or under-sampling
instances.

Information regarding the dataset can be found here:
http://archive.ics.uci.edu/ml/machine-learning-databases/internet_ads/ad.DOCUMENTATION 

The explanatory variables are the dimensions of the image, words from the
containing page's URL, words from the image's URL, the image's alt text, the image's
anchor text, and a window of words surrounding the image tag. The response variable
is the image's class. The explanatory variables have already been transformed into
feature representations. The first three features are real numbers that encode the width,
height, and aspect ratio of the images. The remaining features encode binary term
frequencies for the text variables.

We will grid search for the
hyperparameter values that produce the decision tree with the greatest accuracy,
and then evaluate the tree's performance on a test set:

## Managing our imports

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

## Exploring the Data using Pandas

In [2]:
df = pd.read_csv('datasets/ad.data', header=None)
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df[len(df.columns.values)-1]
# The last column describes the targets
explanatory_variable_columns.remove(len(df.columns.values)-1)
y = [1 if e == 'ad.' else 0 for e in response_variable_column]
X = df[list(explanatory_variable_columns)]

  interactivity=interactivity, compiler=compiler, result=result)


We encoded the advertisements as the positive class and the content as the negative
class. More than one quarter of the instances are missing at least one of the values
for the image's dimensions. These missing values are marked by whitespace and a
question mark. We replaced the missing values with negative one, but we could have
imputed the missing values; for instance, we could have replaced the missing height
values with the average height value:

In [3]:
X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Splitting the data into training and test sets:

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

We create a pipeline and an instance of DecisionTreeClassifier. Then, we set
the criterion keyword argument to entropy to build the tree using the information
gain heuristic:

In [5]:
pipeline = Pipeline([
('clf', DecisionTreeClassifier(criterion='entropy'))
])

Next, we specified the hyperparameter space for the grid search:

In [6]:
parameters = {
'clf__max_depth': (150, 155, 160),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}

We set GridSearchCV() to maximize the model's F1 score:

In [8]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print classification_report(y_test, predictions)

[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:  5.6min finished


Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best score: 0.874
Best parameters set:
	clf__max_depth: 160
	clf__min_samples_leaf: 1
	clf__min_samples_split: 2
             precision    recall  f1-score   support

          0       0.99      0.98      0.98       702
          1       0.87      0.95      0.91       118

avg / total       0.97      0.97      0.97       820



Thus the results show approximately 87 percent of the images that the classifier predicted were ads, were truly ads.

#### Here we discussed how to train decision trees using the ID3 algorithm, which recursively splits the training instances into subsets that reduce our uncertainty about the value of the response variable.

## Improving our model's performance using Random Forests

Ensemble learning methods combine a set of models to produce an estimator that
has better predictive performance than its individual components. A Random Forest
is a collection of decision trees that have been trained on randomly selected subsets
of the training instances and explanatory variables. Random forests usually make
predictions by returning the mode or mean of the predictions of their constituent
trees; scikit-learn's implementations return the mean of the trees' predictions.

Random forests are less prone to overfitting(Fit the training data perfectly, but do a very poor job of prediction on the test set) than decision trees because no single
tree can learn from all of the instances and explanatory variables; no single tree can
memorize all of the noise in the representation.

Here we will update our ad blocker's classifier to use a random forest.

In [9]:
# Replace the DecisionTreeClassifier using scikit-learn's API by replacing the object with an instance of RandomForestClassifier. Like the previous example, we will
# Importing the RandomForestClassifier class from the ensemble module:

from sklearn.ensemble import RandomForestClassifier

In [10]:
# Replacing the DecisionTreeClassifier in the pipeline with an instance of RandomForestClassifier and updating the hyperparameter space:

pipeline = Pipeline([
('clf', RandomForestClassifier(criterion='entropy'))
])
parameters = {
'clf__n_estimators': (5, 10, 20, 50),
'clf__max_depth': (50, 150, 250),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}

In [None]:
# Grid Search to find the values of the hyperparameters that produce the random forest with the best predictive performance.

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print('\t%s: %r' % (param_name, best_parameters[param_name]))

predictions = grid_search.predict(X_test)
print(classification_report(y_test, predictions))

Replacing the single decision tree with a random forest resulted in a significant
reduction of the error rate; the random forest improves the precision and recall for
ads to 0.97 and 0.83.

## Summary
Thus here we discussed ensemble learning methods,
which combine the predictions from a set of models to produce an estimator with
better predictive performance. Finally, we used random forests to improve our decision tree model's performance in predicting whether or
not an image on a web page is a banner advertisement.