# Speed Dating Analysis Part 2

Here we will talk about the learning algorithms used to measure the accuracy of the models implemented.

First we import *scikit learn* and load the data set. The data set we are loading is the preprocessed version of the data set which we saved as a pickle.

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.ensemble as ske
from sklearn import datasets, svm, model_selection, tree, metrics, preprocessing

# Read dataframe.
data = pd.read_pickle('../dating')

# Set target column.
target = "match"

## Decision Tree

The first learning model that we will apply is a simple Decision Tree. The way it works is that it splits the data recursively. Every feature will be split based on its possible outcomes and the entropy between each new subset and the target feature *match* is calculated. We measure the information gain of every new subset by substracting the entropy of the dataset before the split by the sum entropy of each new subset. The node with the most information gain becomes the root node of our tree. The data will now be split based on this root node. In the next iteration, this root node is not in the data set and now we repeat the same entropy calculation minus one feature. This is done until all paths in the tree reaches leaf.

In [2]:
# Drop the target label, which we save separately.
X = data.drop([target], axis=1).values
y = data[target].values

# Split data in a 20/80 split.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

# Initialize a decison tree with a max_depth of 10.
clf_tree = tree.DecisionTreeClassifier(max_depth=10)

# Fit the model with the training data.
clf_tree.fit(X_train, y_train)

# Use the test data to calculate the score.
print("Decision Tree Score (No Cross-validation):", clf_tree.score(X_test, y_test))

Decision Tree Score (No Cross-validation): 0.896181384248


The accuracy rating is pretty good, but in order for the output to be more meaningful, it is necessary to apply K-fold Cross Validation to the model.

The way K-fold Cross Validation works is that, in our instance, it splits the data set 80/20 as training data and testing data. But we find every permutation of test data and train data. So the test data will be unique on every validation. We then calculate the average of the total accuracy rate from every permutation that was tested.

In [3]:
def unique_permutations_cross_val(X, y, model):

    # Split data 20/80 to be used in a K-Fold Cross Validation with unique permutations.
    shuffle_validator = model_selection.ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
    
    # Calculate the score of the model after Cross Validation has been applied to it. 
    scores = model_selection.cross_val_score(model, X, y, cv=shuffle_validator)

    # Print out the score (mean), as well as the variance.
    print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std()))

We will create a function for our Decision Tree and run it with K-fold Cross Validation.

In [4]:
def decision_tree(target, data):

    # Drop the target label, which we save separately.
    X = data.drop([target], axis=1).values
    y = data[target].values

    # Split data in a 20/80 split.
    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

    # Initialize a decison tree with a max_depth of 10.
    clf_tree = tree.DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

    # Fit the model with the training data.
    clf_tree.fit(X_train, y_train)

    # Use the test data to calculate the score.
    print("Decision Tree Score (No Cross-validation):", clf_tree.score(X_test, y_test))

    # Implement the decision tree again using Cross Validation.
    unique_permutations_cross_val(X, y, clf_tree)

    return clf_tree

decision_tree = decision_tree(target, data)

Decision Tree Score (No Cross-validation): 0.897374701671
Accuracy: 0.9056 (+/- 0.01)


## Random Forest

The next learning model we will implement is the Random Forest algorithm. The problem with Decision Trees is that they overfit the data way too strongly. The Random Forest algorithm is intended to add some randomness to the trained data. The basic idea of a Random Forest is that it takes a bunch of weak learners and combines them to create a strong learner. So in essence, the Random Forest algorithm creates a bunch of Decision Tree with some random value selected from the data set. The accuracy rating of every prediction is averaged over the accuracy prediction of all Decision Trees. When implementing the Random Forest algorithm, we will also implement K-fold Cross Validation alongside it as well.

In [5]:
def random_forest(target, data):

    # Drop the target label, which we save separately.
    X = data.drop([target], axis=1).values
    y = data[target].values

    # Random Forest Classifier.
    clf_tree = ske.RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
    
    # Implement the decision tree again using Cross Validation.
    unique_permutations_cross_val(X, y, clf_tree)

    clf_tree.fit(X, y)
    
    return clf_tree

rnd_forest = random_forest(target, data)

Accuracy: 0.9203 (+/- 0.01)


## Gradient Boosting

The next model we will implement is Gradient Boosting. The idea behind a Gradient Boosting algorithm is that we take a weak learner and make it stronger by changing the weight distributions. So first the Decision Tree is trained and the loss is calculated using some loss function. Based on the results of the loss function, the weights of the Decision Tree are altered depending on how wrong it is, we want the Mean Squared Error to be as small as possible between the prediction and the actual value. The weights for the predictions that are far off, and they are decreased for those that are too close. This will eventually result in the total Mean Squared Error.

In [6]:
def gradient_boosting(target, data):

    # Drop the target label, which we save separately.
    X = data.drop([target], axis=1).values
    y = data[target].values

    # Gradient Boosting.
    clf_gradient = ske.GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=50,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

    # Implement the decision tree again using Cross Validation.
    unique_permutations_cross_val(X, y, clf_gradient)

    clf_gradient.fit(X, y)
    
    return clf_gradient

grd_boost = gradient_boosting(target, data)

Accuracy: 0.9236 (+/- 0.01)


All our model are now constructed and it seems that Gradient Boosting performs best out of all the models. However, it would also be neat if we could create a fake participant in the data set to see what this participant's particular match rate would be.

In [7]:
# Extract all column headers.
header_list = list(data.columns.values)

# Get the mean values for all columns and store them in a list.
mean_values = [data[header].mean() for header in header_list]

# Map each column header to its respective mean.
header_dict = dict(zip(header_list, mean_values))

# Remove the target header as this is not part of the dataset.
del header_dict[target]

print(header_dict)

{'shar4_1': 9.2024063356685968, 'intel1_1': 17.698318214370971, 'fun': 5.5750527229872224, 'dec_o': 0.37168775364048701, 'race': 2.3728813559322033, 'sports': 5.6056084468720204, 'pf_o_fun': 15.24855887310378, 'shar1_1': 10.233034986820034, 'pid': 244.7923288325965, 'exphappy': 4.8323360459550022, 'music': 6.837192647409883, 'shar1_2': 10.888342126088105, 'samerace': 0.36070661255669612, 'attr2_1': 27.137091191215085, 'wave': 9.7849128670327055, 'sinc': 6.243430746561887, 'goal': 1.8694199092862258, 'attr': 5.3761150377772369, 'prob_o': 4.5220987654320979, 'amb2_1': 10.096976605395083, 'exercise': 5.492957746478873, 'tvsports': 4.0216042014800673, 'id': 7.8634355974692607, 'like': 5.331541569731848, 'intel1_2': 15.311486415193883, 'idg': 15.204941513487706, 'hiking': 4.9541656719980907, 'met': 0.90187554341075638, 'attr1_1': 19.71357885926103, 'fun1_1': 15.30945094294581, 'field_cd': 6.6911113769589665, 'pf_o_amb': 9.2083429742106535, 'amb3_2': 6.4692693220786071, 'pf_o_att': 19.732108

I created a dictionary of the data set where the *keys* are the column names and the *values* are the mean values of each column. The idea is to create an average participant and because it is stored in a dictionary, we can alter any feature to whatever we want and see if the prediction changes.

In [8]:
# Extract only the mean values after values have been changed.
extract_values = list(header_dict.values())

# Random Forest can predict the match rate if you give a value for every column.
# In this case the values represent the mean of every column.
predictions = rnd_forest.predict([extract_values])
print(predictions)

[0]


As you can see above, the random forest predicted a match rate of 0. We can alter some features in order to try and achieve a match rate of 1.

In [9]:
# Key = Column Header,
# Value = Mean of Column,
# Storing it in a dictionary makes it easier to change values so
# as to see how the prediction changes.
header_dict['exercise'] = 10
header_dict['age'] = 23
header_dict['movies'] = 10
header_dict['reading'] = 10
header_dict['music'] = 10
header_dict['samerace'] = 1

# Extract only the mean values after values have been changed.
extract_values = list(header_dict.values())

# Random Forest can predict the match rate if you give a value for every column.
# In this case the values represent the mean of every column.
predictions = rnd_forest.predict([extract_values])
print(predictions)

[1]


We made an ideal candidate. He likes movies, reading, music, and exercising. He is also of the same race as his partner, and this gives him a match rate of 1.