This is a notebook intended for trying out some techniques on the notebook 'Iris_Classification.ipynb' separately to avoid code conflictions with the stable version which is running. This was started mainly because I cannot get the same results as Randal S Olson's popular notebook. So, this notebook contains some of the analysis I did to get to know the reasons for non-reproducibility.

In [1]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

The problem which I can't get my head around is the GridSearchCV function of sklearn.grid_search.GridSearchCV and also whether the problem is related to the cross-validation sets generated by sklearn.cross_validation.StratifiedKFold being random.

In [2]:
iris_data = pd.read_csv('data/iris_data_clean.csv')
iris_data.head()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [3]:
inputs = iris_data[['sepal_length_cm', 'sepal_width_cm','petal_length_cm', 'petal_width_cm']].values
targets = iris_data['class'].values

print inputs[:5]
print targets[:5]

[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]
['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa']


In [91]:
from sklearn.model_selection import StratifiedKFold

# List of five different cross_validation splits
cv_lists = []

for i in xrange(5):
    data_splits = []
    skf = StratifiedKFold(n_splits = 10)
    for train, test in skf.split(inputs, targets):
        data_split = np.zeros(len(train) + len(test))
        data_split[test] = 1
        data_splits.append(data_split)    # append 10 folds
    data_splits = np.array(data_splits)    # convert into a ndarray
    cv_lists.append(data_splits)   # append current cv_split
    

cv_lists = np.array(cv_lists)
print "cv_lists shape:", cv_lists.shape
first_cv_list = cv_lists[0,:,:]
print "first_cv_list.shape:", cv_lists[0].shape
# Check if all the cv_lists are actually equal when shuffle = False
bul = np.all(cv_lists == cv_lists[0], axis=0)
print 'cv_lists_equality:', bul
print 'Are all the cv_lists same? :', np.all(bul == True)

cv_lists shape: (5, 10, 149)
first_cv_list.shape: (10, 149)
cv_lists_equality: [[ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]
 ..., 
 [ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]]
Are all the cv_lists same? : True


So, it is confirmed that StratifiedKFold returns the same train-test split if <code>shuffle = False</code> (which is default). If <code>shuffle</code> is set to <code>True</code>, it would shuffle the data before splitting and that shuffling can be specified by <code>random_state</code> which if it is <code>None</code> uses standard numpy RNG (random number generator) and if is an <code>int</code> value, uses that as a seed to generate random numbers using a pseudo random generator. So, that's not messing up with the random results I got from GridSearchCV.

So, it is indeed the decision tree which gets randomly initialized for every iteration.

In [92]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

dtc = DecisionTreeClassifier()

cv = StratifiedKFold(n_splits=10)

parameter_grid = {'criterion': ['gini', 'entropy'],
                  'random_state': range(20),
                  'max_depth': [1, 2, 3, 4, 5],
                  'max_features': [1, 2, 3, 4]}

grid_search = GridSearchCV(dtc, param_grid = parameter_grid,
                          cv = cv.split(inputs, targets))

grid_search.fit(inputs, targets)

print "Best Score: {}".format(grid_search.best_score_)
print "Best params: {}".format(grid_search.best_params_)
grid_search.best_estimator_

Best Score: 0.973154362416
Best params: {'max_features': 2, 'random_state': 6, 'criterion': 'gini', 'max_depth': 5}


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=2, max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=6,
            splitter='best')

Now, that I am sure that setting <code>random_state</code> to a fixed value will give me reproducible results but what should this fixed value be? Because, depending on this value, the <code>best_params_</code> found by <code>GridSearchCV</code> are different. So, this value does carry a lot of value, I think.