    GRID-SEARCH
    
    Grid search is a technique that can be used to improve a model's performance on a given datasets
    by searching through a space of parameters and finding parameters that increase the model's
    performance.

In [6]:
import numpy as np

    Sample illustration:
    
    Let's train a model that identifies the different plant species of the iris flower, using the iris
    dataset in-built in sklearn.

In [3]:
from sklearn.datasets import load_iris  # Load iris dataset
from sklearn.model_selection import train_test_split  # Split data into train and test sets
from sklearn.model_selection import cross_val_score  # Cross validation split

# These are the models we want to train
from sklearn.tree import DecisionTreeClassifier  # Decision Tree Model
from sklearn.linear_model import LogisticRegression  # Logistic Regression Model
from sklearn.svm import SVC  # Support Vector Classifier

In [2]:
data, target = load_iris(return_X_y=True)

In [8]:
# Decision Tree

tree = DecisionTreeClassifier()
tree_scores = cross_val_score(tree, data, target, cv=5)

print("Model Accuracies: {}".format(tree_scores))
print("Mean Accuracy: {}".format(np.mean(tree_scores)))

Model Accuracies: [0.96666667 0.96666667 0.9        0.96666667 1.        ]
Mean Accuracy: 0.9600000000000002


In [7]:
# Logistic Regression

log_reg = LogisticRegression(max_iter=1000)
log_scores = cross_val_score(log_reg, data, target, cv=5)

print("Model Accuracies: {}".format(log_scores))
print("Mean Accuracy: {}".format(np.mean(log_scores)))

Model Accuracies: [0.96666667 1.         0.93333333 0.96666667 1.        ]
Mean Accuracy: 0.9733333333333334


In [9]:
# SVC

svc = SVC()
svm_scores = cross_val_score(svc, data, target, cv=5)

print("Model Accuracies: {}".format(svm_scores))
print("Mean Accuracy: {}".format(np.mean(svm_scores)))

Model Accuracies: [0.96666667 0.96666667 0.96666667 0.93333333 1.        ]
Mean Accuracy: 0.9666666666666666


    The mean accuracy scores above give us an idea about how well our model's generalizing on a new
    dataset. Once we've understood how well our model's are performing, the other main question to ask is: 
    
        ??? "Can we improve their performance?"
    
    Yes, we can improve their performance by searching through a space of model parameters, and evaluating
    the accuracy scores (or regression metrics - for regression).

In [33]:
# Improving Decision Trees

# Maximum depth is a feature for decision trees
# Let's say we want to see how our model's performance
# changes by tweaking this parameter

list_max_depth = [3, 6, 12, 18, 24, 30, 40, 50]  # Does this parameter influence our model's accuracy?

best_score_ = 0  # Accuracy scores range [0, 100]

for max_depth in list_max_depth:
    clf_tree = DecisionTreeClassifier(max_depth=max_depth)
    scores = cross_val_score(clf_tree, data, target, cv=5)
    mean_score = np.mean(scores)
    
    if mean_score > best_score_:
        best_score_ = mean_score
        best_params_ = {
            "max_depth": max_depth
        }
        
print("Best Score: {}".format(best_score_))
print("Best Params: {}".format(best_params_))

Best Score: 0.9733333333333334
Best Params: {'max_depth': 3}


    We observe that the accuracy of the tree classifier improved by using a max_depth of 3.
    
    Can we improve this further by adding another search parameter, hencing searching through
    a combition of the max_depth + the additional parameter.

In [39]:
# Adding another parameter

list_max_depth = [3, 6, 12, 18, 24, 30, 40, 50]
list_max_leaf_nodes = [4, 5, 6, 7, 8, 9, 10]

best_score_ = 0

for max_depth in list_max_depth:
    for max_leaf_nodes in list_max_leaf_nodes:
        
        tree = DecisionTreeClassifier(max_depth=max_depth, max_leaf_nodes=max_leaf_nodes)
        scores = cross_val_score(tree, data, target, cv=5)
        mean_score = np.mean(scores)
        
        if mean_score > best_score_:
            best_score_ = mean_score
            best_params_ = {
                "max_depth": max_depth,
                "max_leaf_nodes": max_leaf_nodes
            }
            
print("Best Score: {}".format(best_score_))
print("Best Params: {}".format(best_params_))

Best Score: 0.9733333333333334
Best Params: {'max_depth': 3, 'max_leaf_nodes': 5}


    Here, we observe that our accuracy score hasn't improved with the addition of max_lead_nodes. This
    entails that 'max_leaf_nodes' has no impact on the performance.

`Improve Logistic Regression`

    Without performing any hyper-parameter tuning, the accuracy of this model is 97%. Here, I introduce
    a search space of parameters which will be used in the model to evaluate it's performance.

In [43]:
# # Let's test these parameters; here, we choose to be
# # ignorant about them, as the focus is in finding
# # parameters

# penalty_space = ['l1', 'elasticnet']
# C = [0.01, 0.1, 0.5, 1.0, 1.3, 1.5]

# best_score_ = 0

# # Loop through all paramters; 24 combinations parameters
# for penalty in penalty_space:
#     for c in C:
#         log_reg = LogisticRegression(penalty=penalty, C=c, max_iter=1000)
#         mean_score = np.mean(cross_val_score(log_reg, data, target, cv=5))
        
#         if mean_score > best_score_:
#             best_score_ = mean_score
#             best_params_ = {
#                 "penalty": penalty,
#                 "C": c
#             }
            
# print("Best Score: {}".format(best_score_))
# print("Best Params: {}".format(best_params_))

`Improve Support Vector Machine`

In [46]:
# Parameter Space
C_space = [0.001, 0.01, 0.1, 1.0, 10, 100]
gamma_space = [0.001, 0.01, 0.1, 1.0, 10, 100]

best_score_ = 0

for C in C_space:  # Loop through 'C_space'
    for gamma in gamma_space:  # Loop through gamma_space
        
        # Instantiate SVM Classifier with params
        svm_clf = SVC(C=C, gamma=gamma)
        # Cross Validation Scores - Model Generalization
        scores = cross_val_score(svm_clf, data, target, cv=5)
        # Mean CV Scores
        mean_score = np.mean(scores)
        
        if mean_score > best_score_:
            best_score_ = mean_score
            best_params = {
                'C': C,
                'gamma': gamma
            }
            
print("Best Score: {}".format(best_score_))
print("Best Params: {}".format(best_params_))

Best Score: 0.9800000000000001
Best Params: {'penalty': 'l2', 'C': 1.0}


    Observe how the model's accuracy has improved from 96% to 96% by searching through a parameter space
    for parameters that yield better performance.
    
    This is Grid-search cv is nutshell

In [51]:
from grid_search_implementation import GridSearch

In [53]:
svc = SVC()
params = {
    'gamma': [0.001, 0.01, 0.1, 1.0, 10, 100]
}

grid_search = GridSearch(model=svc, params=params, cv=5).fit(X_train=data, y_train=target)

TypeError: 'builtin_function_or_method' object is not subscriptable