Part 1: Get the wine dataset

*Describe data set

In [23]:
# Split data into training and test set
import numpy as np
import pandas as pd

df = pd.read_csv('wine\wine.csv')
print("Total rows: ", len(df))
# print(df)

# wine = np.array(df)
# print(wine)

train_idxs = np.random.choice(range(len(df)), size=int(0.8*len(df)))
print(train_idxs)
train = df.iloc[train_idxs]
print("Training set rows: ",len(train))

test_idxs = np.full(len(df), True)
test_idxs[train_idxs] = False
test = df.iloc[test_idxs]
print("Test set rows: ",len(test))

train.to_csv('wine-train.csv')
test.to_csv('wine-test.csv')


Total rows:  177
[ 48 117 106 114  50  61 117 101 155  87 102 127  85  29  71 130 171  96
 121 148  31 125 113  76 133 113  31  93 134  31 166  75  28 118 151  53
  70 139  37 131 116 102 143  14  14 124 176  46 118  44  98  26   5 165
  32  49 155  90 108 112  66 154   8   3  71  86  14  31 169 147 114 144
 166 134 122  21  94  75  53 165 150 144  50 113  42 158   6  23  23   6
  53  80  78 104 173  64  41  54 161  48 142 123  77 106 100 107  17  18
  34 145 107  57  66 121   0 133  97 126  79  88 164 104  26 149   0 133
 114 139  95  87 158 126 155   1   4 153  60 147   9 122  13]
Training set rows:  141
Test set rows:  77


In [24]:
# Split training data into training and cross validation sets
from sklearn.model_selection import train_test_split
from collections import Counter

df_train = pd.read_csv('wine-train.csv')
train = np.array(df_train, dtype=float)
print(train)

train, valid = train_test_split(train, shuffle=True)

# Split test and validation data into X and Y (inputs and labels)
train_y, train_X, valid_y, valid_X = train[:, 1], train[:, 2 : ], valid[:, 1], valid[:, 2 : ] # The labels are in column number 2, the Xs are column 3 onwards

# Split test data into X and Y (inputs and labels)
test = np.array(pd.read_csv('wine-test.csv'))
test_y, test_X = test[:, 1], test[:, 2 : ]

# Look at the balance of classes to make sure that evaluating models on just their accuracy is okay
print(Counter(test_y))
print(Counter(train_y))


[[4.800e+01 1.000e+00 1.394e+01 ... 1.120e+00 3.100e+00 1.260e+03]
 [1.170e+02 2.000e+00 1.277e+01 ... 7.000e-01 2.120e+00 3.720e+02]
 [1.060e+02 2.000e+00 1.272e+01 ... 8.800e-01 2.420e+00 4.880e+02]
 ...
 [9.000e+00 1.000e+00 1.410e+01 ... 1.250e+00 3.170e+00 1.510e+03]
 [1.220e+02 2.000e+00 1.305e+01 ... 7.300e-01 3.100e+00 3.800e+02]
 [1.300e+01 1.000e+00 1.438e+01 ... 1.200e+00 3.000e+00 1.547e+03]]
Counter({2.0: 28, 1.0: 27, 3.0: 22})
Counter({2.0: 43, 1.0: 32, 3.0: 30})


Part 2: Fit models to the wine dataset and test performance

Using a classification tree on the model. Evaluate the model's performance by comparing its predicted labels with the test labels using accuracy (the fraction of the predictions that were correct).


In [25]:
# run a classification tree on the dataset
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

tree = DecisionTreeClassifier()
tree.fit(test_X, test_y)

# Evaluate performance on cross validation set
pred_y = tree.predict(valid_X)
print(accuracy_score(valid_y, pred_y))

# Evaluate performance by comparing with test data
pred_y = tree.predict(test_X)
print(accuracy_score(test_y, pred_y))


0.9722222222222222
1.0


Part 3: Ensembling to improve performance

Ensemble the classification tree model used above buy using random forests. Evaluate model by looking at its accuracy. 

In [26]:
from sklearn.ensemble import RandomForestClassifier

randomForest = RandomForestClassifier(n_estimators=100, max_depth=2, max_samples=10)
randomForest.fit(train_X, train_y) # fit random forest of decision trees

# Evaluate the ensemble's performance
score = randomForest.score(test_X, test_y) # use the model's score method to compute it's accuracy
print(score)



0.922077922077922


Part 4: Finding the best models and hyperparameters

We have used the following models for supervised learning classification problems so far: Logistic Regression, RandomForests, Support Vector Machines, and K nearest-neighbours. Using sklearn's VotingClassifier, we can ensemble different models and using sklearn's accuracy_score, we can compare the accuracies to find the best single model, or combination of models. 

In [88]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from numpy import random
import itertools


def update_best_model(model, train_X, train_y, test_X, test_y, best_accuracy, best_model):
    # A function to see if this current model gives a better accuracy than any previous models, and then to update the new best model
    model.fit(train_X, train_y)
    pred_y = model.predict(test_X)
    accuracy = accuracy_score(test_y, pred_y) # If this accuracy is highest so far, then update the best model with this model
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = model
    return best_accuracy, best_model

def parameter_search(model_name, train_X, train_y, test_X, test_y,  estimators=[], n_samples=100,):
            """
            Hyperparameter search function.

            Finds the best parameters of a given model based on whichever parameters give the highest accuracy score.
            Works with the following scikit-learn models: RandomForestClassifier, LogisticRegression, SVC and KNeighborsClassifier.

            Parameters 

            model_name:{"RandomForestClassifier", "LogisticRegression", "SVC", "KNeighborsClassifier"}
                Depending on the model chosen, different hyperparameters will be returned. 
                For "RandomForestClassifier", the n_estimators, max_depth and max_samples will be returned
                For "LogisticRegression", the C will be returned
                For "SVC", the C and gamma will be returned
                For "KNeighborsClassifier", the n_neighbours will be returned.

            train_X, train_y, test_X, test_y: numpy array
                This will take in the training data and test data in the order of the training Xs then ys, then the test Xs then ys.

            n_samples: int, default=1000
                How many samples of each hyperparameter are required, meaning a total of (n_samples**number of hyperparameters) samples will be taken.

            """
            # Initialise best accuracy and best model variables
            best_accuracy = 0
            best_model = None

            # Search for different parameteres depending on the model
            if model_name == "RandomForestClassifier":
                for i in range(1, min(len(train_X), n_samples) + 1): # There can't be more samples than there are examples in data
                    for j in range(1, n_samples + 1): 
                        model = RandomForestClassifier(max_samples=i,max_depth=j)
                        best_accuracy, best_model = update_best_model(model, train_X, train_y, test_X, test_y, best_accuracy, best_model)
            elif model_name in ["LogisticRegression","SVC"]:
                half = int(n_samples/2)
                C = random.exponential(scale=1, size=(half)) # let half of the values for C be small (roughly between 0 and 1)
                C = np.append(C, 10*random.exponential(scale=100, size=(n_samples-half))) # let other half be big (roughly between 10 and 10000)
                for c in C:
                    if model_name == "LogisticRegression":
                        model = LogisticRegression(C=c)
                        best_accuracy, best_model = update_best_model(model, train_X, train_y, test_X, test_y, best_accuracy, best_model)
                    else:
                        for g in C:
                            model = SVC(C=c, gamma=g)
                            best_accuracy, best_model = update_best_model(model, train_X, train_y, test_X, test_y, best_accuracy, best_model)
            elif model_name == "KNeighborsClassifier":
                for i in range(1, min(len(train_X), n_samples) + 1): # There can't be more neighbours than there are examples in the data
                    # Initialise and test the model on these parameters
                    model = KNeighborsClassifier(n_neighbors=i)
                    best_accuracy, best_model = update_best_model(model, train_X, train_y, test_X, test_y, best_accuracy, best_model)
            elif model_name == "VotingClassifier":
                total_combs_of_estimators = []
                for i in range(2, len(estimators) + 1): # get all the different combinations of the models for ensembling
                    total_combs_of_estimators.extend(list(itertools.combinations(estimators, i)))
                for estimator in total_combs_of_estimators:
                    model = VotingClassifier(estimators=estimator, voting='hard')
                    best_accuracy, best_model = update_best_model(model, train_X, train_y, test_X, test_y, best_accuracy, best_model)
            else:
                print("Please check the documentation and specify a relevant model")
                return
            print("The best parameters have been found for", best_model.__class__.__name__, "with an accuracy of", best_accuracy, "on the cross validation data, with parameters:",best_model.get_params(),"\n")
            return best_model



def BestModelAndParameter(train_X, train_y, valid_X, valid_y, test_X, test_y):
    
    # initialise every model with their best parameters, using VALIDATION data not test data
    ran_for = parameter_search("RandomForestClassifier",train_X, train_y, valid_X, valid_y, n_samples=10)
    log_reg = parameter_search("LogisticRegression",train_X, train_y, valid_X, valid_y)
    sup_vec = parameter_search("SVC",train_X, train_y, valid_X, valid_y)
    K_near = parameter_search("KNeighborsClassifier",train_X, train_y, valid_X, valid_y)
    
    estimators=[('rf', ran_for), ('lr', log_reg), ('svc', sup_vec), ('Knear', K_near)]
    voting = parameter_search("VotingClassifier",train_X, train_y, valid_X, valid_y,estimators=estimators)

    best_accuracy = 0
    best_model = None
    for model in (ran_for, log_reg, sup_vec, K_near, voting):
        best_accuracy, best_model = update_best_model(model, train_X, train_y, test_X, test_y, best_accuracy, best_model)
        print("On the training data, model", model.__class__.__name__, "has an accuracy of:", accuracy_score(test_y, model.predict(test_X)))
    print("\nTherefore, the best model was", best_model.__class__.__name__, "with an accuracy of", best_accuracy)
    print("\nThis model has the following parameters:", best_model.get_params())
    return best_model

model = BestModelAndParameter(train_X, train_y, valid_X, valid_y, test_X, test_y)


The best parameters have been found for RandomForestClassifier with an accuracy of 1.0 on the cross validation data, with parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 5, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': 3, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False} 

The best parameters have been found for LogisticRegression with an accuracy of 0.9166666666666666 on the cross validation data, with parameters: {'C': 0.9974236128941865, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False} 

The best p

Part 5: Visualising results and summarise

In [61]:
import sys
sys.path.append('..')
from utils import get_classification_data, show_data, visualise_predictions, colors

show_data(train_X, train_y)

TypeError: 'numpy.float64' object cannot be interpreted as an integer

Part 6: "A stakeholder asks you which features most affect the response variable (output). Describe how you would organise a test to determine this."

I would test this by manipulating the input data for the models, such that all the Xs for one feature are set to zero, and then repeating this until every feature has had a chance to be set to zero. I would then compare which result ends up with the biggest difference from the original result which had all features included. This feature when set to zero that correspends to the biggest difference would therefore be the feature that has the greatest influence over the response variable. 