Part 1: Get the wine dataset

*Describe data set

In [11]:
# Split data into training and test set
import numpy as np
import pandas as pd

df = pd.read_csv('wine\wine.csv')
print("Total rows: ", len(df))
# print(df)

# wine = np.array(df)
# print(wine)

train_idxs = np.random.choice(range(len(df)), size=int(0.8*len(df)))
print(train_idxs)
train = df.iloc[train_idxs]
print("Training set rows: ",len(train))

test_idxs = np.full(len(df), True)
test_idxs[train_idxs] = False
test = df.iloc[test_idxs]
print("Test set rows: ",len(test))

train.to_csv('wine-train.csv')
test.to_csv('wine-test.csv')


Total rows:  177
[ 23  14  95  26  11   0  31  81  49  93   2  33  96  13 124 173  49  73
  39 119  32 125 149  66 148  64 113  73  64  12  41  67 143 176  45   2
 155 171  61 103  96  12  76 139 172 103 109   5 155 109 107 133   4  21
   6 114  20  58 154  62 149 123 130  80  72  10 175  37  77 102 150 140
 128 109 163 154 104 139 153  28  43 124 138  57 159 120 119  92 170  55
 106  73  60  57 100  63  32  74 118 159  11  69   2  80 165 156  56  22
 107  86 173 141  32 115 158   5 140 100  17 171 128 132  61 122 176  92
  76 134 163 101 102  72 170 114  14  42  75 133  85 160   9]
Training set rows:  141
Test set rows:  77


In [12]:
# Split training data into training and cross validation sets
from sklearn.model_selection import train_test_split

df_train = pd.read_csv('wine-train.csv')
train = np.array(df_train)

train, valid = train_test_split(train, shuffle=True)

# Split test and validation data into X and Y (inputs and labels)
train_y, train_X, valid_y, valid_X = train[:, 1], train[:, 2 : ], valid[:, 1], valid[:, 2 : ] # The labels are in column number 2, the Xs are column 3 onwards

# Split test data into X and Y (inputs and labels)
test = np.array(pd.read_csv('wine-test.csv'))
test_y, test_X = test[:, 1], test[:, 2 : ]

Part 2: Fit models to the wine dataset and test performance

Using a classification tree on the model. Evaluate the model's performance by comparing its predicted labels with the test labels using mean squared error.


In [13]:
# run a classification tree on the dataset
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error

tree = DecisionTreeClassifier()
tree.fit(test_X, test_y)

# Evaluate performance on cross validation set
pred_y = tree.predict(valid_X)
print(mean_squared_error(valid_y, pred_y))
# print(tree.get_depth(), tree.get_n_leaves())

# Evaluate performance by comparing with test data
pred_y = tree.predict(test_X)
print(mean_squared_error(test_y, pred_y))


0.16666666666666666
0.0


Part 3: Ensembling to improve performance

Ensemble the classification tree model used above buy using random forests. Evaluate model by looking at its accuracy. 

In [14]:
from sklearn.ensemble import RandomForestClassifier

randomForest = RandomForestClassifier(n_estimators=100, max_depth=2, max_samples=10)
randomForest.fit(train_X, train_y) # fit random forest of decision trees

# Evaluate the ensemble's performance
score = randomForest.score(test_X, test_y) # use the model's score method to compute it's accuracy
print(score)



0.948051948051948


Part 4: Finding the best models and hyperparameters

We have used the following models for supervised learning classification problems so far: Logistic Regression, RandomForests, Support Vector Machines, and K nearest-neighbours. Using sklearn's VotingClassifier, we can ensemble different models and using sklearn's accuracy_score, we can compare the accuracies to find the best single model, or combination of models. 

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

class BestEnsembleParameter():
    def __init__(self):
        pass
    def __call__(self, train_X, train_y, test_X, test_y):
        # initialise every model
        ran_for = RandomForestClassifier()
        log_reg = LogisticRegression()
        sup_vec = SVC()
        K_near = KNeighborsClassifier()
        voting = VotingClassifier(estimators=[('rf', ran_for), ('lr', log_reg), ('svc', sup_vec), ('Knear', K_near)], voting='hard', verbose=True)
        # fit every model
        best_accuracy = 0
        best_model = ''
        for model in (ran_for, log_reg, sup_vec, K_near, voting):
            model.fit(train_X, train_y)
            pred_y = model.predict(test_X)  # do predictions for every model
            accuracy = accuracy_score(test_y, pred_y) # get the accuracy of each model
            print(model.__class__.__name__, accuracy) # print each model's accuracy
            # update the best accuracy score if it is the highest score so far
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_model = model.__class__.__name__
        print("\nTherefore, the best model was", best_model, "with an accuracy of", best_accuracy)

        def parameter_search():
            # random or grid search over parameters
            return
finder = BestEnsembleParameter()
finder(train_X, train_y, test_X, test_y)


RandomForestClassifier 0.974025974025974
LogisticRegression 0.922077922077922
SVC 0.6493506493506493
KNeighborsClassifier 0.6363636363636364
[Voting] ....................... (1 of 4) Processing rf, total=   0.1s
[Voting] ....................... (2 of 4) Processing lr, total=   0.0s
[Voting] ...................... (3 of 4) Processing svc, total=   0.0s
[Voting] .................... (4 of 4) Processing Knear, total=   0.0s
VotingClassifier 0.8181818181818182

Therefore the best model was RandomForestClassifier with an accuracy of 0.974025974025974


Part 5: Visualising results and summarise

Part 6: "A stakeholder asks you which features most affect the response variable (output). Describe how you would organise a test to determine this."

I would test this by manipulating the input data for the models, such that all the Xs for one feature are set to zero, and then repeating this until every feature has had a chance to be set to zero. I would then compare which result ends up with the biggest difference from the original result which had all features included. This feature when set to zero that correspends to the biggest difference would therefore be the feature that has the greatest influence over the response variable. 