# Lab 11

# Data Set

This week, we will be looking at 2 machine learning datasets. The first, `wines.txt`, is a database containing a chemical analysis of three different wine cultivars. There are 178 wines with 13 features in this data set. The task in this dataset is to predict which cultivar each wine belongs to.

The second dataset, `mystery.txt`, is a mystery. There are 10 features for each data point, and we are trying to predict a numerical result `R`. There are over 25,000 data points in this data set. This is a very simple task, which could be performed perfectly by about 15 lines of Python code. You’ll find that our machine learning classifiers have a very hard time with it. Play with the different classifiers and see if you can get one to do reasonably well.

# Decision Trees
The program below will build a decision tree classifier. This program requires the file `wine.txt` to run (included in the lab). This program will run a cross validation for decision trees on this dataset and produce an accuracy score. It will also produce a file called `dtree.txt`. You can download this file and put the contents of that file into this website to view the tree: `www.webgraphviz.com`

In [None]:
import pandas as pd
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from timeit import default_timer as timer

train=pd.read_csv('wine.txt')
X=train.drop(train.columns[0],axis=1)
Y=train[train.columns[0]]

model= DecisionTreeClassifier(criterion='entropy')

start=timer()
accuracy=(cross_val_score(model,X,Y,cv=20,scoring='accuracy')).mean()
end=timer()
time=end-start
print("Decision tree accuracy: %f" % accuracy)
print("Decision tree training time: %f seconds" % time)


# Creates a dot file that you can visualize at this website
# http://www.webgraphviz.com
model.fit(X,Y)
dotfile = open("dtree.txt", 'w')
tree.export_graphviz(model, out_file = dotfile, feature_names = X.columns)
dotfile.close()

# Nearest Neighbor
This program below builds and run a cross validation for k-nearest neighbor classifiers on this dataset to produce an accuracy score.

In [None]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from timeit import default_timer as timer

train=pd.read_csv('wine.txt')
X=train.drop(train.columns[0],axis=1)
Y=train[train.columns[0]]
scaler = preprocessing.StandardScaler().fit(X)
X=scaler.transform(X)

model=KNeighborsClassifier(n_neighbors=5)

start=timer()
accuracy=(cross_val_score(model,X,Y,cv=20,scoring='accuracy')).mean()
end=timer()
time=end-start
print("Nearest neighbor accuracy: %f" % accuracy)
print("Nearest neighbor time: %f seconds" % time)

# Support Vector Machines

The program below builds and run a cross validation for SVM classifiers on this dataset and produces an accuracy score.

In [None]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn import svm
from timeit import default_timer as timer

train=pd.read_csv('wine.txt')
X=train.drop(train.columns[0],axis=1)
Y=train[train.columns[0]]
scaler = preprocessing.StandardScaler().fit(X)
X=scaler.transform(X)

model= svm.SVC(kernel='linear')

start=timer()
accuracy=(cross_val_score(model,X,Y,cv=20,scoring='accuracy')).mean()
end=timer()
time=end-start
print("SVM accuracy: %f" % accuracy)
print("SVM training time: %f seconds" % time)


# Neural Networks

This program builds and run a cross validation for neural network classifiers on the wine dataset and produces an accuracy score. If it complains about "max iterations", you can ignore this.

In [None]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier
from timeit import default_timer as timer

train=pd.read_csv('wine.txt')
X=train.drop(train.columns[0],axis=1)
Y=train[train.columns[0]]
scaler = preprocessing.StandardScaler().fit(X)
X=scaler.transform(X)

model=MLPClassifier(activation='logistic')

start=timer()
accuracy=(cross_val_score(model,X,Y,cv=20,scoring='accuracy')).mean()
end=timer()
time=end-start
print("Neural net accuracy: %f" % accuracy)
print("Neural net training time: %f seconds" % time)

# Mystery

This section is a challenge. We want to see who can train the best classifier for the mystery dataset. `mystery.txt` contains the mystery dataset and the program below loads it for you. Your task is to try different classifiers on the mystery dataset. Feel free to tweak the parameters of the algorithms. You can also do some preprocessing of the data if you like.

In [None]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from timeit import default_timer as timer

train=pd.read_csv('mystery.txt')
X=train.drop(train.columns[-1],axis=1)
Y=train[train.columns[-1]]

model= DecisionTreeClassifier(criterion='entropy')

start=timer()
accuracy=(cross_val_score(model,X,Y,cv=20,scoring='accuracy')).mean()
end=timer()
time=end-start
print("Mystery accuracy: %f" % accuracy)
print("Mystery training time: %f seconds" % time)


# Lab Questions

1. What feature was most important in the decision tree classifier? What leads you to conclude that this is the most important feature?
2. The decision tree classifier can split most of the first cultivar from the rest of the wines by making a series of decisions. What are those decisions?
3. The decision tree classifier can identify most of the third cultivar. What series of decisions would lead to a classification as the third cultivar?
4. Which of the machine learning algorithms performs the best on the wine data set? Which performs the worst? Which takes the longest time to train? Which takes the least?
5. What classifiers and parameters did you try on the mystery dataset?
6. What was your best performance on the mystery dataset? What algorithm and parameters did you use to get this result?
7. Can you guess what the mystery task is?