---
title: "Topics in Econometrics and Data Science: Tutorial 10"

---

#### General Note

You will very likely find the solution to these exercises online. We, however, strongly encourage you to work on these exercises without doing so. Understanding someone else’s solution is very different from coming up with your own. Use the lecture notes and try to solve the exercises independently.

## Exercise 1: Wine classification

Now load the wine dataset. The dataset contains information on the chemical composition of wines. You can load the data via

In [10]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
dataset = load_wine()

1. Make yourself familiar with the data. How many different wine types are contained in the sample? How many different features and observations are included?

2. Use the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function (`test_size = 0.2`, `random_state=0`) to split your your data into a training and a testing sample.

In [11]:
from sklearn.model_selection import train_test_split

3. Try to classify your data with the $k$-nearest neighbor classification with the [`neighbors.KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) function. Use different weights and number of neighbors to minimize your empirical error rate.

In [12]:
from sklearn import neighbors

4. Try to improve on your result by using random forests and the function [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [13]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

5. Can you try to determine two of the most important features (determined by the attribute [`.feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_) ) that can be used to separate the results?

## Exercise 2: Digits classification

The (famous) MNIST dataset contains 70.000 observations of an image of a handwritten digit. Each observation consists $784$ features (grey level) which correspond to a $28\times28$ image. The MNIST data set is very popular to train and test algorithms in machine learning (see [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/) )


![Example, MNIST.](MNIST_1.png)

Due to time constraints we process the just a subset with a reduced number of features. At first load the digits datastet.

In [14]:
import pandas as pd
from sklearn.datasets import load_digits
dataset = load_digits()

1. Again, make yourself familiar with the data. How many different features and observations are included?

In [15]:
import numpy as np

2. Use the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function (`test_size = 0.2`, `random_state=0`) to split your your data into a training and a testing sample.


In [16]:
from sklearn.model_selection import train_test_split

3. Try to classify your data with the support vector machines and the function [`svm.SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). Use different kernels and other tuning parameters minimize your empirical error rate.

In [17]:
from sklearn import svm