# A larger example

Here we will use a dataset on Italian wine. The dataset is taken from
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/

The actual data is in the file wine.data, and a description of the data can be found in wine.names.

If we look at the beginning of the data file we see:<br>
`1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
...`

First is the class of the wine and the follows the data, which are (taken from wines.names):<br>
1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10) Color intensity
11) Hue
12) OD280/OD315 of diluted wines
13) Proline 
    
    
The data is clearly a simple CSV file, thus we start by reading the data.

---

* Author: Troels C. Petersen (NBI) & Brian Vinter (formerly NBI, now AU)
* Email:  petersen@nbi.dk
* Date:   25th of April 2024

In [1]:
import csv
import numpy

with open('data_Wine.csv') as input_file:
    raw_data = numpy.array([row for row in csv.reader(input_file)]).astype(numpy.float)

labels = raw_data[:, 0 ]
data   = raw_data[:, 1:]

This time the data are not in the simple [0:1] range - so we normalize each column.

In [2]:
_, num_c = data.shape
for i in range(num_c):
    data[:, i] = data[:, i] / numpy.max(data[:, i])

We know that there are 13 columns in data, but at this point we may as well make our distance meassure indepedent of dimensions.

In [3]:
def all_distances(point, db):
    result = []
    for entry in db:
        distance = 0.0
        for dim in zip(point, entry):
                distance += (dim[0] - dim[1])**2
        result.append(numpy.sqrt(distance))
    return numpy.array(result)

# Clustering:

Your challenge is to see, if you can cluster the wine data using the various algorithms. Check (using the labels) how well you/the algorithms do.

In [None]:
# Your code...

# Sidetrack: k-Nearest-Neighbor (i.e. classification):

This part is not clustering, but classification based on the k nearest neighbors. We can reuse the simple election mechanism.

In [4]:
import collections
def classify(point, k=5):
    distances = all_distances(point, data)
    votes = []
    for _ in range(k):
        winner = numpy.argmin(distances)
        votes.append(labels[winner])
        distances[winner] = 1000
    return collections.Counter(votes).most_common(1)[0][0]
    

Now we can test the result against the database itself.

In [5]:
score = 0
for point in raw_data:
    if point[0] == classify(point[1:], 6):
        score += 1
print('Matched', score, 'of', len(raw_data))

Matched 176 of 178


The result is quite satisfactory. However, since we are matching against the database itself, the tested point is itself in the test set, which is an unfair advantage compared to a real world scenario. Eliminating this bias is left as an exercise, it is quite simple though.

You should play around with values of k as well, to tell the best number of neighbors to match against.