In [2]:
import numpy as np

In [3]:
import math

In [4]:
import os

In [5]:
import pandas as pd

Linear regression is a definite classic among the countless machine learning methods borrowed from statistics. Among the more recent methods that have been invented by computer scientists, the co-called, nearest neighbor method is an equally classic technique ....

The nearest neighbor method is just about the simplest imaginable method. However, it is not to be trifled with: an aspiring machine learning researcher tends to humbled now and then by having their latest inventions beaten by the good old nearest neighbor method.

The nearest neighbor method can be used for both regression and classification tasks. In regression, the task is to predict a continuous value like for example the price of a cabin whereas in classification, the output is a label chosen from a finite set of alternatives, for example sick or health.

In order to quantify how close an item is to another, we need to define a _distance metric_. The analogy amplied by the word is pretty obvious: in everyday life you can say that Helsinki is closer to Stockholm  than it is to newyork, by calculating how many meters apart they are.

The distance is calculated over a two dimesional terrian. To simplify matters, we can pretend that the surface is flat, and that the location of each city is expressed in a so-called Cartesian coordinate system. This way, distances can be calculated by adding up the suares of differences in the x and y coordinates, and taking the squar root of the sum. for example the distance between Helsinki and NewYork would then be:
$$
D_{\text{HEL, NY}} = \sqrt{(X_{\text{HEL}} - x_{\text{NY}})^2 + (Y_{\text{HEL}} - Y_{\text{NY}})^2}
$$


Where $X_{\text{HEL}}$, $Y_{\text{HEL}}$ are the coordinates of Helsinki and $X_{\text{NY}}$, $Y_{\text{NY}}$ are the coordinates of NewYork. This is called _Euclidean distance_ . Especially for long distances. it dose't actually give the real distance becuase of the curvature of the surface of the earth, but this is of no consequence for us because we will be using it on abstarct, non-geographical coordinates.

$Euclidean$ $Distance$ is the measure of the stright-line distance between two points in Euclidean space. It is the most commonly used distance metric and corresponds to our intuative understanding of distance in the physical world ........

Perhaps you'll find it easier to read this in terms of the following Python expression:

In [6]:
x_hel = 34
x_ny  = 49
y_hel = 10
y_ny  = 50 
D = math.sqrt((x_hel - x_ny) ** 2) + ((y_hel - y_ny) ** 2)

They mean exactely the same thing ....

Analogously, we could calculate the difference or "distances" between cabin1 with 34${m^2}$ 10 meters from the lake, and cabin2 with 49${m^2}$ and 50 meters from a lake, by considering the size in square meters, the distance to a lake and any other input features there are as the coordinates. With the above two features, the distance between cabin1 and cabin2 would become: $$D_{\text1,2} = \sqrt{(34-49)^2 + {(10-50)^2}}$$

In [7]:
D1_2 = math.sqrt(((34 - 49) ** 2) + ((10 - 50) ** 2))

In [8]:
round(D1_2, 3)

42.72

In [9]:
# In Python code, we could write the above as 
cabin1 =np.array([34, 10])
cabin2 = np.array([49, 50])


In [10]:
D = math.sqrt((cabin1[0] - cabin2[0]) ** 2 + (cabin1[1] - cabin2[1]) ** 2)

In [11]:
round(D, 3)

42.72

We could also incorporate the number of bathroms et cetra to the distance calculation just by including the squared difference in the number of the bathrooms in the sum inside the quare root.

Mathematically, we'd call the lists of input features vector. In coding terms, they are lists or one-dimenstional arrays.



Here is a short piece of code, using the same cabin pricing data as in previous section. It uses training data from four cabins to predict the prices of two more cabins in the test data.

In [12]:
import math
import numpy as np

In [13]:

x_train = np.array([
    [25, 2, 50, 1, 500],
    [39, 3, 10, 1, 1000],
    [82, 5, 20, 2, 120],
    [130, 6, 10, 2, 600]
])

y_train = [127900, 222100, 268000, 460700]

x_test  = np.array([[115, 6, 10, 1, 560], [13, 2, 13, 1, 1000]])

def dist(a, b):
    sum = 0
    for ai, bi in zip(a, b):
        sum = sum + (ai - bi) ** 2
    return np.sqrt(sum)

n_train = len(x_train) # number of data points in the training data set

for test_item in x_test:
    d = np.empty(n_train) # d will hold the distances between this test data point and all the training data points
    for i, train_item in enumerate(x_train):
        d[i] = dist(test_item, train_item)
        nearest_index = np.argmin(d) #the nearest neighbor will be in y_train(nearest)
        #the argmin() functionused to find the index of the minimum value in array
#or in a sequence this method is avaliable in Numpy library, which is widely
#used in numerical and scientific compution  .....

print(y_train[nearest_index])

222100


Let's dissect the code a bit....

The outer for loop goes through the two test data items (x_test). For each test data item we calculate the distance to each of the training data items in the inner loop.

To calculate the distances, we've defined the function (dist). We then find the index of the item with the shortest distance using the np.argmin() function. This is the nearest neighbor.

If you look at the output of the above program, you'll notice that the prices predicted for the two test data items are 460700 euros, and 222100 euros.

When comparing the cabins the price for the first test item seems reasonable: it is the price of the last raining data set cabin, which is a similar large cabin. 

But the price of the second test cabin is the price of the second training set cabin, even though it would make intutively more sense for the first cabin in the training set to be the closest since they are of a similar size. Why was the second cabin then selected as the nearest neighbor?

As we mensioned, the distances are calculated using the Euclidean distance, which is the common straight-line  distance of geometery. In this case, the vectors whose distance we are evaluating are defined by numerical values of the five features.

In the case of the cabins, the sizes range between 13 and 130 aquare meters, while the distance to the closest neighboring cabin range between 120 and 1000 meters, so it is clear that when comparing just the numerical values, the differences in distances are much bigger than the differences in the sizes.

In fact, in the above case , the nearest neighbor of the 13 square meters cabin would have been the same 39 quare meters cabin if we had only done the comparison based on the proximity of neighboring cabins alone.

### Note:
# Which distance??

The slightly odd thing about the Euclidean distance to compare cabins is that each dimension (or feature) is comapred using the same scale. In other words, the distance between two otherwise identical cabins whose sizes differ by 100 square meters is the same as the distance between two otherwise identical cabins where one has neighbors 900 meters away.

This might feela little stange since a 100-meters  difference in this cases may be considered relatively insignificant compared to a 100 square meters size difference. Moreover, we could have decided to measure the are in quare feet, in which case the difference of 100 units would have been even less significant, or in fact in square inches(one square meter is 1550 square inches).

This is indeed an issue that has great practical significance. The difinition of "distance"  can make a big difference to the accuracy of the nearest neighbor method. Often the featuresare scaled so that they all have the same variance and therefore, roughly the same weight or importance in the distance calculations.

The Euclidean distanceis of course just one of many different diatnce metrics. One simple and easily understandable (thoguh maybe not the most often used) metric is called Manhattan or(taxicab) metric, where the distance is calculated by considering only the obsolutedifferences in coordinates. Think of a grid-like city with equal sized blocks: it dose not matter what your choice of route is given that you are always moving closer to your goal in either directions.

# The Yle Areena content recommender:

All of us  are familiar with content recommendation systems.
This is fine foe most for-profit entertainment companies. But as a public broadcaster, the mission of the Finnish public broadcasting company Yle is to serve the public interest - which requires a broader range of content. With that in mind, Yle decided to build a system that recommends a diverse but still relevant range of shows to viewers who use Yle Areena
, an online content Algorithm.

#### Personalization that serves a purpose:

in order to create relevant but diverse content recommendations, AI methods based on optimization and machine learning were key. The starting point was to use an exiting algorithm based on collaborative filtering to take care of the content recommendations. The algorithm was then adjusted to ensure it didn't just recommend only the most popular content from generic categories like drama, but would also suggest a broader range of shows.


#### AI method used:

* collabrative filtering algorithms
* Deep learning
* Reinforcement learning

#### Insights from a data scientist:

"When you work on ongoing project that lasts for many years, you encounter different kinds of challenges than those in a one-off project. In our case, the content recommender was working well for our viewers, but we found our selves having to explain to people inside the company why the algorithm was recommending what it did. In the end, we ended up building a dashboard for internal use so the editorial staff at Yle could see in real time why certain types of content are recommended.

It is a good lesson to remmeber - it is not enough to just create a good algorithm and clear UI, you also need to be able to explain how the algorithm works and why it is truthworthy.

This project also raises interesting ethical questions. One of our goals was to ensure that more diverse content is recommended. When we first started, we used a commonly known algorithm used by for example the most popular platforms, but we noticed in out a/b test that it worked best only with drama and entertainment content. That is why we had to develop and implement a more Yle-specific recommendation system that would not ignore educational and cultural content. And it worked - last year we managed to increase our diversity index (a measure of how many types of content are being viewd) while still delivering relevant content to our users

                           _Jaakko Lempinen, Head of AI at Yle_