## Writing our first machine learning algorithm

Machine Learning is not magical. It's just logic and mathematics.

### 1. Standing on the shoulders of giants

As you already know, its just not worth writing out ALL the code. For instance, plotting a graph would take ages if we did it from scratch.

Let's use some libraries to make things easier.

In [None]:
#1. import the graph library
from graphLibrary import * #import all the possible functions
#2. make the graphs display in and amongst the code to make it easy to see!
%matplotlib inline 
#3. make the graphs and charters bigger!
pylab.rcParams['figure.figsize'] = 12, 8  
print ("graph library loaded!")
#4. Spreadsheet library
import pandas as pd
print ("spreadsheet loaded!")
#numpy is a maths library
import numpy as np
print "maths loaded! 5 * 5 = ", 5 * 5

### 2. We need some data!

Import our data from a CSV file.

In [None]:
dataset = pd.read_csv('wine.csv')
dataset.head().transpose()

### 3. Visualising data

Now we can see our data, let's choose which variables we want to cluster together. And let's plot them on a scatter plot.

In [None]:
column1 = "Alcohol"#name for column 1 to cluster
column2 = "Magnesium"#name for column 2 to cluster
#lets have a look at the data!
plt.scatter(dataset[column1], dataset[column2])

### 4. Pick a random starting point

OK, so to do a KMeans Cluster, we need to give our clusters random starting points. By default, we'll try 3 clusters.

We will set the random cluster centers to be in the range of the 2 columns we have chosen above.

In [None]:
clusterDF = createRandomClusterCenters(dataset, range1=column1, range2=column2)
clusterDF.head()

### 5. Take another look

As a data scientist, its always important to just have a look at the data. 

In [None]:
#draw a scatter plot of the original data
plt.scatter(dataset[column1], dataset[column2])

#put the random cluster centers on the graph
plt.scatter(clusterDF['CX'], clusterDF['CY'], color='orange', s=100)
plt.title(column1 + " vs " + column2)
plt.ylabel(column2)

In [None]:
#just for reference, lets have a look again
print "cluster centers:"
print clusterDF.head()
print
print "dataset: "
print dataset.head().transpose()

### 6. Writing an ML algorithm!

OK... Now over to you. We know that we are interested in the distance from each point to the nearest cluster center. 

You are now going to write the algorithm to calculate the distance each point is to each cluster center. Remember, at the moment the cluster centers are just randomly positioned, but this makes no difference!

Pythagoras' theorem is 

```
distance = √(x*x + y*y)
```

... to calculate a square root we'll need that maths library from earlier, Numpy!

In [None]:
def calculateDistances():
    #first cluster X and Y differences
    print "running calculate distances..."
    
    x1 = dataset[column1] - clusterDF.loc[1]["CX"]
    x2 = dataset[column1] - clusterDF.loc[2]["CX"]
    x3 = dataset[column1] - clusterDF.loc[3]["CX"]

    y1 = dataset[column2] - clusterDF.loc[1]["CY"]
    y2 = dataset[column2] - clusterDF.loc[2]["CY"]
    y3 = dataset[column2] - clusterDF.loc[3]["CY"]

    dataset["distance1"] = np.sqrt(x1 * x1 + y1 * y1)
    dataset["distance2"] = np.sqrt(x2 * x2 + y2 * y2)
    dataset["distance3"] = np.sqrt(x3 * x3 + y3 * y3)


### 7. Put each point in a cluster.

Ok, so intuitively, for each data point, whichever distance to a cluster center is the 
shortest, must be the closest cluster. So we make it a member of that cluster

In [None]:
def whichCluster(datapoint):
    if datapoint["distance1"] < datapoint["distance2"] and datapoint["distance1"] < datapoint["distance3"]:
        return "C1"
    elif datapoint["distance2"] < datapoint["distance3"] and datapoint["distance2"] < datapoint["distance1"]:
        return "C2"
    else:
        return "C3"
print ("function to cluster datapoints ready!")

### 8. Improving our clusters

Now we need to run a few steps to calculate the clusters. Each time we run this, we can see the cluster centers move
and each data point is updated with a new cluster... machine learning in action!

In [None]:
print "Cluster Centers BEFORE running calculation"
print clusterDF.head()

"""1. calculate the distances to each center""" 
calculateDistances()

"""2. assign each point to a cluster """ 
assignToNewCluster(dataset, whichCluster)

"""3. move the cluster centers to the middle of their clusters """
updateClusterCenterPositions(dataset, clusterDF, column1, column2)

"""4. plot the data and lets see it move!"""
plotClusters(dataset, clusterDF, column1, column2)
print 
print "Cluster Centers AFTER running calculation"
clusterDF.head()