## A Machine Learning example
* the steps in a machine learning pipeline
* few recurrent python's code that we will use and reuse during the course

## 1. Analyse your Data

* import the libraries that have the tools that we need

In [None]:
import numpy as np
import pandas as pd

* Load data from file

In [None]:
iris_dataset = pd.read_csv("data/iris.csv")

* show the first 5 rows of our dataset

In [None]:
iris_dataset.head()

* show the column names

In [None]:
iris_dataset.columns

* list the unique value of our target column

In [None]:
iris_dataset["Species"].unique()

* convert the strings into numbers using a function of skikit learn called LabelEncoder
* in most cases the target will be already turned into a number in the original CSV file, but this is a good example of data preprocessing

In [None]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
iris_dataset["Species"] = LE.fit_transform( iris_dataset["Species"] )


* show our dataset after the labelEncoder operation

In [None]:
iris_dataset.head()

* so we have 3 classes : class 0, class 1 and class 2


* visualise the data to check for relevant, useful features

In [None]:
pd.plotting.scatter_matrix(iris_dataset,  c = iris_dataset["Species"], figsize=(15, 15))

## 2. Define the features X and the target y

#### What feature is certainly irrelevant ?

* the ID column is irrelevant and would cause issues being completely not related to the target.
* we use selection on the data to store the X and y

In [None]:
X = iris_dataset[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
X.shape

In [None]:
y = iris_dataset['Species']
y.shape

## 3. Divide the data into 2 splits: training set and testing set

* we need to divide our data in training and testing. The reason is that what we want is an algorithm that is able to generalise well on unseen data. It is easy to construct a model that can perform very well when we provide the algorithm with the correct answers, but what we really want is an algorithm that learns a general mapping between the input and the desired output, so that it is able to apply the general understanding of the data with unseen data. The only way to test that the algorithm is able to do so is to separate our data in training and testing. We use the training set data to teach the algorithm and we use the testing set to evaluate if the algorithm is able to perform well on unseen data.
* PS. by default the method train_test_split will shuffle the data.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.30) 

In [None]:
print (X_train.shape)
print (X_test.shape)

In [None]:
print (y_train.shape)
print (y_test.shape)

### 4. Create the model

* k nearest neighbors. (we will cover the theory of this algorithm in the next lesson)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
# notice that when we create the model we decide a parameter (in this case the number of neighbors)

### 5. Train the model

In [None]:
knn.fit(X_train, y_train)

### 6. Evaluate the model

* show the accuracy score of the model that we trained (the number is the percentage of how much accurate is our model)

In [None]:
knn.score(X_train, y_train)

#### Evaluating the model
* The information above tells us that if we feed data to the model that the model have already seen it will be 97% accurate in make prediction.
* How good the model is for unseen data?
* This is the time to use the data that we have created for testing.
* if we would find that the predictions are not accurate enough we would then try to change and refine our model (for instance by changing the number of neighbours in the KNN).

In [None]:
knn.score(X_test, y_test)

### 7. Tune the parameters of the model to increase the performance

* it did very well in this case... so there is no specific need to change the n_neighbors paramenter


### 8. Make prediction

* pretend that we sent the model to a junior Botanist and that she collected a flower 
* She took these measures from the flower: 
    SepalLengthCm = 5 cm
    SepalWidthCm = 2.9 cm 
    PetalLengthCm = 1 cm 
    PetalWidthCm = 0.2 cm
* which are exactly the features that she needs to run the model that we sent to her

In [None]:
flower = np.array([[5, 2.9, 1, 0.2], [7, 1, 2, 0.5]])

In [None]:
predictions = knn.predict(flower)
print(predictions)

* class zero
* we can use the inverse of the LabelEncoder to collect the original name of the flower
* in your exercise you will not need this step since the targets are already numbers

In [None]:
LE.inverse_transform(predictions)