# Machine Learning / Aprendizagem Automática

## Sara C. Madeira, 2019/20

# Practical 02 + Practical 03 - Introduction to Scikit-learn

## 0. Getting Started

In this tutorial we use [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org), [Scikit-learn](http://scikit-learn.org/stable/) and other Python technical libraries such as [Pandas](http://pandas.pydata.org). 

**In the lab** Python 3, Jupyter and Scikit-learn should be installed in both Windows and Linux. 

**At home** you can get Python 3, Jupyter and Scikit-learn by installing [Anaconda](https://www.anaconda.com). 

Once you have Python 3, Jupyter and Scikit-learn you can follow this tutorial by running the examples interactively using the Jupyter notebook (file AA_201920_TP02_TP03.ipynb) 
or running the examples using an IDE such as Spyder that is installed with Anaconda.

The tutorial has two main parts:

* **PART 1. Python Technical Libraries**: provinding an overview on Pandas, NumPy, Scikit-learn, Matplotlib and SciPy. The focus is on Pandas, a powerful tool for data analysis, and on the relationship between Pandas, NumPy, and Scikit-learn. 

You should complete this part in **Practical 02**.

* **PART 2. Scikit-learn**: providing a first introduction to Scikit-learn by using a classical supervised learning problem as example. 

You should complete this part in **Practical 03**.

# 1. Python's Technical Libraries (Practical 02)

Python has an excellent suite of libraries available that make it a first‐class environment for technical computing. The main ones are the following ([The Data Science Handbook, F. Cady, Wiley, 2017](http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1119092949.html)):
* **Pandas**: This is a very important library for data analysis. It stores and operates on data in data frames, very efficiently and with a sleek, intuitive API. More info on Pandas [here](http://pandas.pydata.org). 
* **NumPy**: A library for dealing with numerical arrays in ways that are fast and memory efficient, but it is clunky and low level for a user. Under the hood, Pandas operates on NumPy arrays. More info on NumPy [here](http://www.numpy.org).
* **Scikit‐learn**: This is the main machine learning library and it operates on NumPy arrays. You can take Pandas objects, turn them into NumPy arrays, and then plug them into scikit‐learn. More info on Scykit-learn [here](http://scikit-learn.org/stable/).
* **Matplotlib**: This is the main plotting and visualization library. Similar to NumPy, it is low level and a bit clunky to use directly. Pandas provides human‐friendly wrappers that call matplotlib routines. More info on Matplotlib [here](https://matplotlib.org)
* **SciPy**: This provides a suite of functions that perform fancy numerical operations on NumPy arrays. More info on SciPy [here](https://www.scipy.org)

Note that these are not the only technical computing libraries available in Python, but they are by far the most popular, and together they form a cohesive, powerful tool suite for data analysis and machine learning. **NumPy** is the most fundamental library; it defines the core numerical arrays that everything else operates on. However, most of your actual code (especially data munging and feature extraction) will be working within **Pandas**, only switching to the other libraries as needed, and the machine learning algorithms will be executed using **Scikit‐learn**. 

### 1.1. NumPy

NumPy is one of the fundamental packages for scientific computing in Python. It contains functionalities for multidimensional arrays, high-level mathematical functions such as linear algebra operations and the Fourier transform and pseudo random number generators.

**The NumPy array is the fundamental data structure in scikit-learn**. Scikit-learn takes in data in the form of NumPy arrays. Any data you are using will have to be converted to a NumPy array. 

The core functionality of NumPy is this **``ndarray``** meaning it has n dimensions and all elements of the array must be the same type. 

A NumPy array looks like this:

In [None]:
import numpy as np

np

In [None]:
# creating a ndarray

x = np.array([[1, 2, 3], [4, 5, 6]])
x

In [None]:
type(x)

**ndarray** can be used like a matrix with index access to rows and values.

In [None]:
# first row

x[0]

In [None]:
# value on first row and second column

x[0][1]

**Both Pandas and Scikit-learn use NumPy functions and the ndarray data structure.**

### 1.2. SciPy

SciPy is a collection of functions for scientific computing. It provides, among other functionality, advanced linear algebra routines, mathematical function optimization, signal processing, special mathematical functions, and statistical distributions. **Scikit-learn draws from SciPy’s collection of functions for implementing its algorithms.**

The most important part of SciPy for us is scipy.sparse with provides **sparse matrices**, which is **another representation that is used for data in Scikit-learn**. Sparse matrices are used whenever we want to store a 2D-array that contains mostly zeros and we want to efficiently store all the non-zero values without wasting memory storing the zeros:

In [None]:
from scipy import sparse

sparse

In [None]:
# create a 2d numpy array with a diagonal of ones, and zeros everywhere else

eye = np.eye(4)

print("Numpy array:\n", eye)

In [None]:
# convert the numpy array to a scipy sparse matrix in CSR format
# In CSR (Compressed Sparse Row matrix) format only the non-zero entries are stored

sparse_matrix = sparse.csr_matrix(eye)

print("\nScipy sparse CSR matrix:\n", sparse_matrix)

### 1.3. Matplotlib

Matplotlib is the primary scientific plotting library in Python. It provides functions for making publication-quality visualizations such as line charts, histograms, scatter plots, and so on. Visualizing your data and any aspects of your analysis can give you important insights.

In [None]:
# We use the 'inline' backend only in Jupyter
# This allows the matplotlib graphs to be included in the notebook next to the code
%matplotlib inline

import matplotlib.pyplot as plt
# Generate a sequence of integers
x = np.arange(20)

# create a second array using sinus
y = np.sin(x)

# The plot function makes a line chart of one array against another
plt.plot(x, y, marker="x")

### 1.4. Pandas

[Pandas](http://pandas.pydata.org) is a powerful and widely used Python library for data wrangling and analysis. It is built around a data structure called **DataFrame**, that is modeled after the R DataFrame. Simply put, a Pandas DataFrame is a table, similar to an Excel Spreadsheet. Pandas provides a great range of methods to modify and operate on this table, in particular it allows SQL-like queries and joins of tables. Another valuable tool provided by Pandas is its ability to ingest from a great variety of file formats and databases, such as SQL, Excel files, and comma separated value (CSV) files.

The rest of this section will be a quick crash course on the **basic data structures of Pandas: Data Frames and Series** following ideas and examples in [The Data Science Handbook, F. Cady, Wiley, 2017](http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1119092949.html). You can read more on Pandas [here](http://pandas.pydata.org). A very good reference on Pandas is the book ([Python for Data Analysis](http://wesmckinney.com/pages/book.html) written by its author Wes Mckinney.

### Data Frames

The central data structure in Pandas is called a **DataFrame**, which is a table with rows and columns, where each column holds data of a particular type (such as integers, strings, or floats). 

DataFrames make it easy and efficient to apply a function to every element in a column or to calculate aggregates such as the sum of a column. Some of the basic operations on data frames are shown is what follow (**more info** [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)).

In [None]:
# import Pandas and alias it as pd

import pandas as pd

pd

In [None]:
# Making a DataFrame from a Dictionary that maps column names to their values

df = pd.DataFrame({
  "name": ["Bob", "Alex", "Janice"],
  "age": [60, 25, 33]
  })

df

In [None]:
# size

df.size

In [None]:
# number of columns 

len(df.columns)

In [None]:
# column identifiers

df.columns

In [None]:
# number of rows 

len(df)

In [None]:
# Checking values of a specific column

df["age"]

In [None]:
# Checking values of a specific column

df["name"]

In [None]:
# Making new columns from old ones is really easy

df["age_plus_one"] = df["age"] + 1 

df["age_times_two"] = 2 * df["age"] 

df["age_squared"] = df["age"]**2

df["over_30"] = (df["age"] > 30) # this column is bools

df

In [None]:
# Checking values of column "over_30"

df["over_30"]

In [None]:
# The columns can have several built-in aggregate functions such as:

#sum
total_age = df["age"].sum()

total_age

In [None]:
# The columns can have several built-in aggregate functions such as:

# median
median_age = df["age"].quantile(0.5)

median_age

In [None]:
# The columns can have several built-in aggregate functions such as:

# mean
mean_age = df["age"].mean()

mean_age

In [None]:
# You can select several rows of the DataFrame and make a new DataFrame out of them

print(df)

df_age_below_50 = df[df["age"] < 50]

df_age_below_50

In [None]:
# You can select several rows of the DataFrame and make a new DataFrame out of them

df_below_30 = df[df["over_30"] == False]

df_below_30

In [None]:
# You can also make selections using more complex logic expressions

# all rows with 'age' above 30, but without the name  "Bob"

df_30_notBob = df[(df['age']>30) & ~(df['name'] == 'Bob')]
df_30_notBob

In [None]:
# You can also apply a custom function to a column 

def f(x):
    return x**2

df["age_squared"] = df["age"].apply(f)

df

In [None]:
# Apply a custom function to a column - another way

df["age_squared"] = df["age"].apply(lambda x: x**2)

df

In [None]:
# Creating the index

df = pd.DataFrame({
  "name": ["Bob", "Alex", "Jane"],
  "age": [60, 25, 33]
  })

df.index

In [None]:
# prints 0‐2, the line numbers

nr = len(df) # number of rows

for i in range(nr):
    print(df.index[i])

In [None]:
# Get row 0 (data in df is indexed by row number starting at 0)

df.iloc[0]

In [None]:
# Create a DataFrame containing the same data, but where column "name" is the index

df_w_name_as_index = df.set_index("name")

df_w_name_as_index

In [None]:
# data in df is now indexed by name

df_w_name_as_index.index 

In [None]:
# Get the row for Bob (row where "name"='Bob')

bobs_row = df_w_name_as_index.loc["Bob"] 

bobs_row

In [None]:
# Get value in column "age" in the row where "name" is Bob

bobs_row["age"]

#### Exercise 1

Modify the following code to create a DataFrame from Dictionary data and then try some of the functions described [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [None]:
# create a simple dataset of people

data = {'Name': ["John", "Anna", "Peter", "Linda"],
        'Location' : ["New York", "Paris", "Berlin", "London"],
        'Age' : [24, 13, 53, 33]
       }
data

# create a DataFrame from the dictionary data
# ...

#### Exercise 2

Create a **.csv** file (comma separated file) named **``myfile.csv``** with 4 rows and 2 columns. The first row should have the column names: **``name``** and **``age``**, and rows 1 to 3 the following values:  Bob, 60; Alex, 25; and Jane, 33. 

Create a Data Frame from your .csv file as follows:

In [None]:
# Reading a DataFrame from a file

other_df = pd.read_csv("myfile.csv")

other_df

### Series

Besides DataFrames, the other important data structure in Pandas is **Series**: a column in a DataFrame is a Series. Conceptually, a Series is just an array of data objects with an index associated. The columns of a DataFrame are Series objects that all happen to share the same index. The following code shows some  basic Series operations (**more info** [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)):

In [None]:
# import Pandas and alias it as pd
import pandas as pd

pd

In [None]:
# Make Series from list

s = pd.Series([1,2,3]) 

# display the values in s. Note that the index is on the left.
s

In [None]:
# Add a number to each element of s

s+2 

In [None]:
# Note that s did not change

s

In [None]:
# if you want the change to be efective

s = s+2

s

In [None]:
# Adding two series will add corresponding elements to each other

s + pd.Series([4,4,5])

In [None]:
# Note again that s did not change

s

In [None]:
# Note that as above now s changes

s = s + pd.Series([4,4,5])

s

### Joining and Grouping

So far we have focused on the following DataFrame operations: 
* Creating data frames
* Adding new columns that are derived from basic operations on existing columns
* Using simple conditions to select rows in a DataFrame
* Aggregating columns
* Setting columns to function as an index, and using the index to pull out rows of the data.

This section discusses two more advanced operations: **joining and grouping**. 

These may be familiar to you from working with SQL. You can read more on **join** [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) and more on **groupby** [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html).

[Joining](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) is used if you want to combine two separate data frames into a single frame containing all the data. 

We take two data frames, match up rows that have a common index, and combine them into a single frame.

In [None]:
df_w_age = pd.DataFrame({
  "name": ["Tom", "Tyrell", "Claire"],
  "age": [60, 25, 33]
  })

df_w_age

In [None]:
df_w_height = pd.DataFrame({
  "name": ["Tom", "Tyrell", "Claire"],
  "height": [6.2, 4.0, 5.5]
  })

df_w_height

In [None]:
# Join df_w_age and df_w_height using the common index "name"
# (Index should be similar to one of the columns)

joined = df_w_age.set_index("name").join(df_w_height.set_index("name"))

joined

In [None]:
# You can then create an index on row number

joined.reset_index()

[Grouping](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) is used when we want to group the rows based on some property and aggregate each group separately. 

This is done with the **groupby** function.

In [None]:
# Create a DataFrame from a Dictionary

df = pd.DataFrame({"name": ["Tom", "Tyrell", "Claire"], 
                    "age": [60, 25, 33],
                   "height": [6.2, 4.0, 5.5],
                   "gender": ["M", "M", "F"]})
df

In [None]:
# Use built-in aggregation functions: groupby gender and compute mean

means = df.groupby("gender").mean()

means

In [None]:
# Use built-in aggregation functions: groupby gender and compute median

medians = df.groupby("gender").quantile(0.5)

medians

In [None]:
# Append two extra rows to df

df_new = pd.DataFrame({"name": ["Peter", "Susan"], 
                    "age": [60, 25],
                   "height": [6.2,  5.5],
                   "gender": ["M", "F"]})

df = df.append(df_new)

df

In [None]:
# Groupby age and computer mean

mean = df.groupby("age").mean()

mean

In [None]:
# Use a custom aggregation function

def agg(df):
    return pd.Series({"name": max(df["name"]), 
                  "oldest": max(df["age"]),
                  "mean_height": df["height"].mean()})

# groupby gender and apply function agg

df.groupby("gender").apply(agg)

### Converting a DataFrame to a NumPy array

**To convert a DataFrame into a NumPy array use the .values property. This is important since Scikit-learn works with NumPy arrays.**

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
                     'y': [0.01, -0.01, 0.25, -4.1, 0.],
                     'z': [-1.5, 0., 3.6, 1.3, -2.]})
df

In [None]:
type(df)

In [None]:
df.columns

In [None]:
df.values

In [None]:
dfValuesAsArray = df.values

type(dfValuesAsArray)

In [None]:
dfValuesAsArray

For some machine learning models, you may only wish to **use a subset of the columns**. 

In this case you can use **loc indexing with values** to get the desired subset of columns:

In [None]:
model_cols = ['x', 'y']

df.loc[:, model_cols].values

**To convert back to a DataFrame you can pass a two-dimensional ndarray with optional column names:**

In [None]:
df2 = pd.DataFrame(dfValuesAsArray)

df2

In [None]:
df2 = pd.DataFrame(dfValuesAsArray, columns=['x', 'y', 'z'])

df2

Note that the .values attribute contains only numeric types as in the example above, when your data is homogeneous, but the **.values attribute can also be an ndarray of Python objects**, when your data is heterogeneous as in the example below. 

In either case **you can use the ndarray obtained from .values in Skikit-learn**.

In [None]:
df3 = pd.DataFrame({'x': [1, 2, 3, 4, 5],
                     'y': [0.01, -0.01, 0.25, -4.1, 0.],
                     'z': [-1.5, 0., 3.6, 1.3, -2.]})

df3

In [None]:
# add a column named "strings" to the DataFrame

df3['strings'] = ['a', 'b', 'c', 'd', 'e']

df3

In [None]:
df3.values

In [None]:
type(df3.values)

In [None]:
df3AsArray = df3.values

df3AsArray

## 2. Scikit-learn (Practical 03)

This section is based almost entirely on Chapter 1 from the book [Introduction to Machine Learning with Python: A Guide for Data Scientists, Sarah Guido&Andreas Müller, 2016](https://www.safaribooksonline.com/library/view/introduction-to-machine/9781449369880/).  We will go through a simple machine learning application and create our first model.

### A First Application: Classifying iris species

Let’s assume that a hobby botanist is interested in distinguishing what is the species of some iris flowers that she found. She has collected some measurements associated with the iris: the length and width of the petals, and the length and width of the sepal, all measured in centimeters.

She also has the measurements of some irises that have been previously identified by an expert botanist as belonging to the species Setosa, Versicolor or Virginica. For these measurements, she can be certain of which species each iris belongs to. Let’s assume that these are the only species our hobby botanist will encounter in the wild.

**Our goal is to build a machine learning model that can learn from the measurements of these irises whose species is known, so that we can predict the species for a new iris.** Thus, the desired output for a single data point (an iris) is the predicted species of this flower. This is a classical **supervised learning problem**.

### 2.1. Meet the Data

We use the **iris dataset**, a classical dataset in machine learning and statistics. It is included in scikit-learn in the **datasets module** and can be loaded by calling the load_iris function:

In [None]:
from sklearn.datasets import load_iris

iris_dataset = load_iris()

The iris object that is returned by load_iris is a **Bunch object**, which is very similar to a dictionary. 

In [None]:
type(iris_dataset)

**Take a look at the dataset:**

In [None]:
iris_dataset

Similarly to a Dictionary **it contains keys and values**:

In [None]:
iris_dataset.keys()

In [None]:
#You can see a description of the dataset by printing the DESCR key:

print(iris_dataset.DESCR)

In [None]:
iris_dataset.values()

The value with key target_names is an array of strings, containing the species of flower we want to predict:

In [None]:
# values in key target_names

iris_dataset['target_names']

**The data itself is contained in the target and data fields.**

The data contains the numeric measurements of sepal length, sepal width, petal length, and petal width in a numpy array:

In [None]:
# values in key data

iris_dataset['data']

In [None]:
type(iris_dataset['data'])

The rows in the data array correspond to flowers, while the columns represent the four measurements that were taken for each flower. 

Take a look at the ** size of the dataset**:

In [None]:
iris_dataset['data'].shape

**The shape of the data array is the number of samples times the number of features.
This is a convention in scikit-learn and your data will always be assumed to be in this shape.**

In [None]:
# feature values of the first 5 examples

iris_dataset['data'][:5]

The target array contains the species of each of the flowers that were measured, also
as a numpy array:

In [None]:
# target

type(iris_dataset['target'])

In [None]:
# target of the first 5 learning examples

iris_dataset['target'][:5]

**The target is a one-dimensional array**, with one entry per flower:

In [None]:
iris_dataset['target'].shape

The species are encoded as integers from 0 to 2. The meaning of the numbers are given by the iris['target_names'] array: 0 means Setosa, 1 means Versicolor and 2 means Virginica.

In [None]:
# target

iris_dataset['target']

### 2.2. Measuring Success: Training and Testing data

To assess the models’ performance, we show the model new data (that it hasn’t seen before) for which we have labels. This is usually done by splitting the labeled data we have collected (here our 150 flower measurements) into two parts.
One part of the data is used to build our machine learning model and is called the **training data or training set**. The rest of the data will be used to access how well the model works and is called test data, **test set or hold-out set**.

**Scikit-learn contains a function that shuffles the dataset and splits it for you, the train_test_split function.** This function extracts 75% of the rows in the data as the training set, together with the corresponding labels for this data. The remaining 25% of the data, together with the remaining labels are declared as the test set. How much data you want to put into the training and the test set respectively depends on the problem, but using a test-set containing 25% of the data is a good rule of thumb.

**In scikit-learn, data is usually denoted with a capital X, while labels are denoted by a lower-case y.**

Let's call train_test_split on our data and assign the outputs using this nomenclature:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], 
                                                    iris_dataset['target'], 
                                                    random_state=0)

To make sure that we will get the same output if we run the same function several times, we provide the pseudo random number generator with a fixed seed using the **random_state parameter**. This will make the outcome deterministic, so this line will always have the same outcome.

The train_test_split function outputs X_train, X_test, y_train and y_test, which are all numpy arrays. X_train contains 75% of the rows of the dataset, and X_test contains the remaining 25%:

In [None]:
X_train.shape

In [None]:
X_test.shape

### 2.3. First Things  First: Look at Your Data

Before building a machine learning model, it is often a good idea to inspect the data, to see if the task is easily solvable without machine learning, or if the desired information might not be contained in the data. One of the best ways to inspect data is to visualize it. One way to do this is by using a **scatter plot**. 

A scatter plot of the data puts one feature along the x-axis, one feature along the y- axis, and draws a dot for each data point. Unfortunately, computer screens have only two dimensions, which allows us to only plot two (or maybe three) features at a time. It is difficult to plot datasets with more than three features this way. One way around this problem is to **do a pair plot, which looks at all pairs of two features**. If you have a small number of features, such as the four we have here, this is quite reasonable. You should keep in mind that a pair plot does not show the interaction of all of features at once, so some interesting aspects of the data may not be revealed when visualizing it this way.

Here is a pair plot of the features in the training set (don't worry abour the details in the code, the important here are the plots). The data points are colored according to the species the iris belongs to:

In [None]:
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names

iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)

# create a scatter matrix from the dataframe, color by y_train

pd.plotting.scatter_matrix(iris_dataframe, 
                           c=y_train, 
                           figsize=(15, 15),
                           marker='o', 
                           hist_kwds={'bins': 20}, 
                           s=60, 
                           alpha=.8)

From the plots, we can see that the three classes seem to be relatively well separated using the sepal and petal measurements. This means that a simple machine learning model will likely be able to learn to separate them.

### 2.4. Building your First model: k-Nearest Neighbors

Now we can start building the actual machine learning model. There are many classification algorithms in scikit-learn that we could use. Here we will use a **k-nearest neighbors classifier**, which is easy to understand: this algorithm stores the training set, and in order to make a prediction for a new data point finds the point in the training set that is closest to the new point. Then, it simply assigns the label of this closest data training point to the new data point.

The k in k nearest neighbors stands for the fact that instead of using only the closest neighbor to the new data point, we can consider any fixed number k of neighbors in the training (for example, the closest three or five neighbors). Then, we can make a prediction using the majority class among these neighbors. We use only a single neighbor for now.

**All machine learning models in scikit-learn are implemented in their own class, which are called Estimator classes.**

The k nearest neighbors classification algorithm is implemented in the **KNeighborsClassifier** class in the neighbors module. Before we can use the model, we need to instantiate the class into an object. This is when we will set any parameters of the model. The single parameter of the KNeighbor sClassifier is the number of neighbors, which we will set to one:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# use k nearest neigbors with k=1
knn = KNeighborsClassifier(n_neighbors=1)

The knn object encapsulates the algorithm to build the model from the training data, as well the algorithm to make predictions on new data points. It will also hold the information the algorithm has extracted from the training data. 

**To build the model on the training set, we call the fit method of the knn object**, which takes as arguments the numpy array X_train containing the training data and the NumPy array y_train of the corresponding training labels:

In [None]:
# training the model
knn.fit(X_train, y_train)

### 2.5. Making Predictions

**We can now use the trained model to make predictions on new data, for which we might not know the correct labels.** 

Imagine we found an iris in the wild with a sepal length of 5cm, a sepal width of 2.9cm, a petal length of 1cm and a petal width of 0.2cm. What species of iris would this be ?

In [None]:
# store the new data in a NumPy array

X_new = np.array([[5, 2.9, 1, 0.2]])

X_new

**To make prediction we call the predict method** of the knn object:

In [None]:
prediction = knn.predict(X_new)

prediction

Our model predicts that this new iris belongs to the class 0, meaning its species is Setosa.

In [None]:
iris_dataset['target_names'][prediction]

In [None]:
# what happens if we use 3 neighbors ?

knn3 = KNeighborsClassifier(n_neighbors=3)

knn3.fit(X_train, y_train)

prediction2 = knn3.predict(X_new)

iris_dataset['target_names'][prediction2]

**The prediction does not change, although it could change since the model is not the same.**

**How do we know whether we can trust the model?** We don’t know the correct species of this sample, and we shoudn't know since this is the whole point of building the model, but can we somehow evaluate how good is the model, that is, how likely is how model to behave well in unseen data ?

### 2.6. Evaluating the Model

**This is where the test set that we created earlier comes in.** 

This data was not used to build the model, but we do know what the correct species are for each iris in the test set.

In this context, we can make a prediction for an iris in the test data, and compare it against its label (the known species). 

We can then measure how well the model works by computing, for example, its **accuracy**, which is the fraction of flowers for which the right species was predicted:

In [None]:
# accuracy of knn with 1 neigbour

y_pred = knn.predict(X_test)

np.mean(y_pred == y_test)

We can also use the score method of the knn object, which will compute the **accuracy in the test set**:

In [None]:
knn.score(X_test, y_test)

For this model, the test set accuracy is about 0.97, which means we made the right prediction for 97% of the irises in the test set. 

Under some mathematical assumptions, this means that we can expect our model to be correct 97% of the time for new irises. Thus, for our hobby botanist application, this a high level of accuracy means that our models may be trustworthy enough to use.

Note that in this case the accuracy in the train set is 100%. **Note also that the accuracy  in the train set is usually highter than the accuracy in the test set, although not 100% in most cases.**

In [None]:
y_pred = knn.predict(X_train)

np.mean(y_pred == y_train)

### 2.7. Summary
   
**The following code contains the core code for applying any machine learning algorithms using scikit-learn. The fit, predict and score methods are the common interface to supervised models in scikit-learn.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], 
                                                     iris_dataset['target'],
                                                    random_state=0)

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train, y_train)

knn.score(X_test, y_test)

**Once we have trained the model (classifier) we can use it classify new examples using the function predict**.

In [None]:
X_new = np.array([[5, 2.9, 1, 0.2]])

prediction = knn.predict(X_new)

prediction

## 3. Classes and Functions in Scikit-learn

[Scikit-learn organization](http://scikit-learn.org/stable/) is as follows:

* [Classification](http://scikitlearn.org/stable/supervised_learning.html#supervised-learning)

* [Regression](http://scikitlearn.org/stable/supervised_learning.html#supervised-learning)

* [Clustering](http://scikit-learn.org/stable/modules/clustering.html#clustering)

* [Dimensionality reduction](http://scikit-learn.org/stable/modules/decomposition.html#decompositions)

* [Model selection](http://scikit-learn.org/stable/model_selection.html#model-selection)

* [Preprocessing](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)

Classification and regression enable [Supervised Learning](http://scikitlearn.org/stable/supervised_learning.html#supervised-learning), while clustering enables [Unsupervised Learning](http://scikit-learn.org/stable/modules/clustering.html#clustering). 

**Class and function reference of scikit-learn is available in the [API Reference](http://scikit-learn.org/stable/modules/classes.html). 

For further details refer to the [full user guide](http://scikit-learn.org/stable/modules/classes.html).**