# Lab Session 1 - Introduction to Machine Learning

For this lab session, we will go through a simple machine learning application and create our first model. We will be using the **Fruit Dataset** to create a classifier that can predict Fruit Type (apple, mandarin, orange, and lemon).

## Import required modules

In [1]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import cm
from pandas.core.arrays.sparse import SparseArray as _SparseArray



## First Things First: Look at Your Data

### Question 0
 
**Using `read_csv`, create a dataframe and keep in mind that the dataset file "fruits.txt" should be in the same folder as your python file.**

In [2]:
df=pd.read_csv("1.fruits.txt")
df.head()

#Each row corresponds to a single data instance (sample)
#The features are the "mass, width, hight and color_score" and they are describing each data intance (sample)
#In a supervised ML model the table should also have a "label" columns. Therefore, i need to add a training label
#matching apples with a number, mandarin with another number so on 

#The main goal is to build the classifier from this data that can predict type of fruit for any given 
#observation of any feature

Unnamed: 0,name,subtype,mass,width,height,color_score
0,apple,granny_smith,192,8.4,7.3,0.55
1,apple,granny_smith,180,8.0,6.8,0.59
2,apple,granny_smith,176,7.4,7.2,0.6
3,mandarin,mandarin,86,6.2,4.7,0.8
4,mandarin,mandarin,84,6.0,4.6,0.79


### Question 1

How many data points (**Number of Instances**) and features (**Number of Attributes**) does the fruit dataset have?

(Hint: use `shape`)

What is the class distribution? (i.e. how many instances of `apple`, `mandarin`, `orange`, and `lemon`)

Hint: use value_counts()

Using `head` display the first 8 instances of the fruit dataset.



In [3]:


df["name"].value_counts()

apple       19
orange      19
lemon       16
mandarin     5
Name: name, dtype: int64

In [4]:
df.shape

(59, 6)

## Building a Model

### Question 2
Split the DataFrame into `X` (the data) and `y` (the labels).

*This function should return* 
* `X` *has shape* `(59, 3)`
* `y` *has shape* `(59,)`.

**For this example, only use `mass`, `width`, and `height` features**

In [5]:
X = df[["mass","width","height"]]

y = df["name"]
#df_new=df
#print(X,y)


In [6]:
X.shape

(59, 3)

In [7]:
y.shape

(59,)

### Question 3
Using `train_test_split`, split `X` and `y` into training and test sets `(X_train, X_test, y_train, and y_test)`.

**Set the random number generator state to 0 using `random_state=0`**
(This way, the same selection of test data is done each time)

This function should return a tuple of length 4: `(X_train, X_test, y_train, y_test)`
Print the shape of each of these 4 elements


In [8]:
# To estimate how well the classifier will do in future samples, I will have an array of traiing samples that 
# it will be called the training set that it will be used to train the classifier and then I will put 
# the remaining labelled sample into a test set that it will be used to evaluated the trained classifier

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

#this random_state parameter provides a seed value to the function state random generator, in order to get always the same split
#this function by default will give you an split of 75/25 training/test.
# To split the data sklearn produce the train test split function
#In this example I will use a 75/25 split for the train_test_split (is the default)
# In the X I put the features and in the y the label data


Seed value: Conceptually, the seed value is used to generate the random number generator. And, every time you use the same seed value, you will get the same random values. If we choose different seed values in the random state functions, that will result in a different randomize splits. if we want to get the same split everytime we need to set the seed value to a fix number 

### In order to see how the data was splited 

In [10]:
X_train.shape


(44, 3)

In [11]:
X_test.shape

(15, 3)

In [12]:
y_train.shape

(44,)

In [13]:
y_test.shape

(15,)

In [14]:
X_train

Unnamed: 0,mass,width,height
42,154,7.2,7.2
48,174,7.3,10.1
7,76,5.8,4.0
14,152,7.6,7.3
32,164,7.2,7.0
49,132,5.8,8.7
29,160,7.0,7.4
37,154,7.3,7.3
56,116,5.9,8.1
18,162,7.5,7.1


## Building Your First Model: k-Nearest Neighbors

### Question 4
Using `KNeighborsClassifier` create a classifier object using five nearest neighbors (`n_neighbors = 5`).

*This function should return a `sklearn.neighbors.classification.KNeighborsClassifier`.

In [15]:
knn = KNeighborsClassifier(n_neighbors = 5)
#Like this I create a classifier object

### K-NN classifier Algorithm

Give a training set X_train with labels y_train, and given a new instance (sample) x_test to be classified:
1. Find the most similar instance(let´s call them X_NN) to x_test that are in X_train
2. Get the labels y_NN for the instances in X_NN
3. Predict the label for x_test by combining the labels y_NN e.g. simple majority vote

#### A nearest neighbor algorithm needs four things specified
1. A distance metric
2. How many "Nearest" neighbors to look at? (At least one) --> Typicallt Euclidean (Minkowski with p=2)
3. Optional weighting function on the neighbor points (not use it here) --> e.g. five
4. Method for aggregating the classes of neighbor points --> Simple majority vote (Class with the most representatives among nearest neighbors)

### 
Using your knn classifier object `knn` and `X_train`, `y_train` train the classifier (fit the estimator).

In [16]:
# How to train the classifier? By passing in the training set data in X_train and the labels in y_train to the 
# classifiers fit method. This is the estimator. In other words, it updates the state of the k-NN variable here,
# Which means that in the case of K-Nearest Neighbors, it will memorize the trainin set examples in some kind
# of internal storage for future use

knn.fit(X_train, y_train)

KNeighborsClassifier()

### Question 5
Use the trained k-NN classifier model to classify new, previously unseen objects

**Use the following input: fruit with mass `20g`, width `4,3 cm`, height `5,5 cm`**
**Use the following input: a small fruit with mass `100g`, width `6,3 cm`, height `8,5 cm`**


In [17]:
big_fruit_prediction=knn.predict([[20, 4.3, 5.5]])
print("The labal of the big fruit is: ",big_fruit_prediction)
small_fruit_prediction=knn.predict([[100, 6.3, 8.5]])
print("The labal of the small fruit is: ", small_fruit_prediction,)

The labal of the big fruit is:  ['mandarin']
The labal of the small fruit is:  ['lemon']




## Evaluating the Model

### Question 6
We can measure how well the model works by computing the accuracy on the test data. This is the fraction of fruits for which the right fruit type was predicted:

**Use `score` to evalute the accuracy of the classifier, using the test data**

In [18]:
test_score=knn.score(X_test, y_test)
print("Test set score.{:.2f}".format(test_score))

Test set score.0.53


### The accuracy 
Is the fraction of test sample whose label was correctly predicted by the classifier

## Improving the Model

### Question 7
Try to improve the accuracy, by changing the number of neighbors. What is the optimal number of neighbors?

Now try adding distance weighting by changing the default value of the weights parameter (as described here https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

What is the best accuracy you get? What is the optimal number of neighbors with distance weighting? 

In [19]:
knn = KNeighborsClassifier(n_neighbors = 4, weights="distance")

knn.fit(X_train, y_train)

test_score=knn.score(X_test, y_test)
print("Test set score.{:.2f}".format(test_score))

train_score=knn.score(X_train, y_train)
print("Train set score: {:.2f}".format(train_score))

Test set score.0.73
Train set score: 1.00


### Answer

The optimal number of neigbors is 4 providing an accuracy of 73,33% 

#### Distance weighting:
`weights = 'distance'` is in contrast to the default which is `weights = 'uniform'`. When weights are uniform, a simple majority vote of the nearest neighbors is used to assign cluster membership.

When weights are distance weighted, the voting is proportional to the distance value. Nearby points will have a greater influence than more distance points (even if the counts of different groups are the similar).

Distance weighting is very useful for sparse data.


### Question 8

Try to include the color_score feature in the data. To do this, you need to reassign X with the fruits dataframe, this time including color_score. Then re-do the train-test split, and fit the model again. Print the score for the test data.

In [20]:

X = df[["mass","width","height","color_score"]] #Add one more feature to the training set

y = df["name"]


In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


In [37]:
knn = KNeighborsClassifier(n_neighbors = 4,  weights="distance")
knn.fit(X_train, y_train)
test_score=knn.score(X_test, y_test)
print("Test set score.{:.2f}".format(test_score))
train_score=knn.score(X_train, y_train)
print("Train set score: {:.2f}".format(train_score))

Test set score.0.73
Train set score: 1.00


### Question 9

Is the result different? Our results might not be very reliable, since we have such a small amount of test data. Try doing the train-test split 5 times, but this time allowing random variation by removing the random_state argument. Make a loop where you compute the average test score, with and without color_score. Based on this, do you think color_score adds useful information?

In [54]:
def loopTest():
    """ 1. In this function I am splitting the data many times due to the while loop without randome statement fixed
        2. Fitting the data each time
        3. Evaluating the scoring each time
        4. Calculating the average"""
    testCnt=5
    tot=0
    while testCnt>0:
    
        X_train, X_test, y_train, y_test = train_test_split(X,y) #i am not setting the random variation here
        knn.fit(X_train, y_train)
        test_score=knn.score(X_test,y_test)
        tot=tot+test_score
        print("Test set score: {:.2f}".format(test_score))
        testCnt=testCnt-1
    res=tot/5
    print("\navg {:.2f}".format(res),"\n \n")


In [55]:
X = df[["mass","width","height","color_score"]] #I do not define y because its already defined before, this its just to identify that is with colors and without
print("Result with color_score\n")
loopTest()
X= df[["mass","width","height"]]
print("Result without color_score\n")
loopTest()

Result with color_score

Test set score: 0.67
Test set score: 0.60
Test set score: 0.60
Test set score: 0.67
Test set score: 0.60

avg 0.63 
 

Result without color_score

Test set score: 0.53
Test set score: 0.67
Test set score: 0.87
Test set score: 0.67
Test set score: 0.80

avg 0.71 
 

