# Lab Session 1 - Introduction to Machine Learning

For this lab session, we will go through a simple machine learning application and create our first model. We will be using the **Fruit Dataset** to create a classifier that can predict Fruit Type (apple, mandarin, orange, and lemon).

## Import required modules

In [6]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier


## First Things First: Look at Your Data

### Question 0
 
**Using `read_csv`, create a dataframe and keep in mind that the dataset file "fruits.txt" should be in the same folder as your python file.**

In [7]:
df=pd.read_csv("fruits.txt")
df.head(5)

#Each row corresponds to a single data instance (sample)
#The features are the "mass, width, hight and color_score" and they are describing each data intance (sample)
#In a supervised ML model the table should also have a "label" columns 

Unnamed: 0,name,subtype,mass,width,height,color_score
0,apple,granny_smith,192,8.4,7.3,0.55
1,apple,granny_smith,180,8.0,6.8,0.59
2,apple,granny_smith,176,7.4,7.2,0.6
3,mandarin,mandarin,86,6.2,4.7,0.8
4,mandarin,mandarin,84,6.0,4.6,0.79


### Question 1

How many data points (**Number of Instances**) and features (**Number of Attributes**) does the fruit dataset have?

(Hint: use `shape`)

What is the class distribution? (i.e. how many instances of `apple`, `mandarin`, `orange`, and `lemon`)

Hint: use value_counts()

Using `head` display the first 8 instances of the fruit dataset.



In [3]:
#

df["name"].value_counts()

apple       19
orange      19
lemon       16
mandarin     5
Name: name, dtype: int64

## Building a Model

### Question 2
Split the DataFrame into `X` (the data) and `y` (the labels).

*This function should return* 
* `X` *has shape* `(59, 3)`
* `y` *has shape* `(59,)`.

**For this example, only use `mass`, `width`, and `height` features**

In [15]:
y = df["name"]
x = df[["mass","width","height"]]

df_new=df
print(x,y)


    mass  width  height
0    192    8.4     7.3
1    180    8.0     6.8
2    176    7.4     7.2
3     86    6.2     4.7
4     84    6.0     4.6
5     80    5.8     4.3
6     80    5.9     4.3
7     76    5.8     4.0
8    178    7.1     7.8
9    172    7.4     7.0
10   166    6.9     7.3
11   172    7.1     7.6
12   154    7.0     7.1
13   164    7.3     7.7
14   152    7.6     7.3
15   156    7.7     7.1
16   156    7.6     7.5
17   168    7.5     7.6
18   162    7.5     7.1
19   162    7.4     7.2
20   160    7.5     7.5
21   156    7.4     7.4
22   140    7.3     7.1
23   170    7.6     7.9
24   342    9.0     9.4
25   356    9.2     9.2
26   362    9.6     9.2
27   204    7.5     9.2
28   140    6.7     7.1
29   160    7.0     7.4
30   158    7.1     7.5
31   210    7.8     8.0
32   164    7.2     7.0
33   190    7.5     8.1
34   142    7.6     7.8
35   150    7.1     7.9
36   160    7.1     7.6
37   154    7.3     7.3
38   158    7.2     7.8
39   144    6.8     7.4
40   154    7.1 

### Question 3
Using `train_test_split`, split `X` and `y` into training and test sets `(X_train, X_test, y_train, and y_test)`.

**Set the random number generator state to 0 using `random_state=0`**
(This way, the same selection of test data is done each time)

This function should return a tuple of length 4: `(X_train, X_test, y_train, y_test)`
Print the shape of each of these 4 elements


## Building Your First Model: k-Nearest Neighbors

### Question 4
Using `KNeighborsClassifier` create a classifier object using five nearest neighbors (`n_neighbors = 5`).

*This function should return a `sklearn.neighbors.classification.KNeighborsClassifier`.

### 
Using your knn classifier object `knn` and `X_train`, `y_train` train the classifier (fit the estimator).

### Question 5
Use the trained k-NN classifier model to classify new, previously unseen objects

**Use the following input: fruit with mass `20g`, width `4,3 cm`, height `5,5 cm`**
**Use the following input: a small fruit with mass `100g`, width `6,3 cm`, height `8,5 cm`**


## Evaluating the Model

### Question 6
We can measure how well the model works by computing the accuracy on the test data. This is the fraction of fruits for which the right fruit type was predicted:

**Use `score` to evalute the accuracy of the classifier, using the test data**

## Improving the Model

### Question 7
Try to improve the accuracy, by changing the number of neighbors. What is the optimal number of neighbors?

Now try adding distance weighting by changing the default value of the weights parameter (as described here https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

What is the best accuracy you get? What is the optimal number of neighbors with distance weighting? 

### Question 8

Try to include the color_score feature in the data. To do this, you need to reassign X with the fruits dataframe, this time including color_score. Then re-do the train-test split, and fit the model again. Print the score for the test data.

### Question 9

Is the result different? Our results might not be very reliable, since we have such a small amount of test data. Try doing the train-test split 5 times, but this time allowing random variation by removing the random_state argument. Make a loop where you compute the average test score, with and without color_score. Based on this, do you think color_score adds useful information?