# ARTificial INtelligence ARTIN
**Master CORO - Centrale Nantes**

**Autor: Diana Mateus**


# Introductory TD1

Scikit-learn is a very popular Machine Learning library for Python. In this notebook we will study how to put in practice the the three simple machine learning models 

*   Linear Regression
*   K-nearest neighbors
*   Naive Bayes 

seen during the lecture relying on scikit-learn.

**GOALS**: 

*   Splitting data for Machine Learning
*   Discovering the simple three step mechanism to load, train and test an existing model from the scikit library


**DELIVERABLES**:
There are is no code to deliver. However, while running the code, you should make sure you understand:

*   What is the purpose of data splitting.
*   What are the different lines in the code doing. Especially what is happening each time you call ``fit``.
*   And to answering to the questions at the end of the notebook, for which you are invited to modify the code


# Importing Modules

In [None]:
import numpy as np #scientific computing (in ML it handles and operates on multi-dimensional arrays)
import matplotlib.pyplot as plt #for data visualization
import sklearn #for Machine Learning
import pandas as pd #for reading, writing and processing databases 


# Loading and Splitting Data

The objective of a Supervised Machine Learning methods is

*   to learn from examples
*   to be able to make predictions
*   for unseen data!!!

The primary rule when training a model is therefore to split all the available  data with supervised data in groups:
*  **Training set** : used to fit the model parameters.
*  **Validation set** : used to set the model hyper-parameters.
*  **Test set** : used only after training and validation have been finished to evaluate the performance of the method.

For real life problems is important to reduce the use of the test set to its minimum, to improve generalization.



### Datasets for Regression


In [None]:
from sklearn import datasets 
diabetes = datasets.load_diabetes()



# Explore on your own the dimensions of the dataset and their meaning? w
X = diabetes.data[:, np.newaxis, 2] 

# SPLIT what is the effect of the following lines?
X_train = X[:-30]
X_test = X[-30:]
y_train = diabetes.target[:-30]
y_test = diabetes.target[-30:]

# Explore the data
# print(diabetes.DESCR) #Uncomment to see the dataset description
print(diabetes.feature_names)
print(diabetes.target.shape)
print('X train', X_train.shape)
print('X test',X_test.shape)
print('y train',y_train.shape)
print('y test',y_test.shape)



### Datasets for Classification

In [None]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

print(label_names)
print(labels.shape)
print(feature_names)
print(features.shape)
#print(data.DESCR) #Uncomment to see the dataset description

In [None]:
from sklearn.model_selection import train_test_split

X2_train, X2_test, y2_train, y2_test = train_test_split(features,labels,test_size = 0.40, random_state = 42)


# Training a ML model

When relying with on the scikit library, training a model is very simple. You  need to:
*   Load the model from scikit
*   Declare a new instance of the model 
*   Fit the parameters
*   Make predictions for new data

Identify in the example code the above steps



### Model 1. Linear Regression

In [None]:
#Load and declare a new instance
from sklearn import linear_model
regr = linear_model.LinearRegression()

In [None]:
#Fit (train) the model 
regr.fit(X_train, y_train)
print('Coefficients: \n', regr.coef_)
print('Intercept: \n', regr.intercept_)

In [None]:
#Make predictions
y_pred = regr.predict(X_test)

In [None]:
#Evaluate the performance
from sklearn.metrics import mean_squared_error, r2_score
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

In [None]:
#Visualize 
plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred, color='red', linewidth=3)
#plt.xticks(())
#plt.yticks(())
plt.xlabel('bmi')
plt.ylabel('diabetes progression')
plt.show()

## Model 2. K-Nearest Neighbors Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knnClassifier = KNeighborsClassifier(n_neighbors=2)
knnClassifier.fit(X2_train, y2_train)
y2_pred_knn = knnClassifier.predict(X2_test)
print(y2_pred_knn)

In [None]:
#Compute accuracy on the training set
train_accuracy = knnClassifier.score(X2_train, y2_train)
    
#Compute accuracy on the test set
test_accuracy = knnClassifier.score(X2_test, y2_test) 

print('train accuracy', train_accuracy)
print('test accuracy',test_accuracy)

## Model 3. Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB
gnbClassifier = GaussianNB()
gnbClassifier.fit(X2_train,y2_train)
y2_pred_gnb = gnbClassifier.predict(X2_test)
print(y2_pred_gnb)

In [None]:
#Compute accuracy on the training set
train_accuracy = knnClassifier.score(X2_train, y2_train)
    
#Compute accuracy on the test set
test_accuracy = knnClassifier.score(X2_test, y2_test) 

print('train accuracy', train_accuracy)
print('test accuracy',test_accuracy)

# QUESTIONS


1.   Why is it wrong to use all available data for training (no data split)?
2.   What is the effect of increasing the amount of training data? (In the regression example, try different amounts and plot the train and test errors for each case)
3.   Only one feature 'bmi' was used during training, is it using more information better? (try adding new features).
4.   How do we know if the learning was succesful in each case?
5.   What model for classification is better?, why?
6.   For which of the above models is it useful to split the data in three groups (instead of two) to also consider a validation set?
7.   What is the best neighborhood size for the knn classifier, what is the appropriate methodologogy to find this number? 
8.   Naive classifiers are probabilistic classifiers. How do we recover the probabilistic information associated to this model? What quantities can we recover
9.   We saw in the lecture that linear models were used for regression, while KNN and Naive Bayes were used for classification. Can we use linear models for classification?  KNN or Naive Bayes for regression problems?



