# Training a Machine Learning Model With Scikit-Learn 

Based on Machine Learning Practices by Kevin Markham

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Reading Datastes
- Dataset: [Iris Dataset](http://archive.ics.uci.edu/ml/datasets/Iris)  

50 sammples of 3 different species of Iris (150 samples)_

Measurements:
- Sepal Length
- Sepal width
- Petal Length
- Petal Width

### Species
    - Setosa
    - Versicolor
    - Virginica

- 150 Observations
- 4 Features Sepal Length & Width Petal Length & Width
- Response is the iris species
- CLASSIFICATION problem: Response is categorical

In [2]:
# Importing load_iris dataset from the right module

from sklearn.datasets import load_iris

In [3]:
iris = load_iris()

In [23]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [4]:
# It's a Bunch from sklearn

type(iris)

sklearn.utils.Bunch

In [5]:
X = iris.data
y = iris.target

In [8]:
X.shape, y.shape

((150, 4), (150,))

In [9]:
type(X), type(y)

(numpy.ndarray, numpy.ndarray)

### K-Nearest Neighbors KNN Classification Model

1. Pick a value for K
2. Search for the K observations in the training data that are nearest to the measurements of the unknown iris
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris

__VERY USEFUL__ when we have different classes in the dataset which have __VERY DISSIMILAR FEATURES VALUES__

### Scikit-Learn 4-Step Modelling Pattern

#### STEP 1: Import the class you plan to use

This means, Import the model that you want to use

In [11]:
from sklearn.neighbors import KNeighborsClassifier

#### STEP 2: Instantiate the Estimator

- Estimator: Scikit-Learn Model
- Instantiate: Make an instance of this model

In [31]:
knn = KNeighborsClassifier(n_neighbors=1)

- Name of the object does matter
- Can specify tuning parameters(aka 'hyperparameters') during this step
- All parameters not specified are set to their defaults

In [32]:
knn

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

#### STEP 3: Fit the Model with Data(aka 'Model Training')
- Model is learning the relationship between X and y
- Occurs inplace

In [33]:
knn.fit(X,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

#### STEP 4: Predict the response for a new observation
- New observations are called 'out of sample' data
- Uses the info it learned during the model training process

In [34]:
knn.predict([[3,5,4,2]])

array([2])

In [35]:
knn.predict_proba([[3,5,4,2]])

array([[0., 0., 1.]])

- Returns a NumPy Array
- Can predict for a multiple observations at once

In [36]:
X_new = [[3,5,4,2],[5,4,3,2]]
knn.predict(X_new)

array([2, 1])

In [37]:
knn.predict_proba(X_new)

array([[0., 0., 1.],
       [0., 1., 0.]])

### Using a different value for K

In [40]:
# K=5 Neighbors
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X,y)
knn.predict(X_new)

array([1, 1])

### Using a different Classification Model

### Logistic Regression (Classification Model)

In [42]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X,y)
logreg.predict(X_new)

array([2, 0])

In [43]:
logreg.predict_proba(X_new)

array([[0.11183585, 0.00264262, 0.88552152],
       [0.87855906, 0.03542857, 0.08601237]])

In [45]:
logreg

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)