# Training a machine learning model with scikit-learn

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()

# Store feature matrix in "X"
X = iris.data

# Store response vector in "y"
y = iris.target

## K-nearest neighbors (KNN) x=classification
1. Pick a value for K
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris

In [3]:
print(X.shape)
print(y.shape)

(150, 4)
(150,)


## scikit-learn 4-step modeling pattern
**Step 1:** Import the class you plan to use

In [4]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** "Instantiate" the "estimator"
- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"

In [5]:
knn = KNeighborsClassifier(n_neighbors=1)

- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to defaults

In [6]:
print(knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')


**Step 3:** Fit the model with data (aka "model training")
- Model is learning the relationship between X and y
- Occurs in-place

In [7]:
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

**Step 4:** Predict the response for a new observation
- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process

In [9]:
knn.predict([3, 5, 4, 2])



array([2])

- Returns a NumPy array
- Can predict for multiple observations at once

In [11]:
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)

array([2, 1])

## Using a different value for K

In [12]:
# instantiate the model (using the value of K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)

array([1, 1])

## Using a different classification model

In [13]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X, y)

# predict the response for new observations
logreg.predict(X_new)

array([2, 0])