## Dataset Loading

The variables of data are called its *features*
*Feature matrix* − It is the collection of features, in case there are more than one.
*Feature Names*-  It is the list of all the names of the features

*Response* − It is the output variable that basically depends upon the feature variables.
*Target Names* − It represent the possible values taken by a response vector.

[tutorial source](https://www.tutorialspoint.com/scikit_learn/scikit_learn_modelling_process.htm)

In [1]:
from sklearn.datasets import load_iris

> Scikit-learn have few example datasets like iris and digits for classification and the Boston house prices for regression.

In [2]:
iris_flower = load_iris()
x = iris_flower.data # features: sepal length and width, petal length and width, 150 data, 150 x 4 matrix
y = iris_flower.target # class: setosa, versicolor, virginica
feature_names = iris_flower.feature_names
target_names = iris_flower.target_names
print("Features Names: ", feature_names)
print("Target Names: ", target_names)
print("x first 10 rows\n", x[:10])
print("y first 10 rows\n", y[:10])

Features Names:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target Names:  ['setosa' 'versicolor' 'virginica']
x first 10 rows
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
y first 10 rows
 [0 0 0 0 0 0 0 0 0 0]


### Splitting the dataset
To check the accuracy of our model, we can split the dataset into two pieces-a **training set** and a **testing set.**

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)
# x is a feature matrix
# y is a response vector
# test_size - ratio of the test data to the total given data
# 150 rows * 0.3 = 45 rows in this case
# random_state=1 - It is used to guarantee that the split will always be the same. This is useful in the situations where you want reproducible results, useful when debugging.
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

#NOTE Training data to learn patterns, Testing data to check accuracy

(105, 4)
(45, 4)
(105,)
(45,)


### Training the Model

In [5]:
from sklearn.neighbors import KNeighborsClassifier

# KNeighborsClassifier is a machine learning model from scikit-learn. It helps you classify new data based on similarity with already labeled data.

In [6]:
model = KNeighborsClassifier(n_neighbors=3)  # looks at 3 neighbors
model.fit(x_train, y_train)  # train from training data, model fitting

# now model predicting
# model.predict([[5.1, 3.5, 1.4, 0.2]]) # new flower data
# o/p: array([0]) = class Setosa
model.predict([[2.7, 2.3, 3.3, 2.8]]) # versicolor

array([1])

##### Where do we use testing data then?

x_test to predict, y_test to check if it's right

In [7]:
from sklearn.metrics import accuracy_score

In [8]:
predictions = model.predict(x_test)
accuracy = accuracy_score(y_test, predictions)

print("Model's Accuracy:", accuracy)  # Model's Accuracy: 0.9777777777777777

Model's Accuracy: 0.9777777777777777
