## The best way to think about data within Scikit-Learn is in terms of tables of data. For example, consider the Iris dataset, famously analyzed by Ronald Fisher in 1936. 

## In general, we will refer to the rows of the matrix as samples, and the number of rows as n_samples.

## we will refer to the columns of the matrix as features, and the number of columns as n_features.

## This table layout makes clear that the information can be thought of as a twodimensional numerical array or matrix, which we will call the features matrix. By convention, this features matrix is often stored in a variable named X. 

## The features matrix is assumed to be two-dimensional, with shape [n_samples, n_features], and is most often contained in a NumPy array or a Pandas DataFrame, though some Scikit-Learn models also accept SciPy sparse matrices.

## The samples (i.e., rows) always refer to the individual objects described by the dataset.

## For example, the sample might be a flower, a person, a document, an image, a sound file, a video, an astronomical object, or anything else you can describe with a set of quantitative measurements.

## The features (i.e., columns) always refer to the distinct observations that describe each sample in a quantitative manner. Features are generally real-valued, but may be Boolean or discrete-valued in some cases.

## In addition to the feature matrix X, we also generally work with a label or target array, which by convention we will usually call y. The target array is usually one dimensional, with length n_samples, and is generally contained in a NumPy array or Pandas Series.

In [27]:
import sklearn.datasets
iris_dataset = sklearn.datasets.load_iris()
X_iris = iris_dataset['data']
y_iris = iris_dataset['target']



# Supervised learning example: Iris classification

## Our question will be this: given a model trained on a portion of the Iris data, how well can we predict the remaining labels?

## For this task, we will use an extremely simple generative model known as Gaussian naive Bayes, which proceeds by assuming each class is drawn from an axis-aligned Gaussian distribution. 

## Because it is so fast and has no hyperparameters to choose, Gaussian naive Bayes is often a good model to use as a baseline classification, before you explore whether improvements can be found through more sophisticated models.

## We would like to evaluate the model on data it has not seen before, and so we will split the data into a training set and a testing set. This could be done by hand, but it is more convenient to use the train_test_split utility function.

In [28]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,
random_state=1)

## With the data arranged, we can follow our recipe to predict the labels.

In [32]:
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB() # 2. instantiate model
model.fit(Xtrain, ytrain) # 3. fit model to data
y_model = model.predict(Xtest) # 4. predict on new data

## Finally, we can use the accuracy_score utility to see the fraction of predicted labels that match their true value:

In [34]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

0.9736842105263158

## With an accuracy topping 97%, we see that even this very naive classification algorithm is effective for this particular dataset!