## Machine learning: the problem setting

In general, an ML problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

We know that there are few categories of problems:
* supervised learning,
* classification
* regression
* unsupervised learning

__Training set and testing set__

Machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.

To solve these problem, we use scikit-learn.



# What is Scikit-learn?

- simple and efficient tool for data analysis and data mining
- built on NumPy, SciPy, and matplotlib
- open source, commercially usable berkely software distribution (BSD) licence

# What we can achieve using Scikit-learn?

For many categories of problem, we get these benefit:
- Classification: identifying which category an object belongs to. For example, Spam detection
- Regression: predicting an attribute associated with an object. For example, Stock price prediction
- Clustering: automatic group of similar object into a set. For example, customer segmentation

Further more, we can obtain some advanced features like:
- Model selection: comparing, validating, and choosing a model and parameter
- Dimentionality reduction: reducing the random variable to consider. For applications, It is used to increase model efficiency
- Preprocessing: feature extraction and normalization. With this, we can transform input data as text for use with machine learning algorithms

## Start with Scikit-Learn

### Install

* `pip install -U scikit-learn`
  
### Import

* `import sklearn`
  
### Scikit Learn Loading Dataset

A collection of data is called dataset. It is having the following two components:

- Features − The variables of data are called its features. They are also known as predictors, inputs or attributes.

  + Feature matrix − It is the collection of features, in case there are more than one.

  + Feature Names − It is the list of all the names of the features.

- Response − It is the output variable that basically depends upon the feature variables. They are also known as target, label or output.

  + Response Vector − It is used to represent response column. Generally, we have just one response column.

  + Target Names − It represent the possible values taken by a response vector.

Scikit-learn have few example datasets like `iris` and `digits` for classification and the Boston house prices for regression.

Let’s start with loading a simple dataset named `Iris`. Let’s see how to load the dataset using scikit-learn.


In [7]:
from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data
y = iris.target

feature_names = iris.feature_names
target_names = iris.target_names

print("Feature names:", feature_names)
print("Target names:", target_names)
print("\nFirst 10 rows of X:\n", X[:10])


Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']

First 10 rows of X:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


`Iris` is one of the classic and most studied datasets in the field of machine learning and statistics. It contains the measured characteristics of 150 Iris flowers, belonging to three different species: Iris Setosa, Iris Versicolor, and Iris Virginica. Each flower is described by four characteristics: the length and width of the sepal (sepal), and the length and width of the petals (petal).

Using the Iris flower dataset in machine learning helps us understand how a model can learn to recognize and classify real-life objects (in this case flowers) based on measured data about their characteristics. their physical points.

![](https://th.bing.com/th/id/OIP.d5wc_Y_h7MEeF8vQhTm5cwAAAA?rs=1&pid=ImgDetMain)

Source: 
- [scikit-learn](https://scikit-learn.org/stable/tutorial/basic/tutorial.html#machine-learning-the-problem-setting)
- [tutorialspoint](https://www.tutorialspoint.com/scikit_learn/scikit_learn_modelling_process.htm)