## Introduction to sklearn

scikit-learn (https://scikit-learn.org/stable/, also known as sklearn) is an exellent machine learning library written in Python.

We illustrate how to train a machine learning model using sklearn in this section. The modules that we use include 
* [sklearn.datasets](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets): utilities to load various machine learning datasets.
* [sklearn.model_selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection): a collection of utilities for model selection, including splitting datasets, cross-validation.
* [sklearn.linear_model](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model): a collection of linear models, including OLS, logistic regression.
* [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics): a collection of performance metrics, including accuracy and MSE.

**Predicting house price using linear regression**

We will build a linear regression model use the Boston house price dataset. The dataset has 506 examples, and 13 features.

In [1]:
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
X.shape, y.shape

((506, 13), (506,))

We do a random 70-30 train-test split. 

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train.shape, X_test.shape

((354, 13), (152, 13))

Now we train and evaluate an ordinary least squares model.

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

reg = LinearRegression()
reg.fit(X_train, y_train)
print("R2 (train) = ", reg.score(X_train, y_train))
print("MSE (train) = ", mean_squared_error(y_train, reg.predict(X_train)))
print("MSE (test) = ", mean_squared_error(y_test, reg.predict(X_test)))

R2 (train) =  0.7434997532004697
MSE (train) =  22.545481487421423
MSE (test) =  21.517444231176903


We can see that sklearn has a highly streamlined process of building a machine learning model. This allows the plug-and-play of different models. For example, to train a ridge regression model, we simply need to replace the line `reg = LinearRegression()` by `reg = Ridge()` (import `Ridge` from `sklearn.linear_model` first). sklearn also contains implementations of many other more sophisticated regression models such as `sklearn.svm.SVR` for support vector regression, `sklearn.tree.DecisionTreeRegressor` for decision tree regression, `sklearn.ensemble.RandomForestRegressor` for random forest regression. 

Most algorithms have some hyperparameters that often need to be tuned. If you are using the default values, make sure that you know what they are, and tune them if they don't work well.

**Going further**

You will find it helpful to browse the following links

* A full list of the sklearn modules: https://scikit-learn.org/stable/modules/classes.html. 
* A user guide: https://scikit-learn.org/stable/user_guide.html.
* A gallery of examples: https://scikit-learn.org/stable/auto_examples/index.html.
