<div style="font-size:18pt; padding-top:20px; text-align:center">СЕМИНАР. <b>Регуляризация и кросс-валидация </b></div><hr>
<div style="text-align:right;">Папулин С.Ю. <span style="font-style: italic;font-weight: bold;">(papulin.study@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Содержание</span>
    <ol>
        <li><a href="#1">Регуляризация</a></li>
        <li><a href="#2">Кросс-валидация</a>
        <li><a href="#3">Регуляризация и кросс-валидация</a></li>
        <li><a href="#4">Выбор модели</a></li>
        <li><a href="#5">Источники</a>
        </li>
    </ol>
</div>

<p><b>Подключение библиотек</b></p>

In [None]:
from sklearn import datasets
from scipy import stats
import numpy as np

from sklearn.model_selection import cross_val_predict, train_test_split
from sklearn import linear_model
import matplotlib.pyplot as plt

%matplotlib inline

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:16pt; font-weight:bold">1. Регуляризация</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

### Регрессия

In [None]:
from sklearn.linear_model import Ridge, Lasso

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from numpy.polynomial.polynomial import polyval

In [None]:
def regression_dataset(n=100):
    x = stats.uniform.rvs(size=n, loc=0, scale=6, random_state=0)
    y = stats.norm.rvs(size=n, loc=0, scale=0.5, random_state=0) + np.sin(x)
    return (x, y)

In [None]:
x, y = regression_dataset()

In [None]:
plt.plot(x, y, "o")
plt.title("Initial data")
plt.grid(True)
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

<p>Формирование обучающего и тестового подмножеств</p>

In [None]:
X_ = x.reshape(-1, 1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.3, random_state=200)

<p>Обучение</p>

In [None]:
pl_lr = Pipeline([("plF", PolynomialFeatures(degree=15)), ("lr", Ridge(alpha=0, 
                                                                       normalize=True, 
                                                                       fit_intercept=True))])

pl_lr.fit(X_train, y_train)

<p>Результат обучения</p>

In [None]:
print("Коэффициенты: ", pl_lr.named_steps["lr"].coef_)
print("Пересечение: ", pl_lr.named_steps["lr"].intercept_)

Функция предсказания:

In [None]:
f_x = lambda x : polyval(x, pl_lr.named_steps["lr"].coef_) + pl_lr.named_steps["lr"].intercept_

или

In [None]:
y_train_pred = pl_lr.predict(X_train)
y_train_pred[:5]

График функции регрессии:

In [None]:
xx = np.linspace(0, 6, 100).reshape(-1, 1)

plt.figure(figsize=[12, 4])

plt.subplot(1,2,1)

plt.title("Train data")
plt.plot(X_train, y_train, "o")
plt.plot(xx, pl_lr.predict(xx), color="red", lw=2)
plt.plot(X_train, y_train_pred, "o", color="red", lw=2)
plt.vlines(X_train, ymin=y_train, ymax=f_x(X_train), colors="black", linestyles="dotted")
plt.xlabel(X)
plt.ylabel(Y)
plt.grid(True)


plt.subplot(1,2,2)

plt.title("Test data")
plt.plot(X_test, y_test, "o")
plt.plot(xx, pl_lr.predict(xx), color="red", lw=2)
plt.plot(X_test, pl_lr.predict(X_test), "o", color="red", lw=2)
plt.vlines(X_test, ymin=y_test, ymax=pl_lr.predict(x_test), colors="black", linestyles="dotted")
plt.xlabel(X)
plt.ylabel(Y)
plt.grid(True)

plt.tight_layout()

plt.show()

<p>Проверка на тестовом подмножестве</p>

In [None]:
pl_lr.score(X_test, y_test)

<p><b>Регуляризация</b></p>

In [None]:
pl_reg = Pipeline([("plF", PolynomialFeatures(degree=15)), ("lr", Ridge(alpha=0.01, 
                                                                        normalize=True, 
                                                                        fit_intercept=True))])
#pl_reg = Pipeline([("plF", PolynomialFeatures(degree=20)), ("lr", Lasso(alpha=0.1, fit_intercept=True))])

pl_reg.fit(X_train, y_train)

In [None]:
print("Коэффициенты: ", pl_reg.named_steps["lr"].coef_)
print("Пересечение: ", pl_reg.named_steps["lr"].intercept_)

График функции регрессии:

In [None]:
xx = np.linspace(0, 6, 100).reshape(-1, 1)

plt.figure(figsize=[12, 4])

plt.subplot(1,2,1)

plt.title("Train data")
plt.plot(X_train, y_train, "o")
plt.plot(xx, pl_reg.predict(xx), color="red", lw=2)
plt.plot(X_train, pl_reg.predict(X_train), "o", color="red", lw=2)
plt.vlines(X_train, ymin=y_train, ymax=pl_reg.predict(X_train), colors="black", linestyles="dotted")
plt.xlabel(X)
plt.ylabel(Y)
plt.grid(True)


plt.subplot(1,2,2)

plt.title("Test data")
plt.plot(X_test, y_test, "o")
plt.plot(xx, pl_reg.predict(xx), color="red", lw=2)
plt.plot(X_test, pl_reg.predict(X_test), "o", color="red", lw=2)
plt.vlines(X_test, ymin=y_test, ymax=pl_reg.predict(X_test), colors="black", linestyles="dotted")
plt.xlabel(X)
plt.ylabel(Y)
plt.grid(True)

plt.tight_layout()

plt.show()

Коэффициент детерминации (R^2):

In [None]:
pl_reg.score(X_train, y_train)

In [None]:
pl_reg.score(X_test, y_test)

### Классификация

In [None]:
from sklearn.linear_model import LogisticRegression, RidgeClassifier

<a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">LogisticRegression</a>

Загрузка исходных данных:

In [None]:
digits = datasets.load_digits()
print(digits.DESCR)

In [None]:
digits.keys()

In [None]:
IMAGE_INDX = 3

print("Target value:", digits.target[IMAGE_INDX])

plt.imshow(digits.images[IMAGE_INDX])
plt.show()

Преобразование исходных данных:

In [None]:
X = digits["images"].reshape(len(digits["images"]), -1)
X.shape

In [None]:
y = digits["target"]
y.shape

Количество элементов каждого класса:

In [None]:
np.unique(y, return_counts=True)

Формирование обучающего и тестового подмножеств:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=200)

Обучение модели:

In [None]:
logr_model = LogisticRegression(penalty="l2", C=1, multi_class="ovr", solver="liblinear", random_state=12345)
logr_model.fit(X_train, y_train)

Определение качества модели:

In [None]:
# train error
train_error = logr_model.score(X_train, y_train)
train_error

In [None]:
# test error
test_error = logr_model.score(X_test, y_test)
test_error

In [None]:
print("Target value:", digits.target[IMAGE_INDX])
print("Predicted value:", logr_model.predict(digits["images"][IMAGE_INDX].reshape(1, -1)))
plt.imshow(digits.images[IMAGE_INDX])
plt.show()

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:16pt; font-weight:bold">2. Кросс-валидация</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

### Holdout

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x = np.array([0,1,2,3,4,5,6,7,8,9])
y = np.array([1,1,0,0,1,0,1,1,0,0])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0)

In [None]:
x_train.shape, y_train.shape

In [None]:
x_train, y_train

### K-Folds

In [None]:
from sklearn.model_selection import cross_val_score, cross_validate, KFold

In [None]:
kf = KFold(n_splits=3, shuffle=True, random_state=0)
kf

In [None]:
splits = kf.split(x, y)

In [None]:
for train_index, test_index in splits:
    print(train_index, test_index)

#### cross_val_score

In [None]:
x, y = regression_dataset()
x_ = x.reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x_, y, test_size=0.3, random_state=200)

In [None]:
lr_model = linear_model.LinearRegression(fit_intercept=True)

In [None]:
scores = cross_val_score(lr_model, x_train, y_train, cv=4, scoring="neg_mean_squared_error")
scores

#### cross_validate

In [None]:
scores = cross_validate(lr_model, x_train, y_train, cv=4, return_train_score=True, 
                        scoring=["neg_mean_squared_error", "r2"])
scores

In [None]:
print("Доверительный интервал: %0.2f (+/- %0.2f)" % (scores["test_r2"].mean(), scores["test_r2"].std() * 1.96))

### Leave-One-Out - LOO

In [None]:
from sklearn.model_selection import LeaveOneOut

x = np.array([0,1,2,3,4,5,6,7,8,9])
y = np.array([1,1,0,0,1,0,1,1,0,0])

loo = LeaveOneOut()
for train, test in loo.split(x):
    print("%s %s" % (train, test))

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:16pt; font-weight:bold">3. Регуляризация и кросс-валидация</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

#### Ридж-регрессия с кросс-валидацией

<a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html">RidgeCV</a>

In [None]:
from sklearn.linear_model import RidgeCV

In [None]:
x, y = regression_dataset()

In [None]:
plt.plot(x, y, "o")
plt.title("Initial data")
plt.grid(True)
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

<p>Формирование обучающего и тестового подмножеств</p>

In [None]:
X_ = x[:, np.newaxis]  # or x.reshape(-1, 1)
X_[:5]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.3, random_state=200)

<p>Обучение</p>

In [None]:
alphas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.1, 1, 10, 100]

pl_ridge = Pipeline([("plF", PolynomialFeatures(degree=15)), ("lr", RidgeCV(alphas=alphas, 
                                                                            normalize=True,
                                                                            fit_intercept=True, cv=4, 
                                                                            store_cv_values=False))])

pl_ridge.fit(X_train, y_train)

<p>Результат обучения</p>

In [None]:
print("Коэффициенты: ", pl_ridge.named_steps["lr"].coef_)
print("Пересечение: ", pl_ridge.named_steps["lr"].intercept_)
print("Альфа: ", pl_ridge.named_steps["lr"].alpha_ )

Функция предсказания:

In [None]:
f_x = lambda x : polyval(x, pl_ridge.named_steps["lr"].coef_) + pl_ridge.named_steps["lr"].intercept_

или

In [None]:
y_train_pred = pl_ridge.predict(X_train)
y_train_pred[:5]

In [None]:
xx = np.linspace(0,6,100).reshape(-1, 1)

plt.title("Train data")
plt.plot(X_train, y_train, "o")
plt.plot(xx, pl_ridge.predict(xx), color="red", lw=2)
plt.plot(X_train, y_train_pred, "o", color="red", lw=2)
plt.vlines(X_train, ymin=y_train, ymax=y_train_pred, colors="black", linestyles="dotted")
plt.xlabel("X")
plt.ylabel("Y")
plt.grid(True)
plt.show()

<p>Проверка на тестовом подмножестве</p>

In [None]:
pl_ridge.score(X_test, y_test)

### Классификация c кросс-валидацией

In [None]:
from sklearn.linear_model import LogisticRegressionCV

<a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html">LogisticRegressionCV</a>

In [None]:
X = digits["images"].reshape(len(digits["images"]), -1)
X.shape

In [None]:
Cs = [0.01, 0.1, 1, 10, 100, 1000]
X = digits["images"].reshape(len(digits["images"]), -1)
y = digits["target"]

Формирование обучающего и тестового подмножеств:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=200)

In [None]:
logr_model = LogisticRegressionCV(penalty="l2", Cs=Cs, cv=10, multi_class="ovr", solver="liblinear", 
                                  random_state=12345)
logr_model.fit(X_train, y_train)

In [None]:
print("Коэффициенты: ", logr_model.coef_)
print("Пересечение: ", logr_model.intercept_)
print("Коэф. регуляризации: ", logr_model.C_)

Определение качества модели:

In [None]:
logr_model.score(X_train, y_train)

In [None]:
logr_model.score(X_test, y_test)

<a name="4"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">4. Выбор модели</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

### Grid

<a href="http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html">GridSearchCV</a>

<a name="5"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">5. Источники</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

<a href="http://scikit-learn.org/stable/modules/cross_validation.html">3.1. Cross-validation: evaluating estimator performance</a>