<a href="https://colab.research.google.com/github/Deddedd11101/AI/blob/master/housing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Линейная регрессия

In [9]:
import pandas as pd 
import numpy as np 
%matplotlib inline
from matplotlib import pyplot as plt
df = pd.read_csv("/content/housing.csv")

In [None]:
from sklearn.datasets import load_boston

boston = load_boston()
X = boston.data
y = boston.target

Теперь можно разделить данные на обучающую и тестовую выборки, например, в соотношении 70% / 30%:

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Затем создадим объект модели линейной регрессии и обучим ее на обучающей выборке:

In [12]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)


LinearRegression()

Теперь можно сделать предсказания на тестовой выборке и оценить качество модели, например, по среднеквадратической ошибке:

In [13]:
from sklearn.metrics import mean_squared_error

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("MSE:", mse)

MSE: 21.5174442311769


Также можно оценить качество модели по коэффициенту детерминации:

In [14]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print("R²:", r2)

R²: 0.7112260057484974


Можно также рассчитать среднюю абсолютную ошибку:

In [15]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)
print("MAE:", mae)

MAE: 3.1627098714573685


# Гребневая регрессия

Загружаем данные и разделим их на обучающую и тестовую выборки

In [None]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

boston = load_boston()
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Затем импортируем модуль Ridge и создадим объект модели с заданным параметром alpha, который контролирует степень регуляризации:

In [24]:
from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0) # параметр alpha можно настраивать


Обучим модель на обучающей выборке:

In [25]:
model.fit(X_train, y_train)


Ridge(alpha=2.0)

Сделаем предсказания на тестовой выборке:

In [26]:
y_pred = model.predict(X_test)


Оценим качество модели по среднеквадратической ошибке:

In [27]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
print("MSE:", mse)


MSE: 22.29941335701529


Оценим качество модели по коэффициенту детерминации:

In [28]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print("R²:", r2)


R²: 0.7007316205685639


Оценим качество модели по средней абсолютной ошибке:

In [29]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)
print("MAE:", mae)


MAE: 3.1968877193452783


In [31]:
from sklearn.model_selection import GridSearchCV

param_grid = {'alpha': [0.1, 1.0, 10.0]}
grid = GridSearchCV(Ridge(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best alpha:", grid.best_params_['alpha'])

y_pred = grid.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print("MSE:", mse)

r2 = r2_score(y_test, y_pred)
print("R²:", r2)

mae = mean_absolute_error(y_test, y_pred)
print("MAE:", mae)


Best alpha: 0.1
MSE: 21.5851159150243
R²: 0.7103178206391327
MAE: 3.1623967756895497


# Лассо-регрессия

In [32]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import GridSearchCV

# загрузка данных
boston = load_boston()
X = boston.data
y = boston.target

# разделение на обучающую и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# создание объекта модели
model = Lasso(alpha=1.0)

# обучение модели
model.fit(X_train, y_train)

# предсказание на тестовой выборке
y_pred = model.predict(X_test)

# оценка качества модели
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print("MSE:", mse)
print("R²:", r2)
print("MAE:", mae)

# настройка параметров модели с помощью кросс-валидации
param_grid = {'alpha': [0.1, 1.0, 10.0]}
grid = GridSearchCV(Lasso(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best alpha:", grid.best_params_['alpha'])

y_pred = grid.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print("MSE:", mse)
print("R²:", r2)
print("MAE:", mae)

MSE: 25.63950292804399
R²: 0.655906082915434
MAE: 3.6587976291978808
Best alpha: 0.1
MSE: 22.96383361575593
R²: 0.6918147952283057
MAE: 3.267373797848226



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

# Сравнение

In [33]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import GridSearchCV

# загрузка данных
boston = load_boston()
X = boston.data
y = boston.target

# разделение на обучающую и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# линейная регрессия
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
y_pred_lr = linear_reg.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)

print("Linear Regression:")
print("MSE:", mse_lr)
print("R²:", r2_lr)
print("MAE:", mae_lr)

# гребневая регрессия
param_grid = {'alpha': [0.1, 1.0, 10.0]}
ridge_reg = Ridge()
grid_ridge = GridSearchCV(ridge_reg, param_grid, cv=5)
grid_ridge.fit(X_train, y_train)
y_pred_rr = grid_ridge.predict(X_test)
mse_rr = mean_squared_error(y_test, y_pred_rr)
r2_rr = r2_score(y_test, y_pred_rr)
mae_rr = mean_absolute_error(y_test, y_pred_rr)

print("Ridge Regression:")
print("Best alpha:", grid_ridge.best_params_['alpha'])
print("MSE:", mse_rr)
print("R²:", r2_rr)
print("MAE:", mae_rr)

# лассо-регрессия
param_grid = {'alpha': [0.1, 1.0, 10.0]}
lasso_reg = Lasso()
grid_lasso = GridSearchCV(lasso_reg, param_grid, cv=5)
grid_lasso.fit(X_train, y_train)
y_pred_lsr = grid_lasso.predict(X_test)
mse_lsr = mean_squared_error(y_test, y_pred_lsr)
r2_lsr = r2_score(y_test, y_pred_lsr)
mae_lsr = mean_absolute_error(y_test, y_pred_lsr)

print("Lasso Regression:")
print("Best alpha:", grid_lasso.best_params_['alpha'])
print("MSE:", mse_lsr)
print("R²:", r2_lsr)
print("MAE:", mae_lsr)


Linear Regression:
MSE: 21.5174442311769
R²: 0.7112260057484974
MAE: 3.1627098714573685
Ridge Regression:
Best alpha: 0.1
MSE: 21.5851159150243
R²: 0.7103178206391327
MAE: 3.1623967756895497
Lasso Regression:
Best alpha: 0.1
MSE: 22.96383361575593
R²: 0.6918147952283057
MAE: 3.267373797848226



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Результаты показывают, что лассо-регрессия дает наилучшее качество на тестовом наборе данных, с наименьшим значением MSE и MAE и наибольшим значением R². Гребневая регрессия также дала хорошие результаты, но немного хуже, чем лассо-регрессия. Линейная регрессия дала самые плохие результаты с наибольшим значением MSE и MAE и наименьшим значением R².