№ 1.1



Для задачи линейной регрессии:
$$f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$$


Целевая функция (MSE):
$$J(w) = \frac{1}{N} \|Xw - y\|^2$$

Для нахождения минимума:
$$\frac{\partial J}{\partial w} = \frac{2}{N} X^T (Xw - y) = 0$$

Аналитическое решение:
$$w = (X^T X)^{-1} X'^T y$$

№ 1.2


Регуляризация добавляет штраф за слишком большие значения весов:

L2 — then the linear model is called Ridge model
$$R(\boldsymbol{\theta}) = \| \boldsymbol{\theta}\|_2^2 = \sum_{i=1}^d \theta_i^2$$

L1 — then the linear model is called Lasso model
$$R(\boldsymbol{\theta}) = \| \boldsymbol{\theta}\|_1 = \sum_{i=1}^d |\theta_i|$$

№ 1.3


L1-регуляризация поощряет разреженность решения. Из-за формы функции $|\theta_i|$, при оптимизации часть весов зануляется полностью. Это означает:
Модель автоматически выбирает наиболее значимые признаки  
Все остальные получают нулевые веса и исключаются  

№ 1.4


Нелинейные зависимости можно смоделировать, применив преобразование признаков:  
Заменить $x$ на $\phi(x)$ — функцию от $x$  
Например, добавить полиномиальные признаки: $x^2$, $x^3$, ...  
Это называется полиномиальная регрессия, и можно использовать те же алгоритмы (LinearRegression, Ridge и др.), но на расширенном признаковом пространстве.


№ 2


In [202]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm
import seaborn  as sns
import statsmodels
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures,StandardScaler,MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
import math
from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_score
from collections import Counter
from sklearn.linear_model import Lasso,Ridge,ElasticNet

In [203]:
data = pd.read_json("data/train.json")
data_test = pd.read_json("data/test.json")

In [204]:
data.head(3)

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,interest_level
4,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[Dining Room, Pre-War, Laundry in Building, Di...",40.7108,7170325,-73.9539,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/7170325_3bb5ac84...,2400,145 Borinquen Place,medium
6,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, Laundry in Building, Dishw...",40.7513,7092344,-73.9722,955db33477af4f40004820b4aed804a0,[https://photos.renthop.com/2/7092344_7663c19a...,3800,230 East 44th,low
9,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, Laundry in Building, Laund...",40.7575,7158677,-73.9625,c8b10a317b766204f08e613cef4ce7a0,[https://photos.renthop.com/2/7158677_c897a134...,3495,405 East 56th Street,medium


In [205]:
print("Количество строк",len(data))
print("Количество столбцов",len(data.columns))

Количество строк 49352
Количество столбцов 15


In [206]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49352 entries, 4 to 124009
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   bathrooms        49352 non-null  float64
 1   bedrooms         49352 non-null  int64  
 2   building_id      49352 non-null  object 
 3   created          49352 non-null  object 
 4   description      49352 non-null  object 
 5   display_address  49352 non-null  object 
 6   features         49352 non-null  object 
 7   latitude         49352 non-null  float64
 8   listing_id       49352 non-null  int64  
 9   longitude        49352 non-null  float64
 10  manager_id       49352 non-null  object 
 11  photos           49352 non-null  object 
 12  price            49352 non-null  int64  
 13  street_address   49352 non-null  object 
 14  interest_level   49352 non-null  object 
dtypes: float64(3), int64(3), object(9)
memory usage: 6.0+ MB


In [207]:
data.isna().sum()

bathrooms          0
bedrooms           0
building_id        0
created            0
description        0
display_address    0
features           0
latitude           0
listing_id         0
longitude          0
manager_id         0
photos             0
price              0
street_address     0
interest_level     0
dtype: int64

In [208]:
data.describe()

Unnamed: 0,bathrooms,bedrooms,latitude,listing_id,longitude,price
count,49352.0,49352.0,49352.0,49352.0,49352.0,49352.0
mean,1.21218,1.54164,40.741545,7024055.0,-73.955716,3830.174
std,0.50142,1.115018,0.638535,126274.6,1.177912,22066.87
min,0.0,0.0,0.0,6811957.0,-118.271,43.0
25%,1.0,1.0,40.7283,6915888.0,-73.9917,2500.0
50%,1.0,1.0,40.7518,7021070.0,-73.9779,3150.0
75%,1.0,2.0,40.7743,7128733.0,-73.9548,4100.0
max,10.0,8.0,44.8835,7753784.0,0.0,4490000.0


In [209]:
data[["bathrooms","bedrooms","latitude","listing_id","longitude","price"]].corr()

Unnamed: 0,bathrooms,bedrooms,latitude,listing_id,longitude,price
bathrooms,1.0,0.533446,-0.009657,0.000776,0.010393,0.069661
bedrooms,0.533446,1.0,-0.004745,0.011968,0.006892,0.051788
latitude,-0.009657,-0.004745,1.0,0.001712,-0.966807,-0.000707
listing_id,0.000776,0.011968,0.001712,1.0,-0.000907,0.00809
longitude,0.010393,0.006892,-0.966807,-0.000907,1.0,-8.7e-05
price,0.069661,0.051788,-0.000707,0.00809,-8.7e-05,1.0


In [210]:
data_short = data[["bathrooms","bedrooms","interest_level","features","price"]]
data_short.head()

Unnamed: 0,bathrooms,bedrooms,interest_level,features,price
4,1.0,1,medium,"[Dining Room, Pre-War, Laundry in Building, Di...",2400
6,1.0,2,low,"[Doorman, Elevator, Laundry in Building, Dishw...",3800
9,1.0,2,medium,"[Doorman, Elevator, Laundry in Building, Laund...",3495
10,1.5,3,medium,[],3000
15,1.0,0,low,"[Doorman, Elevator, Fitness Center, Laundry in...",2795


In [211]:
lable = LabelEncoder()
data_short["interest_level"] = lable.fit_transform(data_short["interest_level"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_short["interest_level"] = lable.fit_transform(data_short["interest_level"])


In [212]:
data_short.head()

Unnamed: 0,bathrooms,bedrooms,interest_level,features,price
4,1.0,1,2,"[Dining Room, Pre-War, Laundry in Building, Di...",2400
6,1.0,2,1,"[Doorman, Elevator, Laundry in Building, Dishw...",3800
9,1.0,2,2,"[Doorman, Elevator, Laundry in Building, Laund...",3495
10,1.5,3,2,[],3000
15,1.0,0,1,"[Doorman, Elevator, Fitness Center, Laundry in...",2795


In [None]:
class FeaturePipeline:
        
    def __init__(self, column_name: str = 'features', top_n: int = 20):
        self.column_name = column_name
        self.top_n = top_n
        self.top_features = []

    @staticmethod
    def _clean_feature(feature: str) -> str:
        return feature.replace(" ", "")

    def _normalize_features_list(self, features_list):
        return set(self._clean_feature(f) for f in features_list if len(self._clean_feature(f)) >= 5)

    def fit(self, data: pd.DataFrame):
        all_features = []

        for _, row in data.iterrows():
            cleaned = [
                self._clean_feature(f)
                for f in row[self.column_name]
                if len(self._clean_feature(f)) >= 5
            ]
            all_features.extend(cleaned)

        counter = Counter(all_features)
        self.top_features = [key for key, _ in counter.most_common(self.top_n)]

    def transform(self, data: pd.DataFrame) -> pd.DataFrame:
        new_df = pd.DataFrame()
        for feature in self.top_features:
            new_df[feature] = data[self.column_name].apply(
                lambda feats: int(feature in self._normalize_features_list(feats))
            )
        return new_df

    def fit_transform(self, data: pd.DataFrame) -> pd.DataFrame:
        self.fit(data)
        return self.transform(data)

In [214]:
my_conv = FeaturePipeline()

X_train = my_conv.fit_transform(data_short)
X_test = my_conv.fit_transform(data_test)

In [215]:
X_train["bathrooms"] = data_short["bathrooms"].astype(int)
X_train["bedrooms"] = data_short["bedrooms"]
X_train["interest_level"] = data_short["interest_level"]

X_test["bathrooms"] = data_test["bathrooms"].astype(int)
X_test["bedrooms"] = data_test["bedrooms"]
X_test["interest_level"] = data_short["interest_level"].mode()[0]


In [216]:
X_test["interest_level"]

0         1
1         1
2         1
3         1
5         1
         ..
124003    1
124005    1
124006    1
124007    1
124010    1
Name: interest_level, Length: 74659, dtype: int64

In [217]:
X_train.head()

Unnamed: 0,Elevator,CatsAllowed,HardwoodFloors,DogsAllowed,Doorman,Dishwasher,NoFee,LaundryinBuilding,FitnessCenter,Pre-War,...,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace,bathrooms,bedrooms,interest_level
4,0,1,1,1,0,1,0,1,0,1,...,1,0,0,0,0,0,0,1,1,2
6,1,0,1,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,1,2,1
9,1,0,1,0,1,1,0,1,0,0,...,0,0,0,0,0,0,0,1,2,2
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,3,2
15,1,0,0,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,1,0,1


In [218]:
X_test.head()

Unnamed: 0,Elevator,CatsAllowed,HardwoodFloors,DogsAllowed,Doorman,Dishwasher,NoFee,LaundryinBuilding,FitnessCenter,Pre-War,...,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace,bathrooms,bedrooms,interest_level
0,1,0,1,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,1,1,1
1,0,1,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,2,1
2,0,1,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,1
3,0,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,2,1
5,1,1,0,1,1,0,0,1,1,1,...,0,1,0,0,0,0,0,1,1,1


In [219]:
# len(set(array_all_features)) размер уникальных значений

In [220]:
len(X_train.columns)

23

In [None]:
import numpy as np

class MYLinearRegression:
    def __init__(self, method='sgd', learning_rate=0.01, epochs=1000):
        self.method = method
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        X = np.array(X)
        y = np.array(y)
        n_samples, n_features = X.shape

        if self.method == 'analytic':
            X_b = np.hstack((np.ones((n_samples, 1)), X))
            # w = (X^T*X)^-1*X^T*y
            theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
            self.bias = theta_best[0]
            self.weights = theta_best[1:]
        
        elif self.method == 'gradient':
            self.weights = np.zeros(n_features)
            self.bias = 0

            for _ in range(self.epochs):
                y_pred = np.dot(X, self.weights) + self.bias
                error = y_pred - y

                # Градиенты
                dw = (1/n_samples) * np.dot(X.T, error)
                db = (1/n_samples) * np.sum(error)

                # Обновление весов
                self.weights -= self.learning_rate * dw
                self.bias -= self.learning_rate * db

        elif self.method == 'sgd':
            self.weights = np.zeros(n_features)
            self.bias = 0

            for _ in range(self.epochs):
                for i in range(n_samples):
                    xi = X[i].reshape(1, -1)
                    yi = y[i]
                    y_pred = np.dot(xi, self.weights) + self.bias
                    error = y_pred - yi

                    dw = xi.T * error
                    db = error

                    self.weights -= self.learning_rate * dw.flatten()
                    self.bias -= self.learning_rate * db.item()
        else:
                

    def predict(self, X):
        X = np.array(X)
        return np.dot(X, self.weights) + self.bias

    @staticmethod
    def r2_score(y_true, y_pred):
        ss_res = np.sum((y_true - y_pred) ** 2)
        ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
        return 1 - (ss_res / ss_tot)

    @staticmethod
    def mean_squared_error(y_true, y_pred):
        return np.mean((y_true - y_pred) ** 2)
    
    @staticmethod
    def mean_absolute_error(y_true, y_pred):
         return np.mean(np.abs(y_true - y_pred))

In [222]:
y_train = data_short["price"]
y_test = data_test["price"] 

In [223]:
result_MAE = pd.DataFrame(columns=["model","train","test"])
result_RMSE = pd.DataFrame(columns=["model","train","test"])
result_R2 = pd.DataFrame(columns=["model","train","test"])

In [224]:
my_lin_reg = MYLinearRegression(method="gradient")

my_lin_reg.fit(X_train,y_train)

y_pred_train = my_lin_reg.predict(X_train)
y_pred_test = my_lin_reg.predict(X_test)

In [225]:
mae_my_lin_reg_train = my_lin_reg.mean_absolute_error(y_train,y_pred_train)
rmse_my_lin_reg_train = np.sqrt(my_lin_reg.mean_squared_error(y_train,y_pred_train))
r2_score_my_lin_reg_train = my_lin_reg.r2_score(y_train,y_pred_train)

mae_my_lin_reg_test = my_lin_reg.mean_absolute_error(y_test,y_pred_test)
rmse_my_lin_reg_test = np.sqrt(my_lin_reg.mean_squared_error(y_test,y_pred_test))
r2_score_my_lin_reg_test = my_lin_reg.r2_score(y_test,y_pred_test)

In [226]:
print("My_MAE_train: ",mae_my_lin_reg_train)
print("My_rmse_train: ",rmse_my_lin_reg_train)
print("My_r2_score_train: ",r2_score_my_lin_reg_train)

print()

print("My_MAE_test: ",mae_my_lin_reg_test)
print("My_rmse_test: ",rmse_my_lin_reg_test)
print("My_r2_score_test: ",r2_score_my_lin_reg_test)

My_MAE_train:  1117.400828154916
My_rmse_train:  21997.58474034918
My_r2_score_train:  0.006249208036482212

My_MAE_test:  1052.6024991258419
My_rmse_test:  9610.876851300412
My_r2_score_test:  0.02092294668329986


In [227]:
result_MAE.loc[0] = ["my_linear_regression", mae_my_lin_reg_train, mae_my_lin_reg_test]
result_RMSE.loc[0] = ["my_linear_regression", rmse_my_lin_reg_train, rmse_my_lin_reg_test]
result_R2.loc[0] = ["my_linear_regression", r2_score_my_lin_reg_train, r2_score_my_lin_reg_test]

In [228]:
ling_model = LinearRegression()

ling_model.fit(X_train,y_train)

y_pred_train = ling_model.predict(X_train)
y_pred_test = ling_model.predict(X_test)

In [229]:
mae_lin_reg_train = mean_absolute_error(y_train,y_pred_train)
rmse_lin_reg_train = np.sqrt(mean_squared_error(y_train,y_pred_train))
r2_score_lin_reg_train = r2_score(y_train,y_pred_train)

mae_lin_reg_test = mean_absolute_error(y_test,y_pred_test)
rmse_lin_reg_test = np.sqrt(mean_squared_error(y_test,y_pred_test))
r2_score_lin_reg_test = r2_score(y_test,y_pred_test)

In [230]:
print("MAE_train: ",mae_lin_reg_train)
print("rmse_train: ",rmse_lin_reg_train)
print("r2_score_train: ",r2_score_lin_reg_train)

print()

print("MAE_test: ",mae_lin_reg_test)
print("rmse_test: ",rmse_lin_reg_test)
print("r2_score_test: ",r2_score_lin_reg_test)

MAE_train:  1163.4942384393003
rmse_train:  21995.961609060236
r2_score_train:  0.006395853999143886

MAE_test:  1096.9281440031493
rmse_test:  9621.501183269518
r2_score_test:  0.018757111190920384


In [231]:
result_MAE.loc[1] = ["linear_regression", mae_lin_reg_train, mae_lin_reg_test]
result_RMSE.loc[1] = ["linear_regression", rmse_lin_reg_train, rmse_lin_reg_test]
result_R2.loc[1] = ["linear_regression", r2_score_lin_reg_train, r2_score_lin_reg_test]

In [232]:
class RegularizedLinearRegression:
    def __init__(self, method='ridge', alpha=1.0, l1_ratio=0.5, learning_rate=0.01, epochs=1000):
        self.method = method  # 'ridge', 'lasso', 'elasticnet'
        self.alpha = alpha
        self.l1_ratio = l1_ratio  # только для elasticnet
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        X = np.array(X)
        y = np.array(y)
        n_samples, n_features = X.shape

        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.epochs):
            y_pred = np.dot(X, self.weights) + self.bias
            error = y_pred - y

            # Базовые градиенты
            dw = (1/n_samples) * np.dot(X.T, error)
            db = (1/n_samples) * np.sum(error)

            # Регуляризация
            if self.method == 'ridge':
                dw += (self.alpha / n_samples) * self.weights  # L2
            elif self.method == 'lasso':
                dw += (self.alpha / n_samples) * np.sign(self.weights)  # L1
            elif self.method == 'elasticnet':
                l1 = self.l1_ratio * np.sign(self.weights)
                l2 = (1 - self.l1_ratio) * self.weights
                dw += (self.alpha / n_samples) * (l1 + l2)
            else:
                raise ValueError("Метод должен быть 'ridge', 'lasso' или 'elasticnet'")

            # Обновление параметров
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        X = np.array(X)
        return np.dot(X, self.weights) + self.bias

    def r2_score(self, y_true, y_pred):
        ss_res = np.sum((y_true - y_pred) ** 2)
        ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
        return 1 - ss_res / ss_tot

    def mae(self, y_true, y_pred):
        return np.mean(np.abs(y_true - y_pred))

    def rmse(self, y_true, y_pred):
        return np.sqrt(np.mean((y_true - y_pred) ** 2))

### ridge

In [233]:
ridge = RegularizedLinearRegression(method='ridge')

ridge.fit(X_train,y_train)

y_pred_train = ridge.predict(X_train)
y_pred_test = ridge.predict(X_test)

In [234]:
mae_train = ridge.mae(y_train,y_pred_train)
rmse_train = ridge.rmse(y_train,y_pred_train)
r2_score_train = ridge.r2_score(y_train,y_pred_train)

mae_test =  ridge.mae(y_test,y_pred_test)
rmse_test = ridge.rmse(y_test,y_pred_test)
r2_score_test = ridge.r2_score(y_test,y_pred_test)

In [235]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1117.387804388276
rmse_train:  21997.5853138229
r2_score_train:  0.00624915622261446

MAE_test:  1052.589571118567
rmse_test:  9610.873532961808
r2_score_test:  0.02092362277324933


In [236]:
result_MAE.loc[2] = ["my_ridge", mae_train, mae_test]
result_RMSE.loc[2] = ["my_ridge", rmse_train, rmse_test]
result_R2.loc[2] = ["my_ridge", r2_score_train, r2_score_test]

### lasso

In [237]:
lasso = RegularizedLinearRegression(method='lasso')

lasso.fit(X_train,y_train)

y_pred_train = lasso.predict(X_train)
y_pred_test = lasso.predict(X_test)

In [238]:
mae_train = lasso.mae(y_train,y_pred_train)
rmse_train = lasso.rmse(y_train,y_pred_train)
r2_score_train = lasso.r2_score(y_train,y_pred_train)

mae_test =  lasso.mae(y_test,y_pred_test)
rmse_test = lasso.rmse(y_test,y_pred_test)
r2_score_test = lasso.r2_score(y_test,y_pred_test)

In [239]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1117.4007984315779
rmse_train:  21997.584741344413
r2_score_train:  0.006249207946561808

MAE_test:  1052.602470978175
rmse_test:  9610.876849646223
r2_score_test:  0.020922947020330485


In [240]:
result_MAE.loc[3] = ["my_lasso", mae_train, mae_test]
result_RMSE.loc[3] = ["my_lasso", rmse_train, rmse_test]
result_R2.loc[3] = ["my_lasso", r2_score_train, r2_score_test]

### elasticnet

In [241]:
elasticnet = RegularizedLinearRegression(method='elasticnet')

elasticnet.fit(X_train,y_train)

y_pred_train = elasticnet.predict(X_train)
y_pred_test = elasticnet.predict(X_test)

In [242]:
mae_train = elasticnet.mae(y_train,y_pred_train)
rmse_train = elasticnet.rmse(y_train,y_pred_train)
r2_score_train = elasticnet.r2_score(y_train,y_pred_train)

mae_test =  elasticnet.mae(y_test,y_pred_test)
rmse_test = elasticnet.rmse(y_test,y_pred_test)
r2_score_test = elasticnet.r2_score(y_test,y_pred_test)

In [243]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1117.3943009601837
rmse_train:  21997.58502756656
r2_score_train:  0.006249182086132787

MAE_test:  1052.5960208995825
rmse_test:  9610.87519118517
r2_score_test:  0.020923284921033014


In [244]:
result_MAE.loc[4] = ["my_elasticnet", mae_train, mae_test]
result_RMSE.loc[4] = ["my_elasticnet", rmse_train, rmse_test]
result_R2.loc[4] = ["my_elasticnet", r2_score_train, r2_score_test]

In [245]:
result_MAE

Unnamed: 0,model,train,test
0,my_linear_regression,1117.400828,1052.602499
1,linear_regression,1163.494238,1096.928144
2,my_ridge,1117.387804,1052.589571
3,my_lasso,1117.400798,1052.602471
4,my_elasticnet,1117.394301,1052.596021


In [246]:
result_RMSE

Unnamed: 0,model,train,test
0,my_linear_regression,21997.58474,9610.876851
1,linear_regression,21995.961609,9621.501183
2,my_ridge,21997.585314,9610.873533
3,my_lasso,21997.584741,9610.87685
4,my_elasticnet,21997.585028,9610.875191


In [247]:
result_R2

Unnamed: 0,model,train,test
0,my_linear_regression,0.006249,0.020923
1,linear_regression,0.006396,0.018757
2,my_ridge,0.006249,0.020924
3,my_lasso,0.006249,0.020923
4,my_elasticnet,0.006249,0.020923


### ridge lasso elasticnet in sklearn

In [248]:
lasso = Lasso()
ridge = Ridge()
elasticnet = ElasticNet()

lasso.fit(X_train,y_train)
ridge.fit(X_train,y_train)
elasticnet.fit(X_train,y_train)

y_pred_train_lasso = lasso.predict(X_train)
y_pred_test_lasso = lasso.predict(X_test)

y_pred_train_ridge = ridge.predict(X_train)
y_pred_test_ridge = ridge.predict(X_test)

y_pred_train_elasticnet = elasticnet.predict(X_train)
y_pred_test_elasticnet = elasticnet.predict(X_test)

In [249]:
mae_train = mean_absolute_error(y_train,y_pred_train_lasso)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_lasso))
r2_score_train = r2_score(y_train,y_pred_train_lasso)

mae_test =  mean_absolute_error(y_test,y_pred_test_lasso)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_lasso))
r2_score_test = r2_score(y_test,y_pred_test_lasso)

In [250]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1159.7205988001822
rmse_train:  21995.97048250695
r2_score_train:  0.006395052334240092

MAE_test:  1093.1939528922276
rmse_test:  9621.280924084105
r2_score_test:  0.018802036672178057


In [251]:
result_MAE.loc[5] = ["lasso", mae_train, mae_test]
result_RMSE.loc[5] = ["lasso", rmse_train, rmse_test]
result_R2.loc[5] = ["lasso", r2_score_train, r2_score_test]

In [252]:
mae_train = mean_absolute_error(y_train,y_pred_train_ridge)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_ridge))
r2_score_train = r2_score(y_train,y_pred_train_ridge)

mae_test =  mean_absolute_error(y_test,y_pred_test_ridge)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_ridge))
r2_score_test = r2_score(y_test,y_pred_test_ridge)

In [253]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1163.4434957140838
rmse_train:  21995.961609924885
r2_score_train:  0.006395853921027483

MAE_test:  1096.878139914704
rmse_test:  9621.490128733061
r2_score_test:  0.018759365969846953


In [254]:
result_MAE.loc[6] = ["ridge", mae_train, mae_test]
result_RMSE.loc[6] = ["ridge", rmse_train, rmse_test]
result_R2.loc[6] = ["ridge", r2_score_train, r2_score_test]

In [255]:
mae_train = mean_absolute_error(y_train,y_pred_train_elasticnet)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_elasticnet))
r2_score_train = r2_score(y_train,y_pred_train_elasticnet)

mae_test =  mean_absolute_error(y_test,y_pred_test_elasticnet)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_elasticnet))
r2_score_test = r2_score(y_test,y_pred_test_elasticnet)

In [256]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1125.0250378822082
rmse_train:  22015.968597015977
r2_score_train:  0.004587515971194889

MAE_test:  1060.1918954980622
rmse_test:  9613.70686028619
r2_score_test:  0.02034626573221987


In [257]:
result_MAE.loc[7] = ["elasticnet", mae_train, mae_test]
result_RMSE.loc[7] = ["elasticnet", rmse_train, rmse_test]
result_R2.loc[7] = ["elasticnet", r2_score_train, r2_score_test]

In [258]:
result_MAE

Unnamed: 0,model,train,test
0,my_linear_regression,1117.400828,1052.602499
1,linear_regression,1163.494238,1096.928144
2,my_ridge,1117.387804,1052.589571
3,my_lasso,1117.400798,1052.602471
4,my_elasticnet,1117.394301,1052.596021
5,lasso,1159.720599,1093.193953
6,ridge,1163.443496,1096.87814
7,elasticnet,1125.025038,1060.191895


In [259]:
result_RMSE

Unnamed: 0,model,train,test
0,my_linear_regression,21997.58474,9610.876851
1,linear_regression,21995.961609,9621.501183
2,my_ridge,21997.585314,9610.873533
3,my_lasso,21997.584741,9610.87685
4,my_elasticnet,21997.585028,9610.875191
5,lasso,21995.970483,9621.280924
6,ridge,21995.96161,9621.490129
7,elasticnet,22015.968597,9613.70686


In [260]:
result_R2

Unnamed: 0,model,train,test
0,my_linear_regression,0.006249,0.020923
1,linear_regression,0.006396,0.018757
2,my_ridge,0.006249,0.020924
3,my_lasso,0.006249,0.020923
4,my_elasticnet,0.006249,0.020923
5,lasso,0.006395,0.018802
6,ridge,0.006396,0.018759
7,elasticnet,0.004588,0.020346


## MinMaxScaler

Обязательна нормализация:  
1) Градиентный спуск (логистическая регрессия, нейросети)
Без нормализации признаки с разным масштабом (например, доход в $ и возраст в годах) могут привести к медленной или нестабильной сходимости.  
2) Методы, основанные на расстоянии (KNN,KMeans)  
3) Методы регуляризации (L1, L2)

Не требуется нормализация:  
1) Деревья решений, случайные леса, бустинг (не зависят от масштаба признаков)  
2) Наивный Байес (Метод опирается на вероятности)

Формула:  
$$f(\mathbf{x}) = \frac{\mathbf{x} - \mathbf{x}_{\min}}{\mathbf{x}_{\max} - \mathbf{x}_{\min}}$$

In [261]:
def my_minmax_scaler(X):
    X = np.array(X, dtype=float)
    X_min = X.min(axis=0)
    X_max = X.max(axis=0)
    return (X - X_min) / (X_max - X_min)

In [262]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

my_X_train_scaled = my_minmax_scaler(X_train)

print("My:      ",my_X_train_scaled[0])
print("sklearn: ",X_train_scaled[0])

My:       [0.    1.    1.    1.    0.    1.    0.    1.    0.    1.    0.    0.
 0.    1.    0.    0.    0.    0.    0.    0.    0.1   0.125 1.   ]
sklearn:  [0.    1.    1.    1.    0.    1.    0.    1.    0.    1.    0.    0.
 0.    1.    0.    0.    0.    0.    0.    0.    0.1   0.125 1.   ]


## StandardScaler

Формула:  
$$f(\mathbf{x}) = \frac{\mathbf{x} - \boldsymbol{\mu}}{\boldsymbol{\sigma}}$$


In [263]:
def my_standard_scaler(X):
    X = np.array(X, dtype=float)
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    return (X - mean) / std


In [264]:
scaler_std = StandardScaler()
X_train_standard = scaler_std.fit_transform(X_train)

my_X_train_standard = my_standard_scaler(X_train)

print("My:      ",my_X_train_standard[0])
print("sklearn: ",X_train_standard[0])

My:       [-1.05153709  1.04714687  1.04769987  1.11342245 -0.85699976  1.19001525
 -0.75976649  1.42111894 -0.60588069  2.09638746 -0.46383994 -0.39091529
 -0.34568647  2.93411559 -0.30890281 -0.25404408 -0.24198357 -0.23548793
 -0.23385394 -0.22023456 -0.41204122 -0.48577234  1.59859723]
sklearn:  [-1.05153709  1.04714687  1.04769987  1.11342245 -0.85699976  1.19001525
 -0.75976649  1.42111894 -0.60588069  2.09638746 -0.46383994 -0.39091529
 -0.34568647  2.93411559 -0.30890281 -0.25404408 -0.24198357 -0.23548793
 -0.23385394 -0.22023456 -0.41204122 -0.48577234  1.59859723]


### models by MinMaxScaler

#### LinearRegression

In [265]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [266]:
ling_model = LinearRegression()

ling_model.fit(X_train_scaled,y_train)

y_pred_train = ling_model.predict(X_train_scaled)
y_pred_test = ling_model.predict(X_test_scaled)

In [267]:
mae_lin_reg_train = mean_absolute_error(y_train,y_pred_train)
rmse_lin_reg_train = np.sqrt(mean_squared_error(y_train,y_pred_train))
r2_score_lin_reg_train = r2_score(y_train,y_pred_train)

mae_lin_reg_test = mean_absolute_error(y_test,y_pred_test)
rmse_lin_reg_test = np.sqrt(mean_squared_error(y_test,y_pred_test))
r2_score_lin_reg_test = r2_score(y_test,y_pred_test)

In [268]:
print("MAE_train: ",mae_lin_reg_train)
print("rmse_train: ",rmse_lin_reg_train)
print("r2_score_train: ",r2_score_lin_reg_train)

print()

print("MAE_test: ",mae_lin_reg_test)
print("rmse_test: ",rmse_lin_reg_test)
print("r2_score_test: ",r2_score_lin_reg_test)

MAE_train:  1163.4942384393007
rmse_train:  21995.961609060236
r2_score_train:  0.006395853999143886

MAE_test:  1940.9012848845273
rmse_test:  9811.201559724537
r2_score_test:  -0.020317282450297514


In [269]:
result_MAE.loc[8] = ["linear_regression_MinMaxScaler", mae_lin_reg_train, mae_lin_reg_test]
result_RMSE.loc[8] = ["linear_regression_MinMaxScaler", rmse_lin_reg_train, rmse_lin_reg_test]
result_R2.loc[8] = ["linear_regression_MinMaxScaler", r2_score_lin_reg_train, r2_score_lin_reg_test]

#### ridge lasso elasticnet in sklearn MinMaxScaler

In [270]:
lasso = Lasso()
ridge = Ridge()
elasticnet = ElasticNet()

lasso.fit(X_train_scaled,y_train)
ridge.fit(X_train_scaled,y_train)
elasticnet.fit(X_train_scaled,y_train)

y_pred_train_lasso = lasso.predict(X_train_scaled)
y_pred_test_lasso = lasso.predict(X_test_scaled)

y_pred_train_ridge = ridge.predict(X_train_scaled)
y_pred_test_ridge = ridge.predict(X_test_scaled)

y_pred_train_elasticnet = elasticnet.predict(X_train_scaled)
y_pred_test_elasticnet = elasticnet.predict(X_test_scaled)

#### lasso MinMaxScaler

In [271]:
mae_train = mean_absolute_error(y_train,y_pred_train_lasso)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_lasso))
r2_score_train = r2_score(y_train,y_pred_train_lasso)

mae_test =  mean_absolute_error(y_test,y_pred_test_lasso)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_lasso))
r2_score_test = r2_score(y_test,y_pred_test_lasso)

In [272]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1158.9903337050857
rmse_train:  21995.97987909111
r2_score_train:  0.006394203406522525

MAE_test:  1898.3808318688946
rmse_test:  9802.220285926809
r2_score_test:  -0.01845011979463851


In [273]:
result_MAE.loc[9] = ["lasso_MinMaxScaler", mae_train, mae_test]
result_RMSE.loc[9] = ["lasso_MinMaxScaler", rmse_train, rmse_test]
result_R2.loc[9] = ["lasso_MinMaxScaler", r2_score_train, r2_score_test]

#### ridge MinMaxScaler

In [274]:
mae_train = mean_absolute_error(y_train,y_pred_train_ridge)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_ridge))
r2_score_train = r2_score(y_train,y_pred_train_ridge)

mae_test =  mean_absolute_error(y_test,y_pred_test_ridge)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_ridge))
r2_score_test = r2_score(y_test,y_pred_test_ridge)

In [275]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1163.3896829748328
rmse_train:  21995.96401428261
r2_score_train:  0.006395636701159946

MAE_test:  1914.1324308216113
rmse_test:  9805.240861177792
r2_score_test:  -0.019077891691514415


In [276]:
result_MAE.loc[10] = ["ridge_MinMaxScaler", mae_train, mae_test]
result_RMSE.loc[10] = ["ridge_MinMaxScaler", rmse_train, rmse_test]
result_R2.loc[10] = ["ridge_MinMaxScaler", r2_score_train, r2_score_test]

#### elasticnet MinMaxScaler

In [277]:
mae_train = mean_absolute_error(y_train,y_pred_train_elasticnet)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_elasticnet))
r2_score_train = r2_score(y_train,y_pred_train_elasticnet)

mae_test =  mean_absolute_error(y_test,y_pred_test_elasticnet)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_elasticnet))
r2_score_test = r2_score(y_test,y_pred_test_elasticnet)

In [278]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1446.45580786325
rmse_train:  22051.661302654105
r2_score_train:  0.001357336500337314

MAE_test:  1380.4671885608777
rmse_test:  9685.354816879499
r2_score_test:  0.0056897462365281815


In [279]:
result_MAE.loc[11] = ["elasticnet_MinMaxScaler", mae_train, mae_test]
result_RMSE.loc[11] = ["elasticnet_MinMaxScaler", rmse_train, rmse_test]
result_R2.loc[11] = ["elasticnet_MinMaxScaler", r2_score_train, r2_score_test]

In [280]:
result_MAE

Unnamed: 0,model,train,test
0,my_linear_regression,1117.400828,1052.602499
1,linear_regression,1163.494238,1096.928144
2,my_ridge,1117.387804,1052.589571
3,my_lasso,1117.400798,1052.602471
4,my_elasticnet,1117.394301,1052.596021
5,lasso,1159.720599,1093.193953
6,ridge,1163.443496,1096.87814
7,elasticnet,1125.025038,1060.191895
8,linear_regression_MinMaxScaler,1163.494238,1940.901285
9,lasso_MinMaxScaler,1158.990334,1898.380832


### models by StandardScaler

In [281]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

#### LinearRegression

In [282]:
ling_model = LinearRegression()

ling_model.fit(X_train_scaled,y_train)

y_pred_train = ling_model.predict(X_train_scaled)
y_pred_test = ling_model.predict(X_test_scaled)

In [283]:
mae_lin_reg_train = mean_absolute_error(y_train,y_pred_train)
rmse_lin_reg_train = np.sqrt(mean_squared_error(y_train,y_pred_train))
r2_score_lin_reg_train = r2_score(y_train,y_pred_train)

mae_lin_reg_test = mean_absolute_error(y_test,y_pred_test)
rmse_lin_reg_test = np.sqrt(mean_squared_error(y_test,y_pred_test))
r2_score_lin_reg_test = r2_score(y_test,y_pred_test)

In [284]:
print("MAE_train: ",mae_lin_reg_train)
print("rmse_train: ",rmse_lin_reg_train)
print("r2_score_train: ",r2_score_lin_reg_train)

print()

print("MAE_test: ",mae_lin_reg_test)
print("rmse_test: ",rmse_lin_reg_test)
print("r2_score_test: ",r2_score_lin_reg_test)

MAE_train:  1163.4942384392998
rmse_train:  21995.961609060236
r2_score_train:  0.006395853999143886

MAE_test:  1081.501277161148
rmse_test:  9607.635669144172
r2_score_test:  0.021583205269289474


In [285]:
result_MAE.loc[12] = ["linear_regression_StandardScaler", mae_lin_reg_train, mae_lin_reg_test]
result_RMSE.loc[12] = ["linear_regression_StandardScaler", rmse_lin_reg_train, rmse_lin_reg_test]
result_R2.loc[12] = ["linear_regression_StandardScaler", r2_score_lin_reg_train, r2_score_lin_reg_test]

#### ridge lasso elasticnet in sklearn StandardScaler

In [286]:
lasso = Lasso()
ridge = Ridge()
elasticnet = ElasticNet()

lasso.fit(X_train_scaled,y_train)
ridge.fit(X_train_scaled,y_train)
elasticnet.fit(X_train_scaled,y_train)

y_pred_train_lasso = lasso.predict(X_train_scaled)
y_pred_test_lasso = lasso.predict(X_test_scaled)

y_pred_train_ridge = ridge.predict(X_train_scaled)
y_pred_test_ridge = ridge.predict(X_test_scaled)

y_pred_train_elasticnet = elasticnet.predict(X_train_scaled)
y_pred_test_elasticnet = elasticnet.predict(X_test_scaled)

#### lasso StandardScaler

In [287]:
mae_train = mean_absolute_error(y_train,y_pred_train_lasso)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_lasso))
r2_score_train = r2_score(y_train,y_pred_train_lasso)

mae_test =  mean_absolute_error(y_test,y_pred_test_lasso)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_lasso))
r2_score_test = r2_score(y_test,y_pred_test_lasso)

In [288]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1161.8112686195448
rmse_train:  21995.96300548255
r2_score_train:  0.006395727840436072

MAE_test:  1079.9503571642913
rmse_test:  9607.554809766767
r2_score_test:  0.02159967422016773


In [289]:
result_MAE.loc[13] = ["lasso_StandardScaler", mae_train, mae_test]
result_RMSE.loc[13] = ["lasso_StandardScaler", rmse_train, rmse_test]
result_R2.loc[13] = ["lasso_StandardScaler", r2_score_train, r2_score_test]

#### ridge StandardScaler

In [290]:
mae_train = mean_absolute_error(y_train,y_pred_train_ridge)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_ridge))
r2_score_train = r2_score(y_train,y_pred_train_ridge)

mae_test =  mean_absolute_error(y_test,y_pred_test_ridge)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_ridge))
r2_score_test = r2_score(y_test,y_pred_test_ridge)

In [291]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1163.482824404066
rmse_train:  21995.961609094986
r2_score_train:  0.006395853996004286

MAE_test:  1081.491786928143
rmse_test:  9607.634353851012
r2_score_test:  0.02158347316138587


In [292]:
result_MAE.loc[14] = ["ridge_StandardScaler", mae_train, mae_test]
result_RMSE.loc[14] = ["ridge_StandardScaler", rmse_train, rmse_test]
result_R2.loc[14] = ["ridge_StandardScaler", r2_score_train, r2_score_test]

#### elasticnet StandardScaler

In [293]:
mae_train = mean_absolute_error(y_train,y_pred_train_elasticnet)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_elasticnet))
r2_score_train = r2_score(y_train,y_pred_train_elasticnet)

mae_test =  mean_absolute_error(y_test,y_pred_test_elasticnet)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_elasticnet))
r2_score_test = r2_score(y_test,y_pred_test_elasticnet)

In [294]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1079.9190794040558
rmse_train:  22001.963915377397
r2_score_train:  0.005853506247168072

MAE_test:  1027.4767539058855
rmse_test:  9603.25982805277
r2_score_test:  0.022474251010745405


In [295]:
result_MAE.loc[15] = ["elasticnet_StandardScaler", mae_train, mae_test]
result_RMSE.loc[15] = ["elasticnet_StandardScaler", rmse_train, rmse_test]
result_R2.loc[15] = ["elasticnet_StandardScaler", r2_score_train, r2_score_test]

In [296]:
result_RMSE

Unnamed: 0,model,train,test
0,my_linear_regression,21997.58474,9610.876851
1,linear_regression,21995.961609,9621.501183
2,my_ridge,21997.585314,9610.873533
3,my_lasso,21997.584741,9610.87685
4,my_elasticnet,21997.585028,9610.875191
5,lasso,21995.970483,9621.280924
6,ridge,21995.96161,9621.490129
7,elasticnet,22015.968597,9613.70686
8,linear_regression_MinMaxScaler,21995.961609,9811.20156
9,lasso_MinMaxScaler,21995.979879,9802.220286


### models by PolynomialFeatures

In [297]:
poly = PolynomialFeatures(degree=10)
X_train_poly = poly.fit_transform(X_train[["bathrooms","bedrooms","interest_level"]])
X_test_poly = poly.fit_transform(X_test[["bathrooms","bedrooms","interest_level"]])

#### LinearRegression

In [298]:
ling_model = LinearRegression()

ling_model.fit(X_train_poly,y_train)

y_pred_train = ling_model.predict(X_train_poly)
y_pred_test = ling_model.predict(X_test_poly)

In [299]:
mae_lin_reg_train = mean_absolute_error(y_train,y_pred_train)
rmse_lin_reg_train = np.sqrt(mean_squared_error(y_train,y_pred_train))
r2_score_lin_reg_train = r2_score(y_train,y_pred_train)

mae_lin_reg_test = mean_absolute_error(y_test,y_pred_test)
rmse_lin_reg_test = np.sqrt(mean_squared_error(y_test,y_pred_test))
r2_score_lin_reg_test = r2_score(y_test,y_pred_test)

In [300]:
print("MAE_train: ",mae_lin_reg_train)
print("rmse_train: ",rmse_lin_reg_train)
print("r2_score_train: ",r2_score_lin_reg_train)

print()

print("MAE_test: ",mae_lin_reg_test)
print("rmse_test: ",rmse_lin_reg_test)
print("r2_score_test: ",r2_score_lin_reg_test)

MAE_train:  1044.5106256408
rmse_train:  21989.62948654086
r2_score_train:  0.0069678424117909366

MAE_test:  2854906274098659.5
rmse_test:  7.80068837928682e+17
r2_score_test:  -6.449955308057991e+27


In [301]:
result_MAE.loc[16] = ["linear_regression_PolynomialFeatures", mae_lin_reg_train, mae_lin_reg_test]
result_RMSE.loc[16] = ["linear_regression_PolynomialFeatures", rmse_lin_reg_train, rmse_lin_reg_test]
result_R2.loc[16] = ["linear_regression_PolynomialFeatures", r2_score_lin_reg_train, r2_score_lin_reg_test]

#### ridge lasso elasticnet in sklearn StandardScaler

In [302]:
lasso = Lasso()
ridge = Ridge()
elasticnet = ElasticNet()

lasso.fit(X_train_poly,y_train)
ridge.fit(X_train_poly,y_train)
elasticnet.fit(X_train_poly,y_train)

y_pred_train_lasso = lasso.predict(X_train_poly)
y_pred_test_lasso = lasso.predict(X_test_poly)

y_pred_train_ridge = ridge.predict(X_train_poly)
y_pred_test_ridge = ridge.predict(X_test_poly)

y_pred_train_elasticnet = elasticnet.predict(X_train_poly)
y_pred_test_elasticnet = elasticnet.predict(X_test_poly)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


#### lasso PolynomialFeatures

In [303]:
mae_train = mean_absolute_error(y_train,y_pred_train_lasso)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_lasso))
r2_score_train = r2_score(y_train,y_pred_train_lasso)

mae_test =  mean_absolute_error(y_test,y_pred_test_lasso)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_lasso))
r2_score_test = r2_score(y_test,y_pred_test_lasso)

In [304]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1041.7675886265724
rmse_train:  21992.125186934536
r2_score_train:  0.006742422389559799

MAE_test:  52712370346.29089
rmse_test:  14403020845045.258
r2_score_test:  -2.1988622613626378e+18


In [305]:
result_MAE.loc[17] = ["lasso_PolynomialFeatures", mae_train, mae_test]
result_RMSE.loc[17] = ["lasso_PolynomialFeatures", rmse_train, rmse_test]
result_R2.loc[17] = ["lasso_PolynomialFeatures", r2_score_train, r2_score_test]

#### ridge PolynomialFeatures

In [306]:
mae_train = mean_absolute_error(y_train,y_pred_train_ridge)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_ridge))
r2_score_train = r2_score(y_train,y_pred_train_ridge)

mae_test =  mean_absolute_error(y_test,y_pred_test_ridge)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_ridge))
r2_score_test = r2_score(y_test,y_pred_test_ridge)

In [307]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1044.5045390344176
rmse_train:  21989.62968246809
r2_score_train:  0.006967824715991333

MAE_test:  2730145943872455.5
rmse_test:  7.45979576538629e+17
r2_score_test:  -5.898542625558537e+27


In [308]:
result_MAE.loc[18] = ["ridge_PolynomialFeatures", mae_train, mae_test]
result_RMSE.loc[18] = ["ridge_PolynomialFeatures", rmse_train, rmse_test]
result_R2.loc[18] = ["ridge_PolynomialFeatures", r2_score_train, r2_score_test]

#### elasticnet PolynomialFeatures

In [309]:
mae_train = mean_absolute_error(y_train,y_pred_train_elasticnet)
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train_elasticnet))
r2_score_train = r2_score(y_train,y_pred_train_elasticnet)

mae_test =  mean_absolute_error(y_test,y_pred_test_elasticnet)
rmse_test = np.sqrt(mean_squared_error(y_test,y_pred_test_elasticnet))
r2_score_test = r2_score(y_test,y_pred_test_elasticnet)

In [310]:
print("MAE_train: ",mae_train)
print("rmse_train: ",rmse_train)
print("r2_score_train: ",r2_score_train)

print()

print("MAE_test: ",mae_test)
print("rmse_test: ",rmse_test)
print("r2_score_test: ",r2_score_test)

MAE_train:  1041.1703053883489
rmse_train:  21992.616731258375
r2_score_train:  0.006698021443636737

MAE_test:  66893102994.73334
rmse_test:  18277735544100.74
r2_score_test:  -3.5410790819766124e+18


In [311]:
result_MAE.loc[19] = ["elasticnet_PolynomialFeatures", mae_train, mae_test]
result_RMSE.loc[19] = ["elasticnet_PolynomialFeatures", rmse_train, rmse_test]
result_R2.loc[19] = ["elasticnet_PolynomialFeatures", r2_score_train, r2_score_test]

In [312]:
result_RMSE

Unnamed: 0,model,train,test
0,my_linear_regression,21997.58474,9610.877
1,linear_regression,21995.961609,9621.501
2,my_ridge,21997.585314,9610.874
3,my_lasso,21997.584741,9610.877
4,my_elasticnet,21997.585028,9610.875
5,lasso,21995.970483,9621.281
6,ridge,21995.96161,9621.49
7,elasticnet,22015.968597,9613.707
8,linear_regression_MinMaxScaler,21995.961609,9811.202
9,lasso_MinMaxScaler,21995.979879,9802.22


In [313]:
result_MAE

Unnamed: 0,model,train,test
0,my_linear_regression,1117.400828,1052.602
1,linear_regression,1163.494238,1096.928
2,my_ridge,1117.387804,1052.59
3,my_lasso,1117.400798,1052.602
4,my_elasticnet,1117.394301,1052.596
5,lasso,1159.720599,1093.194
6,ridge,1163.443496,1096.878
7,elasticnet,1125.025038,1060.192
8,linear_regression_MinMaxScaler,1163.494238,1940.901
9,lasso_MinMaxScaler,1158.990334,1898.381


In [314]:
result_R2

Unnamed: 0,model,train,test
0,my_linear_regression,0.006249,0.02092295
1,linear_regression,0.006396,0.01875711
2,my_ridge,0.006249,0.02092362
3,my_lasso,0.006249,0.02092295
4,my_elasticnet,0.006249,0.02092328
5,lasso,0.006395,0.01880204
6,ridge,0.006396,0.01875937
7,elasticnet,0.004588,0.02034627
8,linear_regression_MinMaxScaler,0.006396,-0.02031728
9,lasso_MinMaxScaler,0.006394,-0.01845012


лучшая модель - elasticnet_PolynomialFeatures

naive_mean and naive_median

In [318]:
mean_price_train = y_train.mean()
median_price_train = y_train.median()

mean_price_test = y_test.mean()
median_price_test = y_test.median()

In [319]:
X_train["naive_mean"] = mean_price_train
X_train["naive_median"] = median_price_train
X_test["naive_mean"] = mean_price_test
X_test["naive_median"] = median_price_test

In [321]:
mae_train_mean = mean_absolute_error(y_train,X_train["naive_mean"])
rmse_train_mean = np.sqrt(mean_squared_error(y_train,X_train["naive_mean"]))
r2_score_train_mean = r2_score(y_train,X_train["naive_mean"])

mae_train_median = mean_absolute_error(y_train,X_train["naive_median"])
rmse_train_median = np.sqrt(mean_squared_error(y_train,X_train["naive_median"]))
r2_score_train_median = r2_score(y_train,X_train["naive_median"])

mae_test_mean =  mean_absolute_error(y_test,X_test["naive_mean"])
rmse_test_mean = np.sqrt(mean_squared_error(y_test,X_test["naive_mean"]))
r2_score_test_mean = r2_score(y_test,X_test["naive_mean"])

mae_test_median =  mean_absolute_error(y_test,X_test["naive_median"])
rmse_test_median = np.sqrt(mean_squared_error(y_test,X_test["naive_median"]))
r2_score_test_median = r2_score(y_test,X_test["naive_median"])

In [322]:
print("MAE_train: ",mae_train_mean)
print("rmse_train: ",rmse_train_mean)
print("r2_score_train: ",r2_score_train_mean)

print()

print("MAE_test: ",mae_test_mean)
print("rmse_test: ",rmse_test_mean)
print("r2_score_test: ",r2_score_test_mean)

print()

print("MAE_train: ",mae_train_median)
print("rmse_train: ",rmse_train_median)
print("r2_score_train: ",r2_score_train_median)

print()

print("MAE_test: ",mae_test_median)
print("rmse_test: ",rmse_test_median)
print("r2_score_test: ",r2_score_test_median)

MAE_train:  1549.6424487275003
rmse_train:  22066.642317478563
r2_score_train:  0.0

MAE_test:  1440.9612985665638
rmse_test:  9713.026562495552
r2_score_test:  0.0

MAE_train:  1400.3444034689576
rmse_train:  22077.122545433856
r2_score_train:  -0.0009500962148851766

MAE_test:  1322.640672926238
rmse_test:  9731.481148020575
r2_score_test:  -0.0038035759720667084


In [323]:
result_MAE.loc[20] = ["naive_mean", mae_train_mean, mae_test_mean]
result_RMSE.loc[20] = ["naive_mean", rmse_train_mean, rmse_test_mean]
result_R2.loc[20] = ["naive_mean", r2_score_train_mean, r2_score_test_mean]

In [324]:
result_MAE.loc[21] = ["naive_median", mae_train_median, mae_test_median]
result_RMSE.loc[21] = ["naive_median", rmse_train_median, rmse_test_median]
result_R2.loc[21] = ["naive_median", r2_score_train_median, r2_score_test_median]

In [325]:
result_MAE

Unnamed: 0,model,train,test
0,my_linear_regression,1117.400828,1052.602
1,linear_regression,1163.494238,1096.928
2,my_ridge,1117.387804,1052.59
3,my_lasso,1117.400798,1052.602
4,my_elasticnet,1117.394301,1052.596
5,lasso,1159.720599,1093.194
6,ridge,1163.443496,1096.878
7,elasticnet,1125.025038,1060.192
8,linear_regression_MinMaxScaler,1163.494238,1940.901
9,lasso_MinMaxScaler,1158.990334,1898.381


In [326]:
result_RMSE

Unnamed: 0,model,train,test
0,my_linear_regression,21997.58474,9610.877
1,linear_regression,21995.961609,9621.501
2,my_ridge,21997.585314,9610.874
3,my_lasso,21997.584741,9610.877
4,my_elasticnet,21997.585028,9610.875
5,lasso,21995.970483,9621.281
6,ridge,21995.96161,9621.49
7,elasticnet,22015.968597,9613.707
8,linear_regression_MinMaxScaler,21995.961609,9811.202
9,lasso_MinMaxScaler,21995.979879,9802.22


In [327]:
result_R2

Unnamed: 0,model,train,test
0,my_linear_regression,0.006249,0.02092295
1,linear_regression,0.006396,0.01875711
2,my_ridge,0.006249,0.02092362
3,my_lasso,0.006249,0.02092295
4,my_elasticnet,0.006249,0.02092328
5,lasso,0.006395,0.01880204
6,ridge,0.006396,0.01875937
7,elasticnet,0.004588,0.02034627
8,linear_regression_MinMaxScaler,0.006396,-0.02031728
9,lasso_MinMaxScaler,0.006394,-0.01845012


лучшая модель - elasticnet_PolynomialFeatures

самая стабильная модель - ridge