### Подключим необходимые библиотеки:

In [508]:
import pandas                   as pd
import numpy                    as np
import matplotlib.pyplot        as plt
import seaborn                  as sns
import sklearn.linear_model     as skllinear
import sklearn.model_selection  as sklmodsel
import sklearn.metrics          as sklmetrics
import re                       as re
import random                   as rnd
import sklearn.preprocessing    as sklprep

### Смысл признаков:

|Признак|Оригинальное описание|
|:---------|:--------|
|BHK                | Number of Bedrooms, Hall, Kitchen.
|Rent               | Rent of the Houses/Apartments/Flats. 
|Size               | Size of the Houses/Apartments/Flats in Square Feet.
|Floor              | Houses/Apartments/Flats situated in which Floor and Total Number of Floors (Example: Ground out of 2, 3 out of 5, etc.)
|Area               | Type: Size of the Houses/Apartments/Flats calculated on either Super Area or Carpet Area or Build Area.
|Area               | Locality: Locality of the Houses/Apartments/Flats.
|City               | City where the Houses/Apartments/Flats are Located.
|Furnishing Status  | Furnishing Status of the Houses/Apartments/Flats, either it is Furnished or Semi-Furnished or Unfurnished.
|Tenant             | Preferred: Type of Tenant Preferred by the Owner or Agent.
|Bathroom           | Number of Bathrooms.
|Point of Contact   | Whom should you contact for more information regarding the Houses/Apartments/Flats.

### Считаем сырые данные: 

In [509]:
data_frame = pd.read_csv('./House_Rent_Dataset.csv')
data_frame = data_frame.astype({
    'Posted On'         : 'datetime64[ns]',
    'Floor'             : 'string',
    'Area Type'         : 'string',
    'Area Locality'     : 'string',
    'City'              : 'string',
    'Furnishing Status' : 'string',
    'Tenant Preferred'  : 'string',
    'Point of Contact'  : 'string'})
data_frame

Unnamed: 0,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2022-05-18,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2022-05-13,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2022-05-16,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2022-07-04,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2022-05-09,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner
...,...,...,...,...,...,...,...,...,...,...,...,...
4741,2022-05-18,2,15000,1000,3 out of 5,Carpet Area,Bandam Kommu,Hyderabad,Semi-Furnished,Bachelors/Family,2,Contact Owner
4742,2022-05-15,3,29000,2000,1 out of 4,Super Area,"Manikonda, Hyderabad",Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Owner
4743,2022-07-10,3,35000,1750,3 out of 5,Carpet Area,"Himayath Nagar, NH 7",Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Agent
4744,2022-07-06,3,45000,1500,23 out of 34,Carpet Area,Gachibowli,Hyderabad,Semi-Furnished,Family,2,Contact Agent


### Начальный анализ признаков:

Численные признаки (__BHK__, __Size__, __Bathroom__) очевидным образом будут включаться в анализ.
Касательно остальных признаков проведём отдельный анализ.

#### __Posted_On__:

Отметим относительно небольшой разброс дат подачи завлений - 3 месяца:

In [510]:
post_days_span = (data_frame['Posted On'].max() - data_frame['Posted On'].min()).days
print(f'Между последним и первым заявлением прошло {post_days_span} дня(/ей)')

Между последним и первым заявлением прошло 89 дня(/ей)


Основных причины изменения цен на рынке недвижимости с течением времени три:
1) __Инфляция__: даёт экспоненциальный рост, который сложно учесть с помощью модели линейной регрессии, однако инфляционные колебания не особо заметны на таком небольшом промежутке времени.
2) __Сезонное изменение цен__: вносит сильный вклад в цены, однако учёт сезонных колебаний предполагает анализ данных на интервале длиной не менее нескольких лет.
3) __Случайные колебания__: несут абсолютно случайный характер и никак не учитываются в коэффициентах модели.

Из вышеуказанного следует, что учёт данного признака в рамках модели линейной регрессии возможен только в категориальном смысле (разбиением всего интервала времени на подинтервалы и последующий сравнительный анализ данных, принадлежащим разным интервалам), однако такой анализ требует вспомогательной оценки интервалов разбиения и выходит за рамки данной работы, поэтому __анализ данных по данному признаку проводиться не будет__. 

Отбросим признак:

In [511]:
data_frame = data_frame.drop(['Posted On'], axis=1)
data_frame

Unnamed: 0,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner
...,...,...,...,...,...,...,...,...,...,...,...
4741,2,15000,1000,3 out of 5,Carpet Area,Bandam Kommu,Hyderabad,Semi-Furnished,Bachelors/Family,2,Contact Owner
4742,3,29000,2000,1 out of 4,Super Area,"Manikonda, Hyderabad",Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Owner
4743,3,35000,1750,3 out of 5,Carpet Area,"Himayath Nagar, NH 7",Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Agent
4744,3,45000,1500,23 out of 34,Carpet Area,Gachibowli,Hyderabad,Semi-Furnished,Family,2,Contact Agent


#### __Floor__:

Данный признак, очевидно, влияет на стоимость жилья. Проанализируем дополнительно имеющиеся данные:

In [512]:
data_frame.groupby(['Floor'])['Floor'].count()

Floor
1                             2
1 out of 1                  134
1 out of 10                   4
1 out of 11                   1
1 out of 12                   2
                           ... 
Upper Basement out of 4       3
Upper Basement out of 40      1
Upper Basement out of 5       1
Upper Basement out of 7       2
Upper Basement out of 9       2
Name: Floor, Length: 480, dtype: int64

Как видно, присутствует достаточно серьёзный разброс количества данных по данному признаку. Явная категоризация не подойдёт. 

Проверим, сколько данных подчиняется формату "__%level%__ out of __%max_level%__":

In [513]:
data_frame[data_frame['Floor'].str.contains(r'out of')]

Unnamed: 0,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner
...,...,...,...,...,...,...,...,...,...,...,...
4741,2,15000,1000,3 out of 5,Carpet Area,Bandam Kommu,Hyderabad,Semi-Furnished,Bachelors/Family,2,Contact Owner
4742,3,29000,2000,1 out of 4,Super Area,"Manikonda, Hyderabad",Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Owner
4743,3,35000,1750,3 out of 5,Carpet Area,"Himayath Nagar, NH 7",Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Agent
4744,3,45000,1500,23 out of 34,Carpet Area,Gachibowli,Hyderabad,Semi-Furnished,Family,2,Contact Agent


Как видно, только 4 строки не подчиняются шаблону, отбросим их.

In [514]:
data_frame = data_frame[data_frame['Floor'].str.contains(r'out of')]
data_frame.reset_index(drop = True, inplace = True)

С помощью регулярных выражений заменим фактор __Floor__ двумя факторами __Level__ и __Max Level__:

In [515]:
data_frame.insert(5, 'Level', data_frame['Floor'].str.extract(r'(.*?) out of .*?'))
data_frame.insert(6, 'Max Level', data_frame['Floor'].str.extract(r'.*? out of (.*)'))
data_frame = data_frame.drop(['Floor'], axis=1)
data_frame

Unnamed: 0,BHK,Rent,Size,Area Type,Level,Max Level,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2,10000,1100,Super Area,Ground,2,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2,20000,800,Super Area,1,3,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2,17000,1000,Super Area,1,3,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2,10000,800,Super Area,1,2,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2,7500,850,Carpet Area,1,2,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner
...,...,...,...,...,...,...,...,...,...,...,...,...
4737,2,15000,1000,Carpet Area,3,5,Bandam Kommu,Hyderabad,Semi-Furnished,Bachelors/Family,2,Contact Owner
4738,3,29000,2000,Super Area,1,4,"Manikonda, Hyderabad",Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Owner
4739,3,35000,1750,Carpet Area,3,5,"Himayath Nagar, NH 7",Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Agent
4740,3,45000,1500,Carpet Area,23,34,Gachibowli,Hyderabad,Semi-Furnished,Family,2,Contact Agent


Пересчитаем количество по новым признакам:

In [516]:
data_frame.groupby(['Level', 'Max Level'])['Level'].count()
print(data_frame.groupby(['Level', 'Max Level'])['Level'].count().to_string())

Level           Max Level
1               1            134
                10             4
                11             1
                12             2
                13             1
                14             2
                15             1
                16             1
                19             1
                2            379
                20             1
                22             1
                24             1
                3            293
                35             1
                4            200
                5             87
                6             17
                7             18
                8             11
                9              3
10              10             3
                11             3
                12             7
                13             4
                14            10
                15             3
                16             4
                18             5
                1

Как видно, количество категорий __Lower Basement__ и __Upper Basement__ небольшое, отбросим их, а __Ground__ заменим на 0:

In [517]:
data_frame = data_frame[data_frame['Level'] != 'Lower Basement']
data_frame = data_frame[data_frame['Level'] != 'Upper Basement']
data_frame.loc[data_frame['Level'] == 'Ground', 'Level'] = '0'
data_frame.reset_index(drop = True, inplace = True)
data_frame

Unnamed: 0,BHK,Rent,Size,Area Type,Level,Max Level,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2,10000,1100,Super Area,0,2,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2,20000,800,Super Area,1,3,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2,17000,1000,Super Area,1,3,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2,10000,800,Super Area,1,2,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2,7500,850,Carpet Area,1,2,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner
...,...,...,...,...,...,...,...,...,...,...,...,...
4703,2,15000,1000,Carpet Area,3,5,Bandam Kommu,Hyderabad,Semi-Furnished,Bachelors/Family,2,Contact Owner
4704,3,29000,2000,Super Area,1,4,"Manikonda, Hyderabad",Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Owner
4705,3,35000,1750,Carpet Area,3,5,"Himayath Nagar, NH 7",Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Agent
4706,3,45000,1500,Carpet Area,23,34,Gachibowli,Hyderabad,Semi-Furnished,Family,2,Contact Agent


#### __Area Type__:

Посчитаем количество категорий:

In [518]:
data_frame.groupby(['Area Type'])['Area Type'].count()

Area Type
Built Area        2
Carpet Area    2268
Super Area     2438
Name: Area Type, dtype: int64

Видно, что присутствуют всего два объекта с категорией __Built Area__. Посчитаем их выбросом и удалим:

In [519]:
data_frame = data_frame[data_frame['Area Type'] != 'Built Area']
data_frame.loc[data_frame['Level'] == 'Ground', 'Level'] = '0'
data_frame.reset_index(drop = True, inplace = True)

#### __Area Locality__:

Посчитаем количество категорий:

In [520]:
data_frame.groupby(['Area Locality'])['Area Locality'].count()

Area Locality
 Beeramguda, Ramachandra Puram, NH 9     1
 in Boduppal, NH 2 2                     1
 in Erragadda, NH 9                      1
 in Miyapur, NH 9                        1
117 Residency, Chembur East              1
                                        ..
vanamali chs ghatla, Ghatla              1
venkatapuram                             1
venkatesa perumal nagar                  1
villvam towers tnhb colony               1
whitefield                              12
Name: Area Locality, Length: 2218, dtype: int64

Снова видим большое количество разносортных категорий. Линейная регрессия по признаку не имеет смысла и __не будет осуществляться__. 

Отбросим признак:

In [521]:
data_frame = data_frame.drop(['Area Locality'], axis=1)
data_frame

Unnamed: 0,BHK,Rent,Size,Area Type,Level,Max Level,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2,10000,1100,Super Area,0,2,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2,20000,800,Super Area,1,3,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2,17000,1000,Super Area,1,3,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2,10000,800,Super Area,1,2,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2,7500,850,Carpet Area,1,2,Kolkata,Unfurnished,Bachelors,1,Contact Owner
...,...,...,...,...,...,...,...,...,...,...,...
4701,2,15000,1000,Carpet Area,3,5,Hyderabad,Semi-Furnished,Bachelors/Family,2,Contact Owner
4702,3,29000,2000,Super Area,1,4,Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Owner
4703,3,35000,1750,Carpet Area,3,5,Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Agent
4704,3,45000,1500,Carpet Area,23,34,Hyderabad,Semi-Furnished,Family,2,Contact Agent


#### __City__:

Посчитаем количество категорий:

In [522]:
data_frame.groupby(['City'])['City'].count()

City
Bangalore    883
Chennai      885
Delhi        603
Hyderabad    857
Kolkata      522
Mumbai       956
Name: City, dtype: int64

Видно, что распределение по городам довольно равномерное, имеется смысл проводить анализ по данному признаку.

#### __Furnishing Status__:

Посчитаем категории:

In [523]:
data_frame.groupby(['Furnishing Status'])['Furnishing Status'].count()

Furnishing Status
Furnished          673
Semi-Furnished    2236
Unfurnished       1797
Name: Furnishing Status, dtype: int64

Аналогично __City__, распределение по наличию мебели довольно равномерное, имеется смысл проводить анализ по данному признаку.

#### __Tenant Preferred__:

Посчитаем категории:

In [524]:
data_frame.groupby(['Tenant Preferred'])['Tenant Preferred'].count()

Tenant Preferred
Bachelors            826
Bachelors/Family    3417
Family               463
Name: Tenant Preferred, dtype: int64

Аналогично предыдущему признаку.

#### __Point of Contact__

Посчитаем категории:

In [525]:
data_frame.groupby(['Point of Contact'])['Point of Contact'].count()

Point of Contact
Contact Agent      1513
Contact Builder       1
Contact Owner      3192
Name: Point of Contact, dtype: int64

Видно, что присутствует всего один объект с признаком __Builder__. Посчитаем его выбросом и удалим:

In [526]:
data_frame = data_frame[data_frame['Point of Contact'] != 'Builder']
data_frame.reset_index(drop = True, inplace = True)

#### Итог анализа дискретных признаков:

Получена следующая таблица признаков:

In [527]:
data_frame

Unnamed: 0,BHK,Rent,Size,Area Type,Level,Max Level,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2,10000,1100,Super Area,0,2,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2,20000,800,Super Area,1,3,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2,17000,1000,Super Area,1,3,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2,10000,800,Super Area,1,2,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2,7500,850,Carpet Area,1,2,Kolkata,Unfurnished,Bachelors,1,Contact Owner
...,...,...,...,...,...,...,...,...,...,...,...
4701,2,15000,1000,Carpet Area,3,5,Hyderabad,Semi-Furnished,Bachelors/Family,2,Contact Owner
4702,3,29000,2000,Super Area,1,4,Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Owner
4703,3,35000,1750,Carpet Area,3,5,Hyderabad,Semi-Furnished,Bachelors/Family,3,Contact Agent
4704,3,45000,1500,Carpet Area,23,34,Hyderabad,Semi-Furnished,Family,2,Contact Agent


Посчитаем количество итоговых категорий по дискретным признакам:

In [528]:
print(data_frame.groupby(['City', 'Furnishing Status', 'Tenant Preferred', 'Point of Contact'])['City'].count().to_string())

City       Furnishing Status  Tenant Preferred  Point of Contact
Bangalore  Furnished          Bachelors         Contact Agent         7
                                                Contact Owner         4
                              Bachelors/Family  Contact Agent        10
                                                Contact Owner        63
                              Family            Contact Agent         3
                                                Contact Owner         4
           Semi-Furnished     Bachelors         Contact Agent        40
                                                Contact Owner        29
                              Bachelors/Family  Contact Agent        86
                                                Contact Owner       384
                              Family            Contact Agent        16
                                                Contact Owner        27
           Unfurnished        Bachelors         Contact Agent        16

Как видно, получено достаточно равномерное распределение по категориям. Последующий анализ предполагает замену категориальных признаков на бинарные:

In [529]:
target          = 'Rent'
num_features    = ['BHK', 'Size', 'Level', 'Max Level', 'Bathroom']
cat_features    = ['Area Type', 'City', 'Furnishing Status', 'Tenant Preferred', 'Point of Contact']
data_frame      = pd.get_dummies(data_frame, columns=cat_features, dtype='int64')
all_features    = list(data_frame.columns)
data_frame

Unnamed: 0,BHK,Rent,Size,Level,Max Level,Bathroom,Area Type_Carpet Area,Area Type_Super Area,City_Bangalore,City_Chennai,...,City_Mumbai,Furnishing Status_Furnished,Furnishing Status_Semi-Furnished,Furnishing Status_Unfurnished,Tenant Preferred_Bachelors,Tenant Preferred_Bachelors/Family,Tenant Preferred_Family,Point of Contact_Contact Agent,Point of Contact_Contact Builder,Point of Contact_Contact Owner
0,2,10000,1100,0,2,2,0,1,0,0,...,0,0,0,1,0,1,0,0,0,1
1,2,20000,800,1,3,1,0,1,0,0,...,0,0,1,0,0,1,0,0,0,1
2,2,17000,1000,1,3,1,0,1,0,0,...,0,0,1,0,0,1,0,0,0,1
3,2,10000,800,1,2,1,0,1,0,0,...,0,0,0,1,0,1,0,0,0,1
4,2,7500,850,1,2,1,1,0,0,0,...,0,0,0,1,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4701,2,15000,1000,3,5,2,1,0,0,0,...,0,0,1,0,0,1,0,0,0,1
4702,3,29000,2000,1,4,3,0,1,0,0,...,0,0,1,0,0,1,0,0,0,1
4703,3,35000,1750,3,5,3,1,0,0,0,...,0,0,1,0,0,1,0,1,0,0
4704,3,45000,1500,23,34,2,1,0,0,0,...,0,0,1,0,0,0,1,1,0,0


Разобъём отдельно на входные данные и выходные:

In [530]:
y_df = data_frame[target]
y_df

0       10000
1       20000
2       17000
3       10000
4        7500
        ...  
4701    15000
4702    29000
4703    35000
4704    45000
4705    15000
Name: Rent, Length: 4706, dtype: int64

In [531]:
A_df = data_frame[[*all_features]]
A_df = A_df.drop(['Rent'], axis=1)
A_df

Unnamed: 0,BHK,Size,Level,Max Level,Bathroom,Area Type_Carpet Area,Area Type_Super Area,City_Bangalore,City_Chennai,City_Delhi,...,City_Mumbai,Furnishing Status_Furnished,Furnishing Status_Semi-Furnished,Furnishing Status_Unfurnished,Tenant Preferred_Bachelors,Tenant Preferred_Bachelors/Family,Tenant Preferred_Family,Point of Contact_Contact Agent,Point of Contact_Contact Builder,Point of Contact_Contact Owner
0,2,1100,0,2,2,0,1,0,0,0,...,0,0,0,1,0,1,0,0,0,1
1,2,800,1,3,1,0,1,0,0,0,...,0,0,1,0,0,1,0,0,0,1
2,2,1000,1,3,1,0,1,0,0,0,...,0,0,1,0,0,1,0,0,0,1
3,2,800,1,2,1,0,1,0,0,0,...,0,0,0,1,0,1,0,0,0,1
4,2,850,1,2,1,1,0,0,0,0,...,0,0,0,1,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4701,2,1000,3,5,2,1,0,0,0,0,...,0,0,1,0,0,1,0,0,0,1
4702,3,2000,1,4,3,0,1,0,0,0,...,0,0,1,0,0,1,0,0,0,1
4703,3,1750,3,5,3,1,0,0,0,0,...,0,0,1,0,0,1,0,1,0,0
4704,3,1500,23,34,2,1,0,0,0,0,...,0,0,1,0,0,0,1,1,0,0


In [532]:
A_train, A_test, y_train, y_test = sklmodsel.train_test_split(A_df, y_df, test_size=0.2, random_state=3)
y_train = y_train.values.reshape(-1,1)
y_test = y_test.values.reshape(-1,1)

Произведём Scaling для большей численной стабильности:

In [533]:
A_scaler = sklprep.StandardScaler()
A_train = A_scaler.fit_transform(A_train)
A_test = A_scaler.transform(A_test)

### Явное решение:

Общий вид нахождения коэффициентов линейной регрессии при L2-регуляризации имеет вид: 
$$\vec{x} = (\hat{A}^T \hat{A}    + \alpha \hat{I})^{-1} \hat{A}^T \vec{y}$$
При $\alpha = 0$ получаем обыкновенную линейную регрессию.

Однако в случае существования нелинейного сдвига (прямая регрессии исходит не из нуля) необходимо видоизменить матрицу признаков:

$$\vec{\chi} = (\hat{B}^T \hat{B}    + \alpha \hat{I})^{-1} \hat{B}^T \vec{y}$$
$$\vec{\chi}    =  \begin{bmatrix} \vec{x} \\   y_0     \end{bmatrix}$$
$$\hat{B}      =  \begin{bmatrix} \hat{A} &    \vec{1} \end{bmatrix}$$

Рассмотрим результаты при $\alpha = 0$:

In [534]:
alpha = 0
B       = np.hstack((A_train, np.ones((len(A_train), 1))))
INV     = np.linalg.inv((B.T @ B + alpha * np.eye(len(B[0])))) @ B.T
sol     = (INV @ y_train).reshape(-1)

y0      = sol[-1]
x       = sol[:-1]

x, y0

(array([ 1.71175520e+04,  8.91395526e+03,  5.35760971e+02,  5.24167383e+04,
         1.18908468e+04,  1.68462272e+20,  1.68462272e+20,  1.48278002e+18,
         1.47321190e+18,  1.27062764e+18,  1.45108793e+18,  1.19135274e+18,
         1.50979271e+18,  6.79918452e+16,  9.71619875e+16,  9.45473901e+16,
        -5.20711699e+18, -6.13624327e+18, -4.12869160e+18,  1.03937688e+05,
         3.30783681e+03,  9.72866838e+04]),
 34961.720510095605)

Метрики:

In [535]:
y_predicted = np.dot(A_test, x) + y0
mae     = sklmetrics.mean_absolute_error            (y_test, y_predicted)
mape    = sklmetrics.mean_absolute_percentage_error (y_test, y_predicted)
mse     = sklmetrics.mean_squared_error             (y_test, y_predicted)
mape    = sklmetrics.mean_absolute_percentage_error (y_test, y_predicted)
r2      = sklmetrics.r2_score                       (y_test, y_predicted)
rmse    =  np.sqrt(mse)

print('Метрики:')
print('MAE:', round(mae, 3))
print('MSE:', round(mse, 3))
print('RMSE:', round(rmse, 3))
print('MAPE:', round(mape, 4))
print('R2:', round(r2, 4))

Метрики:
MAE: 148719.446
MSE: 28749092642.015
RMSE: 169555.574
MAPE: 10.5811
R2: -6.7485


Аналогично при $\alpha = 1$:

In [536]:
alpha = 1
B       = np.hstack((A_train, np.ones((len(A_train), 1))))
INV     = np.linalg.inv((B.T @ B + alpha * np.eye(len(B[0])))) @ B.T
sol     = (INV @ y_train).reshape(-1)

y0      = sol[-1]
x       = sol[:-1]

x, y0

(array([ 2608.02155333, 26276.0853552 ,  5647.59283492,  1264.91514558,
         6929.03583793,   985.85006401,  -985.85006401, -2140.3256728 ,
        -4490.96869264,  1139.69536444, -8208.10816424, -2268.78550335,
        15204.24792211,  2486.73944126,  -967.0566853 ,  -794.48891702,
          720.83593052,  1196.77869345, -2687.82541519,  1566.68262021,
          490.66822832, -1583.51038118]),
 34952.43452855246)

Метрики:

In [537]:
y_predicted = np.dot(A_test, x) + y0
mae     = sklmetrics.mean_absolute_error            (y_test, y_predicted)
mape    = sklmetrics.mean_absolute_percentage_error (y_test, y_predicted)
mse     = sklmetrics.mean_squared_error             (y_test, y_predicted)
mape    = sklmetrics.mean_absolute_percentage_error (y_test, y_predicted)
r2      = sklmetrics.r2_score                       (y_test, y_predicted)
rmse    =  np.sqrt(mse)

print('Метрики:')
print('MAE:', round(mae, 3))
print('MSE:', round(mse, 3))
print('RMSE:', round(rmse, 3))
print('MAPE:', round(mape, 4))
print('R2:', round(r2, 4))

Метрики:
MAE: 22214.693
MSE: 1919871033.347
RMSE: 43816.333
MAPE: 1.1687
R2: 0.4826


Как видно из метрик, решение с регуляризацией позволяет избавиться от переобучения.

### Градиентный спуск:

Найдём решение с помощью градиентного спуска:

In [538]:
w_init=None 
eta=1e-2
eps=1e-8 
max_iter=1e5

B = np.array(B)
y = np.array(y_train)
t_init = np.zeros(B.shape[1])
weight_dist = np.inf
i = 0
t = t_init
t = t.reshape(-1,1)
while (weight_dist > eps) and (i < max_iter):
    gradient_step = (2/len(B)) * eta * np.dot(B.T, (np.dot(B, t) - y))
    t_new = t - gradient_step
    weight_dist = np.linalg.norm(t_new - t)
    i += 1
    t = t_new

y0 = t[-1].reshape(-1)
x  = t[:-1].reshape(-1)
x, y0

(array([ 2601.99303813, 26295.19064959,  5652.83225915,  1257.06480868,
         6923.01602907,   985.88968866,  -985.88968866, -2142.7623604 ,
        -4493.55023027,  1141.53356832, -8212.57002685, -2270.30041191,
        15213.09675018,  2486.08365168,  -967.2147858 ,  -793.85484663,
          721.29053194,  1196.95744771, -2688.6644322 ,  1564.02758107,
          491.01223075, -1580.86798042]),
 array([34961.7205101]))

Метрики:

In [539]:
y_predicted = np.dot(A_test, x) + y0
mae     = sklmetrics.mean_absolute_error            (y_test, y_predicted)
mape    = sklmetrics.mean_absolute_percentage_error (y_test, y_predicted)
mse     = sklmetrics.mean_squared_error             (y_test, y_predicted)
mape    = sklmetrics.mean_absolute_percentage_error (y_test, y_predicted)
r2      = sklmetrics.r2_score                       (y_test, y_predicted)
rmse    =  np.sqrt(mse)

print('Метрики:')
print('MAE:', round(mae, 3))
print('MSE:', round(mse, 3))
print('RMSE:', round(rmse, 3))
print('MAPE:', round(mape, 4))
print('R2:', round(r2, 4))

Метрики:
MAE: 22218.284
MSE: 1919964505.952
RMSE: 43817.4
MAPE: 1.1688
R2: 0.4825


Как видно, результат градиентного спуска с точностью до погрешности соответствует явному решению с L2-регуляризацией.