**Задание**
* Используя данные из обучающего датасета (train.csv), построить модель для предсказания цен на недвижимость (квартиры).
С помощью полученной модели, предсказать цены для квартир из тестового датасета (test.csv).

* Целевая переменная:
Price

* Метрика качества:
R2 - коэффициент детерминации (sklearn.metrics.r2_score)


* Требования к решению:

1) R2 > 0.6

2) Тетрадка Jupyter Notebook с кодом Вашего решения, названная по образцу {ФИО}_solution.ipynb, пример SShirkin_solution.ipynb

3) Файл CSV с прогнозами целевой переменной для тестового датасета, названный по образцу {ФИО}_predictions.csv, пример SShirkin_predictions.csv 

Файл должен содержать два поля: Id, Price и в файле должна быть 5001 строка (шапка + 5000 предсказаний).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
0,14038,35,2.0,47.981561,29.442751,6.0,7,9.0,1969,0.08904,B,B,33,7976,5,,0,11,B,184966.93073
1,15053,41,3.0,65.68364,40.049543,8.0,7,9.0,1978,7e-05,B,B,46,10309,1,240.0,1,16,B,300009.450063
2,4765,53,2.0,44.947953,29.197612,0.0,8,12.0,1968,0.049637,B,B,34,7759,0,229.0,1,3,B,220925.908524
3,5809,58,2.0,53.352981,52.731512,9.0,8,17.0,1977,0.437885,B,B,23,5735,3,1084.0,0,5,B,175616.227217
4,10783,99,1.0,39.649192,23.776169,7.0,11,12.0,1976,0.012339,B,B,35,5776,1,2078.0,2,4,B,150226.531644


**Описание датасета**:

* **Id** - идентификационный номер квартиры
* **DistrictId**  - идентификационный номер района
* **Rooms**  - количество комнат
* **Square** - площадь
* **LifeSquare** - жилая площадь
* **KitchenSquare** - площадь кухни
* **Floor** - этаж
* **HouseFloor** - количество этажей в доме
* **HouseYear** - год постройки дома
* **Ecology_1, Ecology_2, Ecology_3** - экологические показатели местности
* **Social_1, Social_2, Social_3** - социальные показатели местности
* **Healthcare_1, Helthcare_2** - показатели местности, связанные с охраной здоровья
* **Shops_1, Shops_2** - показатели, связанные с наличием магазинов, торговых центров
* **Price** - цена квартиры

In [3]:
df.shape

(10000, 20)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             10000 non-null  int64  
 1   DistrictId     10000 non-null  int64  
 2   Rooms          10000 non-null  float64
 3   Square         10000 non-null  float64
 4   LifeSquare     7887 non-null   float64
 5   KitchenSquare  10000 non-null  float64
 6   Floor          10000 non-null  int64  
 7   HouseFloor     10000 non-null  float64
 8   HouseYear      10000 non-null  int64  
 9   Ecology_1      10000 non-null  float64
 10  Ecology_2      10000 non-null  object 
 11  Ecology_3      10000 non-null  object 
 12  Social_1       10000 non-null  int64  
 13  Social_2       10000 non-null  int64  
 14  Social_3       10000 non-null  int64  
 15  Healthcare_1   5202 non-null   float64
 16  Helthcare_2    10000 non-null  int64  
 17  Shops_1        10000 non-null  int64  
 18  Shops_2

In [5]:
df.dtypes

Id                 int64
DistrictId         int64
Rooms            float64
Square           float64
LifeSquare       float64
KitchenSquare    float64
Floor              int64
HouseFloor       float64
HouseYear          int64
Ecology_1        float64
Ecology_2         object
Ecology_3         object
Social_1           int64
Social_2           int64
Social_3           int64
Healthcare_1     float64
Helthcare_2        int64
Shops_1            int64
Shops_2           object
Price            float64
dtype: object

In [6]:
df.describe()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Price
count,10000.0,10000.0,10000.0,10000.0,7887.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,5202.0,10000.0,10000.0,10000.0
mean,8383.4077,50.4008,1.8905,56.315775,37.199645,6.2733,8.5267,12.6094,3990.166,0.118858,24.687,5352.1574,8.0392,1142.90446,1.3195,4.2313,214138.857399
std,4859.01902,43.587592,0.839512,21.058732,86.241209,28.560917,5.241148,6.775974,200500.3,0.119025,17.532614,4006.799803,23.831875,1021.517264,1.493601,4.806341,92872.293865
min,0.0,0.0,0.0,1.136859,0.370619,0.0,1.0,0.0,1910.0,0.0,0.0,168.0,0.0,0.0,0.0,0.0,59174.778028
25%,4169.5,20.0,1.0,41.774881,22.769832,1.0,4.0,9.0,1974.0,0.017647,6.0,1564.0,0.0,350.0,0.0,1.0,153872.633942
50%,8394.5,36.0,2.0,52.51331,32.78126,6.0,7.0,13.0,1977.0,0.075424,25.0,5285.0,2.0,900.0,1.0,3.0,192269.644879
75%,12592.5,75.0,2.0,65.900625,45.128803,9.0,12.0,17.0,2001.0,0.195781,36.0,7227.0,5.0,1548.0,2.0,6.0,249135.462171
max,16798.0,209.0,19.0,641.065193,7480.592129,2014.0,42.0,117.0,20052010.0,0.521867,74.0,19083.0,141.0,4849.0,6.0,23.0,633233.46657


In [7]:
df.describe(include='object')

Unnamed: 0,Ecology_2,Ecology_3,Shops_2
count,10000,10000,10000
unique,2,2,2
top,B,B,B
freq,9903,9725,9175


In [8]:
df['Ecology_2'].value_counts()

B    9903
A      97
Name: Ecology_2, dtype: int64

In [9]:
df['Ecology_3'].value_counts()

B    9725
A     275
Name: Ecology_3, dtype: int64

In [10]:
df['Shops_2'].value_counts()

B    9175
A     825
Name: Shops_2, dtype: int64

In [11]:
df.isna().sum()

Id                  0
DistrictId          0
Rooms               0
Square              0
LifeSquare       2113
KitchenSquare       0
Floor               0
HouseFloor          0
HouseYear           0
Ecology_1           0
Ecology_2           0
Ecology_3           0
Social_1            0
Social_2            0
Social_3            0
Healthcare_1     4798
Helthcare_2         0
Shops_1             0
Shops_2             0
Price               0
dtype: int64

In [12]:
df['Healthcare_1'].value_counts()

540.0     511
30.0      348
1046.0    245
750.0     163
229.0     148
         ... 
370.0      14
32.0       12
1815.0     10
35.0        2
0.0         1
Name: Healthcare_1, Length: 79, dtype: int64

In [13]:
df['LifeSquare'].value_counts()

35.812832    1
58.218079    1
35.213655    1
23.656629    1
4.289714     1
            ..
82.418226    1
43.005439    1
56.867287    1
87.018830    1
33.743934    1
Name: LifeSquare, Length: 7887, dtype: int64

In [14]:
df['Healthcare_1'].fillna(df['Healthcare_1'].median(), inplace=True)

In [15]:
df['LifeSquare'].fillna(df['LifeSquare'].median(), inplace=True)

In [17]:
df.isna().sum()

Id               0
DistrictId       0
Rooms            0
Square           0
LifeSquare       0
KitchenSquare    0
Floor            0
HouseFloor       0
HouseYear        0
Ecology_1        0
Ecology_2        0
Ecology_3        0
Social_1         0
Social_2         0
Social_3         0
Healthcare_1     0
Helthcare_2      0
Shops_1          0
Shops_2          0
Price            0
dtype: int64

In [18]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,10000.0,8383.4077,4859.01902,0.0,4169.5,8394.5,12592.5,16798.0
DistrictId,10000.0,50.4008,43.587592,0.0,20.0,36.0,75.0,209.0
Rooms,10000.0,1.8905,0.839512,0.0,1.0,2.0,2.0,19.0
Square,10000.0,56.315775,21.058732,1.136859,41.774881,52.51331,65.900625,641.0652
LifeSquare,10000.0,36.26604,76.609981,0.370619,25.527399,32.78126,41.427234,7480.592
KitchenSquare,10000.0,6.2733,28.560917,0.0,1.0,6.0,9.0,2014.0
Floor,10000.0,8.5267,5.241148,1.0,4.0,7.0,12.0,42.0
HouseFloor,10000.0,12.6094,6.775974,0.0,9.0,13.0,17.0,117.0
HouseYear,10000.0,3990.1663,200500.261427,1910.0,1974.0,1977.0,2001.0,20052010.0
Ecology_1,10000.0,0.118858,0.119025,0.0,0.017647,0.075424,0.195781,0.5218671


In [24]:
df.cov()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Price
Id,23610070.0,2747.544,-23.850342,-1030.511,6223.198644,2758.896565,34.322197,-275.77803,4875103.0,10.466502,-65.729663,-39584.21,-1083.644646,-10342.34,10.89933,-203.59416,4458647.0
DistrictId,2747.544,1899.878,2.613849,-24.42765,-39.561561,50.241786,-27.499151,-44.02185,117370.8,0.338747,188.348085,29249.61,141.371526,8630.521,19.930937,36.497345,1073148.0
Rooms,-23.85034,2.613849,0.70478,11.71932,7.751859,0.122839,-0.002927,-0.166687,-1786.198,-0.003232,1.118338,239.9547,0.256318,22.74487,0.079693,0.216349,42904.73
Square,-1030.511,-24.42765,11.719321,443.4702,268.631062,5.003894,12.669665,11.630168,-38134.25,-0.161618,-26.099869,-3638.403,17.686202,-548.0143,-0.722163,2.161707,1017148.0
LifeSquare,6223.199,-39.56156,7.751859,268.6311,5869.089131,1.964886,7.204339,11.287731,-32058.18,-0.180225,-41.55527,-7845.561,14.416688,-356.1595,-1.851038,-1.235468,557320.3
KitchenSquare,2758.897,50.24179,0.122839,5.003894,1.964886,815.72598,-1.706018,0.150966,5487.465,-0.019111,21.722015,4326.322,-10.305444,187.5044,1.841565,1.402326,76562.69
Floor,34.3222,-27.49915,-0.002927,12.66967,7.204339,-1.706018,27.469634,14.879817,975.1562,-0.010064,-4.127156,-347.7545,-0.279375,-415.3078,-0.513032,0.611235,62653.13
HouseFloor,-275.778,-44.02185,-0.166687,11.63017,11.287731,0.150966,14.879817,45.913823,-1174.024,-0.003518,-2.471205,195.304,-1.31402,-529.0903,-0.695573,0.855831,55554.48
HouseYear,4875103.0,117370.8,-1786.19801,-38134.25,-32058.176732,5487.465297,975.156225,-1174.023646,40200350000.0,34.951017,10638.261178,1582617.0,3912.516233,-1534512.0,3367.644632,3547.540189,80170400.0
Ecology_1,10.4665,0.3387471,-0.003232,-0.1616182,-0.180225,-0.019111,-0.010064,-0.003518,34.95102,0.014167,0.055226,4.41792,-0.351932,0.3888005,0.005488,-0.043906,-645.3494


In [25]:
df.corr()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Price
Id,1.0,0.012973,-0.005847,-0.010071,0.016718,0.01988,0.001348,-0.008376,0.005004,0.018097,-0.000772,-0.002033,-0.009358,-0.002851,0.001502,-0.008718,0.00988
DistrictId,0.012973,1.0,0.071432,-0.026613,-0.011847,0.040358,-0.120373,-0.149051,0.01343,0.065294,0.246463,0.167479,0.136095,0.265185,0.306147,0.174214,0.2651
Rooms,-0.005847,0.071432,1.0,0.662893,0.12053,0.005123,-0.000665,-0.029302,-0.010612,-0.032347,0.07598,0.071335,0.012811,0.036285,0.063557,0.053618,0.550291
Square,-0.010071,-0.026613,0.662893,1.0,0.166509,0.00832,0.114791,0.081505,-0.009032,-0.064479,-0.07069,-0.04312,0.035241,-0.034853,-0.02296,0.021357,0.520075
LifeSquare,0.016718,-0.011847,0.12053,0.166509,1.0,0.000898,0.017942,0.021745,-0.002087,-0.019765,-0.030938,-0.025559,0.007896,-0.006226,-0.016177,-0.003355,0.078331
KitchenSquare,0.01988,0.040358,0.005123,0.00832,0.000898,1.0,-0.011397,0.00078,0.000958,-0.005622,0.043379,0.037805,-0.01514,0.008793,0.04317,0.010216,0.028864
Floor,0.001348,-0.120373,-0.000665,0.114791,0.017942,-0.011397,1.0,0.418986,0.000928,-0.016133,-0.044914,-0.01656,-0.002237,-0.106125,-0.065537,0.024264,0.128715
HouseFloor,-0.008376,-0.149051,-0.029302,0.081505,0.021745,0.00078,0.418986,1.0,-0.000864,-0.004362,-0.020801,0.007194,-0.008137,-0.104576,-0.068728,0.026279,0.08828
HouseYear,0.005004,0.01343,-0.010612,-0.009032,-0.002087,0.000958,0.000928,-0.000864,1.0,0.001465,0.003026,0.00197,0.000819,-0.01025,0.011245,0.003681,0.004305
Ecology_1,0.018097,0.065294,-0.032347,-0.064479,-0.019765,-0.005622,-0.016133,-0.004362,0.001465,1.0,0.026464,0.009264,-0.124068,0.004375,0.030873,-0.076749,-0.058381


In [27]:
df.corr()['Price'].sort_values()

Ecology_1       -0.058381
HouseYear        0.004305
Id               0.009880
KitchenSquare    0.028864
Social_3         0.074878
LifeSquare       0.078331
HouseFloor       0.088280
Healthcare_1     0.128059
Floor            0.128715
Shops_1          0.180876
Social_2         0.239226
Helthcare_2      0.253090
Social_1         0.263286
DistrictId       0.265100
Square           0.520075
Rooms            0.550291
Price            1.000000
Name: Price, dtype: float64

In [28]:
df.to_csv('Christina.Yarochkina_predictions.csv', encoding='utf-8')