#  Real Estate Price Prediction
[Geekbrains Python for Data Science course competition](https://www.kaggle.com/c/real-estate-price-prediction-moscow)

Курсовой проект по теме "Библиотеки Python для Data Science: Numpy, Matplotlib, Scikit-learn".

**Задача:** предсказать стоимость недвижимости.

**Метрика:** R2 - коэффициент детерминации (sklearn.metrics.r2_score)  

## Выполнил [Посягин Константин](https://gb.ru/users/1024991), группа 1114

НАЧНЕМ!

_________________________________________

# 0. Подгружаем библиотеки

In [45]:
import numpy as np
import pandas as pd
import math

import matplotlib as mpl
from matplotlib import pyplot as plt
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

import seaborn as sns
sns.set_style('whitegrid')

import warnings
warnings.filterwarnings("ignore") 

import re
from pylab import rcParams

mpl.rcParams.update({'font.size': 14})

import sklearn as skl
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse, r2_score as r2

# 1. Загружаем данные 

In [2]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [3]:
df_train.columns

Index(['Id', 'DistrictId', 'Rooms', 'Square', 'LifeSquare', 'KitchenSquare',
       'Floor', 'HouseFloor', 'HouseYear', 'Ecology_1', 'Ecology_2',
       'Ecology_3', 'Social_1', 'Social_2', 'Social_3', 'Healthcare_1',
       'Helthcare_2', 'Shops_1', 'Shops_2', 'Price'],
      dtype='object')

## Описание датасета

**Id** - идентификационный номер квартиры 

**DistrictId** - идентификационный номер района

**Rooms** - количество комнат

**Square** - площадь

**LifeSquare** - жилая площадь

**KitchenSquare** - площадь кухни

**Floor** - этаж

**HouseFloor** - количество этажей в доме

**HouseYear** - год постройки дома

**Ecology_1, Ecology_2, Ecology_3** - экологические показатели местности

**Social_1, Social_2, Social_3** - социальные показатели местности

**Healthcare_1, Helthcare_2** - показатели местности, связанные с охраной здоровья

**Shops_1, Shops_2** - показатели, связанные с наличием магазинов, торговых центров

**Price** - цена квартиры

In [4]:
df_train.head()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
0,11809,27,3.0,115.027311,,10.0,4,10.0,2014,0.075424,B,B,11,3097,0,,0,0,B,305018.871089
1,3013,22,1.0,39.832524,23.169223,8.0,7,8.0,1966,0.118537,B,B,30,6207,1,1183.0,1,0,B,177734.553407
2,8215,1,3.0,78.342215,47.671972,10.0,2,17.0,1988,0.025609,B,B,33,5261,0,240.0,3,1,B,282078.72085
3,2352,1,1.0,40.409907,,1.0,10,22.0,1977,0.007122,B,B,1,264,0,,0,1,B,168106.00763
4,13866,94,2.0,64.285067,38.562517,9.0,16,16.0,1972,0.282798,B,B,33,8667,2,,0,6,B,343995.102962


In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             10000 non-null  int64  
 1   DistrictId     10000 non-null  int64  
 2   Rooms          10000 non-null  float64
 3   Square         10000 non-null  float64
 4   LifeSquare     7887 non-null   float64
 5   KitchenSquare  10000 non-null  float64
 6   Floor          10000 non-null  int64  
 7   HouseFloor     10000 non-null  float64
 8   HouseYear      10000 non-null  int64  
 9   Ecology_1      10000 non-null  float64
 10  Ecology_2      10000 non-null  object 
 11  Ecology_3      10000 non-null  object 
 12  Social_1       10000 non-null  int64  
 13  Social_2       10000 non-null  int64  
 14  Social_3       10000 non-null  int64  
 15  Healthcare_1   5202 non-null   float64
 16  Helthcare_2    10000 non-null  int64  
 17  Shops_1        10000 non-null  int64  
 18  Shops_2

Сразу же видим:

1. В столбцах LifeSquare и Healthcare_1 есть пропуски!
2. Тип данных в столбцах [Ecology_2, Ecology_3 и Shops_2] - object.

Посмотрим подробнее на количественные характеристики

In [6]:
df_train.describe(include='all')

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
count,10000.0,10000.0,10000.0,10000.0,7887.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000,10000,10000.0,10000.0,10000.0,5202.0,10000.0,10000.0,10000,10000.0
unique,,,,,,,,,,,2,2,,,,,,,2,
top,,,,,,,,,,,B,B,,,,,,,B,
freq,,,,,,,,,,,9903,9725,,,,,,,9175,
mean,8383.4077,50.4008,1.8905,56.315775,37.199645,6.2733,8.5267,12.6094,3990.166,0.118858,,,24.687,5352.1574,8.0392,1142.90446,1.3195,4.2313,,214138.857399
std,4859.01902,43.587592,0.839512,21.058732,86.241209,28.560917,5.241148,6.775974,200500.3,0.119025,,,17.532614,4006.799803,23.831875,1021.517264,1.493601,4.806341,,92872.293865
min,0.0,0.0,0.0,1.136859,0.370619,0.0,1.0,0.0,1910.0,0.0,,,0.0,168.0,0.0,0.0,0.0,0.0,,59174.778028
25%,4169.5,20.0,1.0,41.774881,22.769832,1.0,4.0,9.0,1974.0,0.017647,,,6.0,1564.0,0.0,350.0,0.0,1.0,,153872.633942
50%,8394.5,36.0,2.0,52.51331,32.78126,6.0,7.0,13.0,1977.0,0.075424,,,25.0,5285.0,2.0,900.0,1.0,3.0,,192269.644879
75%,12592.5,75.0,2.0,65.900625,45.128803,9.0,12.0,17.0,2001.0,0.195781,,,36.0,7227.0,5.0,1548.0,2.0,6.0,,249135.462171


## Приведение типов
 

В столбцах Id, DistrictId тип данных - *int64*. 

Так как эти данные не имеет смысла сравнивать математически - приведем их к типу *str*.

In [7]:
df_train['Id'] = df_train['Id'].astype(str)
df_train['DistrictId'] = df_train['DistrictId'].astype(str)

# 2. Базовое решение

Посчитаем нашу метрику на почти необработанных данных. 

Всего лишь заполним пропуски медианой и избавимся от грубых выбросов 

## Заполняем пропуски

In [8]:
ls_condition = df_train['LifeSquare'].isna()
ls_condition.value_counts()

False    7887
True     2113
Name: LifeSquare, dtype: int64

In [9]:
h1_condition = df_train['Healthcare_1'].isna()
h1_condition.value_counts()

False    5202
True     4798
Name: Healthcare_1, dtype: int64

В столбце LifeSquare у нас 2113 пропусков. Заменим их медианой.

В столбце Healthcare_1 у нас 4798 пропусков. Заменим их медианой.

In [10]:
ls_med = df_train["LifeSquare"].median().round(2)
print(ls_med)

32.78


In [11]:
h1_med = df_train["Healthcare_1"].median().round(2)
print(h1_med)

900.0


In [12]:
df_train.loc[ls_condition, 'LifeSquare'] = ls_med
df_train.loc[h1_condition, 'Healthcare_1'] = h1_med

In [13]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             10000 non-null  object 
 1   DistrictId     10000 non-null  object 
 2   Rooms          10000 non-null  float64
 3   Square         10000 non-null  float64
 4   LifeSquare     10000 non-null  float64
 5   KitchenSquare  10000 non-null  float64
 6   Floor          10000 non-null  int64  
 7   HouseFloor     10000 non-null  float64
 8   HouseYear      10000 non-null  int64  
 9   Ecology_1      10000 non-null  float64
 10  Ecology_2      10000 non-null  object 
 11  Ecology_3      10000 non-null  object 
 12  Social_1       10000 non-null  int64  
 13  Social_2       10000 non-null  int64  
 14  Social_3       10000 non-null  int64  
 15  Healthcare_1   10000 non-null  float64
 16  Helthcare_2    10000 non-null  int64  
 17  Shops_1        10000 non-null  int64  
 18  Shops_2

## Избавляемся от выбросов 

In [17]:
for i in df_train.columns:
    print(f'{i}\n{df_train[i].value_counts()}\n{"-"*50}')
    print()

Id
13067    1
3636     1
11262    1
2173     1
2019     1
        ..
13703    1
15300    1
9150     1
7651     1
13351    1
Name: Id, Length: 10000, dtype: int64
--------------------------------------------------

DistrictId
27     851
1      652
23     565
6      511
9      294
      ... 
117      1
199      1
196      1
205      1
209      1
Name: DistrictId, Length: 205, dtype: int64
--------------------------------------------------

Rooms
2.0     3880
1.0     3705
3.0     2235
4.0      150
5.0       18
0.0        8
10.0       2
6.0        1
19.0       1
Name: Rooms, dtype: int64
--------------------------------------------------

Square
52.327165    1
34.785487    1
45.823093    1
57.607965    1
57.925603    1
            ..
60.776683    1
72.956943    1
51.770111    1
41.843220    1
64.226361    1
Name: Square, Length: 10000, dtype: int64
--------------------------------------------------

LifeSquare
32.780000    2113
20.151696       1
22.922376       1
23.884805       1
51.82687

**Rooms** Заменим 0 на 1, а 19 на 10.

In [19]:
df_train.loc[df_train['Rooms'] == 0, 'Rooms'] = 1
df_train.loc[df_train['Rooms'] == 19, 'Rooms'] = 10
df_train['Rooms'].value_counts()

2.0     3880
1.0     3713
3.0     2235
4.0      150
5.0       18
10.0       3
6.0        1
Name: Rooms, dtype: int64

**Square** Заменим все что меньше 10 на 10, все что больше 250 на 250.

In [20]:
df_train.loc[df_train['Square'] < 10, 'Square'] = 10
df_train.loc[df_train['Square'] > 250, 'Square'] = 250

Примерно тоже самое сделаем для **LifeSquare** и **KitchenSquare**

In [21]:
df_train.loc[df_train['LifeSquare'] < 5, 'LifeSquare'] = 5
df_train.loc[df_train['LifeSquare'] > 250, 'LifeSquare'] = 250

df_train.loc[df_train['KitchenSquare'] < 2, 'KitchenSquare'] = 2
df_train.loc[df_train['KitchenSquare'] > 100, 'KitchenSquare'] = 100

**HouseYear** тоже немного поправим

In [24]:
df_train.loc[df_train['HouseYear'] < 1941, 'HouseYear'] = 1941
df_train.loc[df_train['HouseYear'] > 2021, 'HouseYear'] = 2021

В столбцах **Ecology_2**, **Ecology_3** и **Shops_2** поменяем 'B' и 'A' на 1 и 0

In [33]:
df_train.loc[df_train['Ecology_2'] == 'B', 'Ecology_2'] = 1
df_train.loc[df_train['Ecology_2'] == 'A', 'Ecology_2'] = 0

df_train.loc[df_train['Ecology_3'] == 'B', 'Ecology_3'] = 1
df_train.loc[df_train['Ecology_3'] == 'A', 'Ecology_3'] = 0

df_train.loc[df_train['Shops_2'] == 'B', 'Shops_2'] = 1
df_train.loc[df_train['Shops_2'] == 'A', 'Shops_2'] = 0

In [34]:
df_train.describe(include='all')

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
unique,10000.0,205.0,,,,,,,,,2.0,2.0,,,,,,,2.0,
top,13067.0,27.0,,,,,,,,,1.0,1.0,,,,,,,1.0,
freq,1.0,851.0,,,,,,,,,9903.0,9725.0,,,,,,,9175.0,
mean,,,1.8904,56.23025,35.498879,6.2756,8.5267,12.6094,1984.9451,0.118858,,,24.687,5352.1574,8.0392,1026.3589,1.3195,4.2313,,214138.857399
std,,,0.824534,19.395,16.531925,4.889114,5.241148,6.775974,18.208774,0.119025,,,17.532614,4006.799803,23.831875,746.662828,1.493601,4.806341,,92872.293865
min,,,1.0,10.0,5.0,2.0,1.0,0.0,1941.0,0.0,,,0.0,168.0,0.0,0.0,0.0,0.0,,59174.778028
25%,,,1.0,41.774881,25.527399,2.0,4.0,9.0,1974.0,0.017647,,,6.0,1564.0,0.0,830.0,0.0,1.0,,153872.633942
50%,,,2.0,52.51331,32.78,6.0,7.0,13.0,1977.0,0.075424,,,25.0,5285.0,2.0,900.0,1.0,3.0,,192269.644879
75%,,,2.0,65.900625,41.427234,9.0,12.0,17.0,2001.0,0.195781,,,36.0,7227.0,5.0,990.0,2.0,6.0,,249135.462171


## Обучаем модель

### Разбиваем Датасет на train и valid

In [35]:
X = df_train.drop(columns='Price')
y = df_train['Price']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.33, shuffle=True, random_state=73)
X_train.index

Int64Index([6953, 3959,  270, 9723, 8157, 9362, 7046, 8461, 6353, 9175,
            ...
            2414, 2017, 4458, 1702, 8513, 4419, 8586, 4014, 8338, 5014],
           dtype='int64', length=6700)

### Построение модели 

Будем пользоваться линейной регрессией

In [36]:
lr = LinearRegression()

Обучим модель на тренировочных данных

In [37]:
lr.fit(X_train, y_train)

LinearRegression()

Теперь, когда модель обучена, мы можем получить предсказанные значения на объектах X_test с помощью метода .predict:

In [39]:
y_pred = lr.predict(X_valid)

y_pred.shape

(3300,)

In [46]:
check_test = pd.DataFrame({
    "y_valid": y_valid,
    "y_pred": y_pred,
})

check_test

Unnamed: 0,y_valid,y_pred
1495,201571.982947,203895.833589
9273,245401.576583,254159.282403
57,200932.350329,228816.507238
7745,347583.008050,312200.887976
1937,136831.390023,136639.297038
...,...,...
6226,219007.091189,232670.339342
371,214944.619599,242302.577330
9231,301800.156753,338728.726205
4695,198191.744288,227831.596757


## Считаем метрику качества 

In [49]:
print("R2:\t" + str(round(r2(y_valid, y_pred), 3)))

R2:	0.526


Не густо. Но на то это и базовое решение. 

# 3. Разведочный анализ данных (EDA) 

## Изучаем целевую переменную 

### График 1. Распределение 

### График 2. Классификация 

## Изучаем признаки

### Корреляции 

### Заполнение пропусков 

#### LifeSquare 

#### Healthcare_1 

### Устраняем выбросы

## Влияние признаков на целевую переменную

###  График 1

### График 2

### График 3 

## Создание новых признаков 

# 5. Предобработка данных 

## Разделим данные на train и valid 

## Масштабируем данные 

## Заполним пропуски 

## Обработаем выбросы 

## Генерируем новые признаки 

## Класс предобработки данных: 

# 6. Обучение и валидация моделей 

## Обучение модели на базовых гиперпараметрах 

## Настраиваем гиперпараметры 

## Автоматизированный выбор гиперпараметров с кросс-валидацией GridSearch CV/RandomSearchCV 

## Следим за переобучением, если оно появилось - ищем лучшие параметры регуляризации  

## Считаем метрики  

# 7. Сохраним 