---
# Data Science and Artificial Intelliegence Practicum
## 5-modul. Machine Learning
---

## 5.5 - Predicting house prices in Tashkent.

---
**CRISP-DM:**
<img src="https://i.imgur.com/dzZnnYi.png" alt="CRISP-DM" width="800"/>

---

**STEPS:**
1. Data Exploration
  1. Data Understanding
  2. Data Cleaning
  3. Analyzing Data
  4. Data Preparation
5. Pipeline for Feature Engineering
6. Modeling / Machine Learning
7. Evaluation
8. Saving the Model

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("https://github.com/anvarnarz/praktikum_datasets/blob/main/housing_data_08-02-2021.csv?raw=True")
df

Unnamed: 0,location,district,rooms,size,level,max_levels,price
0,"город Ташкент, Юнусабадский район, Юнусабад 8-...",Юнусабадский,3,57,4,4,52000
1,"город Ташкент, Яккасарайский район, 1-й тупик ...",Яккасарайский,2,52,4,5,56000
2,"город Ташкент, Чиланзарский район, Чиланзар 2-...",Чиланзарский,2,42,4,4,37000
3,"город Ташкент, Чиланзарский район, Чиланзар 9-...",Чиланзарский,3,65,1,4,49500
4,"город Ташкент, Чиланзарский район, площадь Актепа",Чиланзарский,3,70,3,5,55000
...,...,...,...,...,...,...,...
7560,"город Ташкент, Яшнободский район, Городок Авиа...",Яшнободский,1,38,5,5,24500
7561,"город Ташкент, Яшнободский район, 1-й проезд А...",Яшнободский,2,49,1,4,32000
7562,"город Ташкент, Шайхантахурский район, Зульфиях...",Шайхантахурский,2,64,3,9,40000
7563,"город Ташкент, Мирзо-Улугбекский район, Буюк И...",Мирзо-Улугбекский,1,18,1,4,11000


### Definition of columns:

- `location` - address of the house for sale
- `district` - district where the house is located
- `rooms` - number of rooms
- `size` - house area (sq.m)
- `level` - level(floor) where the house located
- `max_levels` - total number of levels
- `price` - price of the house

### Data Exploration | Exploratory Data Analysis

In [None]:
df.info()

We can see that there are no `NaN` values in the dataset. However, despite the fact that `size` and `price` columns are numbers, their data type is `object`. First, we convert these columns to numeric values.

#### Data Cleaning

In [None]:
size_col = np.array(df['size'], dtype='float64')
size_col

In [None]:
df[df['size']=='Площадьземли:1сот']

`size` column has non-numeric value: **'Площадьземли:1сот'**.\
It's like russian word *«сотка»* whichis is equal `100 m²`. [Wiki reference](https://ru.wikipedia.org/wiki/%D0%A1%D0%BE%D1%82%D0%BA%D0%B0#:~:text=%D0%90%D1%80%20(%D0%B2%20%D1%80%D0%B0%D0%B7%D0%B3%D0%BE%D0%B2%D0%BE%D1%80%D0%BD%D0%BE%D0%B9%20%D1%80%D0%B5%D1%87%D0%B8%20%D1%82%D0%B0%D0%BA%D0%B6%D0%B5%20%C2%AB%D1%81%D0%BE%D1%82%D0%BA%D0%B0%C2%BB%2C%20%D0%BE%D1%82%201/100%20%D0%B3%D0%B5%D0%BA%D1%82%D0%B0%D1%80%D0%B0)%C2%A0%E2%80%94%20%D0%BC%D0%B5%D1%82%D1%80%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B0%D1%8F%20%D0%B5%D0%B4%D0%B8%D0%BD%D0%B8%D1%86%D0%B0%20%D0%B8%D0%B7%D0%BC%D0%B5%D1%80%D0%B5%D0%BD%D0%B8%D1%8F%20%D0%BF%D0%BB%D0%BE%D1%89%D0%B0%D0%B4%D0%B8%2C%20%D1%80%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F%20100%C2%A0%D0%BC%C2%B2.)\
This means that we can convert this to a number.

In [2]:
df.loc[5347, 'size'] = 100
df.loc[[5347]]

Unnamed: 0,location,district,rooms,size,level,max_levels,price
5347,"город Ташкент, Яшнободский район, Дархон",Яшнободский,4,100,3,5,150000


In [3]:
size_col = np.array(df['size'], dtype='float64')
df['size'] = size_col

In [None]:
price_col = np.array(df['price'], dtype='float64')
price_col

In [None]:
df[df['price']=='Договорная']

`price` column also has non-numeric values: **'Договорная'**, which we can translate as *'negotiable'*. We need to convert this value to `NaN` value in order to filling it later.

In [4]:
indices = df[df['price'] == 'Договорная'].index
df.loc[indices, 'price'] = np.nan
df.loc[indices]

Unnamed: 0,location,district,rooms,size,level,max_levels,price
202,"город Ташкент, Яккасарайский район, Баходыра",Яккасарайский,3,119.0,3,9,
411,"город Ташкент, Яккасарайский район, Баходыра",Яккасарайский,4,160.0,4,9,
439,"город Ташкент, Мирзо-Улугбекский район, улица ...",Мирзо-Улугбекский,3,105.0,5,6,
460,"город Ташкент, Чиланзарский район, Чиланзар 1-...",Чиланзарский,3,90.0,6,8,
507,"город Ташкент, Яшнободский район, 1-й проезд А...",Яшнободский,2,48.0,4,4,
...,...,...,...,...,...,...,...
7039,"город Ташкент, Яшнободский район, Городок Авиа...",Яшнободский,1,38.7,3,8,
7196,"город Ташкент, Чиланзарский район, Чиланзар-16",Чиланзарский,2,51.0,3,4,
7323,"город Ташкент, Мирзо-Улугбекский район, жилой ...",Мирзо-Улугбекский,6,208.0,1,7,
7403,"город Ташкент, Учтепинский район, Чиланзар 14-...",Учтепинский,2,35.0,2,9,


In [5]:
price_col = np.array(df['price'], dtype='float64')
df['price'] = price_col

In [None]:
df.info()

In [None]:
df.describe().T

We can see *outliers* in `price` and `size` column.

#### Visualization

In [None]:
df.hist(bins=10, figsize=[20, 15]);

In [None]:
plt.figure(figsize=[15, 7])
sns.histplot(x=df[df['price'] < 500_000]['price'], kde=True);

There are many outliers in the `price` column, we need to correct them.

In [6]:
price_mask = (df['price'] >= 5000) & (df['price'] < 150_000)
df = df[price_mask]

In [None]:
plt.figure(figsize=[15, 7])
sns.histplot(x=df['price'], kde=True, bins=100);

In [None]:
plt.figure(figsize=[15, 7])
sns.histplot(x=df[df['size'] < 3000]['size'], kde=True);

We remove outliers with values less than or equal to `10` and greater than or equal to `155` in the `size` column.

In [7]:
size_mask = (df['size'] >= 10) & (df['size'] <= 155)
df = df[size_mask]

In [None]:
plt.figure(figsize=[15, 7])
sns.histplot(x=df['size'], kde=True, bins=70);

In [None]:
df.describe().T

In [None]:
avg_price = df.groupby('district')['price'].mean().sort_values(ascending=False)
avg_price

In [None]:
plt.figure(figsize=[16, 8])
sns.barplot(x=avg_price.index, y=avg_price)
plt.title("Average Price of Houses")
plt.xticks(rotation=15);

In [None]:
df.corr().style.background_gradient(cmap='Blues')

We can see there is good correlation between `price` and `size` columns.

In [None]:
plt.figure(figsize=[15, 7])
sns.scatterplot(data=df, x='size', y='price');

In [None]:
import warnings
warnings.simplefilter('ignore')

plt.figure(figsize=[18, 8])
sns.swarmplot(data=df, x='rooms', y='price');

In [None]:
plt.figure(figsize=[10, 5])
sns.histplot(x=df.rooms);

In [None]:
plt.figure(figsize=[18, 8])
sns.swarmplot(data=df, x='max_levels', y='price');

In [None]:
plt.figure(figsize=[18, 8])
sns.swarmplot(data=df, x='level', y='price');

In [None]:
df.corrwith(df['price']).sort_values(ascending=False)

In [None]:
df.head()

### Data Preparation | Feature Engineering

**Step by Step:**

**1.** Handling Missing Values \
**2.** Exploring and Creating New Features \
**3.** Encoding Categorical Values \
**4.** Feature Scaling

Since this is just an experiment and to avoid an unexpected modifications to the original df we will copy it to temporary df (We perform actual feature engineering with the pipeline):

---

In [None]:
data = df.copy()
data.head()

#### Handling Missing Values

In [None]:
## Handling missing values (if we had NaN values)

# from sklearn.impute import SimpleImputer

# imputer = SimpleImputer(strategy="mean")
# num_cols = data.drop(['location', 'district'], axis=1).columns
# num_values = imputer.fit_transform(data[num_cols])
# data_num = pd.DataFrame(num_values, columns=num_cols, index=data[num_cols].index)
# data_num

We do not have any missing values. It is beacuase in the process of handling outliers in visualization part, we also got rid of `NaN` values.

#### Exploring New Features

We can create additional features from the current dataset in order to improve model accuracy:
- `room_size_ratio`: The ratio of the number of rooms to the size of the house. It could be useful in identifying houses that have more rooms than their size would suggest, or vice versa.

- `level_size_ratio`: The ratio of the number of levels to the size of the house. It could be useful in identifying houses that have more levels than their size would suggest, or vice versa.

- `price_per_sqrt`: The price of the house divided by its size. It could be useful in identifying how much a house costs per square root, which can be a useful comparison metric.

- `price_per_room`: The price of the house divided by the number of rooms. It could be useful in identifying how much a house costs per room, which can be a useful comparison metric.

- `level_maxlevels_ratio`: The ratio of levels of the house to the maximum possible levels of the house. It could be useful in identifying houses that have higher or lower levels than their maximum possible levels.

- `district_density`: We could use district column and count the number of houses in each district, and divide it by the size of the district to get a measure of the district density.

- `location_density`: Similar to district density, we could use location column and count the number of houses in each location, and divide it by the size of the location to get a measure of the location density.

In [None]:
data['room_size_ratio'] = data['rooms'] / data['size']
data['level_size_ratio'] = data['level'] / data['size']
data['price_per_sqrt'] = data['price'] / data['size']
data['price_per_room'] = data['price'] / data['rooms']
data['level_maxlevels_ratio'] = data['level'] / data['max_levels']
# Create a new column for district density
district_counts = data['district'].value_counts()
data['district_density'] = data['district'].map(district_counts) / len(data)
# Create a new column for location density
location_counts = data['location'].value_counts()
data['location_density'] = data['location'].map(location_counts) / len(data)

In [None]:
data.info()

In [None]:
data.corrwith(data['price']).sort_values(ascending=False)

#### Encoding Categorical Values

In [None]:
data.head()

In [None]:
data['location'].value_counts()

This feature confusing, we remove it.

In [None]:
data.drop('location', axis=1, inplace=True)
data.head()

We only have the `district` categorical column that has to be encoded.

In [None]:
data[['district']].value_counts()

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
data_cat1hot = cat_encoder.fit_transform(df[['district']])
data_cat1hot.toarray()

#### Feature Scaling
We need to bring all values to a common range.

In [None]:
data.describe().T

In [None]:
data.head()

In [None]:
from sklearn.preprocessing import StandardScaler

standart_scaler = StandardScaler()
standart_scaler.fit_transform(data.iloc[:, 1:])

### Pipeline for Feature Engineering

We can combine all the processes so far into a single pipeline, with this, we can automate Feature Engineering:
  - Handling missing(`NaN`) values (`SimpleImputer`)
  - Encoding categorical values (`OneHotEncoder`)
  - Transformer for adding extra features
  - Scaling numeric values (`StandardScaler`)

And combine all and return 1 prepared dataset.

#### Transformer
Transformer for adding extra features for `X_train` set.

In [8]:
df.shape

(7070, 7)

In [9]:
from sklearn.model_selection import train_test_split

# Split train and test set
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

X_train: (5656, 6)
y_train: (5656,)
X_test: (1414, 6)
y_test: (1414,)


In [10]:
X_train.head()

Unnamed: 0,location,district,rooms,size,level,max_levels
5495,"город Ташкент, Учтепинский район, Чиланзар 26-...",Учтепинский,2,52.0,5,5
552,"город Ташкент, Юнусабадский район, Юнусабад 18...",Юнусабадский,2,48.0,3,4
5876,"город Ташкент, Мирабадский район, Фергана Йули",Мирабадский,1,28.0,2,4
6077,"город Ташкент, Шайхантахурский район, Самаркан...",Шайхантахурский,2,75.0,4,7
743,"город Ташкент, Яшнободский район, Карасу 5",Яшнободский,3,67.0,4,4


In [11]:
from sklearn.base import BaseEstimator, TransformerMixin

district = X_train['district'].copy()
price = y_train.copy()
# Indices of columns we need in X_train
rooms_ix, size_ix, level_ix, max_levels_ix = 0, 1, 2, 3

class ExtraFeaturesAdder(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self  # our function is transformer not estimator
    
    def transform(self, X):
        if type(X) != np.ndarray:
            X = X.values
        
        price_per_sqrt = price / X[:, size_ix]
        price_per_room = price / X[:, rooms_ix]
        level_size_ratio = X[:, level_ix] / X[:, size_ix]
        level_maxlevels_ratio = X[:, level_ix] / X[:, max_levels_ix]
        room_size_ratio = X[:, rooms_ix] / X[:, size_ix]
        # creating a new column for district density
        district_density = pd.Series(district).map(district.value_counts()) / len(X)

        return np.c_[X, price_per_sqrt, price_per_room, level_size_ratio,
                       district_density, level_maxlevels_ratio, room_size_ratio]

In [12]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer, ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler

num_cols = make_column_selector(dtype_include='number')
cat_col = ['district']

preprocessor_pipeline = make_column_transformer(
    (make_pipeline(ExtraFeaturesAdder(), StandardScaler()), num_cols),
    (OneHotEncoder(), cat_col)
)

In [15]:
preprocessor_pipeline.fit_transform(X_train)

array([[-0.54588418, -0.65033721,  0.59226343, ...,  0.        ,
         0.        ,  0.        ],
       [-0.54588418, -0.81498273, -0.303536  , ...,  0.        ,
         0.        ,  0.        ],
       [-1.55356072, -1.63821032, -0.75143572, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.55356072, -1.35008066, -0.75143572, ...,  0.        ,
         0.        ,  0.        ],
       [-1.55356072, -1.22659652, -0.75143572, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.46179236,  1.73702281,  0.14436372, ...,  0.        ,
         0.        ,  0.        ]])

The dataset is ready for Machine Learning!

---

### Modeling / Machine Learning

### Evaluation

### Saving the Model