# Data Science project:
### 1. Formulation of the problem
### 2. Retrieving data
### 3. Data analysis
### 4. Data visualization
### 5. Data preprocessing
### 6. Generation of new features
### 7. Building the model
### 8. Quality assessment
### 9. Implementation of the model
### 10. Monitoring quality and improving the model



## 1. Connecting libraries and scripts

In [1]:
import numpy as np
import pandas as pd

from scipy.stats import mode

import warnings
warnings.filterwarnings('ignore')

**Paths to directories and files**

In [2]:
DATASET_PATH = './data/housing.csv'
PREPARED_DATASET_PATH = './data/housing_prepared.csv'

1. Loading data
Description of the task

The goal is to predict the value of the house

What for?

*In banks, insurance companies:*

* Find out the true value of the property (collateral)
* Make a decision on the issuance of a mortgage / insurance
* Decide on% mortgage / insurance

*On ad sites (Avito, Cyan, ...):*

* Find underestimated apartments (~ great deals), show them to users
* Show the market value of an apartment to users
* For those who are selling an apartment, recommend the sale price

*For real estate investors:*

* Determine the market value of apartments
* Search for undervalued assets
* Real estate trading

*Dataset description*

Statistics for a number of homes in California based on the 1990 Census.

- **longitude** - longitude
- **latitude** - latitude
- **housing_median_age** - average house age
- **total_rooms** - total number of rooms
- **total_bedrooms** - total number of bedrooms
- **population** - the number of residents
- **households** - households
- **ocean_proximity** - ocean proximity
- **median_income** - average income
- **median_house_value** - average house value /n

We read the data. By rows - observations, by columns - signs.

In [3]:
df = pd.read_csv(DATASET_PATH, sep=',')
df.head(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,id
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,1
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,2
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,3
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,4


In [4]:
df.tail(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,id
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND,20638
20639,-121.24,39.37,16.0,2785.0,616.0,1387.0,530.0,2.3886,89400.0,INLAND,20639


In [5]:
df.sample(frac=0.01, random_state=10)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,id
20303,-119.18,34.16,12.0,460.0,101.0,405.0,103.0,5.2783,167400.0,NEAR OCEAN,20303
16966,-122.31,37.55,,3931.0,933.0,1877.0,851.0,3.9722,354100.0,NEAR OCEAN,16966
10623,-117.77,33.67,12.0,4329.0,1068.0,1913.0,978.0,4.5094,160200.0,<1H OCEAN,10623
6146,-117.95,34.11,29.0,1986.0,448.0,2013.0,432.0,3.1034,140800.0,INLAND,6146
2208,-119.87,36.81,6.0,1891.0,341.0,969.0,330.0,4.6726,107800.0,INLAND,2208
...,...,...,...,...,...,...,...,...,...,...,...
15132,-116.93,32.83,19.0,3038.0,529.0,1463.0,509.0,3.9440,172500.0,<1H OCEAN,15132
19013,-122.01,38.36,15.0,476.0,67.0,213.0,73.0,7.1053,315200.0,INLAND,19013
2366,-119.57,36.72,11.0,2510.0,460.0,1248.0,445.0,3.6161,99500.0,INLAND,2366
5534,-118.43,33.96,38.0,1104.0,216.0,415.0,163.0,6.1985,422000.0,<1H OCEAN,5534


In [6]:
df.shape

(20640, 11)

In [7]:
df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity', 'id'],
      dtype='object')

In [8]:
df.index

RangeIndex(start=0, stop=20640, step=1)

In [9]:
df['total_rooms']

0         880.0
1        7099.0
2        1467.0
3        1274.0
4        1627.0
          ...  
20635    1665.0
20636     697.0
20637    2254.0
20638    1860.0
20639    2785.0
Name: total_rooms, Length: 20640, dtype: float64

## 2. Data type casting


In [10]:
df.dtypes

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
id                      int64
dtype: object

In [11]:
type(df.longitude)

pandas.core.series.Series

In [12]:
df['id'] = df['id'].astype('str')

In [13]:
df_num_features = df.select_dtypes(include=['float64','int64'])
df_num_features.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0


In [14]:
df_num_features.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,19918.0,20640.0,20433.0,20041.0,20640.0,20640.0,20640.0
mean,-119.471242,35.036934,28.65363,2635.763081,537.870553,1425.418243,499.53968,3.870671,206855.816909
std,5.041408,94.903955,12.576796,2181.615252,421.38507,1135.185798,382.329753,1.899822,115395.615874
min,-124.35,-13534.03,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,786.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1165.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1726.0,605.0,4.74325,264725.0
max,122.03,1327.13,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [15]:
df['ocean_proximity'].value_counts()

<1H OCEAN     9127
INLAND        6542
NEAR OCEAN    2655
NEAR BAY      2288
-               23
ISLAND           5
Name: ocean_proximity, dtype: int64

In [16]:
df.groupby(["ocean_proximity"]).agg({"median_house_value": ["mean", "count"]})

Unnamed: 0_level_0,median_house_value,median_house_value
Unnamed: 0_level_1,mean,count
ocean_proximity,Unnamed: 1_level_2,Unnamed: 2_level_2
-,177352.173913,23
<1H OCEAN,240097.253424,9127
INLAND,124792.727453,6542
ISLAND,380440.0,5
NEAR BAY,259320.058566,2288
NEAR OCEAN,249505.390584,2655


## 3. Handling passes
What can you do with them?

1. Throw out this data
2. Replace gaps with different methods (medians, means, etc.)
3. Do / do not do additional features
4. To do nothing

In [17]:
df.isna().sum()

longitude               0
latitude                0
housing_median_age    722
total_rooms             0
total_bedrooms        207
population            599
households              0
median_income           0
median_house_value      0
ocean_proximity         0
id                      0
dtype: int64

In [18]:
df['housing_median_age_nan'] = 0

In [19]:
df.loc[df["housing_median_age"].isnull(), 'housing_median_age_nan'] = 1
df["housing_median_age_nan"].agg([np.mean, np.sum])

mean      0.034981
sum     722.000000
Name: housing_median_age_nan, dtype: float64

### Separately

In [20]:
median = df['housing_median_age'].median()
df["housing_median_age"] = df["housing_median_age"].fillna(median)

In [21]:
median = int(df["total_bedrooms"].median())
df["total_bedrooms"].fillna(median, inplace=True)

In [22]:
median = df['population'].median()
df["population"].fillna(median, inplace=True)

### Together

In [23]:
medians = df[['housing_median_age', 'total_bedrooms', 'population']].median()
df[['housing_median_age', 'total_bedrooms', 'population']] = \
        df[['housing_median_age', 'total_bedrooms', 'population']].fillna(medians)

In [24]:
df.isna().sum()

longitude                 0
latitude                  0
housing_median_age        0
total_rooms               0
total_bedrooms            0
population                0
households                0
median_income             0
median_house_value        0
ocean_proximity           0
id                        0
housing_median_age_nan    0
dtype: int64

**ocean proximity**

In [25]:
df['ocean_proximity_nan'] = 0
df.loc[df['ocean_proximity'] == '-', "ocean_proximity_nan"] = 1

In [26]:
df[df["ocean_proximity_nan"] == 1].head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,id,housing_median_age_nan,ocean_proximity_nan
1153,-121.46,39.54,14.0,5549.0,1000.0,1822.0,919.0,2.9562,142300.0,-,1153,0,1
2435,-119.59,36.57,19.0,1733.0,303.0,911.0,281.0,3.5987,131700.0,-,2435,0,1
2636,-124.15,40.59,39.0,1186.0,238.0,539.0,212.0,2.0938,79600.0,-,2636,0,1
5980,-117.74,34.1,26.0,2723.0,604.0,1847.0,498.0,2.6779,136000.0,-,5980,0,1
6373,-118.02,34.15,44.0,2419.0,437.0,1045.0,432.0,3.875,280800.0,-,6373,0,1


In [27]:
df['ocean_proximity'].mode()

0    <1H OCEAN
dtype: object

In [28]:
df['ocean_proximity'].value_counts()

<1H OCEAN     9127
INLAND        6542
NEAR OCEAN    2655
NEAR BAY      2288
-               23
ISLAND           5
Name: ocean_proximity, dtype: int64

In [29]:
df['ocean_proximity'].mode()[0]

'<1H OCEAN'

In [30]:
df.replace(
    {'ocean_proximity':
      {'-': df['ocean_proximity'].mode()[0],
      }
    },
    inplace=True)

In [31]:
df['ocean_proximity'].value_counts()

<1H OCEAN     9150
INLAND        6542
NEAR OCEAN    2655
NEAR BAY      2288
ISLAND           5
Name: ocean_proximity, dtype: int64

## 4. Treatment of emissions
*Outliers* are objects in the data that do not belong to a specific dependency. This is an abnormal observation that is far removed from other observations.

What can you do with them?

1. Throw out this data
2. Replace outliers with different methods (medians, means, etc.)
3. Do / do not do an additional feature
4. To do nothing

In [32]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,housing_median_age_nan,ocean_proximity_nan
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,-119.471242,35.036934,28.665746,2635.763081,536.838857,1417.860562,499.53968,3.870671,206855.816909,0.034981,0.001114
std,5.041408,94.903955,12.355019,2181.615252,419.391878,1119.445348,382.329753,1.899822,115395.615874,0.183735,0.033364
min,-124.35,-13534.03,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0,0.0,0.0
25%,-121.8,33.93,19.0,1447.75,297.0,797.0,280.0,2.5634,119600.0,0.0,0.0
50%,-118.49,34.26,29.0,2127.0,435.0,1165.0,409.0,3.5348,179700.0,0.0,0.0
75%,-118.01,37.71,37.0,3148.0,643.25,1701.0,605.0,4.74325,264725.0,0.0,0.0
max,122.03,1327.13,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0,1.0,1.0


In [33]:
 df.loc[df["longitude"] > 0, 'longitude'] = df.loc[df["longitude"] > 0, 'longitude'] * -1

In [34]:
df.loc[df["longitude"] == 0, 'longitude'] = df["longitude"].median()

In [35]:
df[df['longitude'] >= 0]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,id,housing_median_age_nan,ocean_proximity_nan


In [36]:
df[(df["latitude"] <= 0) | (df["latitude"] > 50)]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,id,housing_median_age_nan,ocean_proximity_nan
8283,-118.13,-13534.03,45.0,1016.0,172.0,361.0,163.0,7.5,434500.0,NEAR OCEAN,8283,0,0
12772,-121.42,1327.13,29.0,2217.0,536.0,1203.0,507.0,1.9412,73100.0,INLAND,12772,1,0


In [37]:
df["latitude_outlier"] = 0
df.loc[(df["latitude"] <= 0) | (df["latitude"] > 50), "latitude_outlier"] = 1

In [38]:
df[(df["latitude"] <= 0) | (df["latitude"] > 50)]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,id,housing_median_age_nan,ocean_proximity_nan,latitude_outlier
8283,-118.13,-13534.03,45.0,1016.0,172.0,361.0,163.0,7.5,434500.0,NEAR OCEAN,8283,0,0,1
12772,-121.42,1327.13,29.0,2217.0,536.0,1203.0,507.0,1.9412,73100.0,INLAND,12772,1,0,1


In [39]:
df.loc[df['latitude_outlier'] == 1, "latitude"] = df["latitude"].median()

In [40]:
df[(df["latitude"] <= 0) | (df["latitude"] > 50)]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,id,housing_median_age_nan,ocean_proximity_nan,latitude_outlier
