# Technical Requirements

- Python (preferably 3.9)
- Jupyter
- Anaconda
- git (just for your convenience)
- Stable internet connection

# Anaconda Walkthrough

[![anaconda](media\anaconda.png)](https://www.anaconda.com/products/distribution)

    conda install notebook
    conda install nb_conda_kernels

# How to Get Started?

## Problem Definition

<img src="media\student.png"/>

### How to choose proper place to rent an apartment as a student in Poland?

![learning](media\learning.png)

## The Most Important... DATA

[![kaggle](media\kaggle.png)](https://www.kaggle.com/datasets/dawidcegielski/house-prices-in-poland)

In [4]:
# imports
import os
import json

# define city
city = 'warsaw'

In [21]:
# load data

with open(os.path.join('data', f'{city}.json'), 'r') as json_file:
    data = json.load(json_file)

In [6]:
# look for columns
columns = data.keys()
columns

dict_keys(['floor', 'price', 'rooms', 'sq', 'year'])

## Data Quality 

In [19]:
# count values

for column, values in data.items():
    print(f'{column}: {len(values)}')

floor: 8030
price: 8030
rooms: 8030
sq: 8030
year: 8030


In [8]:
# count nans

from numpy import isnan

def nans_summary(data):
    
    for column, values in data.items():
        print(f'{column}: {sum([isnan(value) for value in values])}')

In [9]:
nans_summary(data)

floor: 0
price: 835
rooms: 0
sq: 0
year: 0


In [10]:
# sign check

def negatives_summary(data):
    
    for column, values in data.items():
        print(f'{column}: {sum([value < 0 for value in values])}')

In [11]:
negatives_summary(data)

floor: 0
price: 0
rooms: 0
sq: 0
year: 0


In [12]:
# time check

time_summary = lambda data: print(f"min year: {min(data['year'])}\nmax year: {max(data['year'])}")

In [13]:
time_summary(data)

min year: 75
max year: 2980


## Data Processing

In [22]:
# delete nans

nan_index = [i for i, price in enumerate(data['price']) if isnan(price)]
for i, index in enumerate(nan_index):
    for values in data.values():
        values.pop(index - i)

In [23]:
# check nans again
nans_summary(data)

floor: 0
price: 0
rooms: 0
sq: 0
year: 0


In [24]:
# delete future

future_index = [i for i, year in enumerate(data['year']) if year > 2022]
for i, index in enumerate(future_index):
    for values in data.values():
        values.pop(index - i)

In [25]:
# check time again
time_summary(data)

min year: 75
max year: 2022


## Get Insights

In [26]:
# import stats

import statistics

In [28]:
# get statistics

for column, values in data.items():
    print(column)
    for stat in (statistics.mean, statistics.median, statistics.stdev):
        print(f'\t{stat.__name__}:', round(stat(values), 3))

floor
	mean: 3.284
	median: 3
	stdev: 2.815
price
	mean: 781558.535
	median: 595000.0
	stdev: 706680.291
rooms
	mean: 2.632
	median: 3
	stdev: 1.009
sq
	mean: 63.108
	median: 54.39
	stdev: 104.903
year
	mean: 1995.974
	median: 2008
	stdev: 42.18


In [29]:
# quartiles

def quartiles(data, column):
    
    for i, bucket in enumerate(statistics.quantiles(data[column])):
        print(f'\t{round((1 + i) / 4 * 100)}%: {round(bucket, 3)}')

In [30]:
quartiles(data, 'price')

	25%: 468750.0
	50%: 595000.0
	75%: 820000.0


In [31]:
quartiles(data, 'sq')

	25%: 43.0
	50%: 54.39
	75%: 71.0


In [32]:
# boundary values

boundaries = lambda data, column: print(f"min: {min(data[column])}\nmax: {max(data[column])}")

In [33]:
boundaries(data, 'price')

min: 5000.0
max: 15000000.0


In [34]:
boundaries(data, 'sq')

min: 11.0
max: 9000.0


In [35]:
# delete outliers

def delete_outliers(data, column, percent=.1):
    
    percent /= 2
    n = len(data[column])
    low, high = sorted(data[column])[int(percent*n)], sorted(data[column])[-int(percent*n)]
    to_delete = [i for i in range(len(data[column])) if data[column][i] < low or data[column][i] > high]
    for i, delete_index in enumerate(to_delete):
        for values in data.values():
            values.pop(delete_index - i)

In [36]:
delete_outliers(data, 'price')

In [38]:
delete_outliers(data, 'sq')

In [39]:
# check boundary again
boundaries(data, 'price')

min: 330000.0
max: 1800000.0


In [40]:
boundaries(data, 'sq')

min: 30.9
max: 104.0


# What Next?

## Feature Engineering

In [42]:
# correlation test
from scipy.stats import pearsonr

for column, values in data.items():
    print(f'{column}')
    results = pearsonr(data['price'], values)
    print(f'\tcorrelation: {round(results[0], 3)}')
    print(f'\tp_value: {round(results[1], 3)}')

floor
	correlation: 0.03
	p_value: 0.017
price
	correlation: 1.0
	p_value: 0.0
rooms
	correlation: 0.413
	p_value: 0.0
sq
	correlation: 0.716
	p_value: 0.0
year
	correlation: 0.015
	p_value: 0.239


## Build a Model

$$y\approx\overline{y}+\beta_1\left(x_1-\overline{x_1}\right)+\beta_2\left(x_2-\overline{x_2}\right),$$
where $\beta_i=\frac{\operatorname{cov}(y,x_i)\sigma^2_{x_j}-\operatorname{cov}(y,x_j)\operatorname{cov}(x_i,x_j)}{\sigma^2_{x_i}\sigma^2_{x_j}-\left(\operatorname{cov}(x_i,x_j)\right)^2}$

In [52]:
# model



## Evaluate

In [53]:
# train loss


In [54]:
# test loss


# Answer Questions

In [55]:
# what is the final cost


# Homework

[![avocado](media\avocado.jpg)](https://www.kaggle.com/datasets/neuromusic/avocado-prices)