# Sklearn Homework

In this homework you will work on 3 files that can be found in the `data/housing/*` directory:

1. `housing_train.csv` - training data
2. `housing_validation.csv` - validation data, you will evaluate your model on this data
3. `housing_example_submission.csv` - example submission of your model

You will split your work into two parts.

## Train a simple linear regression model

First make the following preprocessing on the whole `housing_train.csv`:

1. Columns with more than 70% of `NA` values change into columns `NA_in_col_*` by following instructions in the presentation part. **Remove the original column**.
2. Fill the rest of `NA`s in other columns with the **median** of particular column.

With such dataframe train the linear regression model that will predict the `MEDV` column.
Using this model answer the following questions:

1. Predict the price of the observation `rec` (defined bellow).
2. How much would the prediction change if we'd increase the `RM` column by 2?

In [1]:
import pandas as pd

rec = pd.DataFrame({
    'CRIM': [3.69],
    'ZN': [11.37],
    'INDUS': [11.15],
    'CHAS': [0.07],
    'NOX': [0.87],
    'RM': [6.29],
    'AGE': [68.91],
    'DIS': [3.77],
    'RAD': [9.50],
    'TAX': [410.95],
    'PTRATIO': [18.37],
    'B': [354.47],
    'LSTAT': [0.79],
})
rec

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,3.69,11.37,11.15,0.07,0.87,6.29,68.91,3.77,9.5,410.95,18.37,354.47,0.79


## Train the best model regression you can

Now you are tasked with a real ML case.
Using the acquired knowledge and the [(brilliant) `scikit-learn` documentation](https://scikit-learn.org/stable/) train the best model possible!

Using the `housing_train.csv` file, train the best model you can by minimizing **Mean Square Error** metric, then make prediction on `housing_validation.csv`.
Prediction should be saved into a file with the same format as `housing_example_submission.csv` file (i.e. ensure the correct column name and lack of index. `s.to_csv(filename, index=False)` should work).
Of course `housing_validation.csv` file doesn't contain the `MEDV` column, you have to predict it.

After saving the results, save the model as well.
This can be done for example using `pickle`. [TODO]


### Tips

When you will work on filling `NA`s with median, take a look on how I done it in the presentation.

In the second part experiment with both, preprocessing and modeling.

Remember that training on the `housing_train.csv` and checking the model performance on the same dataset can be very miss leading.

Probably you want to split `housing_train.csv` into train and test (independently of `housing_validation.csv`).
Create the best model using those two datasets and only then make predictions on  `housing_validation.csv`.
Maybe you want to understand what is [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html) here?
How good is your prediction? You will check it in the next lesson!

Apart from `sklearn` you can try [xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn) and [lightgbm](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html), they should work as a drop-in replacement.


### Context

On websites like [kaggle](https://www.kaggle.com/) you can challenge yourself with ML tasks.
Usually the competition looks similarly to this task.
You get one csv file with training data and have to evaluate the best model on another.

## Additional information on dataset

This is modified Boston housing dataset, _classic_ dataset use for learning ML.

(Not fully accurate) information on columns' content:

```
1. CRIM      per capita crime rate by town
2. ZN        proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS     proportion of non-retail business acres per town
4. CHAS      Charles River dummy variable (= 1 if tract bounds 
        river; 0 otherwise)
5. NOX       nitric oxides concentration (parts per 10 million)
6. RM        average number of rooms per dwelling
7. AGE       proportion of owner-occupied units built prior to 1940
8. DIS       weighted distances to five Boston employment centres
9. RAD       index of accessibility to radial highways
10. TAX      full-value property-tax rate per $10,000
11. PTRATIO  pupil-teacher ratio by town
12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT    % lower status of the population
14. MEDV     Median value of owner-occupied homes in $1000's
```

