# Task

When working with AI there is plenty of opportunity to improve processes or suggest new ways of doing things. When doing so it is often very smart and efficient (time is a scarce resource) to create a POC (Proof of Concept) which basically is a small demo checking wether it is worthwile going further with something. It is also something concrete which facilitates discussions, do not underestimate the power of that. 

In this example, you are working in a company that sells houses and they have a "manual" process of setting prices by humans. You as a Data Scientist can make this process better by using Machine Learning. Your task is to create a POC that you will present to your team colleagues and use as a source of discussion of wether or not you should continue with more detailed modelling. 

Two quotes to facilitate your reflection on the value of creating a PoC: 

"*Premature optimization is the root of all evil*". 

"*Fail fast*".

# Task


**More specifially, do the following:**

1. A short EDA (Exploratory Data Analysis) of the housing data set.
- You can use the info, head and describe methods on the housing dataframe. 
2. Drop the column "ocean_proximity", then you only have numeric columns which will simplify your analysis. Remember, this is a POC!
3. Split your data into train and test set. You can use the following code:

```python
train_set, test_set = train_test_split(housing_num, test_size=0.2, random_state=42)

X_train_pre = train_set.drop('median_house_value', axis=1)
y_train = train_set['median_house_value'].copy()

X_test_pre = test_set.drop('median_house_value', axis=1)
y_test = test_set['median_house_value'].copy()
```

4. Create a pipeline containing a SimpleImputer [ SimpleImputer(strategy="median") ] and a std_scaler and fit-transform your train set (X_train_pre) and call the transformed data X_train. 

5. Use GridSearchCV when choosing your model. You will look at a Lasso regression with different alpha values. More specifically, use the following code: 

```python
param_grid = [{'alpha': [0.1, 0.5, 1, 2, 5]}]

lasso_reg = linear_model.Lasso()

grid_search = GridSearchCV(lasso_reg, param_grid, cv=3,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)

grid_search.fit(X_train, y_train)

pd.DataFrame(grid_search.cv_results_)
```

6. Evaluate your model on the test set using the Root Mean Squared Error as the metric. Conclusions? (Remember, you have fitted your pipeline above so now you just transform your test set without fitting your pipeline on it, else it is "cheating".)

7. If you would present this PoC to your work colleagues, this is some of the things you could think of:
- What do you want to highlight/present?
- What is your conclusion?
- What could be the next step? Is the POC convincing enough or is it not worthwile continuing? Do we need to dig deeper into this before taking some decisions?


### Vad är RMSE?

RMSE står för Root Mean Squared Error.\
Det är ett mått för att utvärdera regressionsproblem och mäter prediktionernas medelavstånd från de äkta, observerade värdena.

Matematiska formeln för RMSE är:

$RMSE = \sqrt{\frac{1} {n} \sum_{i=1}^{n}(\hat{y}_i-y_i)^2}$

Idén bakom RMSE är simpel:
- Man tar skillnaden mellan en prediktion och respektive observerad värde: $\hat{y}_i-y_i$;    Det kallas för __Error__.
- Vi bryr oss inte om det är en positiv eller negativ skillnad, därför kvadrerar vi: $(\hat{y}_i-y_i)^2$;   Det kallas för __Squared Error__.
- Vi räknar ut medelvärdet för Squared Error: $\frac{1} {n} \sum_{i=1}^{n}(\hat{y}_i-y_i)^2$;    Det kallas för __Mean Squared Error__.
- Vi tar roten ur Mean Squared Error, så måttet är på det samma skala som datan och därför lättare att tolka: $\sqrt{\frac{1} {n} \sum_{i=1}^{n}(\hat{y}_i-y_i)^2}$

# POC

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.model_selection import GridSearchCV
from sklearn import linear_model

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [3]:
# Below, set your own path where you have stored the data file. 
housing = pd.read_csv(r'C:\Users\Antonio Prgomet\Documents\ec_utbildning\kursframställning\sthlm_gbg\ml_sthlm_gbg\exercises_and_examinations\housing.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Antonio Prgomet\\Documents\\ec_utbildning\\kursframställning\\sthlm_gbg\\ml_sthlm_gbg\\exercises_and_examinations\\housing.csv'

In [None]:
housing

## EDA

# Drop the column "ocean_proximity"

# Spliting train & test

# Create a pipeline 

# GridSearchCV

# Evaluate model on the test set