# Evaluation criteria

The goal of this assignment is to get a view on your hands-on "data engineering" skills.  
At our company, our data scientists and engineers collaborate on projects.  
Your main focus will be creating performant & robust data flows.  
For a take-home-assignment, we cannot grant you access to our infrastructure.  
The assignement below measures your proficiency in general programming, data science & engineering tasks using python.  
Completion should not take more than half a day.

**We expect you to be proficient in:**
 * SQL queries (Sybase IQ system)
 * ETL flows (In collaboration with existing teams)
 * General python to glue it all together
 * Python data science ecosystem (Pandas + SKlearn)
 
**In this exercise we expect you to demonstrate your ability to / knowledge of:**
 * Building a data science runtime
 * PEP8 / Google python styleguide
 * Efficiently getting the job done
 * Choose meaningfull names for variables & functions
 * Writing maintainable code (yes, you might need to document some steps)
 * Help a data scientist present interactive results.
 * Offer predictions via REST api

# Setting-up a data science workspace

We allow you full freedom in setting up a data science runtime.  
The main objective is having a runtime where you can run this notebook and the code you will develop.  
You can choose for a local setup on your pc, or even a cloud setup if you're up for it.   

**In your environment, you will need things for:**
 * https request
 * python3 (not python2 !!)
 * (geo)pandas
 * interactive maps (e.g. folium, altair, ...)
 * REST apis
 
**Deliverables we expect**:
 * notebook with the completed assignment
 * list of packages for your runtime (e.g. yml or txt file)
 * evidence of a working API endpoint

# Importing packages

We would like you to put all your import statements here, together in 1 place.  
Before submitting, please make sure you remove any unused imports :-)  

In [None]:
## Working with dataframes
import pandas as pd

## Requests is a package that is going to help us communicate between our browser and a web server somewhere that is storing data we are interested in.
import requests

## Needed for quality checks
import unittest

## NumPy is the fundamental package for scientific computing in Python (needed for astype function, ...)
import numpy as np

## Joblib is necessary to load an object from disk (eg. Trained model in our case)
import joblib

## Normalizing json stuff
import json

## Map functions
import folium 

## Mathematic functions
from math import radians, cos, sin, asin, sqrt

## Geographic dataframe
import geopandas as gpd

## To download shape file
import urllib
import zipfile






# Data ingestion exercises

## Getting store location data from an API

**Goal:** Obtain a pandas dataframe  
**Hint:** You will need to normalise/flatten the json, because it contains multiple levels  
**API call:** https://ecgplacesmw.colruytgroup.com/ecgplacesmw/v3/nl/places/filter/clp-places  

In [None]:
## Below function retrieves a list from an API (JSON format) and converts it to a normalized pandas dataframe
def get_clp_places(url: str):
    shops = requests.get(url)
    return pd.json_normalize(shops.json())

## Testing the funtion with a list of CLP shops
df_clp = get_clp_places("https://ecgplacesmw.colruytgroup.com/ecgplacesmw/v3/nl/places/filter/clp-places")
df_clp.head(10)

### Quality checks

We would like you to add several checks on this data based on these constraints:  
 * records > 200
 * latitude between 49 and 52
 * longitude between 2 and 7
 
We dont want you to create a full blown test suite here, we're just gonna use 'asserts' from unittest

In [None]:
## Quality checks
qc = unittest.TestCase('__init__')

# The shape functions returns the number of records in our dataframe
qc.assertTrue(df_clp.shape[0] -200 > 0, "Dataframe must contain over 200 records")
qc.assertTrue(df_clp['geoCoordinates.latitude'].between(49,52).all(), "Latitude is not between 49 and 52")
qc.assertTrue(df_clp['geoCoordinates.longitude'].between(2,7).all(), "Longitude is not between 2 and 7")

### Feature creation

Create a new column "antwerpen" which is 1 for all stores in Antwerpen (province) and 0 for all others 

In [None]:
## Creating the new column
## First astype converts the postalcode to an integer
## Second astype converts TRUE to 1 and FALSE to 0
df_clp["antwerpen"]=(df_clp['address.postalcode']
                     .astype(int)
                     .between(2000,2999)
                     .astype(int)
)

df_clp["antwerpen"].value_counts()


## Predict used car value

A datascientist in our team made a basic model to predict car prices.  
The model was saved to disk ('lgbr_cars.model') using joblib's dump fuctionality.  
Documentation states the model is a LightGBM Regressor, trained using the sk-learn api.  

**As engineer, your task it to expose this model as REST-api.** 

First, retrieve the model via the function below.  
Change the path according to your setup.  

In [None]:
## Below function loads the trained model from disk (Stored with Joblib's dump functionality)
def retrieve_model(path: str):
    trained_model=joblib.load(path)
    return trained_model

## Loading the file from disk 
lgbr_cars = retrieve_model("./lgbr_cars.model")

Now you have your trained model, lets do a functional test based on the parameters below.  
You have to present the parameters in this order.  

* vehicleType: coupe
* gearbox: manuell
* powerPS: 190
* model: NaN
* kilometer: 125000
* monthOfRegistration: 5 
* fuelType: diesel
* brand: audi

Based on these parameters, you should get a predicted value of 14026.35068804
However, the model doesnt accept string inputs, see the integer encoding below:

In [None]:
model_test_input = [[3,1,190,-1,125000,5,3,1]]

In [None]:
## Below function predicts user car value based on a trained model. Only returning first argument from the array of results
def make_prediction(trained_model, single_input):
    predicted_value=trained_model.predict(single_input)[0].round(8)
    return predicted_value

predicted_value = make_prediction(lgbr_cars, model_test_input)
print(predicted_value)

Now you got this model up and running, we want you to **expose it as a rest api.**  
We don't expect you to set up any authentication.  
We're not looking for beautiful inputs, just make it work.  
**Building this endpoint should NOT be done in a notebook, but in proper .py file(s)**

Once its up and running, use it to predict the following input:
* [-1,1,0,118,150000,0,1,38] ==> prediction should be 13920.70


In [None]:
## First start the Rest API as stated below from a seperate Anaconda prompt
## set FLASK_APP=rest_api
## flask run

## Testing the API
input = [-1,1,0,118,150000,0,1,38]
response = requests.get('http://127.0.0.1:5000/predict',data=json.dumps({'instance': input }))
response.content

## Geospatial data exercise
The goal of this exercise is to read in some data from a shape file and visualize it on a map
- The map should be dynamic. I want to zoom in and out to see more interesting aspects of the map
- We want you to visualize the statistical sectors within a distance of 2KM of your home location.

Specific steps to take:
- Read in the shape file
- Transform to WGS coordinates
- Create a distance function (Haversine)
- Create variables for home_lat, home_lon and perimeter_distance
- Calculate centroid for each nis district
- Calculate the distance to home for each nis district centroid 
- Figure out which nis districts are near your home
- Create dynamic zoomable map
- Visualize the nis districts near you (centroid <2km away), on the map


In [None]:
## part 1: Reading in the data
## get this file from https://statbel.fgov.be/sites/default/files/files/opendata/Statistische%20sectoren/sh_statbel_statistical_sectors_20200101.shp.zip 
## The World Geodetic System (WGS) is a standard used in cartography, geodesy, and satellite navigation including GPS. 

## Download the file and unzip it
url = "https://statbel.fgov.be/sites/default/files/files/opendata/Statistische%20sectoren/sh_statbel_statistical_sectors_20200101.shp.zip"
extract_dir = "shape"

zip_path, _ = urllib.request.urlretrieve(url)
with zipfile.ZipFile(zip_path, "r") as f:
    f.extractall(extract_dir)

## Load the file in a geopandas dataframe 
df = gpd.read_file('shape/sh_statbel_statistical_sectors_20200101.shp')
df = df.to_crs(epsg=4326) # change projection to wgs84 

## Add columns with centroid lattitudes/longitudes
df['centroid_lon'] = df.to_crs('+proj=cea').centroid.to_crs(df.crs).x
df['centroid_lat'] = df.to_crs('+proj=cea').centroid.to_crs(df.crs).y

## Verification
print(df[['T_SEC_NL','centroid_lon','centroid_lat']].head(10))

In [None]:
## Let's create some variables to indicate the location of your interest  
## My home location: Waterstraat 5 Wielsbeke
home_lat = 50.91453631612504
home_lon = 3.3705045538732263
perimeter_distance = 2 # km

In [None]:
## Haversine formula - https://www.igismap.com/haversine-formula-calculate-geographic-distance-earth/
def haversine(lat1, lon1, lat2, lon2):
    R = 6372.8  # Earth radius in kilometers
    
    ## Convert latitude and longitude to radians
    lat1 = radians(lat1)
    lon1 = radians(lon1)
    lat2 = radians(lat2)
    lon2 = radians(lon2)
    
    ## Calculate differences between the two points
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    
    ## Apply Haversine formula
    a = sin(dlat/2)**2 + cos(lat1)*cos(lat2)*sin(dlon/2)**2
    c = 2*asin(sqrt(a))
    
    ## Return distance in kilometers
    return R*c


Next, implement some sanity checks for your distance function 

In [None]:
# implement sanity checks here
qc_haversine = unittest.TestCase('__init__')

# distance between the same point should be zero
qc_haversine.assertEqual(haversine(home_lat, home_lon, home_lat, home_lon), 0)
# distance between Berlin and Paris should be approximately 878 km
qc_haversine.assertAlmostEqual(haversine(52.5200, 13.4050, 48.8566, 2.3522), 878, delta=10)
# distance between New York and Los Angeles should be approximately 3945 km
qc_haversine.assertAlmostEqual(haversine(40.7128, -74.0060, 34.0522, -118.2437), 3945, delta=50)



Now, create a dynamical map 

In [None]:
## First we calculate the distance to home for each nis district - we put this in a new column
df['distance_to_home'] = (df[['centroid_lon', 'centroid_lat']]
                                        .apply(lambda x: haversine(x['centroid_lat'],
                                                                   x['centroid_lon'],
                                                                   home_lat,
                                                                   home_lon),
                                        axis=1))
## Verification
print(df[['T_SEC_NL','distance_to_home']].head(10))

In [None]:
## filter out places w. distance <2 km - Put this in a new dataframe
df_close_to_home = df[df['distance_to_home'] < perimeter_distance]

## Initialize a map on the location of my home town
my_map = folium.Map(location=[home_lat,home_lon],zoom_start=13,tiles="Cartodb dark_matter")

## add my home location to the map
folium.Circle(location=[home_lat, home_lon],
              radius=15,
              fill=True,
              color='red',
              popup='home location').add_to(my_map)

## add nis districts nearby my home (perimeter of less then 2 km)
df_close_to_home.apply(lambda x: folium.Marker(location=[x['centroid_lat'], x['centroid_lon']],
                                                popup=x['T_SEC_NL'])
                        .add_to(my_map), axis=1)



## Show the map on the screen
my_map
