![title](img/logo_white_full.png)

# Intro

---
## Welcome!

A specialist working for Quantee strives to be an excellent statistician or actuary, has outstanding programming skills and acquires soft skills to smoothly cooperate with clients in the insurance and banking sector. This tutorial is to improve your data science and programming knowledge: 
* introduce you into the most important topics of **data science**,
* is oriented on **insurance and banking data problems**, so that you will be able to utilize it for Quantee projects,
* serves as a **knowledge base** on which you can build your own code for a project.

The tutorial is divided into six parts:
1. **Data Processing Tutorial** - examples of how to perform data analysis, do feature engineering and handle missing entries.
2. **Regression Models Tutorial** - most common machine learning estimators used for regression problems are explained 
3. **Hyperoptimization Tutorial** - how to tune machine learning estimators, crucial for model performance
4. **Stacking Tutorial** - how to merge many machine learning estimators to improve predictability
5. **Dashboards Tutorial** - how to create interactive plots and beautiful web-based applications 
6. **Deployment Tutorial** - how to deploy your dashboard application on a server, so that our clients can use professional solution

---
## Plan
1. Read about **technology** we use at Quantee.
2. Fulfill **requirements** by installing all libraries required in your work.
3. Build your **data science knowledge base**.
     1. Go through the Udemy data science course.
     2. Go through tutorials 1-6 to apply AI in insurance data.
4. Discover **Kaggle competitions** similar to projects at Quantee.

---
## Technology
#### Python
![Python logo](img/logo_python.png)
All tutorials are using Python, a leading technology in Quantee. Although for some clients, we use R or MATLAB as a technology, we aim to convince clients to use Python. Why is that?
* Python is the most popular language for data science. All important packages as scikit-learn or tensorflow are written for this programming language.
* Contrary to R or MATLAB, it is fully object-oriented programming language. This allows us to write beautiful code that is well generalized.
* It's a natural choice for many data scientists coming from IT world, as it integrates much better than R in an engineering environment. 
* It is production-ready, i.e. we can easily develop and deploy a web-application with front-end for our data science projects.
* Most Python libraries are MIT and BSD licensed, which offer much more flexibility in commercial use, in contrary to GPL license of R packages.

#### Visualizations
![Plotly logo](img/logo_plotly.png)
Python comes with great packages such as ```matplotlib``` or ```seaborn``` dedicated for visualizations. Of course you can use them as static images are good for reports etc however, the really powerful library is [Plotly](https://plot.ly/python/) thanks to which you can create interactive graphs like this:

In [1]:
from utils import plot_example_surface # Custom Utilities written for this tutorial

# Produce Plotly graph
plot_example_surface()

In building the visualization for our clients you should remember to use Quantee colors or [Viridis](https://bhaskarvk.github.io/colormap/README-eg1-1.png) colorpalette. The definition of Quantee colors is here:

In [None]:
colors = {'blue': '#0b578e', 'light-blue': '#e6f2ff', 'dark-blue': '#264e86', 
          'intense-blue': '#119dff', 'yellow': '#f4b400'}

#### Dashboards and Deployment
![Dash logo](img/logo_dash.png)
To build beautiful web application all we need is Dash. Dash is a product of Plotly and comes with all components required to build modern UX application (sliders, check-boxes, buttons, etc.) with HTML objects to create a graphical layout of a dashboard.  
You can see [here an example of Dash application](https://quantee-dash-yield-curves-demo.herokuapp.com/) deployed on Heroku that we show our clients as a demo. 

#### IDE
![Spyder logo](img/logo_spyder.png)
Generally, we use Anaconda framework together with most important libraries (numpy, scipy, scikit-learn). It also comes with Jupyter and Spyder installations. 

For people coming from a statistical background, Spyder would be ideal as it resembles R Studio or MATLAB IDE. 

The Jupyter is good for presentations and tutorials, just like this one, but I would not advise building applications in it.

#### Databases
![PostgreSQL logo](img/logo_postgresql.png)
For Quantee products, PostgreSQL is strongly suggested. You can communicate with the database in Python via psycopg2 library. Moreover, Heroku environment comes with PostgreSQL installation, so that during deployment of Dash application on Heroku you can also export your local database on the Heroku server! 

#### Git
![Github logo](img/logo_github.png)
For version controlling of the script, we use Git. Currently, as a Git server, we use [Github](https://github.com/quanteeai) for organizations.

---
## Requirements
* As you are able to open the notebook, I guess you have already installed [Anaconda framework](https://www.anaconda.com/distribution/). Please remember to [add python executable folder](https://geek-university.com/python/add-python-to-the-windows-path/) to environment variable ```PATH```. 
* Open Anaconda prompt and install the following packages:
    * ```$ pip install lightgbm```
    * ```$ pip install tensorflow```
    * ```$ pip install keras```
    * ```$ pip install plotly```
    * ```$ pip install dash```
    * ```$ pip install psycopg2```
* These are the basic requirements - depending on the version of Anaconda and Python you might encounter some errors, particularly when you try to extend lightgbm or tensorflow on GPU. For further technical information, you can ask your colleagues.
* _Advanced_: Install [git](https://git-scm.com/download/win).
* _Advanced_: Install [heroku](https://devcenter.heroku.com/articles/heroku-cli)

---
## Data Science Knowledge Base
You are encouraged to do [excellent Udemy course](https://www.udemy.com/python-for-data-science-and-machine-learning-bootcamp/). It is not free, but ask your CEO about hints on how to get it cheaper and get a complete refund.

The Udemy course is great from a programming point of view, but you might miss statistics. For that purpose, there are two excellent sources:
* [Introduction to Statistical Learning](https://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) - data scientist's bible 
* [Andrew Ng Neural Networks course](https://www.coursera.org/learn/neural-networks-deep-learning?specialization=deep-learning) - basics and advanced materials for neural networks

Additionally or alternatively you can follow Tutorials 1-6 which are oriented on real insurance data.

Other sources?
* [scikit-learn documentation](https://scikit-learn.org/stable/documentation.html)
* [Dawid Kopczyk blog](http://dkopczyk.quantee.co.uk/category/machine_learning/)
* Kaggle kernels (see the next section)

---
## Kaggle Competitions
Here is the list of Kaggle competitions that deal with similar problems as projects realized by Quantee:
#### Insurance
* [Claims severity prediction](https://www.kaggle.com/c/allstate-claims-severity) - in this competition you will need to improve the prediction of claims severity in vehicle insurance. The Kaggle is organized by Allstate, an insurer and AI startup.
* [Underwriting](https://www.kaggle.com/c/prudential-life-insurance-assessment) - Prudential wants to quickly give a quote for their life insurance sold online. Your role is to predict the risk of an application.
* [Insurance cost prediction](https://www.kaggle.com/mirichoi0218/insurance) - this a dataset, not a competition. With a few variables such as age, smoker, region, etc you need to predict the cost of medical insurance.
* [Loss prediction](https://www.kaggle.com/c/liberty-mutual-fire-peril) - fire losses account for a significant portion of total property losses for corporate lines of insurance. In this challenge, your task is to predict the target, a transformed ratio of loss to total insured value.
* [Claim frequency prediction](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction) - a large competition from Brazilian insurer to predict the probability of a claim
* [Quote conversion](https://www.kaggle.com/c/homesite-quote-conversion) - using an anonymized database of information on customer and sales activity, including property and coverage information, Homesite is challenging you to predict which customers will purchase a given quote. 
* [Claims severity prediction](https://www.kaggle.com/c/ClaimPredictionChallenge) - quite old competition from (again) Allstate. The goal of this competition is to better predict Bodily Injury Liability Insurance claim payments based on the characteristics of the insured customer’s vehicle. 
* [Policyholder lapses](https://www.kaggle.com/c/customer-retention) - predict policyholder retention (Update: unfortunately, the competition is closed).

#### Banking
* [Credit scoring](https://www.kaggle.com/c/GiveMeSomeCredit) - improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.
* [Loan default prediction](https://www.kaggle.com/c/loan-default-prediction) - this competition asks you to determine whether a loan will default, as well as the loss incurred if it does default.
* [Credict scoring](https://www.kaggle.com/c/risky-business) - improve credit risk models by predicting the probability of default on a consumer credit product in the next 18 months.

## Let's code!
Ok, after you have read the intro it is a time to deeply dive into the materials and Kaggle competitions. If you find something new you are free to commit the changes to Github repo of this tutorial.
![funny kermit](img/letscode.gif)