In [154]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import sys,os
sys.path.append(os.getcwd())
from descriptive_analysis import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Building an Optimal Premium Model in an insurance company #

## Problem description ##

We are interested in solving a CRM problem for an insurance company. The tasks to be achieved are:

* Finding the ideal target, in this case, people who are more likely to contract the insurance.
* Obtaining the premium we should offer to each client, it means, the optimal price that should be offered to the clients.
* Calculating the difference between offering the premium randomly and optimally using the information obtained in the model.


## Working with data ##

Two databases with clients’ information are available.

In the first one we have the information of 20.000 clients which have already been contacted; 9% of them have contracted the product.
 
Important data is included such as the premium offered, the number of products that they have already bought, the number of years that they have been clients of the company and the socioeconomic status (an economic and sociological measure combined with the person’s work experience and its individual’s or family’s economic and social position in relation to others, based on income, education, and occupation).

In the second database of non-previously contacted clients, we have the same information of 10.000 clients but only 5.000 are going to be contacted due to mechanical restrictions.

Is worthwhile offering the same premium to all the clients? Is better focusing on people with some characteristics than choosing the clients randomly? 

In [8]:
xls = pd.ExcelFile('Database.xlsx')
variable_description = xls.parse(0)
db1 = xls.parse(1)
db2 = xls.parse(2)

In [9]:
variable_description

Unnamed: 0,Variable Name,Meaning
0,Obs,Number of Observations
1,Sales,It indicates whether the client bought a produ...
2,Price Sensitivity,It indicates the client's sensitivity to the p...
3,PhoneType,Client's phone type: Fixed or Mobile
4,Email,It indicates whether the client's email is ava...
5,Tenure,Client's tenure (year when the person became a...
6,NumberofCampaigns,Number of times the client has been called
7,ProdActive,Number of active products
8,ProdBought,Number of different products previously bought
9,Premium Offered,Premium offered to the client


In [10]:
db1.describe()

  interpolation=interpolation)


Unnamed: 0,Obs,Sales,Price Sensitivity,Email,Tenure,NumberofCampaigns,ProdActive,ProdBought,Premium Offered,Birthdate,...,Income,yearBuilt,House Insurance,Pension Plan,Credit,Savings,Number of Mobile Phones,Number of Fixed Lines,ADSL,3G Devices
count,20000.0,20000.0,1475.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,9512.0,...,14860.0,14865.0,14860.0,14860.0,14860.0,14845.0,7164.0,7164.0,7164.0,7164.0
mean,10000.5,0.08575,3.792542,0.07445,2007.11895,3.7092,0.0507,0.3202,13.831877,1966.409062,...,58021.74,1979.502657,7364.270664,37287.15,16967.86,30558.75,1.504467,1.005444,0.502233,0.503769
std,5773.647028,0.280002,1.694535,0.262508,6.715032,4.156429,0.238186,0.706397,2.774808,11.478364,...,66440.6,23.073381,8542.363258,42395.43,19457.61,78724.96,1.120473,0.817475,0.50003,0.500021
min,1.0,0.0,1.0,0.0,1990.0,2.0,0.0,0.0,11.12,1944.0,...,2190.805,1900.0,186.0138,1661.723,617.3887,0.0,0.0,0.0,0.0,0.0
25%,5000.75,0.0,,0.0,2004.0,2.0,0.0,0.0,11.12,,...,,,,,,,,,,
50%,10000.5,0.0,,0.0,2010.0,2.0,0.0,0.0,14.5,,...,,,,,,,,,,
75%,15000.25,0.0,,0.0,2012.0,4.0,0.0,0.0,14.5,,...,,,,,,,,,,
max,20000.0,1.0,6.0,1.0,2013.0,32.0,3.0,6.0,21.85,1984.0,...,4106372.0,2012.0,527866.4178,2620520.0,1202556.0,4884174.0,3.0,2.0,1.0,1.0


In [11]:
db2.describe()

  interpolation=interpolation)


Unnamed: 0,Obs,Price Sensitivity,Email,Tenure,NumberofCampaigns,ProdActive,ProdBought,Birthdate,Living Area (m^2),House Price,Income,yearBuilt,House Insurance,Pension Plan,Credit,Savings,Number of Mobile Phones,Number of Fixed Lines,ADSL,3G Devices
count,10000.0,2694.0,10000.0,10000.0,10000.0,10000.0,10000.0,4541.0,7831.0,7830.0,7831.0,7831.0,7831.0,7831.0,7831.0,7831.0,3760.0,3760.0,3760.0,3760.0
mean,5000.5,3.399777,0.0469,2011.2483,3.5714,0.0223,0.2421,1966.269764,203.41663,219885.5,60026.17,1978.753288,7648.644968,38492.67,17561.62,32296.42,1.520479,0.992287,0.509309,0.499202
std,2886.89568,1.766382,0.211435,5.019897,4.237747,0.156221,0.64802,11.59721,597.405984,311224.7,85461.63,23.229961,10987.924223,54532.66,25028.05,83283.43,1.116566,0.813359,0.49998,0.500066
min,1.0,1.0,0.0,1950.0,2.0,0.0,0.0,1944.0,0.0,7057.584,1580.315,1900.0,134.1774,1198.663,445.3454,0.0,0.0,0.0,0.0,0.0
25%,2500.75,,0.0,2012.0,2.0,0.0,0.0,,,,,,,,,,,,,
50%,5000.5,,0.0,2012.0,2.0,0.0,0.0,,,,,,,,,,,,,
75%,7500.25,,0.0,2013.0,4.0,0.0,0.0,,,,,,,,,,,,,
max,10000.0,6.0,1.0,2013.0,34.0,2.0,5.0,1984.0,22943.212,16119640.0,4426344.0,2012.0,569032.3134,2824619.0,1296269.0,2998083.0,3.0,2.0,1.0,1.0


 ## Descriptive analysis ##
 
 With all the data available in the first database is extremely important to make a complete descriptive analysis of the variables to understand the type of information we are dealing with, which can give us an idea of which variables are relevants to solve our problem. 

For every single variable, we will run a single logistic regression to analyze its importance, and select only a subset of the features for our final model (there is a danger in this method: we may lose any effect due not to individual variables, but to the relationship between them)

First, we separate the data into train and test sets

In [100]:
db1_train, db1_test = train_test_split(db1.index, test_size=0.2, random_state=42)
scores = {}

### Price sensitivity ###

It indicates the client's sensitivity to the price: 1 (less sensitive) - 6 (more sensitive) 

The clients without data in the *Price Sensitivity* feature have a lower probability of buying the insurance, so we have included them in the analysis with the value 7, making them the more sensible to the price

In [173]:
db1.loc[np.isnan(db1['Price Sensitivity']), 'Price Sensitivity'] = 7
scores['Price Sensitivity'], fig = analyze_feature('Price Sensitivity', db1, db1_train, db1_test)
iplot(fig)

### PhoneType ###

Client's phone type: Fixed or Mobile

In [176]:
scores['PhoneType'], fig = analyze_feature('PhoneType', db1, db1_train, db1_test, categorical=True)
iplot(fig)