# Data Challenge -  Prediction of transaction claims status

Authors : Fabien ROUILLON / Vincent CHAMBRIN

This notebook contains the Python code used to predict transaction claims status as part of the data challenge proposed by PriceMinister Rakuten as presented [here](https://challengedata.ens.fr/en/challenge/39/prediction_of_transaction_claims_status.html). 

This notebook is structured as follow:
- importation of Python libraries
- loading the data
- first observations on the dataset
- in-depth study of the dataset
- claims prediction using machine learning algorithms.

Access to the dataset is reserved to people participating in the challenge; the dataset is therefore not provided in the git repository.

## Import Libraries

We will mostly use `pandas` to load and manipulate the data and several predictors from `scikit-learn` to predict claims.

In [None]:
# data analysis 
import pandas as pd
import numpy as np
import random as rnd
import re

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import interp
from itertools import cycle
%matplotlib inline

# preprocessing
from sklearn.preprocessing import LabelBinarizer

# machine learning models
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
#import xgboost as xgb


# Evaluating and tuning the models
from sklearn.model_selection import train_test_split , StratifiedKFold, GridSearchCV, KFold
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix, roc_curve, auc, roc_auc_score
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import label_binarize

## Load data

In [None]:
# Acquire data
train_X = pd.read_csv('train_X.csv', sep=',')
train_Y = pd.read_csv('train_Y.csv', sep=';')
test_X = pd.read_csv('test_X.csv', sep=',')

## Statistical summaries and first observations

In [None]:
# preview the data
train_X.head(3)

In [None]:
train_Y.head(3)

In [None]:
# Merge train_X and train_Y
train_df = pd.merge(train_X, train_Y, on='ID')
train_df.head(3)

In [None]:
# ID feature is useless, we remove it from train set
train_df = train_df.drop(['ID'], axis=1)
train_df.head(3)

In [None]:
train_df.info()
print('_'*50,'\n') # prints a horizontal rule
test_X.info()

In [None]:
train_df.describe()

In [None]:
#Including only string columns
train_df.describe(include=[np.object])

In [None]:
# Count number of null elements in a dataframe column
def count_null(col):
    return sum(col.isnull())

# Describe a dataframe column
def describe(col):
    description = col.value_counts()
    description['NULL']=count_null(col)
    print(description, '\n')

In [None]:
# Description of each column
for col in train_df.columns.values:
    describe(train_df[col])

In [None]:
# Percentage of claims
claims = train_df.CLAIM_TYPE.value_counts()
claims = round(claims.div(claims.sum()/100),1)
claims

##### Especially we can note that:

Most of the variables are Categorical:
* Nominal variables : 
    SHIPPING_MODE (11 levels)
    BUYER_DEPARTMENT (100 levels)
    BUYING_DATE (10 levels)
    SELLER_COUNTRY (39 levels)
    SELLER_DEPARTMENT (98 levels)
    PRODUCT_TYPE (137 levels)
    CLAIM_TYPE (8 levels)
* Dichotomous variables :
    WARRANTIES_FLG
    CARD_PAYMENT
    COUPON_PAYMENT
    RSP_PAYMENT
    WALLET_PAYMENT
* Ordinal variables :
    SHIPPING_PRICE (5 levels)
    WARRANTIES_PRICE (5 levels)
    PRICECLUB_STATUS (5 levels)
    PURCHASE_COUNT (6 levels)
    SELLER_SCORE_COUNT (5 levels)
    ITEM_PRICE (8 levels)
    
* Quantitative variables : 
    REGISTRATION_DATE (17 distinct values)
    BUYER_BIRTHDAY_DATE (107 distincts values)
    SELLER_SCORE_AVERAGE (31 distincts values)
    
Quantitative variables could also be considered as Categorical

Some variables have many missing values:
* WARRANTIES_PRICE : 96603 / 96.6% missing values
* SHIPPING_PRICE : 67610 / 67.6% missing values
* BUYER_BIRTHDAY_DATE : 5836 / 5.8% missing values

Dichotomous classes are not well balanced (thus the use of AUC weighted metric):
* WARRANTIES_FLG : 96603 / 96.6% zeros
* COUPON_PAYMENT : 94271 / 94.3% zeros
* CARD_PAYMENT : 10407 / 89,6% ones
* WALLET_PAYMENT : 87045 / 87.0% zeros
* RSP_PAYMENT : 82942 / 82.9% zeros
* CLAIM_TYPE : 49977 / 50.0% '-' (no claim)

Some categorical variables have a high cardinality:
* PRODUCT_TYPE : 137 distinct values
* BUYER_DEPARTMENT 100 distinct values
* SELLER_DEPARTMENT 98 distinct values
* SELLER_COUNTRY 39 distinct values

Dataset has 3238 duplicate rows: Some users may appear in several rows (cf introduction video).
Some numerical values are discretized for anonymisation puropose but it might be possible to identify individuals.

Features are quite heterogeneous and complex.
Thus it could be interesting to find new socio-demographic features based on the current ones to feed our model.

## In-depth dataset study

In this section we will study the dataset in greater details. This will help creating new variables / features to train the classifiers (feature engineering).

In [None]:
# Percentage of claims for each level of categorical variable col, sorted by level importance
def claim_percentage_crosstab(col):
    df = pd.crosstab(train_df[col], train_df.CLAIM_TYPE).sort_values(by=['-'], ascending=False)
    return round(df.div(df.sum(axis=1)/100, axis=0),1)

# Plot correlation map given a dataframe df
# Note: works only on quantitative variables
def plot_correlation_map(df):
    corr = df.corr()
    _, ax = plt.subplots(figsize=(12, 10))
    cmap = sns.diverging_palette(220, 10, as_cmap=True)
    _ = sns.heatmap(
        corr, 
        cmap = cmap,
        square=True, 
        cbar_kws={'shrink':.9}, 
        ax=ax, 
        annot = True, 
        annot_kws = {'fontsize':12})

In [None]:
# A heat map of correlation may give us a understanding of which variables are important
plot_correlation_map(train_df)

#### SHIPPING_MODE and SHIPPING_PRICE :

In [None]:
# Percentage of each claim given SHIPPING_MODE
claim_percentage_crosstab('SHIPPING_MODE')

In [None]:
# Percentage of each claim given SHIPPING_PRICE
claim_percentage_crosstab('SHIPPING_PRICE')

In [None]:
# Link between SHIPPING_PRICE and SHIPPING_MODE
pd.crosstab(train_df.SHIPPING_MODE, train_df.SHIPPING_PRICE)

Observations :

* The delivery ways which are not popular have in average more claims than others.
Especially for PICKUP with 38% of WITHDRAWAL claim.

* MONDIAL_RELAY_PREPAYE (54%), SUIVI (54%) and NORMAL (53%) have the greater percentage of regular deliveries (without claim), while RECOMMANDE (37%), PICKUP (26%) and Kiala (0%) have the greater percentage of claims.

* NORMAL delivery has the highest percentage of NOT_RECEIVED claims (16%) but the lowest percentage of WITHDRAWAL (4%) and UNDEFINED (2%) claims.

* SUIVI has the lowest percentage of NOT_RECEIVED mention among popular delivery ways

* CHRONOPOST has the highest percentage of DAMAGED claim

* SO_POINT_RELAIS, MONDIAL_RELAY and SO_RECOMMANDE has typically the same percentage of claims


* Paradoxically, claims percentage grows with the price of the delivery.
Especially : UNDEFINED and WITHDRAWAL claims are more likely to happen with a high price.


* Their is a clear correlation between SHIPPING_MODE and SHIPPING_PRICE.
For example SHIPPING_PRICE over 20 are more likely to be RECOMMANDE.
However the prices can vary for a given delivery way and the SHIPPING_PRICE must be kept into our model.

* As SHIPPING_MODE is not an ordinal variable we need to turn it into dummy variables

#### WARRANTIES_FLG and WARRANTIES_PRICE

In [None]:
# Percentage of each claim given WARRANTIES_FLG
claim_percentage_crosstab('WARRANTIES_FLG')

In [None]:
# Percentage of each claim given WARRANTIES_PRICE
claim_percentage_crosstab('WARRANTIES_PRICE')

Observations :

* People having a warranty are slightly more likely to formulate a claim
Especially : WITHDRAWAL happen more frequently when a Warranty is taken.
This make sense since withdrawal possiblity can be included in a warranty

* Their is no clear link between warranties prices and claims

* As the number of null values for WARRANTIES_PRICE is equal to the number of False values for WARRANTIES_FLG, null values exactly correspond to the absence of warranty and can be put at 0 during mapping.

#### PRICECLUB_STATUS

In [None]:
# Percentage of each claim given PRICECLUB_STATUS
claim_percentage_crosstab('PRICECLUB_STATUS').reindex(['UNSUBSCRIBED', 'REGULAR', 'PLATINUM', 'SILVER', 'GOLD'])

Observations:

* PRICECLUB_STATUS are linked to a number of accumulated points won when doing actions as:
selling products, refering a friend, using the prime minister application ...
With these points the customer can occasionally benefit from free purchases and gifts

* There is no clear link between the PRICECLUB_STATUS and claims.

* As UNSUBSCRIBED level represent over 60% of PRICECLUB_STATUS values, null values can be put at the same value than UNSUBSCRIBED during the mapping.

* PRICECLUB_STATUS can be considered as an ordinal variable since their is a rank between the types of status:
UNSUBSCRIBED<REGULAR<PLATINUM<SILVER<GOLD.

#### REGISTRATION_DATE and PURCHASE_COUNT

In [None]:
# Percentage of each claim given REGISTRATION_DATE
new_index = [i for i in range(2001,2018)]
df = claim_percentage_crosstab('REGISTRATION_DATE').reindex(new_index)
plot = df.plot(figsize=(10, 6));
plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

In [None]:
# Percentage of each claim given PURCHASE_COUNT
new_index = ['<5','5<20','20<50','50<100','100<500','>500']
claim_percentage_crosstab('PURCHASE_COUNT').reindex(new_index)

In [None]:
# Link between PURCHASE_COUNT and REGISTRATION_DATE
pd.crosstab(train_df.REGISTRATION_DATE, train_df.PURCHASE_COUNT).reindex(columns = new_index).plot(figsize=(10, 6));

Obervations:

* The different percentage for each type of claim given the date of registration are quite constant.
However, recent users tend to complain more often than old users as the level '-' is slightly decreasing over the year of registration

* It is noticeable that buyers with experience are less likely to have claim type within WITHDRAWAL and UNDEFINED.
Indeed a buyer with a great amount of commands is more susceptible to withdraw their command as they are accustomed to buying online.
However buyers with experience are also less suceptible to claim for damage or non reception of their command. This can be explained by the fact that these customers have their habits are may often pass command to seller which they already tried the effectiveness. Also Experience buyers may have good practices to avoid potential inneficient sellers that beginner custumer do not have. This feature will definitely be valuable for our model.

* As expected their is a link between these two features as buyers with a little number of commands are more suceptible to be recent users.

* Also, their is a clear gap between the most experienced users (<5 items) and other users. For this reason creating a new feature UNEXPERIENCED_USER could be valuable.

#### BUYER_BIRTHDAY_DATE

In [None]:
# Percentage of each claim given BUYER_BIRTHDAY_DATE
new_index = [i for i in range(1902, 1981)]
plot = claim_percentage_crosstab('BUYER_BIRTHDAY_DATE').reindex(new_index).dropna()[30:].plot(); # Taking off extreme values
plot.legend(loc='upper left', bbox_to_anchor=(1.0, 1.0))

Observations :

* We can observe a correlation between age and claims as claims tend to increase while age decrease.
Especially young people seem more likely to claim NOT_RECEIVED which is not obvious.

* Replacing birhtday date by age would be simpler for our model

* There are a lot of Null values for this feature but it would not be efficient to use the mode or mean values in this case since levels are well balanced. The first option is to use a dynamic fill mathode as ffill which propagate last valid observation forward to next valid. The second option is to use knn on other features that may be correlated to BUYER_BIRTHDAY_DATE to fill the missing data.

In [None]:
# This code shows that all ages are not well represented.
# However, for a range of reasonable ages, we have enough individuals.
values = train_df['BUYER_BIRTHDAY_DATE'].value_counts()
pd.set_option('display.max_rows', len(values))
print(values)
pd.reset_option('display.max_rows')

In [None]:
# As we can see from the graph, age has an influence on the claims.
# However, considering each age might be too expensive and subject to overfitting
# especially on birthdate that are not so represented.
# Instead, we can try to study things by grouping people with similar age
def group_by_age(min_threshold, max_threshold, step, birthday_series):
    from math import ceil
    age_series = (2017 - birthday_series).rename('BUYER_AGE')
    nb_group = 1 + 1 + int(ceil((max_threshold - min_threshold) / step))
    def get_group(x):
        if x <= min_threshold:
            return 0
        elif x > max_threshold:
            return nb_group-1
        x = x - min_threshold
        group = 0
        while x > 0:
            group = group + 1
            x = x - step
        return group
    return age_series.map(get_group)

In [None]:
# Turns the categorical variable CLAIM_TYPE into a binary variable 
# indicating if there is a claim (value=1) or no (value=0).
def binarize_claims(series):
    def claim_filter(c):
        return 0 if c == '-' else 1
    return series.map(claim_filter)

In [None]:
pd.crosstab(group_by_age(20, 60, 5, train_df.BUYER_BIRTHDAY_DATE), binarize_claims(train_df.CLAIM_TYPE))

#### BUYER_DEPARTMENT

In [None]:
# This code shows that not all departments are well represented
values = train_df['BUYER_DEPARTMENT'].value_counts()
pd.set_option('display.max_rows', len(values))
print(values)
pd.reset_option('display.max_rows')

In [None]:
pd.crosstab(train_df.BUYER_DEPARTMENT, binarize_claims(train_df.CLAIM_TYPE))

To simplify the model we might want to group buyers localisations by regions instead of departments.

In [None]:
# Build a mapping dictionnary from departments to regions
DEPARTMENT_mapping = {}

Bretagne = [22,29,35,56]
Normandie = [14, 27, 50, 61, 76]
Hauts_De_France = [2,59,60,62,80]
Ile_De_France = [77,78,91,95]
Paris = [75,92,93,94]
Grand_Est =[8,10,51,52,54,55,57,67,68,88]
Bourgogne_France_Compte = [21,25,39,58,70,71,89,90]
Nouvelle_Aquitaine = [16,17,19,23,24,33,40,47,64,79,86,87]
Auvergne_Rhone_Alpes = [1,3,7,15,26,38,42,43,63,69,73,74]
Occitanie = [9,11,12,30,31,32,34,46,48,65,66,81,82]
Centre_Val_De_Loire = [18,28,36,37,41,45]
PACA = [4,5,6,13,83,84]
Pays_De_La_Loire = [44,49,53,72,85]
Etranger = [-1,20,97] # Putting out of France departments into Foreign list, 20=Corse, 97=DOM/TOM
Null = [0,96,98,99,100] # null values (department 97 and 98 do not exist)

Regions = [Bretagne,Normandie,Hauts_De_France,Ile_De_France,Paris,Grand_Est,Bourgogne_France_Compte,Nouvelle_Aquitaine,
           Auvergne_Rhone_Alpes,Occitanie,Centre_Val_De_Loire,PACA,Pays_De_La_Loire,Etranger,Null]

Region_names = ['Bretagne','Normandie','Hauts_De_France','Ile_De_France','Paris','Grand_Est','Bourgogne_France_Compte',
                'Nouvelle_Aquitaine','Auvergne_Rhone_Alpes','Occitanie','Centre_Val_De_Loire','PACA',
                'Pays_De_La_Loire','Etranger','Null']

for ind, region in enumerate(Regions):
    for i in region:
        DEPARTMENT_mapping[i]=Region_names[ind]

In [None]:
# Percentage of each claim given BUYER_REGION
train_df['BUYER_REGION'] = train_df['BUYER_DEPARTMENT'].map(DEPARTMENT_mapping)
claim_percentage_crosstab('BUYER_REGION')

Observation : 

* Their is no clear correlation between BUYER_REGION and CLAIM_TYPE if buyer live in France.
We can only notice that buyer from Ile_De_France region (including Paris) and from PACA are slightly more suceptible to claim NOT_RECEIVED.

* However, as expected, claims happen more often abroad, as delivery is more complex.
Indeed commands passed abroad are more likely to receive claims NOT_RECEIVED or SELLER_CANCEL_POSTERIORI while on the contrary claims DAMAGED or DIFFERENT happen less frequently (this point is less obvious).

* Thus, it seems valuable to add a feature informing if the the command was passed in France or not.

* We could also turn BUYER_REGION into dummy variables to keep information about regions

#### BUYING_DATE

In [None]:
# Mapping BUYING_DATE into numerical variable: '3/2017' -> 3
# Using regular expression to isolate months
train_df['BUYING_DATE'] = train_df['BUYING_DATE'].apply( lambda s : int(re.findall('[0-9]*',s)[0]) )

In [None]:
# Percentage of each claim given BUYING_DATE
claim_percentage_crosstab('BUYING_DATE').reindex([i for i in range(1,11)])

In [None]:
# frequency of commands for each month
train_df.BUYING_DATE.value_counts()

Observations:

* There is no clear correlation between BUYING_DATE and CLAIM_TYPE

* We can still notice that the month where the amount of commands is at the top (January) have the greatest percentage of claims while the month where the amount of commands is tIs he lowest (October) have the lowest percentage of claims.

* It is unfortunate that the data for November and December which are around Christmas are not available.
These data would have bring valuable information about a period that might be a peak period.

* With respect to the previous notes, keeping only 3 levels to separate the particualar months January and October from the rest of the months that have overall the same stats could represent a more valuable feature.

#### SELLER_SCORE_COUNT and SELLER_SCORE_AVERAGE

In [None]:
# Percentage of each claim given BUYING_DATE
claim_percentage_crosstab('SELLER_SCORE_COUNT')

In [None]:
train_df.SELLER_SCORE_AVERAGE.value_counts()

In [None]:
# Percentage of each claim given SELLER_SCORE_AVERAGE
df=claim_percentage_crosstab('SELLER_SCORE_AVERAGE').reindex([i for i in range(50,40,-1)]) # Taking off extremely low values
df

In [None]:
ax = df['-'].plot();
ax.set_ylabel("Percentage of delivery with no claim");

In [None]:
# Link between SELLER_SCORE_COUNT and SELLER_SCORE_AVERAGE
pd.crosstab(train_df.SELLER_SCORE_COUNT, train_df.SELLER_SCORE_AVERAGE[train_df.SELLER_SCORE_AVERAGE>40] )

Observations :

* Logically we can observe that the amount of product sold by the seller is growing while the number of claims is decreasing. This is true for all the claim types except one : SELLER_CANCEL_POSTERIORI percentage tends to be higher for the range 100000<1000000 of SELLER_SCORE_COUNT.

* The null values of SELLER_SCORE_COUNT will be put at 0 in the mapping as it can correspond to seller without any product sold.

* Just as for the last feature it is clear that seller with the highest score are the more reliable and result in a fewer number of claims. Especially seller with a mark of 49 close to the maximum have a number of claims significatively low. Thus it could be useful to add a feature to discriminate these top seller.

* However an important point is that the best score possible 50 is reached only by a few individuals (51, while 3994 for 49 and 18006 for 48) and have in average in very bad reliability. We can also notice that these profiles all have sold less than 100 items. Thus, we can assume that some of these profiles are fake and manage somehow to get the maximum score to trick the system and get people trust. It could also be that those profiles have only a few sells hence the maximum score wich is impossible for a great number of sells.
Anyway, as a result these profiles will be put in the same level as seller with bad scores.

* We can add that without surprise their is an important correlation between SELLER_SCORE_AVERAGE and SELLER_SCORE_COUNT as reliable seller have in majority sold a lot of items.

#### SELLER_COUNTRY

In [None]:
claim_percentage_crosstab('SELLER_COUNTRY')[:30] # Taking off country with a few number of deliveries

Observations:

* Most of European countries have correct percentage of delivery.
For instance: SWITZERLAND (50.1% no claim), GERMANY (60,9% no claim), BELGIUM (58,2% no claim), LUXEMBOURG (58,5% no claim).

* However some European country perform bad like UNITED KINGDOM	that have notably high percentage of NOT_RECEIVED claims (19.2%).

* It is hard to classify these country in classes as many countries have specificities (very high percentage of WITHDRAWAL claims for ITALY, overall high percentage of no claim for UNITED STATES however very high percentage of NOT_RECEIVED claims etc...)
We then need to add as many dummy variables as there are countries.

We might want to create a new feature representing the distance between the seller country and buyer country.

In [None]:
# This code will generate the list of all countries present in at least one dataset
# Creation of a feature representing seller country distance to France by mapping SELLER_COUNTRY:
countries = set()
for dataset in [train_df, test_X]:
    for country in dataset['SELLER_COUNTRY']:
        countries.add(country)
countries = list(countries)
print(len(countries))
print(countries)
SELLER_COUNTRY_LIST = countries

In [None]:
# We build the mapping by hand with the following rule of thumb
# France métropolitaine -> 0
# Western Europe -> 1
# Eastern Europe, North Africa -> 2
# Far countries -> 3
# Countries at the edge of the map -> 4

SELLER_COUNTRY_DISTANCE_mapping = {'ANDORRA': 0, 'AUSTRIA': 1, 'BELGIUM': 0, 'BULGARIA': 2, 'CANADA': 4, 'CHINA': 4,'CYPRUS': 2,
                                   'ESTONIA': 2,'FRANCE, METROPOLITAN': 0,'GERMANY': 1,'GIBRALTAR': 2,'GREECE': 2,'GUYANA': 3,
                                   'HONG KONG': 4,'IRELAND': 1,'ISRAEL': 3,'ITALY': 1,'JAPAN': 4,'JERSEY': 1,'LATVIA': 2,
                                   'LIECHTENSTEIN': 1,'LITHUANIA': 2,'LUXEMBOURG': 1,'MALAYSIA': 4,'MALTA': 1,'MARTINIQUE': 3,
                                   'NETHERLANDS': 1,'POLAND': 2,'PORTUGAL': 1,'ROMANIA': 2,'SINGAPORE': 4,
                                   'SLOVAKIA (Slovak Republic)': 2,'SLOVENIA': 2,'SPAIN': 1,'SWEDEN': 1,'SWITZERLAND': 1,
                                   'UNITED KINGDOM': 1,'UNITED STATES': 3,'VATICAN CITY STATE (HOLY SEE)': 1}

#### SELLER_DEPARTMENT

In [None]:
# Creating SELLER_REGION : (we use the same department mapping as for BUYER_REGION)
train_df['SELLER_REGION'] = train_df['SELLER_DEPARTMENT'].map(DEPARTMENT_mapping)

In [None]:
# Percentage of each claim given SELLER_REGION
claim_percentage_crosstab('SELLER_REGION')

Observations :

* Unlike with BUYER_REGION where no correlation was noticeable with CLAIM_TYPE, here there are some differences between the places where the seller sends the command: commands sent from Paris (59,9% no claim) is overall more reliable than commands from Pays_De_La_Loire (34.2% no claim) for example.
Thus we need to turn BUYER_REGION into dummy variables.

* What's more, it could be unteresting to create an additionnal feature corresponding to commands sent and received in the same region (where SELLER_REGION=BUYER_REGION), since these commands may be more reliable in average.

We might want to go further and create a new feature representing the distance between the seller and buyer.

In [None]:
Region_names_list = ['Bretagne','Normandie','Hauts_De_France','Ile_De_France','Grand_Est','Bourgogne_France_Compte',
                'Nouvelle_Aquitaine','Auvergne_Rhone_Alpes','Occitanie','Centre_Val_De_Loire','PACA',
                'Pays_De_La_Loire']

# Distance in kms between seller region and buyer region:
distance_matrix = pd.DataFrame(np.zeros((12,12)), columns=Region_names_list, index=Region_names_list)
distance_matrix['Bretagne']=[0,311,564,349,830,617,466,738,705,302,1046,113]
distance_matrix['Normandie']=[0,0,257,136,639,444,655,595,787,241,904,387]
distance_matrix['Hauts_De_France']=[0,0,0,225,525,502,800,692,895,348,1001,600]
distance_matrix['Ile_De_France']=[0,0,0,0,492,315,584,466,679,133,775,385]
distance_matrix['Grand_Est']=[0,0,0,0,0,330,969,493,971,587,802,865]
distance_matrix['Bourgogne_France_Compte']=[0,0,0,0,0,0,671,195,673,315,504,639]
distance_matrix['Nouvelle_Aquitaine']=[0,0,0,0,0,0,0,556,246,468,646,353]
distance_matrix['Auvergne_Rhone_Alpes']=[0,0,0,0,0,0,0,0,537,466,314,685]
distance_matrix['Occitanie']=[0,0,0,0,0,0,0,0,0,555,404,585]
distance_matrix['Centre_Val_De_Loire']=[0,0,0,0,0,0,0,0,0,0,758,335]
distance_matrix['PACA']=[0,0,0,0,0,0,0,0,0,0,0,986]
distance_matrix['Pays_De_La_Loire']=[0,0,0,0,0,0,0,0,0,0,0,0]

mat = distance_matrix.values
mat = mat + np.transpose(mat)
distance_matrix = pd.DataFrame(mat, columns=Region_names_list, index=Region_names_list)
distance_matrix

In [None]:
# We do some rescaling
distance_matrix = distance_matrix.apply(lambda x: round(x/100,0)).astype(int)
distance_matrix

In [None]:
def create_SELLER_BUYER_REGION_DISTANCE(df):
    distances = []
    for i in range(len(df)):
        buyer_region = df['BUYER_REGION'][i]
        seller_region = df['SELLER_REGION'][i]
        if buyer_region=="Paris":
            buyer_region="Ile_De_France"
        if seller_region=="Paris":
            seller_region="Ile_De_France"
        if buyer_region=="Etranger" or seller_region=="Etranger":
            distances.append(11)
        elif buyer_region=="Null" or seller_region=="Null":
            distances.append(5)
        else:
            distances.append(distance_matrix.loc[buyer_region, seller_region]) 
    df['SELLER_BUYER_REGION_DISTANCE'] = distances

#### PRODUCT_TYPE and PRODUCT_FAMILY

In [None]:
# Percentage of each claim given PRODUCT_TYPE
claim_percentage_crosstab('PRODUCT_TYPE')[:40]

In [None]:
# Percentage of each claim given PRODUCT_FAMILY
claim_percentage_crosstab('PRODUCT_FAMILY')

Observations :

* As PRODUCT_FAMILY already condensate most of information about products, we keep only a few label from PRODUCT_TYPE which stand out from the other labels are are not already in PRODUCT_FAMILY (Books in both for example). 
Among variables we can keep among PRODUCT_TYPE: PLAY CARDS (83.9% no claim), CD (60,5% no claim), COSMETIC (37.4% no claim), CELLPHONE (31,2% no claim).

* ELECTRONICS are more likely to lead to DAMAGED claims.
Indeed, electronic devices such as Television are more fragile and can be damaged during transport.
On the contrary Wine products are unlikely to be damaged, certainly due to the special care and protections set for the transport since Wine are luxury products

#### ITEM_PRICE

In [None]:
describe(train_df['ITEM_PRICE'])

In [None]:
# Percentage of each claim given ITEM_PRICE
claim_percentage_crosstab('ITEM_PRICE').reindex(['<10','10<20','20<50','50<100','100<500','500<1000','1000<5000','>5000'])

Observations:

* Cheap products are less likely to lead to claims.

* Pricey products are more likely to lead to WITHDRAWAL/UNDEFINED claims.

* Cheap products are more likely to lead to FAKE/NOT_RECEIVED products.

* Products in range 100<500 typically lead more to DAMAGED claims in average.
This corresponds to the price range of electronics, which are more likely to lead to DAMAGED mentions as we just saw.

## Feature engineering utilities

The following functions will be used to create new variables and remove some others.

In [None]:
# Maps a categorical variable 'col' given a dictionnary 'mappping'.
# For example, calling this function with col = 'SHIPPING_PRICE' and 
# the mapping {"<1": 1, "1<5": 2, "5<10": 3, "10<20": 4, ">20": 5, 0.0:0} 
# will change all rows having value "<1" to 1, "1<5" to 2, etc...
def categorical_mapping(dataset, col, mapping):
    dataset[col] = dataset[col].fillna(0).map(mapping).astype(int)
def categorical_mapping_continuous(dataset, col, mapping):
    dataset[col] = dataset[col].fillna(0).map(mapping).astype(float)

# Turns a categorical variable into dummy variables.
# For example calling this with col = 'SHIPPING_MODE'
# will replace the 'SHIPPING_MODE' column with new columns 
# named 'SHIPPING_MODE_MONDIAL_RELAY_PREPAYE', 'SHIPPING_MODE_NORMAL'
# 'SHIPPING_MODE_PICKUP', 'SHIPPING_MODE_RECOMMANDE', etc...
# which values are all zero save for one.
def categorical_to_dummy(dataset, col, dummy_na=False):
    dummy = pd.get_dummies(dataset[col], prefix=col, dummy_na=dummy_na)
    for column in dummy.columns.values:
        dataset[column] = dummy[column]
    del dataset[col]

In [None]:
# Some catgeorical variables have levels that are not represented in both the 
# training and testing set (or not represented in both the training/validation set).
# We might want to remove these levels if we want to turn the variables into dummies
def SELLER_COUNTRY_map(c):
    delete = ['IRELAND','LIECHTENSTEIN','SLOVAKIA (Slovak Republic)','ISRAEL','GIBRALTAR',
              'PORTUGAL','SLOVENIA','LATVIA','MALTA','MALAYSIA','SWEDEN',
              'VATICAN CITY STATE (HOLY SEE)','MARTINIQUE','LITHUANIA','JERSEY','ROMANIA',
              'GUYANA','GREECE','ESTONIA','CYPRUS',]
    if c in delete:
        return 'OTHER'
    return c

def PRODUCT_TYPE_map(c):
    delete = ['ACTIVITE_LOISIRS','AMPLIFICATOR',
              'AUTORADIOS','CYCLE', 'INSOLITE', 'RADIO RECEPTOR', 'TELESCOPE',
              'ANSWERING MACHINE', 'AUTOGRAPHES', 'INPUT ADAPTERS',
              'MUSIC LOT']
    if c in delete:
        return 'OTHER'
    return c

## Model evaluation

The AUC score is used for this challenge:

In [None]:
# Computes the ROC AUC score.
# Dummy enconding is used to represent both the predictions 
# and true values, i.e. the input are matrices and each 
# row is a vector containing a single 1 and zeros elsewhere.
def auc_weighted(y_true, y_pred):
    score = roc_auc_score(y_true, y_pred, average='weighted')
    return score

# Version used for training y as a vector of integers
def auc_weighted_bis(y_true, y_pred):
    lb = LabelBinarizer()
    y_true_ =  lb.fit_transform(y_true)
    y_pred_ = lb.transform(y_pred)
    score = roc_auc_score(y_true_, y_pred_, average='weighted')
    return score

## Predictions

The goal of the challenge is to make the best predictions on a data set for which we don't have the `CLAIM_TYPE` column.

In [None]:
def save_claims_prediction(preds):
    classes = ["-","WITHDRAWAL","DAMAGED","DIFFERENT","SELLER_CANCEL_POSTERIORI",
             "NOT_RECEIVED", "UNDEFINED","FAKE"]
    def claim_type(p):
        return classes[np.argmax(p)]
    df = pd.DataFrame(index=np.arange(0, len(preds)), columns=('ID', 'CLAIM_TYPE') )
    df['ID'] = [i+100000 for i in range(len(preds))]
    df['CLAIM_TYPE'] = [claim_type(p) for p in preds]
    df.to_csv('prediction.csv', sep=';', index=False)

## Model #1 (by Fabien)

This is our first attempt. We start by preprocessing the dataset (creating new features, removing columns), then we use a RandomForest classifier (provided by scikit-learn) to predict the claims for the testing set (for which we really don't know the output).

### Data preparation

What we need to do :
- Categorical variables need to be transformed to numeric variables
- Fill missing values in variables
- Creation of new variables /  Feature Engineering

In [None]:
def model1_feature_engineering(df):
    # Turning SHIPPING_MODE into dummy variables:
    categorical_to_dummy(df, 'SHIPPING_MODE')
    
    # Mapping SHIPPING_PRICE:
    # Considering that Null values correspond to the case when shipping is free
    SHIPPING_PRICE_mapping = {"<1": 1, "1<5": 2, "5<10": 3, "10<20": 4, ">20": 5, 0.0:0} 
    categorical_mapping(df, 'SHIPPING_PRICE', SHIPPING_PRICE_mapping)
    
    # Mapping WARRANTIES_PRICE :
    WARRANTIES_PRICE_mapping = {"<5": 1, "5<20": 2, "20<50": 3, "50<100": 4, "100<500": 5, 0.0:0}
    categorical_mapping(df, 'WARRANTIES_PRICE', WARRANTIES_PRICE_mapping)
    
    # Mapping WARRANTIES_FLG :
    WARRANTIES_FLG_mapping = {True: 1, False: 0}
    categorical_mapping(df, 'WARRANTIES_FLG', WARRANTIES_FLG_mapping)
    
    # Mapping PRICECLUB_STATUS :
    PRICECLUB_STATUS_mapping = {"UNSUBSCRIBED": 0, "REGULAR": 1, "PLATINUM": 2, "SILVER": 3, "GOLD": 4, 0.0:0}
    categorical_mapping(df, 'PRICECLUB_STATUS', PRICECLUB_STATUS_mapping)
    
    # Mapping PURCHASE_COUNT :
    PURCHASE_COUNT_mapping = {'<5':0,'5<20':1,'20<50':2,'50<100':3,'100<500':4,'>500':5}
    categorical_mapping(df, 'PURCHASE_COUNT', PURCHASE_COUNT_mapping)
    
    # Creation of UNEXPERIENCED_BUYER :
    df['UNEXPERIENCED_BUYER'] = np.where(df['PURCHASE_COUNT']==0, 1, 0)

    # Simplification of feature REGISTRATION_DATE, 
    # replaced by BUYER_SENIORITY which corresponds to the number of years the buyer is using the service
    df['BUYER_SENIORITY'] = 2017 - df['REGISTRATION_DATE']
    del df['REGISTRATION_DATE']
    
    # Simplification of feature BUYER_BIRTHDAY_DATE, replaced by BUYER_AGE
    df['BUYER_AGE'] = 2017 - df['BUYER_BIRTHDAY_DATE']
    del df['BUYER_BIRTHDAY_DATE']
        
    # Filling missing data with ffill method:
    df['BUYER_AGE'] = df['BUYER_AGE'].fillna(method='ffill').astype(int)
        
    # TODO: filling missing data with KNN

    # Creating BUYER_REGION 
    df['BUYER_REGION'] = df['BUYER_DEPARTMENT'].map(DEPARTMENT_mapping)    
            
    # Creation of BUYER_IS_ABROAD
    df['BUYER_IS_ABROAD'] = np.where(df['BUYER_REGION']=='Etranger', 1, 0)

    # Turning BUYER_REGION into dummy_variables
    dummy = pd.get_dummies(df['BUYER_REGION'], prefix='BUYER_REGION')
    for column in dummy.columns.values:
        df[column] = dummy[column]

    # Turning BUYER_DEPARTMENT into dummy variables:
    dummy = pd.get_dummies(df['BUYER_DEPARTMENT'], prefix='BUYER_DEPARTMENT')
    for column in dummy.columns.values:
        df[column] = dummy[column] 
    
    # Creating SELLER_REGION : (we use the same department mapping as for BUYER_REGION)
    df['SELLER_REGION'] = df['SELLER_DEPARTMENT'].map(DEPARTMENT_mapping)
            
    # Creation of SELLER_IS_ABROAD
    df['SELLER_IS_ABROAD'] = np.where(df['SELLER_REGION']=='Etranger', 1, 0)   

    # Turning SELLER_REGION into dummy_variables
    dummy = pd.get_dummies(df['SELLER_REGION'], prefix='SELLER_REGION')
    for column in dummy.columns.values:
        df[column] = dummy[column]

    # Turning SELLER_DEPARTMENT into dummy variables:
    dummy = pd.get_dummies(df['SELLER_DEPARTMENT'], prefix='SELLER_DEPARTMENT')
    for column in dummy.columns.values:
        df[column] = dummy[column]

    # Creating SAME_REGION_BUYER_SELLER equal to 1 when Buyer and Seller are from the same region
    df['SAME_REGION_BUYER_SELLER'] = np.where(df['SELLER_REGION']==df['BUYER_REGION'], 1, 0)

    # Creating SAME_DEPARTMENT_BUYER_SELLER equal to 1 when Buyer and Seller are from the same department
    df['SAME_DEPARTMENT_BUYER_SELLER'] = np.where(df['SELLER_DEPARTMENT']==df['BUYER_DEPARTMENT'], 1, 0)

    # Creation of SELLER_BUYER_REGION_DISTANCE
    create_SELLER_BUYER_REGION_DISTANCE(df)
    
    # Now we can delete BUYER_REGION, BUYER_DEPARTMENT, SELLER_REGION and SELLER_DEPARTMENT
    del df['BUYER_REGION']
    del df['BUYER_DEPARTMENT']
    del df['SELLER_REGION']
    del df['SELLER_DEPARTMENT']
    
    
    # Mapping BUYING_DATE into numerical variable: '3/2017' -> 3
    # Using regular expression to isolate months
    df['BUYING_DATE'] = df['BUYING_DATE'].apply( lambda s : int(re.findall('[0-9]*',s)[0]) )
    
    categorical_to_dummy(df, 'BUYING_DATE')
    
    # Mapping SELLER_SCORE_COUNT
    SELLER_SCORE_COUNT_mapping = {0.0:0, '<100':1, '100<1000':2, '1000<10000':3, '10000<100000':4, '100000<1000000':5}
    categorical_mapping(df, 'SELLER_SCORE_COUNT', SELLER_SCORE_COUNT_mapping)
    
    # Mapping SELLER_SCORE_AVERAGE
    def SELLER_SCORE_AVERAGE_map(score):
        if score == 50:
            return -1
        elif score < 44:
            return 0
        elif score < 46:
            return 1
        elif score == 46:
            return 2
        elif score == 47:
            return 3
        elif score == 48:
            return 4
        elif score == 49:
            return 5
        else:
            return 0

    df['SELLER_SCORE_AVERAGE'] = df['SELLER_SCORE_AVERAGE'].apply(SELLER_SCORE_AVERAGE_map)
    
    # Create VIP_SELLER feature for seller with a score of 49 (label 5)
    df['VIP_SELLER'] = np.where(df['SELLER_SCORE_AVERAGE']==5, 1, 0)
    
    # Turning SELLER_COUNTRY into dummy variables:
    dummy = pd.get_dummies(df['SELLER_COUNTRY'], prefix='SELLER_COUNTRY')
    for column in dummy.columns.values:
        df[column] = dummy[column]
    
    # Creation of a feature representing seller country distance to France by mapping SELLER_COUNTRY:    
    categorical_mapping(df, 'SELLER_COUNTRY', SELLER_COUNTRY_DISTANCE_mapping)
    
    # Turning PRODUCT_FAMILY into dummy_variables
    categorical_to_dummy(df, 'PRODUCT_FAMILY')

    # Turning PRODUCT_FAMILY into dummy_variables
    categorical_to_dummy(df, 'PRODUCT_TYPE')
    
    # Mapping ITEM_PRICE:
    ITEM_PRICE_mapping = {"<10" : 1, "10<20" : 2, "20<50" : 3, "50<100" : 4, "100<500" : 5, "500<1000" : 6,
                          "1000<5000" : 7, ">5000" : 8} 
    categorical_mapping(df, 'ITEM_PRICE', ITEM_PRICE_mapping)
    

In [None]:
#Create all datasets that are necessary to train, validate and test models

claim_list = ['-', 'NOT_RECEIVED', 'SELLER_CANCEL_POSTERIORI', 'WITHDRAWAL', 'DAMAGED', 'DIFFERENT', 
                  'UNDEFINED', 'FAKE']

input_train_X = pd.read_csv('train_X.csv', sep=',')
input_train_Y = pd.read_csv('train_Y.csv', sep=';')
input_test_X = pd.read_csv('test_X.csv', sep=',')

training_df = pd.merge(input_train_X, input_train_Y, on='ID').drop(['ID'], axis=1)
testing_df = input_test_X

model1_feature_engineering(training_df)
model1_feature_engineering(testing_df)

#train_X = training_df.drop('CLAIM_TYPE',1).astype(int)
train_X = training_df.drop('CLAIM_TYPE',1)
train_Y = training_df.CLAIM_TYPE
test_X = testing_df.drop('ID',1)

# Label encoding :
claim_mapping = {}
for i in range(8):
    claim_mapping[claim_list[i]]=i

# Mapping
train_Y_mapped = pd.DataFrame(train_Y)
train_Y_mapped = train_Y_mapped.CLAIM_TYPE.map(claim_mapping)
train_Y_mapped = np.array(list(train_Y_mapped))

train_X_full, train_Y_full, train_Y_multiclass_full = train_X, train_Y, train_Y_mapped

In [None]:
train_Y_mapped

In [None]:
print(train_X_full.shape, test_X.shape)

In [None]:
# train_X and test_X do not have the same shape !
# We sanitize our datasets by removing columns that are present in only 
# one of the two datasets.
def del_residual_columns(train, test):
    l_train = train.columns.values
    l_test = test.columns.values
    l_diff_train = []
    l_diff_test = []
    for s in l_train:
        if s not in l_test:
            print(s)
            l_diff_train.append(s)
    for s in l_test:
        if s not in l_train:
            print(s)
            l_diff_test.append(s)
    # Deleting these variables from training and test set
    for s in l_diff_train:
        del train[s]
    for s in l_diff_test:
        del test[s]

In [None]:
del_residual_columns(train_X_full, test_X)

In [None]:
print(train_X_full.shape, test_X.shape)

In [None]:
train_X, valid_X, train_Y, valid_Y = train_test_split(train_X_full, train_Y_multiclass_full, 
                                                      train_size=.7, random_state=7)

print(train_X_full.shape, train_Y_full.shape, train_X.shape, valid_X.shape, 
       train_Y.shape , valid_Y.shape , test_X.shape)

#### MACHINE LEARNING MODELS

In [None]:
index_Y = ["-","WITHDRAWAL","DAMAGED","DIFFERENT","SELLER_CANCEL_POSTERIORI","NOT_RECEIVED","UNDEFINED","FAKE"]
index_Y_claims = ["WITHDRAWAL","DAMAGED","DIFFERENT","SELLER_CANCEL_POSTERIORI","NOT_RECEIVED","UNDEFINED","FAKE"]
index_Y_bin = ["no claim", "claim"]

# Fit the model and make predictions
def pred(model):
    import time
    begin = time.time()
    model.fit(train_X, train_Y)
    pred_Y = model.predict(valid_X)
    print(round(time.time() - begin, 2), " s")
    return pred_Y

# Display the confusion matrix
def display_cm(pred_Y, valid_Y):
    cm = confusion_matrix(valid_Y, pred_Y)
    cm_df = pd.DataFrame(cm, index=index_Y, columns=index_Y)
    return cm_df

# Return average AUC with KFold cross-validation 
def run_kfold(model):
    import time
    begin = time.time()
    kf = KFold(n_splits=10, shuffle=False, random_state=7)
    outcomes = []
    fold = 0
    for train_index, test_index in kf.split(train_X_full):
        fold += 1
        X_train, X_valid = train_X_full.values[train_index], train_X_full.values[test_index]
        #y_train, y_valid = train_Y_multiclass[train_index], train_Y_multiclass[test_index] # With dummy variables
        y_train, y_valid = train_Y_mapped[train_index], train_Y_mapped[test_index] # Without dummy variables
        model.fit(X_train, y_train)
        y_pred = model.predict(X_valid)
        
        # Without dummy variables
        lb = LabelBinarizer()
        y_valid =  lb.fit_transform(y_valid)
        y_pred = lb.transform(y_pred)
        
        
        auc = auc_weighted(y_valid, y_pred)
        outcomes.append(auc)
        print("Fold {0}, AUC: {1}".format(fold, round(auc,3)))     
    mean_outcome = round(np.mean(outcomes),3)
    print("Mean AUC: {0}".format(mean_outcome)) 
    print("\n",round(time.time()-begin,2)," s")
    return mean_outcome

In [None]:
# UPDATE :
# Using Vincent's model 2 ideas with claim/no claim predictions:

# First predict claim/no claim, then predict the type of claim if there is claim

def auc_2_steps(model_1, model_2):    
    # Train test split with all data
    global train_X, valid_X, train_Y, valid_Y, pred_Y
    
    # With dummy variables :
    
    #train_X, valid_X, train_Y_multi, valid_Y = train_test_split(train_X_full, train_Y_full, 
     #                                                           train_size=.7, random_state=7)
    #train_Y = np.where(train_Y_multi!='-', 1, 0)
    #valid_Y_bin = np.where(valid_Y!='-', 1, 0)
        
    # Without dummy variables :
    
    train_X, valid_X, train_Y_multi, valid_Y = train_test_split(train_X_full, train_Y_mapped, 
                                                                train_size=.7, random_state=7)
    train_Y = np.where(train_Y_multi!=0, 1, 0)
    valid_Y_bin = np.where(valid_Y!=0, 1, 0)   

    # First prediction : claim or no claim
    pred_Y_bin = pred(model_1)
    auc = auc_weighted(valid_Y_bin, pred_Y_bin)
    print("auc claim/no claim : ", auc)

    # With dummy variables :
    
    #train_X = train_X[train_Y_multi != '-']
    #train_Y = train_Y_multi[train_Y_multi != '-']
    #lb2 = LabelBinarizer()
    #train_Y = lb2.fit_transform(train_Y)
    
    # Without dummy variables :
    
    train_X = train_X[train_Y_multi != 0]
    train_Y = train_Y_multi[train_Y_multi != 0]
    
    # Second prediction : if claim, predict type of claim
    pred_Y_claims = pred(model_2)
    #pred_Y_claims = lb2.inverse_transform(pred_Y_claims) # With dummy variables

    pred_Y = []
    for i in range(len(pred_Y_claims)):
        if pred_Y_bin[i]==0:
            #pred_Y.append('-') # With dummy variables
            pred_Y.append(0) # Without dummy variables
        else:
            pred_Y.append(pred_Y_claims[i])
    pred_Y = np.array(pred_Y)

    lb3 = LabelBinarizer()
    valid_Y = lb3.fit_transform(valid_Y)
    pred_Y = lb3.transform(pred_Y)

    return auc_weighted(valid_Y, pred_Y)

In [None]:
model_1 = RandomForestClassifier(n_estimators=100)
model_2 = RandomForestClassifier(n_estimators=100)
auc_rf_rf = auc_2_steps(model_1, model_2)
auc_rf_rf

In [None]:
model_1 = RandomForestClassifier(n_estimators=100)
model_2 = KNeighborsClassifier(n_neighbors = 5)
auc_rf_knn = auc_2_steps(model_1, model_2)
auc_rf_knn

In [None]:
model_1 = KNeighborsClassifier(n_neighbors = 5)
model_2 = RandomForestClassifier(n_estimators=100)
auc_knn_rf = auc_2_steps(model_1, model_2)
auc_knn_rf

In [None]:
model_1 = RandomForestClassifier(n_estimators=100)
model_2 = OneVsRestClassifier(xgb.XGBClassifier(n_estimators=200))
auc_rf_gbm = auc_2_steps(model_1, model_2)
auc_rf_gbm

In [None]:
# Displays the models auc

models = pd.DataFrame({
    'Model 1': ['Random Forest', 'Random Forest', 'Random Forest', 'KNN'],
    'Model 2': ['Random Forest', 'Gradient Boosting', 'KNN', 'Random Forest' ],
    'AUC': [auc_rf_rf, auc_rf_gbm, auc_rf_knn, auc_knn_rf]
})

models.sort_values(by='AUC', ascending=False).style.set_table_styles([
    {'selector': '.row_heading, .blank', 'props': [('display', 'none;')]}
])

#### Predictions with all labels (no distinction claim/no claim)

In [None]:
train_X, valid_X, train_Y, valid_Y = train_test_split(train_X_full, train_Y_multiclass_full, 
                                                      train_size=.7, random_state=7)

print(train_X_full.shape, train_Y_full.shape, train_X.shape, valid_X.shape, 
       train_Y.shape , valid_Y.shape , test_X.shape)

In [None]:
# Gradient Boosting with XGboost

gbm = OneVsRestClassifier(xgb.XGBClassifier(n_estimators=50))
pred_Y = pred(gbm)
auc_gbm = round(auc_weighted(valid_Y, pred_Y),3)
auc_gbm

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
auc_rf = run_kfold(random_forest)

In [None]:
# MLP
mlp = MLPClassifier()
pred_Y = pred(mlp)
auc_mlp = round(auc_weighted(valid_Y, pred_Y),3)
auc_mlp

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
auc_decision_tree = run_kfold(decision_tree)

In [None]:
# KNN

knn = KNeighborsClassifier(n_neighbors = 3)
auc_knn = run_kfold(knn)

In [None]:
# Displays the models auc

models = pd.DataFrame({
    'Model': ['Gradient Boosting', 'Random Forest',  
              'MLP', 'Decision Tree', 'KNN'],
    'AUC': [auc_gbm, auc_rf, auc_mlp, auc_decision_tree, auc_knn]
})

models.sort_values(by='AUC', ascending=False).style.set_table_styles([
    {'selector': '.row_heading, .blank', 'props': [('display', 'none;')]}
])

In [None]:
# Tuning models 

model = RandomForestClassifier()

parameters = {'n_estimators': [20, 100, 200], 
              'criterion': ['entropy', 'gini'],
              'max_depth': [5, 10, None],
              'max_features': ['log2', 'sqrt','auto']
             }

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(auc_weighted_bis)

# Run the grid search
grid_obj = GridSearchCV(model, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(train_X, train_Y)

# Set the clf to the best combination of parameters
best_rf = grid_obj.best_estimator_

In [None]:
pred_Y = pred(best_rf)
auc_best_rf = auc_weighted_bis(valid_Y, pred_Y)
auc_best_rf

In [None]:
train_Y

In [None]:
# Only for dummy variables predictions:

#plot roc curves
def plot_roc():
    # Compute ROC curve and ROC area for each class
    n_classes = train_Y.shape[1]
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(valid_Y[:, i], pred_Y[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])

    # Compute micro-average ROC curve and ROC area
    fpr["micro"], tpr["micro"], _ = roc_curve(valid_Y.ravel(), pred_Y.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
    
    # Compute macro-average ROC curve and ROC area

    lw=2

    # First aggregate all false positive rates
    all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))

    # Then interpolate all ROC curves at this points
    mean_tpr = np.zeros_like(all_fpr)
    for i in range(n_classes):
        mean_tpr += interp(all_fpr, fpr[i], tpr[i])

    # Finally average it and compute AUC
    mean_tpr /= n_classes

    fpr["macro"] = all_fpr
    tpr["macro"] = mean_tpr
    roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

    # Plot all ROC curves
    plt.figure(figsize=(12, 6))
    plt.plot(fpr["micro"], tpr["micro"],
             label='micro-average ROC curve (area = {0:0.2f})'
                   ''.format(roc_auc["micro"]),
             color='deeppink', linestyle=':', linewidth=4)

    plt.plot(fpr["macro"], tpr["macro"],
             label='macro-average ROC curve (area = {0:0.2f})'
                   ''.format(roc_auc["macro"]),
             color='navy', linestyle=':', linewidth=4)

    colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
    for i, color in zip(range(n_classes), colors):
        plt.plot(fpr[i], tpr[i], color=color, lw=lw,
                 label='ROC curve of class {0} (area = {1:0.2f})'
                 ''.format(i, roc_auc[i]))

    plt.plot([0, 1], [0, 1], 'k--', lw=lw)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Some extension of Receiver operating characteristic to multi-class')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
# Confusion matrix :

pred_Y = pd.DataFrame(pred_Y)
valid_Y = pd.DataFrame(valid_Y)

display_cm(pred_Y, valid_Y)

#### Submit results with one model

In [None]:
model = RandomForestClassifier(n_estimators=100)

model.fit(train_X_full, train_Y_multiclass)
pred_Y = model.predict(test_X)

pred_Y = lb.inverse_transform(pred_Y)

submission = pd.DataFrame({
        "ID": test_X_ID,
        "CLAIM_TYPE": pred_Y
    })
submission = submission[['ID', 'CLAIM_TYPE']]

# Creates csv file:
submission.to_csv( 'prediction.csv', sep=';', index = False )

#### Submit results with 2 models (claim vs no claim first) :

In [None]:
claim_mapping_reversed = {}
for key, value in claim_mapping.items():
    claim_mapping_reversed[value] = key

train_Y_full_bin = np.where(train_Y_mapped!=0, 1, 0)
train_X_full_claims = train_X_full[train_Y_mapped != 0]
train_Y_full_claims = train_Y_mapped[train_Y_mapped != 0]    

In [None]:
# Tuning the first classifier

model = RandomForestClassifier()

parameters = {'n_estimators': [150, 180, 200, 230, 250]} # This time we only optimize n_estimators to save time

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(auc_weighted)

# Run the grid search
grid_obj = GridSearchCV(model, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(train_X_full, train_Y_full_bin)

# Set the clf to the best combination of parameters
best_rf_1 = grid_obj.best_estimator_

In [None]:
best_rf_1

In [None]:
# Tuning the second classifier

model = RandomForestClassifier()

parameters = {'n_estimators': [150, 180, 200, 230, 250]} # This time we only optimize n_estimators to save time

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(auc_weighted_bis)

# Run the grid search
grid_obj = GridSearchCV(model, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(train_X_full_claims, train_Y_full_claims)

# Set the clf to the best combination of parameters
best_rf_2 = grid_obj.best_estimator_

In [None]:
best_rf_2

In [None]:
# Model 1 : Random Forest
model_1 = best_rf_1
# Model 2 : Random Forest
model_2 = best_rf_2

In [None]:
# WITHOUT DUMMY VARIABLES VERSION

# claim/no claim prediction :
model_1.fit(train_X_full, train_Y_full_bin)
pred_Y_bin = model_1.predict(test_X)

# If claim predict claim type : 
model_2.fit(train_X_full_claims, train_Y_full_claims)
pred_Y_claims = model_2.predict(test_X)

pred_Y = []
for i in range(len(test_X)):
    if pred_Y_bin[i]==0:
        pred_Y.append(0)
    else:
        pred_Y.append(pred_Y_claims[i])
pred_Y = np.array(pred_Y)

In [None]:
# Mapping
pred_Y = pd.DataFrame(pred_Y)
pred_Y = pred_Y[0].map(claim_mapping_reversed)
pred_Y = np.array(list(pred_Y))

In [None]:
submission = pd.DataFrame({
        "ID": test_X_ID,
        "CLAIM_TYPE": pred_Y
    })
submission = submission[['ID', 'CLAIM_TYPE']]
submission.head()

In [None]:
# Creates csv file:
submission.to_csv( 'prediction.csv', sep=';', index = False )

## Model 2 (by Vincent)

In his very first model (which is scarcely visible now), Fabien tried to predict directly the `CLAIM_TYPE` field using a Random Forest classifier in some new features such as `VIP_SELLER`, `BUYER_REGION`, `SAME_REGION_BUYER_SELLER`, etc...

As I didn't know where to start, I decided to use the same feature set, but with a $k$-neighbors classifier. The results were worse. One problem of this approach is that the `-` level in the `CLAIM_TYPE` field is by far the most present (around 50% of orders did not lead to a claim); therefore when using 10 or 5 neighbors, `-` wins quite often (even when using `weights='distance'`). Knowing that, I tried to remove randomly from the training set a proportion of `-` : this led to better results both in the validation set and on the submitted test set; and confirmed that the problem was indeed due to the  fact that the classes are not well-balanced. However, removing randomly elements from the training set is not satisfactory and requires some knowledge on the testing set (which are not supposed to have).

In my second attempt, I splitted the problem into two subproblems that ought to be simpler:
- claim vs non-claim prediction;
- claim kind prediction.
I used a Random Forest to predict the presence of a claim and a KNeighborsClassifier to predict the claim type. 
I used the same features as before.
This led to a significant improvement on both the validation set and testing set.

I tried to improve the results with the same configuration using a custom metric defined as a Python function for the KNeighborsClassifier (the `metric` parameter of the class). This was too slow to be used.

I then decided to change the features used with the KNeighborsClassifier. Indeed some features were encoded in a single variable using integers which implied a deceiving proximity between some levels. For example, the `WARRANTLIES_PRICE` were encoded as follow:
- `Na` : `0` (absence of warrantly)
- `'< 5'` : `1`  
- `'5 < 20'` : `2`  
- `'20 < 50'` : `3`  
- `'50 < 100'` : `4`  

Which implies that an order with warrantly price `'50 < 100'` is closer to one with `'20 < 50'` than to another with warantly price `'< 5'`. However, as far as the proportion of claim is concerned, this is not necessarily true. I switched most of the variables encoded this way into a list of rates indicating the proportion of each claim type.
Also, some variables such as the `BUYER_AGE`, have a much bigger span the other variables. The $k$-neighbors were therefore often orders made by person with similar age, which is not bad per se, but might not be the most relevant variable to choose from. I did some kind of empirical weighting to improve the results. 

All these tweaks allowed me to reach rank #4 of the challenge at the time, outclassing Fabien by a significant marging (which is now reversed :'()
The features used in the lastest version haven't evolved much since then.

After a significant number of tries, I can say that the rate of correct prediction for claim vs. no-claim seems to top at 65%.
For the sake of curiosity, I tried to use a neural network (Keras) with the features used by the Random Forest. The accuracy on the training set progresses slowly to a bit more than 70% with 40 epochs, and the accuracy on the validation seems to get stuck at 64% thus confirming that it will be hard to get more than 65% without adding useful external features.

Since the majority of orders lead to no claim, predicting the presence of a claim correctly is important to get a good AUC score. Predicting the correct type of claim, if any, is of less significance.

I slightly improved the results by reducing the number of neighbors used to predict the claim type to 1, and using a BaggingClassifier.

In [None]:
train_X = pd.read_csv('train_X.csv', sep=',')
train_Y = pd.read_csv('train_Y.csv', sep=';')
test_X = pd.read_csv('test_X.csv', sep=',')
train = pd.merge(train_X, train_Y, on='ID')

### Predicting the presence of a claim

In [None]:
without_claims = []
for i in range(len(train)):
    if train.CLAIM_TYPE[i] == '-':
        without_claims.append(i)
train_claims = train.drop(without_claims)

In [None]:
print(len(train_claims), len(train)) # The two classes are well balanced ('good news everyone')

In [None]:
train_full = train
train, valid = train_test_split(train.copy(), train_size=.7, random_state=42)
#train, valid = train_test_split(train.copy(), train_size=.7, random_state=57)

In [None]:
# We start by defining a function that will transform the dataset (feature engineering)
def data_transformation_claim_predictor(df):
    if 'ID' in df.columns:
        del df['ID']
       
    def CLAIM_TYPE_filter(c):
        if c == '-':
            return 0
        return 1
    if 'CLAIM_TYPE' in df.columns:
        df['CLAIM'] = df['CLAIM_TYPE'].apply(CLAIM_TYPE_filter)
        del df['CLAIM_TYPE']
    
    # Turning SHIPPING_MODE into dummy variables:
    categorical_to_dummy(df, 'SHIPPING_MODE')
    
    # Mapping SHIPPING_PRICE:
    # Considering that Null values correspond to the case when shipping is free
    df['SHIPPING_PRICE'] = df['SHIPPING_PRICE'].fillna(value=0)
    categorical_to_dummy(df, 'SHIPPING_PRICE')
    
    # Mapping WARRANTIES_PRICE :
    df['WARRANTIES_PRICE'] = df['WARRANTIES_PRICE'].fillna(value='0')
    categorical_to_dummy(df, 'WARRANTIES_PRICE')
    
    # Mapping WARRANTIES_FLG :
    WARRANTIES_FLG_mapping = {True: 1, False: 0}
    categorical_mapping(df, 'WARRANTIES_FLG', WARRANTIES_FLG_mapping)
    
    # Mapping PRICECLUB_STATUS :
    df['PRICECLUB_STATUS'] = df['PRICECLUB_STATUS'].fillna(value='NA')
    categorical_to_dummy(df, 'PRICECLUB_STATUS')
    
    # Mapping PURCHASE_COUNT :
    categorical_to_dummy(df, 'PURCHASE_COUNT')
        
    df['BUYER_SENIORITY'] = (2017 - df['REGISTRATION_DATE'])
    del df['REGISTRATION_DATE']
    categorical_to_dummy(df, 'BUYER_SENIORITY')

    # Filling missing data with ffill method:
    df['BUYER_AGE_GROUP'] = group_by_age(20, 60, 5, df.BUYER_BIRTHDAY_DATE)
    df['BUYER_AGE_GROUP'] = df['BUYER_AGE_GROUP'].fillna(method='ffill').astype(int)
    categorical_to_dummy(df, 'BUYER_AGE_GROUP')
    del df['BUYER_BIRTHDAY_DATE']
    
    # Mapping BUYER_REGION
    df['BUYER_REGION'] = df['BUYER_DEPARTMENT'].map(DEPARTMENT_mapping)
    del df['BUYER_DEPARTMENT']
    categorical_to_dummy(df, 'BUYER_REGION')
    
    df['BUYING_DATE'] = df['BUYING_DATE'].apply( lambda s : int(re.findall('[0-9]*',s)[0]) )
    categorical_to_dummy(df, 'BUYING_DATE')

    # Mapping SELLER_SCORE_COUNT
    df['SELLER_SCORE_COUNT'] = df['SELLER_SCORE_COUNT'].fillna(value='NA')
    categorical_to_dummy(df, 'SELLER_SCORE_COUNT')
    
    df['SELLER_SCORE_AVERAGE'] = df['SELLER_SCORE_AVERAGE'].fillna(value=0).astype(int)

    # Processing SELLER_COUNTRY variable:
    df['SELLER_COUNTRY'] = df['SELLER_COUNTRY'].fillna(value='OTHER').apply(SELLER_COUNTRY_map)
    categorical_to_dummy(df, 'SELLER_COUNTRY')

    df['SELLER_REGION'] = df['SELLER_DEPARTMENT'].map(DEPARTMENT_mapping)
    categorical_to_dummy(df, 'SELLER_REGION')
    del df['SELLER_DEPARTMENT']

    # Turning PRODUCT_FAMILY into dummy_variables
    categorical_to_dummy(df, 'PRODUCT_FAMILY')

    # PRODUCT_TYPE
    df['PRODUCT_TYPE'] = df['PRODUCT_TYPE'].fillna(value='OTHER').apply(PRODUCT_TYPE_map)
    categorical_to_dummy(df, 'PRODUCT_TYPE')
    
    
    # Mapping ITEM_PRICE:
    categorical_to_dummy(df, 'ITEM_PRICE')
    if 'ITEM_PRICE_>5000' in df.columns:
        del df['ITEM_PRICE_>5000']


In [None]:
#  Then we can fit either a RandomForest or a neural network
def build_random_forest_regressor():
    t = train.copy()
    data_transformation_claim_predictor(t)
    v = valid.copy()
    data_transformation_claim_predictor(v)
    
    t_X = t.drop('CLAIM',1)
    t_Y = t.CLAIM
    v_X = v.drop('CLAIM',1)
    v_Y = v.CLAIM
    
    # Since there are only two classes (claim vs non-claim), we can use a Regressor
    from sklearn.ensemble import RandomForestRegressor
    #rf = RandomForestRegressor(verbose=1, n_jobs=2, n_estimators = 200, max_features=int(len(t_X.columns)/3), max_depth=30)
    #rf = RandomForestRegressor(verbose=1, n_jobs=2, n_estimators = 100, max_features=int(len(t_X.columns)/3), max_depth=50)
    rf = RandomForestRegressor(verbose=1, n_jobs=2, n_estimators = 20, max_features=int(len(t_X.columns)/3), min_samples_split=100)
    rf.fit(t_X, t_Y)
    s = rf.score(v_X, v_Y)
    print("Score (MSE) : ", s)
    pred_Y = rf.predict(v_X)
    in_middle = ((pred_Y > 0.45) * (pred_Y < 0.55))
    pred_Y = (pred_Y > 0.5) * 1
    s = (v_Y == pred_Y).sum()
    print("Score (accuracy) : ", (s/len(v_X)))
    print("Proportion of claims:", (pred_Y.sum() / len(pred_Y)))
    print("Proportion of in-between:", (in_middle.sum() / len(pred_Y)))
    return rf

def build_deep_claim_predictor():
    t = train.copy()
    data_transformation_claim_predictor(t)
    v = valid.copy()
    data_transformation_claim_predictor(v)
    
    t_X = t.drop('CLAIM',1)
    t_Y = t.CLAIM
    v_X = v.drop('CLAIM',1)
    v_Y = v.CLAIM
    
    from keras.utils.np_utils import to_categorical
    t_Y_cat = to_categorical(t_Y)
    v_Y_cat = to_categorical(v_Y)
    
    # Keras has some problems with Pandas DataFrame, so we extract the raw values
    t_X = t_X.values
    v_X = v_X.values

    print(len(t_X[0]), "input variables")
    
    from keras.models import Sequential
    from keras.layers import Dropout
    from keras.layers import Dense
    classifier = Sequential()
    classifier.add(Dense(input_shape=(len(t_X[0]),), units=128, activation='relu'))
    classifier.add(Dropout(0.3)) 
    classifier.add(Dense(units = 64, activation='relu'))
    classifier.add(Dropout(0.3)) 
    classifier.add(Dense(units = 32, activation='relu'))
    classifier.add(Dropout(0.3)) # Overfitting reduction - Dropout
    classifier.add(Dense(units = 2, activation='softmax'))
    classifier.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    classifier.fit(t_X, t_Y_cat, validation_data = (v_X, v_Y_cat), epochs = 64, batch_size=256)
    score = classifier.evaluate(v_X, v_Y_cat, batch_size=32)    
    print("Score (loss, acc) : ", score)
    pred_Y = classifier.predict(v_X)
    print("Proportion of claims:", (pred_Y[:,1].sum() / len(pred_Y)))
    return classifier


In [None]:
rf = build_random_forest_regressor()

In [None]:
# Displays features importance
df = train.copy()
data_transformation_claim_predictor(df)
feature_names = df.drop('CLAIM',1).columns
imp = rf.feature_importances_ 
print(len(imp), "features")
for i in range(len(imp)):
    print(feature_names[i], ':', imp[i])

In [None]:
deep = build_deep_claim_predictor()

In [None]:
type(deep).__name__

As we can see, both models seem to reach a maximum peak at around 64% of correctly predicted claims. 
It seems it is going to be hard doing better without adding more variables that introduce additional information.
As the "-" class is the most present one, this score will considerably limit the score we can get on the challenge data website.

Trying to add distances between seller and buyer as Fabien did seems relevant. However, I am not convinced that it will fundamentaly change the maximum score we get get. Indeed, predicting claims is hard as two extremely similar orders may lead to different claims (or no claims) depending on external factors that are not present in the dataset and difficult to evaluate.

Moreover, all this work may be a waste of time ! By using only three variables (SELLER_DEPARTMENT, SELLER_SCORE_COUNT and SELLER_SCORE_AVERAGE), I correctly predicted around 62% of orders leading to a claim (by creating a map between these three variables and the percentage of claims corresponding to the tuple). 

### Prediction of claim type

In [None]:
def data_transformation_claim_type_classifier(df):
    if 'ID' in df.columns:
        del df['ID']
    
    # Mapping SHIPPING_MODE
    crosstab = pd.crosstab(train['SHIPPING_MODE'], train.CLAIM_TYPE).sort_values(by=['-'], ascending=False)
    for i in range(1, 8):
        values = crosstab.values[:,i]
        SHIPPING_MODE_mapping = values / crosstab.sum(axis = 1)
        df['SHIPPING_MODE_NOTE_' + str(i)] = df['SHIPPING_MODE'].map(SHIPPING_MODE_mapping).fillna(0)
        df['SHIPPING_MODE_NOTE_' + str(i)] = df['SHIPPING_MODE_NOTE_' + str(i)]*400
    del df['SHIPPING_MODE']
    
    # Mapping SHIPPING_PRICE:
    # Considering that Null values correspond to the case when shipping is free
    SHIPPING_PRICE_mapping = {"<1": 1, "1<5": 2, "5<10": 3, "10<20": 4, ">20": 5, 0.0:0} 
    categorical_mapping(df, 'SHIPPING_PRICE', SHIPPING_PRICE_mapping)
    df['SHIPPING_PRICE'] = df['SHIPPING_PRICE'] * 4
    
    # Mapping WARRANTIES_PRICE :
    WARRANTIES_PRICE_mapping = {"<5": 1, "5<20": 2, "20<50": 3, "50<100": 4, "100<500": 5, 0.0:0}
    categorical_mapping(df, 'WARRANTIES_PRICE', WARRANTIES_PRICE_mapping)
    df['WARRANTIES_PRICE'] = df['WARRANTIES_PRICE'] * 4
    
    # Removing WARRANTIES_FLG :
    del df['WARRANTIES_FLG']
    
    df['CARD_PAYMENT'] = df['CARD_PAYMENT'] * 0.25
    df['COUPON_PAYMENT'] = df['COUPON_PAYMENT'] * 0.25
    df['RSP_PAYMENT'] = df['RSP_PAYMENT'] * 0.25
    df['WALLET_PAYMENT'] = df['WALLET_PAYMENT'] * 0.25
    
    # Mapping PRICECLUB_STATUS :
    PRICECLUB_STATUS_mapping = {"UNSUBSCRIBED": 0, "REGULAR": 0.25, "PLATINUM": 0.5, "SILVER": 0.75, "GOLD": 1, 0.0:0}
    categorical_mapping_continuous(df, 'PRICECLUB_STATUS', PRICECLUB_STATUS_mapping)
    df['PRICECLUB_STATUS'] = df['PRICECLUB_STATUS'] * 10

    # Mapping PURCHASE_COUNT :
    PURCHASE_COUNT_mapping = {'<5':0,'5<20':1,'20<50':2,'50<100':3,'100<500':4,'>500':5}
    categorical_mapping(df, 'PURCHASE_COUNT', PURCHASE_COUNT_mapping)
    df['PRICECLUB_STATUS'] = df['PRICECLUB_STATUS'] * 4

    # Simplification of feature REGISTRATION_DATE, 
    # replaced by BUYER_SENIORITY which corresponds to the number of years the buyer is using the service
    # ranges from 0 to 16 I believe
    df['BUYER_SENIORITY'] = 2017 - df['REGISTRATION_DATE']
    seniority = 2017 - train['REGISTRATION_DATE']
    crosstab = pd.crosstab(seniority, train.CLAIM_TYPE).sort_values(by=['-'], ascending=False)
    for i in range(1, 8):
        values = crosstab.values[:,i]
        BUYER_SENIORITY_mapping = values / crosstab.sum(axis = 1)
        df['BUYER_SENIORITY_NOTE_' + str(i)] = df['BUYER_SENIORITY'].map(BUYER_SENIORITY_mapping).fillna(0)
        df['BUYER_SENIORITY_NOTE_' + str(i)] = df['BUYER_SENIORITY_NOTE_' + str(i)]*400
    del df['BUYER_SENIORITY']
    
    # Simplification of feature BUYER_BIRTHDAY_DATE, replaced by BUYER_AGE
    df['BUYER_AGE_GROUP'] = group_by_age(20, 60, 5, df.BUYER_BIRTHDAY_DATE)
    train_copy = train.copy()
    train_copy['BUYER_AGE_GROUP'] = group_by_age(20, 60, 5, train_copy.BUYER_BIRTHDAY_DATE)
    crosstab = pd.crosstab(train_copy['BUYER_AGE_GROUP'], train_copy.CLAIM_TYPE).sort_values(by=['-'], ascending=False)
    for i in range(1, 8):
        values = crosstab.values[:,i]
        BUYER_AGE_GROUP_mapping = values / crosstab.sum(axis = 1)
        df['BUYER_AGE_GROUP_NOTE_' + str(i)] = df['BUYER_AGE_GROUP'].map(BUYER_AGE_GROUP_mapping).fillna(0)
        df['BUYER_AGE_GROUP_NOTE_' + str(i)] = df['BUYER_AGE_GROUP_NOTE_' + str(i)]*1000
    del df['BUYER_AGE_GROUP']
    del df['BUYER_BIRTHDAY_DATE']
    
    # Mapping BUYER_REGION
    df['BUYER_REGION'] = df['BUYER_DEPARTMENT'].map(DEPARTMENT_mapping)
    del df['BUYER_DEPARTMENT']
    train_copy = train.copy()
    train_copy['BUYER_REGION'] = train_copy['BUYER_DEPARTMENT'].map(DEPARTMENT_mapping)
    crosstab = pd.crosstab(train_copy['BUYER_REGION'], train_copy.CLAIM_TYPE).sort_values(by=['-'], ascending=False)
    for i in range(1, 8):
        values = crosstab.values[:,i]
        BUYER_REGION_mapping = values / crosstab.sum(axis = 1)
        df['BUYER_REGION_NOTE_' + str(i)] = df['BUYER_REGION'].map(BUYER_REGION_mapping).fillna(0)
        df['BUYER_REGION_NOTE_' + str(i)] = df['BUYER_REGION_NOTE_' + str(i)]*400
    del df['BUYER_REGION']

    # Mapping BUYING_DATE
    df['BUYING_DATE'] = df['BUYING_DATE'].apply( lambda s : int(re.findall('[0-9]*',s)[0]) )
    df['BUYING_DATE'] = df['BUYING_DATE'].fillna(method='ffill')
    df['BUYING_DATE'] = df['BUYING_DATE'] * 10

    # Mapping SELLER_SCORE_COUNT
    SELLER_SCORE_COUNT_mapping = {0.0:0, '<100':1, '100<1000':2, '1000<10000':3, '10000<100000':4, '100000<1000000':5}
    categorical_mapping(df, 'SELLER_SCORE_COUNT', SELLER_SCORE_COUNT_mapping)
    df['SELLER_SCORE_COUNT'] = df['SELLER_SCORE_COUNT'] * 20

    
    # Mapping SELLER_SCORE_AVERAGE
    df['SELLER_SCORE_AVERAGE'] = df['SELLER_SCORE_AVERAGE'].fillna(value=0)
    df['SELLER_SCORE_AVERAGE'] = df['SELLER_SCORE_AVERAGE'] * 2
    
    # Processing SELLER_COUNTRY variable:
    crosstab = pd.crosstab(train['SELLER_COUNTRY'], train.CLAIM_TYPE).sort_values(by=['-'], ascending=False)
    for i in range(1, 8):
        values = crosstab.values[:,i]
        SELLER_COUNTRY_mapping = values / crosstab.sum(axis = 1)
        df['SELLER_COUNTRY_NOTE_' + str(i)] = df['SELLER_COUNTRY'].map(SELLER_COUNTRY_mapping).fillna(0)
        df['SELLER_COUNTRY_NOTE_' + str(i)] = df['SELLER_COUNTRY_NOTE_' + str(i)]*400
    del df['SELLER_COUNTRY']
    
    crosstab = pd.crosstab(train['SELLER_DEPARTMENT'], train.CLAIM_TYPE).sort_values(by=['-'], ascending=False)
    # -, DAMAGED, DIFFERENT, FAKE, NOT_RECEIVED, SELLER_CANCEL_POSTERIORI,UNDEFINED, WITHDRAWAL
    coeffs = [400, 400, 400, 400, 400, 800, 1000]
    for i in range(1, 8):
        values = crosstab.values[:,i]
        SELLER_COUNTRY_mapping = values / crosstab.sum(axis = 1)
        df['SELLER_DEPARTMENT_NOTE_' + str(i)] = df['SELLER_DEPARTMENT'].map(SELLER_COUNTRY_mapping).fillna(0)
        df['SELLER_DEPARTMENT_NOTE_' + str(i)] = df['SELLER_DEPARTMENT_NOTE_' + str(i)] * coeffs[i-1]
    del df['SELLER_DEPARTMENT']

    # Processing PRODUCT_FAMILY 
    crosstab = pd.crosstab(train['PRODUCT_FAMILY'], train.CLAIM_TYPE).sort_values(by=['-'], ascending=False)
    for i in range(1, 8):
        values = crosstab.values[:,i]
        PRODUCT_FAMILY_mapping = values / crosstab.sum(axis = 1)
        df['PRODUCT_FAMILY_NOTE_' + str(i)] = df['PRODUCT_FAMILY'].map(PRODUCT_FAMILY_mapping).fillna(0)
        df['PRODUCT_FAMILY_NOTE_' + str(i)] = df['PRODUCT_FAMILY_NOTE_' + str(i)]*400
    del df['PRODUCT_FAMILY']

    crosstab = pd.crosstab(train['PRODUCT_TYPE'], train.CLAIM_TYPE).sort_values(by=['-'], ascending=False)
    for i in range(1, 8):
        values = crosstab.values[:,i]
        PRODUCT_FAMILY_mapping = values / crosstab.sum(axis = 1)
        df['PRODUCT_TYPE_NOTE_' + str(i)] = df['PRODUCT_TYPE'].map(PRODUCT_FAMILY_mapping).fillna(0)
        df['PRODUCT_TYPE_NOTE_' + str(i)] = df['PRODUCT_TYPE_NOTE_' + str(i)]*400
    del df['PRODUCT_TYPE']

    # Mapping ITEM_PRICE:
    ITEM_PRICE_mapping = {"<10" : 1, "10<20" : 2, "20<50" : 3, "50<100" : 4, "100<500" : 5, "500<1000" : 6,
                          "1000<5000" : 7, ">5000" : 8} 
    categorical_mapping(df, 'ITEM_PRICE', ITEM_PRICE_mapping)
    df['ITEM_PRICE'] = df['ITEM_PRICE'] * 10
    
    

In [None]:
def build_knn_classifier():
    t = train_claims.copy()
    data_transformation_claim_type_classifier(t)
    tfull = t
    t, v = train_test_split(t, test_size = 0.5, random_state=42)
    
    t_X = t.drop('CLAIM_TYPE',1)
    t_Y = t.CLAIM_TYPE
    v_X = v.drop('CLAIM_TYPE',1)
    v_Y = v.CLAIM_TYPE
    
    t_Y_multiclass = label_binarize(t_Y, classes=["WITHDRAWAL","DAMAGED","DIFFERENT",
                                                           "SELLER_CANCEL_POSTERIORI","NOT_RECEIVED",
                                                           "UNDEFINED","FAKE"])
    v_Y_multiclass = label_binarize(v_Y, classes=["WITHDRAWAL","DAMAGED","DIFFERENT",
                                                           "SELLER_CANCEL_POSTERIORI","NOT_RECEIVED",
                                                           "UNDEFINED","FAKE"])
    from sklearn.neighbors import KNeighborsClassifier
    knn = KNeighborsClassifier(n_neighbors=1,weights='distance')
    knn.fit(t_X, t_Y_multiclass)
    s = knn.score(v_X, v_Y_multiclass)
    print("Score (accuracy) : ", s)
    pred_Y = knn.predict(v_X)
    print("AUC weighted", auc_weighted(v_Y_multiclass, pred_Y))
    return knn
    
    

def build_bagging_classifier():
    t = train_claims.copy()
    data_transformation_claim_type_classifier(t)
    tfull = t
    t, v = train_test_split(t, test_size = 0.3, random_state=42)
    
    t_X = t.drop('CLAIM_TYPE',1)
    t_Y = t.CLAIM_TYPE
    v_X = v.drop('CLAIM_TYPE',1)
    v_Y = v.CLAIM_TYPE
    
    t_Y_multiclass = label_binarize(t_Y, classes=["WITHDRAWAL","DAMAGED","DIFFERENT",
                                                           "SELLER_CANCEL_POSTERIORI","NOT_RECEIVED",
                                                           "UNDEFINED","FAKE"])
    v_Y_multiclass = label_binarize(v_Y, classes=["WITHDRAWAL","DAMAGED","DIFFERENT",
                                                           "SELLER_CANCEL_POSTERIORI","NOT_RECEIVED",
                                                           "UNDEFINED","FAKE"])
    from sklearn.neighbors import KNeighborsClassifier
    knn = KNeighborsClassifier(n_neighbors=1,weights='distance')
    from sklearn.ensemble import BaggingClassifier
    cla = BaggingClassifier(knn, n_estimators=50, max_samples=0.1, verbose=1, n_jobs=2) 
    #cla = BaggingClassifier(knn, n_estimators=100, max_samples=0.1, verbose=1, n_jobs=2)  
    cla.fit(t_X, t_Y)
    s = cla.score(v_X, v_Y)
    print("Score (accuracy) : ", s)
    pred_Y = cla.predict(v_X)
    pred_Y = label_binarize(pred_Y, classes=["WITHDRAWAL","DAMAGED","DIFFERENT",
                                                           "SELLER_CANCEL_POSTERIORI","NOT_RECEIVED",
                                                           "UNDEFINED","FAKE"])
    print("AUC weighted", auc_weighted(v_Y_multiclass, pred_Y))
    return cla

In [None]:
knn = build_knn_classifier()

In [None]:
# Shows the nearest neighbors
index = 42
df = train.copy()
data_transformation_claim_type_classifier(df)
df = df.drop('CLAIM_TYPE', 1)
X = df[index:index+1]
distances, indexes = knn.kneighbors(X, 15)
distances, indexes = distances[0], indexes[0]
print(distances)
df.iloc[indexes].head(100)

In [None]:
bagging = build_bagging_classifier()

The BAggingClassifier is really slow but predicts correctly the claim type nearly 40% of the time.
Considering that there are seven classes, this is not bad at all.
In my lastest experiment, I got the same performances using a Random Forest instead of the BaggingClassifier. The RandomForest is clearly faster. Thus, Fabien is probably right to choose RandomForest as its default choice !

### Global model evaluation

In [None]:
def eval_classifier(claim_predictor, claim_type_predictor):
    v = valid.copy()
    data_transformation_claim_predictor(v)
    v = v.drop('CLAIM',1)
    if type(claim_predictor).__name__ == 'Sequential':
        v = v.values
    claim_or_no = claim_predictor.predict(v)
    if len(claim_or_no.shape) == 2: # when the predictor is the Keras deep network
        claim_or_no = claim_or_no[:,1]
    
    v = valid.copy()
    data_transformation_claim_type_classifier(v)
    v = v.drop('CLAIM_TYPE',1)
    claim_type = claim_type_predictor.predict(v)
    if type(claim_type_predictor).__name__ == 'BaggingClassifier':
        claim_type = label_binarize(claim_type, classes=["WITHDRAWAL","DAMAGED","DIFFERENT",
                                                           "SELLER_CANCEL_POSTERIORI","NOT_RECEIVED",
                                                           "UNDEFINED","FAKE"])
    
    preds = []
    no_claim_Y = np.zeros(8); no_claim_Y[0] = 1
    for i in range(len(claim_or_no)):
        if claim_or_no[i] > 0.5:
            preds.append(np.concatenate(([0], claim_type[i])))
        else:
            preds.append(no_claim_Y)
    v_Y = valid.copy().CLAIM_TYPE
    v_Y_multiclass = label_binarize(v_Y, classes=["-", "WITHDRAWAL","DAMAGED","DIFFERENT",
                                                           "SELLER_CANCEL_POSTERIORI","NOT_RECEIVED",
                                                           "UNDEFINED","FAKE"])
    s = auc_weighted(v_Y_multiclass, preds)
    print("AUC weighted :", s)
    
    def claim_type_from_categorical(claims):
        classes = ["-","WITHDRAWAL","DAMAGED","DIFFERENT","SELLER_CANCEL_POSTERIORI",
             "NOT_RECEIVED", "UNDEFINED","FAKE"]
        def claim_type(p):
            return classes[np.argmax(p)]
        return np.array([claim_type(c) for c in claims])

    from sklearn.metrics import confusion_matrix
    preds = claim_type_from_categorical(preds)
    return confusion_matrix(v_Y, preds), v_Y, preds
    

In [None]:
cm, v_Y, pred_Y = eval_classifier(rf, knn)
print(cm)
pd.crosstab(v_Y, pred_Y)

In [None]:
eval_classifier(deep, knn)

In [None]:
eval_classifier(rf, bagging)

In [None]:
eval_classifier(deep, bagging)

### Performing predictions on the test set

In [None]:
def prediction_on_test(claim_predictor, claim_type_predictor):
    t = test_X.copy()
    data_transformation_claim_predictor(t)
    if type(claim_predictor).__name__ == 'Sequential':
        t = t.values
    claim_or_no = claim_predictor.predict(t)
    if len(claim_or_no.shape) == 2: # when the predictor is the Keras deep network
        claim_or_no = claim_or_no[:,1]
    
    t = test_X.copy()
    data_transformation_claim_type_classifier(t)
    claim_type = claim_type_predictor.predict(t)
    if type(claim_type_predictor).__name__ == 'BaggingClassifier':
        claim_type = label_binarize(claim_type, classes=["WITHDRAWAL","DAMAGED","DIFFERENT",
                                                           "SELLER_CANCEL_POSTERIORI","NOT_RECEIVED",
                                                           "UNDEFINED","FAKE"])
    
    preds = []
    no_claim_Y = np.zeros(8); no_claim_Y[0] = 1
    for i in range(len(claim_or_no)):
        if claim_or_no[i] > 0.5:
            preds.append(np.concatenate(([0], claim_type[i])))
        else:
            preds.append(no_claim_Y)
    return np.array(preds)

In [None]:
preds = prediction_on_test(rf, bagging)

In [None]:
save_claims_prediction(preds)