
# <center>Predicting the returns of orders  for a retail shoe seller</center>




## Introduction
###  Challenge SD210 2018
#### Authors :  Florence D'Alché & Umut Şimşekli & Moussab Djerrab


**Context of the challenge:**

An electronic commerce company sells shoes, and has a high return rate of his products, more than 20%. This large number of returns and exchanges has a negative impact on its margin. To remedy this problem, the company wants to better understand this phenomenon, and have tools to quantify the probability of return for a given product. It makes available its database of orders placed between October 2011 and October 2015, its product feedback data, and its customer and product databases (provide the data dictionary).

**Goal of the challenge:**
<ul>
<li>Identify conditions that favor product return (eg what type of product is usually returned, which customer is more keen on returning a product, what type of order or purchase context most often leads to returns?)</li>
<li>Build a return forecast template for each product from a shopping cart.
</li>
</ul>

To go further: This project aimes at making stand out purchasing behaviors. With this knowledge, the e-merchant wishes to use this data to better plan his activity. In particular, he wants to forecast the turnover generated by his clients.



**Training data:**

There will be $N= 1067290$ lines of orders in the training dataset. For each order  the training dataset reports if the command has been returned (***ReturnQuantityBin***) and the quantity returned (***ReturnQuantity***). The column to target (***ReturnQuantityBin***) which is a binary column ($y = 1$ if returned and $y=0$ otherwise). 

**Test data:**

The test data contain $N_\text{test} = 800468$ lines of orders. Everything else is similar to the training data.


## Additional Data

As part of the challenge, two additional datasets are avalaible namely (**customers.csv**) and (**products.csv**). Those to sets contains informations on custmers and on the products. A good prediction model will necessarily require extraction of information comming from this dataset. Students are free to use these data as they see fit. Please keep in mind that both sets containes also customers and products that are not present in the training or test sets.

A dictionnary of variables (**dictionnary.xlsx**) is avalaible in the folder containing the datasets. Please refer to it so as to have a definition of the variables at hand.


## The goal and the performance criterion

In this challenge, we will use an evaluation metric, which is commonly used in binary prediction, namely the ROC AUC criteria. **The closest to 1 the better (be affarait if its below 0).**
Hence the form of the file to send is of the form :


| <center> probability </center>  |
| ------------- |
| <center> .90  </center>         |
| <center> ...  </center>         |
| <center> .42  </center>         |


The order of the probabilities needs to respect the order in the test set.



# Training Data

https://www.dropbox.com/sh/uo4oudw43j45mp3/AACA0UqkitNKSWdE_7fs2Wbla?dl=0


In [1]:
from __future__ import division
import os
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

## Loading the data

In [2]:
customers = pd.read_csv("customers.csv")
products = pd.read_csv("products.csv")
X_train = pd.read_csv("X_train.csv")
X_test   = pd.read_csv("X_test.csv")
y_train = pd.read_csv("y_train.csv")

## Defining a feature transformation

In [3]:
def funk_mask(d):
    " Defining a simple mask over the input data "
    columns_ext = ["OrderCreationDate","OrderNumber","VariantId", "CustomerId","OrderCreationDate","OrderShipDate","BillingPostalCode"]
    X1 = d.loc[:,[xx for xx in d.columns if xx not in columns_ext]]
    g = lambda x: x.replace(",",".")
    X1.UnitPMPEUR = map(np.float64,(map(g,X1.UnitPMPEUR)))
    columns2bin = [x for x in X1.columns if X1[x].dtype == np.dtype('O')]
    X2 = pd.get_dummies(X1.loc[:,columns2bin])
    X1 = X1.loc[:,[xx for xx in X1.columns if xx not in columns2bin]]
    res = pd.concat([X1,X2],axis=1)
    res = res.fillna(0)
    return(res)

## Applying the mask

In [4]:
x1 = funk_mask(X_train)
x2 = funk_mask(X_test)
seleckt_columns = np.intersect1d(x1.columns,x2.columns)
x1 = x1.loc[:,seleckt_columns]
x2 = x2.loc[:,seleckt_columns]

## Supervised learning : Logistic regression model

In [12]:
clf = LogisticRegression()
clf.fit(x1.iloc[:50000], y_train.ReturnQuantityBin[:50000])
y_tosubmit = clf.predict_proba(x2.loc[:,x1.columns])

## Score of our prediction : on the train

In [13]:
yres = clf.predict_proba(x1.loc[:100000,x1.columns])
roc_auc_score(y_train.ReturnQuantityBin.iloc[:100001],yres[:,1])

# Submission to the system
np.savetxt('y_pred.txt', y_tosubmit[:,1], fmt='%f')


# <center> That's all folks; Good Luck! </center>