<a href="https://colab.research.google.com/github/PaulinaHeine/Online-Shop-Return-Shipment-Prediction/blob/main/Online_Shop_Return_Shipment_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#***Import Dataset and packages***

In [1]:
import csv
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import numpy as np
import sklearn

df_train = pd.read_csv('/content/drive/My Drive/training.csv')
df_test = pd.read_csv('/content/drive/My Drive/test.csv')
df_sample = pd.read_csv('/content/drive/My Drive/sample_prediction.csv')

Mounted at /content/drive


# ***Feature Engineering & Preprocessing***

In the following, I will add my engineered features. I will add them first on the train and then on the test set so that these data frames have the same form in order to fit a model on them and predict the probabilities of the test data frame.

## Feature 1 Age

This feature shows the age of the customers. Perhaps, age influences the probability of sending back an item. Older people might be overwhelmed by the technical way of sending back items or they might think longer about what they order so the chance of sending it back is lower, while the younger generation is used to deliver services a lot and they grew up with these techniques so the send back more often.

In [2]:
import datetime

In [3]:
#Train
#replace impossible date of births with nan
df_train["dateOfBirth"]=df_train["dateOfBirth"].replace(["1655-04-19"], np.nan)
df_train["dateOfBirth"]=df_train["dateOfBirth"].replace(["?"], np.nan)

#add age
now = pd.Timestamp('now')
df_train['dateOfBirth'] = pd.to_datetime(df_train['dateOfBirth'], format='%Y-%m-%d')    
df_train['dateOfBirth'] = df_train['dateOfBirth'].where(df_train['dateOfBirth'] < now, df_train['dateOfBirth'] -  np.timedelta64(100, 'Y'))   #
df_train['age'] = (now - df_train['dateOfBirth']).astype('<m8[Y]') 

#Fill nans with the mean age 
df_train.age = df_train.age.fillna(df_train["age"].mean())


In [4]:
#Test
#replace impossible date of births with nan
df_test["dateOfBirth"]=df_test["dateOfBirth"].replace(["?"], np.nan)

#add age
now = pd.Timestamp('now')
df_test['dateOfBirth'] = pd.to_datetime(df_test['dateOfBirth'], format='%Y-%m-%d')    
df_test['dateOfBirth'] = df_test['dateOfBirth'].where(df_test['dateOfBirth'] < now, df_test['dateOfBirth'] -  np.timedelta64(100, 'Y'))  
df_test['age'] = (now - df_test['dateOfBirth']).astype('<m8[Y]')    

#Fill nans with the mean age 
df_test.age = df_test.age.fillna(df_test["age"].mean())

## Feature 2 differece of mean price per itemID and real price

Perhaps it is helpful to know, how much an item costs on average based on their itemID. If we look at the difference between the mean price and the real price, we know if a product seems to be expensive or cheap, which might influence the probability of returning an item, because an expensive item needs to convince a lot more because the threshold of keeping it is higher. 
It is clear, that the price influences first the order decision, but in our dataset, we have only ordered products so if a product is cheaper than the expected price, customers will accept minor quality issues whereas items that are more expensive, need to be nearly perfect. So the decision to keep the product is made when the quality of the product and the fit can be checked.

In [5]:
# Train
#add mean price

df = df_train.groupby(by="itemID").mean()["price"]
df_train = pd.merge(df_train,df,on='itemID',how="right",suffixes=["_plain","_meaniID"])
#calculate the difference
df_train["diff_rrpID_price"] = df_train["price_plain"]-df_train["price_meaniID"]


In [6]:
#Test

#add mean price
df = df_test.groupby(by="itemID").mean()["price"]
df_test = pd.merge(df_test,df,on='itemID',how="right",suffixes=["_plain","_meaniID"])
#calculte the difference
df_test["diff_rrpID_price"] = df_test["price_plain"]-df_test["price_meaniID"]


## Feature 3 Target encodet Item ID und returnshipment

This Feature aims to show, how likely it is to send an item back, based on the itemID. My thought behind this feature was, that several items might be returned more often than others, based on their ID. For example swimwear, underwear, and particularly cosmetics might be returned a very few times but cloth-like jackets and t-shirts are returned more often. Because of this I wanted the probability to return an item based on the ID in the model.

In [7]:

#Train

df = df_train.groupby(by="itemID").mean()["returnShipment"]
df_train = pd.merge(df_train,df,on='itemID',how="right",suffixes=["_plain","_prob_iID"])
df_train = df_train.rename(columns = {"returnShipment_plain_plain":"returnShipment_plain"})
df_train = df_train.rename(columns = {"returnShipment_plain_prob_iID":"returnShipment_prob_iID"})

In [8]:
#Test

df = df_train.groupby(by="itemID").mean()["returnShipment_plain"]
df_test = df_test.merge(df,how="left",on="itemID")
df_test = df_test.rename(columns = {"returnShipment_plain":"returnShipment_prob_iID"})
df_test["returnShipment_prob_iID"] =df_test["returnShipment_prob_iID"].fillna(df_test["returnShipment_prob_iID"].mean())

## Feature 4 Target encoding manufacturerID

I thought, that different manufacturers produce on different levels of quality, so the fast fashion manufacture cloth might be returned more often because of quality issues, than high-quality cloth with no issues.

In [9]:
#Train
df = df_train.groupby(by="manufacturerID").mean()["returnShipment_plain"]

df_train = pd.merge(df_train,df,on='manufacturerID',how="right",suffixes=["","_prob_manuf"])
df_train = df_train.rename(columns = {"returnShipment_plain_prob_manuf":"returnShipment_prob_manuf"})


In [10]:
#Test
df = df_train.groupby(by="manufacturerID").mean()["returnShipment_plain"]

df_test = df_test.merge(df,on="manufacturerID",how="left")
df_test = df_test.rename(columns = {"returnShipment_plain":"returnShipment_prob_manuf"})

## Feature 5 targetencoding color

Several colors are neutral (like grey, blue, black) and several colors that are more unusual (like lemon, leopard, bronze). Neutral colors may be more likely to keep because they suit nearly everyone whereas extreme colors, need to fit the customer. Maybe on the website, the lemon-colored jacket looked nice, but in real life, it fits horrible to the customer. With a black jacket this might not happen, they can be other issues that lead to returning the jacket, but the color is not likely to be the reason. So I calculate the probability of returning an item based on the colors of the training dataset and transmit these probabilities on the test set.

In [11]:
#Train

# change missspelled color
df_train["color"]=df_train["color"].replace(["brwon"], ["brown"])

# targetencode color
df=df_train.groupby(by="color").mean()["returnShipment_plain"]

df_train = df_train.merge(df,on='color',how="left",suffixes=["","_prob_col"])
df_train = df_train.rename(columns = {"returnShipment_plain_prob_col":"returnShipment_prob_col"})


In [12]:
#Test

# change missspelled color
df_test["color"]=df_test["color"].replace(["brwon"], ["brown"])

# targetencode color
df=df_train.groupby(by="color").mean()["returnShipment_plain"]

df_test = df_test.merge(df,on="color",how="left")
df_test = df_test.rename(columns = {"returnShipment_plain":"returnShipment_prob_col"})


## Feature 6 size targetencodet

The column size contains sizes in many units, there are children sizes, shoe sizes, EU sizes, and more. First, I wanted to categorize the sizes in the different units, but this may lead to mistakes because there is for example a shoe size 39 and an EU cloth 39. I think, that maybe for one itemID, there are different possible sizes. So if shoes are ItemID 1, there could be a 39 or a 39+ and it would mean the same, it differs because of the person who made these entries. So I tried to just group by the sizes so that we would know, which sizes are more likely to be returned independently by the itemID.

In [13]:
#Train

df = df_train.groupby(by="size").mean()["returnShipment_plain"]

df_train = pd.merge(df_train,df,on='size',how="right",suffixes=["","_prob_size"])
df_train = df_train.rename(columns = {"returnShipment_plain_prob_size":"returnShipment_prob_size"})


In [14]:
#Test

df = df_train.groupby(by="size").mean()["returnShipment_plain"]

df_test = df_test.merge(df,on="size",how="left")
df_test = df_test.rename(columns = {"returnShipment_plain":"returnShipment_prob_size"})

df_test["returnShipment_prob_size"] = df_test["returnShipment_prob_size"].fillna(df_test["returnShipment_prob_size"].mean())


## Feature 7 Delivery Days 

Even if it is more a kind of preprocessing, the deviverydays may have a huge impact on the decision, whether an item is returned. If it takes too long, the customer may have ordered a replacement for the item or he is just unhappy because of the long delivery Duration and does not want to keep it. But if it arrives a few days after ordering, the customer gets the item in time and is hopefully not miserable about the long duration.

In [15]:
#Train

df_train["DeliveryDuration"] = df_train["DeliveryDuration"].str.replace('days', '')
df_train['DeliveryDuration'] = df_train['DeliveryDuration'].astype(float)
df_train["DeliveryDuration"] = df_train["DeliveryDuration"].fillna(df_train["DeliveryDuration"].mean())



In [16]:
#Test

df_test["DeliveryDuration"] = df_test["DeliveryDuration"].str.replace('days', '')
df_test['DeliveryDuration'] = df_test['DeliveryDuration'].astype(float)
df_test.DeliveryDuration = df_test.DeliveryDuration.fillna(df_test["DeliveryDuration"].mean())


## Feature 8 Customer ID

# ***Prepare the data***

I won't standardize or normalize the data, because I am going to use the GradientBoosting classifier, this does not use any distance measurement so standardization or normalization is not necessary. It may be easier to interpret the data, but because this homework sets the focus on the prediction, I try to keep it as simple as possible.

Now I am going to sort the test data frame because the adding of features messed it up. I need it ordered by the Id because I will drop the Id ,it should not be a feature since it should not influence the probability of a returnshipment. I am going to add the Id after I made the predictions, so I need the correct order.


Also, I will drop some columns of the train and the test data frame. I will drop size, color, itemID and mufacturerID, because these are not encoded, categorical data. I drop the date of birth because I already have the age in the model. I drop state and salutation because I don't think, that the salutation or the home of a person influences, how likely it is to return an item. I also dropped price_meanID, because only the difference to the real price might have an influence on the probability of a return shipment.



In [17]:
df_test = df_test.sort_values(by="Id")

In [18]:
df_train = df_train.drop([ "size","color","dateOfBirth","state","salutation","price_meaniID","Id","itemID","manufacturerID"], axis =1)
df_test = df_test.drop([ "size","color","dateOfBirth","state","salutation","price_meaniID","Id","itemID","manufacturerID"], axis =1)


Now, the two datasets contain the same columns, apart from the return shipment column, which appears only in the training dataset. Because of that, I am going to split the training dataset into X and y. X now contains the exact same columns as df_test. y contains the outcome of X. 

In [19]:
y = df_train["returnShipment_plain"]
X = df_train.drop(["returnShipment_plain"], axis =1)

In the next step, I will train the Hyperparameters for my model. In order to do that, I will separate 3% of the train data (12.000 datapoints) and tune the hyperparameters on them. I am going to use the 97% rest of the df_train to fit the model.

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train_h, X_test_rest, y_train_h, y_test_rest = train_test_split(X,y,test_size=0.97, random_state=42)

# ***Hyperarameter tuning***

First, I'd like to explain why I use Gradient Boosting for this task. The Gradient Boosting Classifier is very flexible and provides many hyperparameters to tune, so the final score increases. It might tend to overfit the training set depending on which value "subsample" and "learning_rate" take. But with the default values of these parameters and a 5k fold cross-validation on a subset of this size, the danger of overfitting is relatively low.

Now I am goint to set a grid of values for the hyperparameter and find the best values for them.

In [21]:
param_f={
    'loss': ["deviance", "exponential"],  
    'n_estimators': [75,50],
    'random_state':[42], #set to 42, it is not here to tune, but with a fix random_state I was able to compare the scores during my work.
    'max_depth':[3,4,2]
    }

In [22]:
'''
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

gbc = GradientBoostingClassifier()

gbc_cv=GridSearchCV(gbc,param_f,cv=10,scoring="roc_auc",verbose=1)

gbc_cv.fit(X_train_h,y_train_h)

print(gbc_cv.best_params_,gbc_cv.best_score_)

Fitting 10 folds for each of 12 candidates, totalling 120 fits
{'loss': 'exponential', 'max_depth': 3, 'n_estimators': 75, 'random_state': 42} 0.7292578987717302
'''

'\nfrom sklearn.ensemble import GradientBoostingClassifier\nfrom sklearn.model_selection import GridSearchCV\n\ngbc = GradientBoostingClassifier()\n\ngbc_cv=GridSearchCV(gbc,param_f,cv=10,scoring="roc_auc",verbose=1)\n\ngbc_cv.fit(X_train_h,y_train_h)\n\nprint(gbc_cv.best_params_,gbc_cv.best_score_)\n\nFitting 10 folds for each of 12 candidates, totalling 120 fits\n{\'loss\': \'exponential\', \'max_depth\': 3, \'n_estimators\': 75, \'random_state\': 42} 0.7292578987717302\n'

The best values are:

 loss : 'exponential' 
 
 max_depth: 3
 
 n_estimators: 75 
 
 random_state: 42}
 
 with a roc_auc of: 0.7292578987717302

Now I hardcode these values and fit a model with them on the rest of the dataset which was not used for hyperparameter fitting. Before I do that I split the remaining data again, so I will have a very small test set to evaluate. I know, that you are going to do that, but I want an impression of how good the model is.

# ***Fitting the modell***

In [23]:
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import GradientBoostingClassifier

#insert the best values
gbc = GradientBoostingClassifier(
    loss= "exponential",
    max_depth = 3,
    n_estimators = 75,
    random_state = 42 )

gbc = gbc.fit (X_test_rest, y_test_rest)


Now the model is fitted and in the following code we can see, which features are imortant for the model.

In [24]:
for fn, fi in sorted(zip(X.columns, gbc.feature_importances_), key=lambda xx: xx[1], reverse=True):
  print(f"{fn}: {fi:.3f}")

DeliveryDuration: 0.523
returnShipment_prob_iID: 0.422
age: 0.021
price_plain: 0.020
diff_rrpID_price: 0.010
returnShipment_prob_size: 0.002
returnShipment_prob_col: 0.001
returnShipment_prob_manuf: 0.001


# ***Create prediction file***

In [None]:
y_pred = gbc.predict_proba(df_test)

df_pred = pd.DataFrame(y_pred[:,1], columns = ['prediction'])

df_pred["Id"] = range(400000,531170)

df_pred = df_pred[['Id','prediction']]

#df_pred.to_csv(r'/content/drive/My Drive/HW5_PaulinaHeine.csv',index=False)


This is the prdiction file that I heve uploaded as well.

In [None]:
df_pred


Unnamed: 0,Id,prediction
0,400000,0.347593
1,400001,0.484777
2,400002,0.630273
3,400003,0.614342
4,400004,0.000588
...,...,...
131165,531165,0.073278
131166,531166,0.567946
131167,531167,0.277272
131168,531168,0.527310
