# DataCo Smart Supply Chain Model Testing

## Importing important library

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Seeing the data structures

In [2]:
df = pd.read_csv('DataCoSupplyChainDataset.csv', encoding='unicode_escape')
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,Type,Days for shipping (real),Days for shipment (scheduled),Benefit per order,Sales per customer,Delivery Status,Late_delivery_risk,Category Id,Category Name,Customer City,Customer Country,Customer Email,Customer Fname,Customer Id,Customer Lname,Customer Password,Customer Segment,Customer State,Customer Street,Customer Zipcode,Department Id,Department Name,Latitude,Longitude,Market,Order City,Order Country,Order Customer Id,order date (DateOrders),Order Id,Order Item Cardprod Id,Order Item Discount,Order Item Discount Rate,Order Item Id,Order Item Product Price,Order Item Profit Ratio,Order Item Quantity,Sales,Order Item Total,Order Profit Per Order,Order Region,Order State,Order Status,Order Zipcode,Product Card Id,Product Category Id,Product Description,Product Image,Product Name,Product Price,Product Status,shipping date (DateOrders),Shipping Mode
0,DEBIT,3,4,91.25,314.640015,Advance shipping,0,73,Sporting Goods,Caguas,Puerto Rico,XXXXXXXXX,Cally,20755,Holloway,XXXXXXXXX,Consumer,PR,5365 Noble Nectar Island,725.0,2,Fitness,18.251453,-66.037056,Pacific Asia,Bekasi,Indonesia,20755,1/31/2018 22:56,77202,1360,13.11,0.04,180517,327.75,0.29,1,327.75,314.640015,91.25,Southeast Asia,Java Occidental,COMPLETE,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,2/3/2018 22:56,Standard Class
1,TRANSFER,5,4,-249.089996,311.359985,Late delivery,1,73,Sporting Goods,Caguas,Puerto Rico,XXXXXXXXX,Irene,19492,Luna,XXXXXXXXX,Consumer,PR,2679 Rustic Loop,725.0,2,Fitness,18.279451,-66.037064,Pacific Asia,Bikaner,India,19492,1/13/2018 12:27,75939,1360,16.389999,0.05,179254,327.75,-0.8,1,327.75,311.359985,-249.089996,South Asia,Rajastán,PENDING,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,1/18/2018 12:27,Standard Class
2,CASH,4,4,-247.779999,309.720001,Shipping on time,0,73,Sporting Goods,San Jose,EE. UU.,XXXXXXXXX,Gillian,19491,Maldonado,XXXXXXXXX,Consumer,CA,8510 Round Bear Gate,95125.0,2,Fitness,37.292233,-121.881279,Pacific Asia,Bikaner,India,19491,1/13/2018 12:06,75938,1360,18.030001,0.06,179253,327.75,-0.8,1,327.75,309.720001,-247.779999,South Asia,Rajastán,CLOSED,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,1/17/2018 12:06,Standard Class
3,DEBIT,3,4,22.860001,304.809998,Advance shipping,0,73,Sporting Goods,Los Angeles,EE. UU.,XXXXXXXXX,Tana,19490,Tate,XXXXXXXXX,Home Office,CA,3200 Amber Bend,90027.0,2,Fitness,34.125946,-118.291016,Pacific Asia,Townsville,Australia,19490,1/13/2018 11:45,75937,1360,22.940001,0.07,179252,327.75,0.08,1,327.75,304.809998,22.860001,Oceania,Queensland,COMPLETE,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,1/16/2018 11:45,Standard Class
4,PAYMENT,2,4,134.210007,298.25,Advance shipping,0,73,Sporting Goods,Caguas,Puerto Rico,XXXXXXXXX,Orli,19489,Hendricks,XXXXXXXXX,Corporate,PR,8671 Iron Anchor Corners,725.0,2,Fitness,18.253769,-66.037048,Pacific Asia,Townsville,Australia,19489,1/13/2018 11:24,75936,1360,29.5,0.09,179251,327.75,0.45,1,327.75,298.25,134.210007,Oceania,Queensland,PENDING_PAYMENT,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,1/15/2018 11:24,Standard Class


After we see the first 5 rows, we want to dive deeper into the data, first we see number of categorical and numerical features based on the types of the features.

In [3]:
num_feat = [n for n in df.columns if df[n].dtypes!='O']
cat_feat = [c for c in df.columns if df[c].dtypes=='O']
print('Number of numerical features: ', len(num_feat))
print('Number of categorical features: ', len(cat_feat))

Number of numerical features:  29
Number of categorical features:  24


#### *Choosing Columns for Late Delivery Risk Prediction*

Before we choose proper column to predict lateness of the delivery, we need to take a look at the description of each features.

1. Type	:  Type of transaction made
2. Days for shipping (real)     	:  Actual shipping days of the purchased product
3. Days for shipment (scheduled)	:  Days of scheduled delivery of the purchased product
4. Benefit per order	:  Earnings per order placed
5. Sales per customer	:  Total sales per customer made per customer
6. Delivery Status	:  Delivery status of orders: Advance shipping , Late delivery , Shipping canceled , Shipping on time
7. Late_delivery_risk           	:  Categorical variable that indicates if sending is late (1), it is not late (0).
8. Category Id	:  Product category code
9. Category Name	:  Description of the product category
10. Customer City	:  City where the customer made the purchase
11. Customer Country	:  Country where the customer made the purchase
12. Customer Email	:  Customer's email
13. Customer Fname	:  Customer name
14. Customer Id	:  Customer ID
15. Customer Lname	:  Customer lastname
16. Customer Password	:  Masked customer key
17. Customer Segment	:  Types of Customers: Consumer , Corporate , Home Office
18. Customer State	:  State to which the store where the purchase is registered belongs
19. Customer Street	:  Street to which the store where the purchase is registered belongs
20. Customer Zipcode	:  Customer Zipcode
21. Department Id	:  Department code of store
22. Department Name	:  Department name of store
23. Latitude	:  Latitude corresponding to location of store
24. Longitude	:  Longitude corresponding to location of store
25. Market	:  Market to where the order is delivered : Africa , Europe , LATAM , Pacific Asia , USCA
26. Order City	:  Destination city of the order
27. Order Country	:  Destination country of the order
28. Order Customer Id	:  Customer order code
29. order date (DateOrders)	:  Date on which the order is made
30. Order Id	:  Order code
31. Order Item Cardprod Id	:  Product code generated through the RFID reader
32. Order Item Discount	:  Order item discount value
33. Order Item Discount Rate     	:  Order item discount percentage
34. Order Item Id	:  Order item code
35. Order Item Product Price     	:  Price of products without discount
36. Order Item Profit Ratio	:  Order Item Profit Ratio
37. Order Item Quantity	:  Number of products per order
38. Sales	:  Value in sales
39. Order Item Total  	:  Total amount per order
40. Order Profit Per Order	:  Order Profit Per Order
41. Order Region	:  Region of the world where the order is delivered 
42. Order State	:  State of the region where the order is delivered
43. Order Status	:  Order Status 
44. Product Card Id	:  Product code
45. Product Category Id	:  Product category code
46. Product Description	:  Product Description
47. Product Image	:  Link of visit and purchase of the product
48. Product Name	:  Product Name
49. Product Price	:  Product Price
50. Product Status	:  Status of the product stock :If it is 1 not available , 0 the product is available 
51. Shipping date (DateOrders)   	:  Exact date and time of shipment
52. Shipping Mode	:  The following shipping modes are presented : Standard Class , First Class , Second Class , Same Day

There are 2 important factors that determine lateness of the delivery such as:
1. Distance travelled
2. Shipping Mode

Those 2 things are the most common factor affecting the lateness of the delivery. 
But, we want to see othe factor such as:
1. Customer Segment
2. Total Order
3. Payment Type
4. Earnings
5. Product Category

Those are several factors that we want to see the correlation with the lateness of the delivery

In [4]:
late_pred = ['Longitude', 'Latitude', 'Order Country', 'Customer Segment', 'Type', 'Benefit per order','Shipping Mode','Late_delivery_risk']

In [5]:
df_ship =df[late_pred]
df_ship.head()

Unnamed: 0,Longitude,Latitude,Order Country,Customer Segment,Type,Benefit per order,Shipping Mode,Late_delivery_risk
0,-66.037056,18.251453,Indonesia,Consumer,DEBIT,91.25,Standard Class,0
1,-66.037064,18.279451,India,Consumer,TRANSFER,-249.089996,Standard Class,1
2,-121.881279,37.292233,India,Consumer,CASH,-247.779999,Standard Class,0
3,-118.291016,34.125946,Australia,Home Office,DEBIT,22.860001,Standard Class,0
4,-66.037048,18.253769,Australia,Corporate,PAYMENT,134.210007,Standard Class,0


In [6]:
df_ship.shape

(180519, 8)

In [7]:
df_ship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180519 entries, 0 to 180518
Data columns (total 8 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Longitude           180519 non-null  float64
 1   Latitude            180519 non-null  float64
 2   Order Country       180519 non-null  object 
 3   Customer Segment    180519 non-null  object 
 4   Type                180519 non-null  object 
 5   Benefit per order   180519 non-null  float64
 6   Shipping Mode       180519 non-null  object 
 7   Late_delivery_risk  180519 non-null  int64  
dtypes: float64(3), int64(1), object(4)
memory usage: 11.0+ MB


In [8]:
print(df_ship['Shipping Mode'].value_counts())

Standard Class    107752
Second Class       35216
First Class        27814
Same Day            9737
Name: Shipping Mode, dtype: int64


In [9]:
import warnings
warnings.filterwarnings('ignore')

df_ship['Count'] = df['Order Country'].map(df['Order Country'].value_counts())
df_ship.head()

Unnamed: 0,Longitude,Latitude,Order Country,Customer Segment,Type,Benefit per order,Shipping Mode,Late_delivery_risk,Count
0,-66.037056,18.251453,Indonesia,Consumer,DEBIT,91.25,Standard Class,0,4204
1,-66.037064,18.279451,India,Consumer,TRANSFER,-249.089996,Standard Class,1,4783
2,-121.881279,37.292233,India,Consumer,CASH,-247.779999,Standard Class,0,4783
3,-118.291016,34.125946,Australia,Home Office,DEBIT,22.860001,Standard Class,0,8497
4,-66.037048,18.253769,Australia,Corporate,PAYMENT,134.210007,Standard Class,0,8497


In [12]:
df_ship.Count.dtypes

dtype('int64')

In [13]:
Q1 = df_ship.Count.quantile(0.25)
Q3 = df_ship.Count.quantile(0.75)


df_ship = df_ship[~((df_ship.Count < (Q1)) | (df_ship.Count > (Q3 )))]
df_ship.shape

(98313, 9)

In [14]:
Q1 = df_ship['Benefit per order'].quantile(0.25)
Q3 = df_ship['Benefit per order'].quantile(0.75)
IQR = Q3-Q1

df_ship = df_ship[~((df_ship['Benefit per order'] < (Q1 - 1.5 * IQR)) | (df_ship['Benefit per order'] > (Q3 + 1.5 * IQR)))]
df_ship.shape

(88066, 9)

In [15]:
df_ship = df_ship.drop('Count', axis=1)

In [16]:
cleanup = {'Shipping Mode':{"Standard Class":0,"Second Class" : 1,"First Class":2,"Same Day":3}}

df_ship=df_ship.replace(cleanup)
df_ship.head(2)

Unnamed: 0,Longitude,Latitude,Order Country,Customer Segment,Type,Benefit per order,Shipping Mode,Late_delivery_risk
0,-66.037056,18.251453,Indonesia,Consumer,DEBIT,91.25,0,0
3,-118.291016,34.125946,Australia,Home Office,DEBIT,22.860001,0,0


In [17]:
df_ship['Customer Segment'].value_counts()

Consumer       45381
Corporate      26612
Home Office    16073
Name: Customer Segment, dtype: int64

In [18]:
cleanup1 = {'Customer Segment':{"Consumer":0,"Corporate" : 1,"Home Office":2}}

df_ship=df_ship.replace(cleanup1)
df_ship.head(2)

Unnamed: 0,Longitude,Latitude,Order Country,Customer Segment,Type,Benefit per order,Shipping Mode,Late_delivery_risk
0,-66.037056,18.251453,Indonesia,0,DEBIT,91.25,0,0
3,-118.291016,34.125946,Australia,2,DEBIT,22.860001,0,0


In [20]:
print(df_ship.shape)
df_ship = pd.get_dummies(df_ship)
df_ship.shape

(88066, 8)


(88066, 29)

In [22]:
df_ship['Late_delivery_risk'].value_counts()

1    48129
0    39937
Name: Late_delivery_risk, dtype: int64

In [23]:
from sklearn.model_selection import train_test_split

y=df_ship['Late_delivery_risk']
X=df_ship.drop('Late_delivery_risk',axis=1)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [25]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

logreg = LogisticRegression()
gnb = GaussianNB()
rfc = RandomForestClassifier()
tree = DecisionTreeClassifier()
sgd = SGDClassifier()
knc = KNeighborsClassifier()

In [26]:
from sklearn.metrics import roc_auc_score

In [29]:
#Logistic Regression

logreg.fit(X_train, y_train)
ypred = logreg.predict(X_valid)
roc_auc_score(y_valid, ypred)

0.7010242762413136

In [30]:
#GaussianNB

gnb.fit(X_train, y_train)
ypred = gnb.predict(X_valid)
roc_auc_score(y_valid, ypred)

0.6228195827419153

In [32]:
#DecisionTree

tree.fit(X_train, y_train)
ypred = tree.predict(X_valid)
roc_auc_score(y_valid, ypred)

0.8628984882606454

In [33]:
#SGDClassifier

sgd.fit(X_train, y_train)
ypred = sgd.predict(X_valid)
roc_auc_score(y_valid, ypred)

0.499607424189823

In [34]:
#RandomForest

rfc.fit(X_train, y_train)
ypred = rfc.predict(X_valid)
roc_auc_score(y_valid, ypred)

0.8574299776915302

In [51]:
#KNeighbors

knc.fit(X_train, y_train)
ypred = knc.predict(X_valid)
roc_auc_score(y_valid, ypred)

0.6107902636319618

Decision Tree model ssems the best method to predict the lateness of the delivery