# UPS BR Data Analyst Case:
## Python Challenge

**By Sebastian Montero Segura**

In this challenge, you are presented another fictitious data sample, that mocks customer
purchases and deliveries. I this hypothetical scenario, the company wants to develop a
machine-learning model to predict a delivery delay and notify the customer in advance.
Based on the file provided (predict-model-dataset.csv), the minimum request for this
challenge is to answer the following questions using Python:

>1. Which Warehouse has more delays? Absolute and relative (%)

>2. Which Transportation mode has more delays? Absolute and relative (%)

>3. Generate a visuals to illustrate the distributions of “Gender” x “Customer_Rating”

>4. Based on the data provided, does package weight seems to be related to the delivery
delay?

The extra request for this challenge is:

>►Provide an overall data exploration, using Python. Metrics such mean, max, min,
std_deviation;

>►Predict model and its respective scores (Accuracy, Precision, Recall, F1-Score, ROC,
Confusion Matrix, etc).

### 1.Import the Data:
►Import Python Libraries:

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

►Read CSV file from Github Repository:

In [24]:
#file_URL = "https://raw.githubusercontent.com/juanrodriguesups/UPS-BR-Data-Analytics-Case/main/predict-model-dataset.csv"
file_URL = "./predict-model-dataset.csv"
df_predict_model = pd.read_csv(file_URL)

### 2. Clean the Data
►Verify that the datatypes imported of the Dataframe columns match the types that we are expecting:

In [None]:
df_predict_model.info()

►Check for null Values in the data:

In [None]:
pd.isnull(df_predict_model).sum()

### 3. Analysis of the Data
#### 1. Which Warehouse has more delays? Absolute and relative (%)

First we'll need to count the number of delays registered in the data grouped by warehouse in order to get the absolute number of delays.
For this, the variable 'result_warehouse_delays' is going to store the count of registers where the 'DELIVERY_DELAYED' is equal to 0, grouped by each unique 'WAREHOUSE_BLOCK' in the data.


In [27]:
result_warehouse_delays = df_predict_model.loc[df_predict_model['DELIVERY_DELAYED'] == 0, ['WAREHOUSE_BLOCK','DELIVERY_DELAYED']].groupby('WAREHOUSE_BLOCK').count()

Second, to discover the relative percentage of delays for each depot we'll need to have the total amount of delays, to later divide the delays for each warehouse between the total of delays and multiply it by 100. The total sum of delays is going to be stored in the vaiable 'Total_Delays'

In [28]:
total_delays = result_warehouse_delays['DELIVERY_DELAYED'].sum()

To store the results of the calculation of the relative delays per warehouse, we create new column in the dataframe called 'RELATIVE_DELAY':

In [29]:
result_warehouse_delays['RELATIVE_DELAY'] = (result_warehouse_delays['DELIVERY_DELAYED'] / total_delays) * 100

When creating the chart we'll need the unique values of warehouse to use as reference of the 'X' axis in the chart

In [30]:
warehouse = df_predict_model['WAREHOUSE_BLOCK'].unique()

And to finish, once we have all the absolute and relative measures for each depot it's time to create some visualizations of the data to have a better understanding of the data:

**Bar chart / Absolute Delays per Warehouse:**

In [None]:
plt.bar(warehouse, result_warehouse_delays['DELIVERY_DELAYED'])
plt.title('Absolute Delays per Warehouse')
plt.ylabel('Total Delays')
plt.xlabel('Warehouse')
plt.show()

**Pie chart / Relative % Delays per Warehouse:**

In [None]:
plt.style.use('ggplot')
plt.pie(result_warehouse_delays['RELATIVE_DELAY'], labels = warehouse, autopct = '%.2f %%')
plt.title('Relative % Delays per Warehouse')
plt.show()

#### 2. Which Transportation mode has more delays? Absolute and relative (%)

As in the case before, we need to count the number of delays registered in the data grouped by Transportation Mode in order to get the absolute number of delays.
For this, the variable 'result_transportation_delays' is going to store the count of registers where the 'DELIVERY_DELAYED' is equal to 0, grouped by each unique 'MODE_OF_SHIPMENT' in the data.

In [33]:
result_transportation_delays = df_predict_model.loc[df_predict_model['DELIVERY_DELAYED'] == 0, ['MODE_OF_SHIPMENT','DELIVERY_DELAYED']].groupby('MODE_OF_SHIPMENT').count()

Second, to discover the relative percentage of delays for each depot we'll need to have the total amount of delays, to later divide the delays for each Transportation Mode between the total of delays and multiply it by 100. The total sum of delays is going to be stored in the vaiable 'total_delays':

In [34]:
total_delays = result_transportation_delays['DELIVERY_DELAYED'].sum()

To store the results of the calculation of the relative delays per shipment mode, we create new column in the dataframe called 'RELATIVE_DELAY':

In [35]:
result_transportation_delays['RELATIVE_DELAY'] = (result_transportation_delays['DELIVERY_DELAYED'] / total_delays) * 100

When creating the chart we'll need the unique values of shipment mode to use as reference of the 'X' axis in the chart

In [36]:
transportation = df_predict_model['MODE_OF_SHIPMENT'].unique()

And to finish, once we have all the absolute and relative measures for each Transportation Mode it's time to create some visualizations of the data to have a better understanding of the data:

**Bar chart / Absolute Delays per Transportation Mode:**

In [None]:
plt.bar(transportation, result_transportation_delays['DELIVERY_DELAYED'])
plt.title('Absolute Delays per Transportation Mode')
plt.ylabel('Total Delays')
plt.xlabel('Transportation Mode')
plt.show()

**Pie chart / Relative % Delays per Transportation Mode:**

In [None]:
plt.style.use('ggplot')
plt.pie(result_transportation_delays['RELATIVE_DELAY'], labels = transportation, autopct = '%.2f %%')
plt.title('Relative % Delays per Transportation Mode')
plt.show()

#### 3. Generate a visuals to illustrate the distributions of “Gender” x “Customer_Rating”

To generate a chart that displays the distribution of Customer Ratings per Gender, we'll create a new dataframe that is going to store the count of ratings per gender grouped by the Ratings.

In [39]:
count_ratingxgender =  df_predict_model.loc[df_predict_model['GENDER'] == 'F',['CUSTOMER_RATING','GENDER']].groupby('CUSTOMER_RATING').count()
count_ratingxgender = count_ratingxgender.rename(columns={'GENDER': 'Female'})
count_ratingxgender['Male'] =  df_predict_model.loc[df_predict_model['GENDER'] == 'M',['CUSTOMER_RATING','GENDER']].groupby('CUSTOMER_RATING').count()

Then display the chart using the previous data set with a plot:

**Bar chart / Costumer ratings per Gender:**

In [None]:
count_ratingxgender.plot(kind = 'bar')
plt.xticks(range(0,6), rotation = 'horizontal')
plt.xlabel('Ratings')
plt.ylabel('Customer Rating')
plt.title('Customer Ratings per Gender')
plt.show

#### 4.  Based on the data provided, does package weight seems to be related to the delivery delay?

Using a new DataFrame to store the weight and Delay columns, we'll find the Correlation coeficient using pythons correlation function:

In [None]:
df_Weight_x_Delay = pd.DataFrame()
df_Weight_x_Delay['Weight'] = df_predict_model.WEIGHT_IN_GMS
df_Weight_x_Delay['Delay'] = df_predict_model.DELIVERY_DELAYED
corr_matrix = df_Weight_x_Delay.corr()
print(corr_matrix['Delay'])

As shown above, the correlation coefficient is a total of -0.27, between the weight and the delay exists a weak negative correlation, this means that if the weight value increases and viceversa, the delay value decreases, and in this case since the delay value for delay equals to 0, when the weight increases the delay value gets closer to 0 that its a delay. 

### Extra
#### ►Provide an overall data exploration, using Python. Metrics such mean, max, min, std_deviation;

To have a overall analysis of the numerical values in the data, we can do that with the describe() function of pandas. This will show us the count, mean, std_deviation, min, max for the columns that contain numerical values.

In [None]:
df_predict_model.describe()

#### ►Predict model and its respective scores (Accuracy, Precision, Recall, F1-Score, ROC,Confusion Matrix, etc).

To create a Predict Model we first need to find the columns in our data set that are going to be useful in our prediction, this to avoid unnecessary information interfiering with our scores.

Once identified the columns that we are going to use, we need to check if any of the selected columns contains categorical data, since this information can't be read by the predict model.

To avoid this issue, the columns containing the categorical data are going to be droped and replaced by columns of each categorical value where its going to use "True" and "False" to identify the data in the column.

In [43]:
Train_data = df_predict_model.join(pd.get_dummies(df_predict_model.WAREHOUSE_BLOCK)).drop(['WAREHOUSE_BLOCK'], axis=1)
Train_data = Train_data.join(pd.get_dummies(df_predict_model.MODE_OF_SHIPMENT)).drop(['MODE_OF_SHIPMENT'], axis=1)
Train_data = Train_data.join(pd.get_dummies(df_predict_model.PRODUCT_IMPORTANCE)).drop(['PRODUCT_IMPORTANCE'], axis=1)

Once the categorical columns are replaced, its time to drop the columns that are not necessary for the model and the column of the measure that we are trying to predict

In [44]:
Train_data = Train_data.drop(['ID'], axis=1)
Train_data = Train_data.drop(['CUSTOMER_RATING'], axis=1)
Train_data = Train_data.drop(['COST_OF_THE_PRODUCT'], axis=1)
Train_data = Train_data.drop(['DISCOUNT_OFFERED'], axis=1)
Train_data = Train_data.drop(['GENDER'], axis=1)
Train_data = Train_data.drop(['DELIVERY_DELAYED'], axis=1)
Train_data = Train_data.drop(['CUSTOMER_CARE_CALLS'], axis=1)

Now we'll import the necessary methods for our Prediction model

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

Now we split the data into test sets and train sets, we'll train our model with our training sets and then test the results of the predictions in the test data.

In [46]:
X_train, X_test, y_train, y_test = train_test_split(Train_data, df_predict_model['DELIVERY_DELAYED'], test_size=0.2, random_state=42)

Now we load our prediction model, in this case we'll use the logistic Regression model and then fit the training data in the model

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

With the training data fitted in the model, we proceed to make predictions in the fist test data

In [48]:
y_pred = model.predict(X_test)

And to finish, we need to get our evaluation metrics using the predictions of the first test data and the second test data to compare its performace with a diferent test set.

In [None]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("ROC AUC:", roc_auc)
print("Confusion Matrix:")
print(confusion_mat)