#  Case Data Scientist

In this notebook, we're going to go through an example machine learning project with the goal of predict the number of transports that may start in a given postal code of five digits on a certain day of the year.

Problem 1:

The first problem will be solved using the following steps:
1. Problem definition.
2. Data.
3. EDA.
4. Preprocessing
5. Model
6. Evaluating the Model
7. Feature Important

Problem 2:

In this problem, we'll answer ***some*** critical questoions to get more insights that improve business decisions.

# Problem 1: Data Science

## 1. problem definition

> Developing a Machine Learning model that is able to **predict the number of transports** that may start in a given postal code of five digits on a certain day of the year.

## 2. Data

There are 2 main datasets:

* **Innrikes Paket kost 2019.09 - 2020.08.pickle** 
  * Dataset of transports within Sweden.
  * The end and start of the journey for the marchandise is described by the Sweden postal zip code with 5 digits (more granular than the one with 3 digits)
  * Also included are the start date of the transport, and various KPIs, such as the duration, weight, volume, and cost.
  
* **df_postal_code_sweden.pickle:**
  presents the latitude and longitude of one point (maybe the center) within each postal code of 5 digits in Sweden.


### Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pickle

from zipfile import ZipFile
import sys

%matplotlib inline
import seaborn as sns
import datetime

sys.path.append('../')
from handling_missing_data import CleanData
# to impute missing data with Feature-engine:
from feature_engine.imputation import RandomSampleImputer

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_score

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)



### Dataset

#### Import files

In [None]:
# import files

df_postal_code_sweden = pickle.load(open('./Data/df_postal_code_sweden.pickle', 'rb'))
gdf_geocode_sweden = pickle.load(open('./Data/gdf_geocode_sweden.pickle', 'rb'))

with ZipFile('./Data/Innrikes Paket kost 2019.09 - 2020.08.pickle.zip', 'r') as zip_ref:
    zip_ref.extractall('./Data/')
df_Innrikes_Paket_kost = pickle.load(open('./Data/Innrikes Paket kost 2019.09 - 2020.08.pickle', 'rb'))

#### Convert file Pickle to CSV

In [None]:
# Convert the pickle files to csv files
df_postal_code_sweden.to_csv(r'./Data/df_postal_code_sweden.csv',index=False)
gdf_geocode_sweden.to_csv(r'./Data/gdf_geocode_sweden.csv',index=False)
df_Innrikes_Paket_kost.to_csv(r'./Data/df_Innrikes_Paket_kost.csv',index=False)

In [None]:
# Read file as csv
df_postal_code_sweden = pd.read_csv('./Data/df_postal_code_sweden.csv')
gdf_geocode_sweden = pd.read_csv('./Data/gdf_geocode_sweden.csv')
df_Innrikes_Paket_kost = pd.read_csv('./Data/df_Innrikes_Paket_kost.csv', parse_dates=['DepartureDate'])

## 3. EDA
Exploratory Data Analysis (EDA) is the process of visualizing and analyzing data to extract insights from it. In other words, EDA is the process of summarizing important characteristics of data in order to gain better understanding of the dataset.
Since EDA has no real set methodolgy, the following is a short check list you might want to walk through:

1. What kind of data do you have and how do you treat different types?
2. What’s missing from the data and how do you deal with it?
3. Where are the outliers and why should you care about them?
4. How can you add, change or remove features to get more out of your data?
- Count of unique values
- Numeric columns
- Missing values
- Summary stats
- Outliers:
    - Considerably higher or lower
    - Require further investigation


In [None]:
# files dim
df_postal_code_sweden.shape, gdf_geocode_sweden.shape, df_Innrikes_Paket_kost.shape

In [None]:
# Check the first 5 rows
df_postal_code_sweden.head()

In [None]:
gdf_geocode_sweden.head()

In [None]:
df_Innrikes_Paket_kost.head()
df_Innrikes_Paket_kost.columns

In [None]:
# Check the type of variables
df_postal_code_sweden.info()
gdf_geocode_sweden.info()
df_Innrikes_Paket_kost.info()

In [None]:
# calculating some statistical data like percentile, mean and std of the numerical values
df_Innrikes_Paket_kost.describe()

### Missing Value

In [None]:
# Function to find the percentage of missing values in each column
def find_missing_percentage(df):
    percent_missing = df.isnull().sum() * 100 / len(df)
    missing_value_df = pd.DataFrame({'column_name': df.columns, 'percent_missing': percent_missing})
    missing_value_df.sort_values('percent_missing', inplace=True)
    return missing_value_df

In [None]:
# Finding the missing values in the df_postal_code_sweden
find_missing_percentage(df_postal_code_sweden)

In [None]:
# Finding the missing values in the df_postal_code_sweden
find_missing_percentage(gdf_geocode_sweden)

In [None]:
find_missing_percentage(df_Innrikes_Paket_kost)

### Uniqueness

In [None]:
# Check the different unique values of different columns
df_Innrikes_Paket_kost.nunique()

In [None]:
# Check the different unique values of different columns
df_postal_code_sweden.nunique()

In [None]:
df_Innrikes_Paket_kost['PlaceOfDestination'].value_counts()

#### Checking the Numerical & Categorical variables including the missing values

In [None]:
#From Freezeframes select only the numeric cols 
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_numerics = df_Innrikes_Paket_kost.select_dtypes(include=numerics)
Col_num = df_Innrikes_Paket_kost.select_dtypes(include=np.number).columns.tolist()
df_numerics.head()
df_numerics.isnull().sum()

In [None]:
list_singular_matrix = []
for i in range(len(Col_num)):
    if len(df_Innrikes_Paket_kost[Col_num[i]].unique()) == 2:
        print(f'Varable_name: {Col_num[i]}')
        list_singular_matrix.append(Col_num[i])
df_Innrikes_Paket_kost.drop(list_singular_matrix,inplace=True,axis=1)
Col_num = df_Innrikes_Paket_kost.select_dtypes(include=np.number).columns.tolist()
df_Innrikes_Paket_kost.shape

In [None]:
# find categorical variables

numerical = [var for var in df_Innrikes_Paket_kost.columns if df_Innrikes_Paket_kost[var].dtype!='O']

print('There are {} numerical variables'.format(len(numerical)))

# Check for columns which are numeric
for label, content in df_Innrikes_Paket_kost.items():
    if  pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
df_categorical = df_Innrikes_Paket_kost.select_dtypes(include= object)
Col_cat = df_Innrikes_Paket_kost.select_dtypes(include=np.object_).columns.tolist()
df_categorical.head()
df_categorical.isnull().mean()

In [None]:
# find categorical variables

categorical = [var for var in df_Innrikes_Paket_kost.columns if df_Innrikes_Paket_kost[var].dtype=='O']

print('There are {} categorical variables'.format(len(categorical)))

# Check for columns which aren't numeric
for label, content in df_Innrikes_Paket_kost.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
pd.crosstab(df_Innrikes_Paket_kost.DepartureDate, df_Innrikes_Paket_kost.PlaceOfDeparture)

#### Sort DataFrame by DepartureDate

As we're working on a time series problem and trying to predict future examples given past examples, it makes sense to sort our data by date.

In [None]:
# Sort DataFrame in date order
df_Innrikes_Paket_kost.sort_values(by=["DepartureDate"], inplace=True, ascending=True)
df_Innrikes_Paket_kost.DepartureDate.head(10)

In [None]:
df_Innrikes_Paket_kost.head()

### Generating Report and Plotting 

In [None]:
#!pip install -U dataprep

In [None]:
# Import DataPrep library
import dataprep as dpr
from dataprep.eda import plot
from dataprep.eda import plot_missing
from dataprep.eda import plot_correlation
#from dataprep.eda import create_report

In [None]:
#plot(df_Innrikes_Paket_kost)

In [None]:
#plot_missing(df_Innrikes_Paket_kost)

In [None]:
#plot_correlation(df_Innrikes_Paket_kost)

# 4. Preprocessing

**Steps to clean the data:**
 - Remove duplicated features in each file.
 - Remove Constant and Quasi Features.
 - Remove low/high correlated features.
 - Covnert the features to correct format and type.
 - Imputing the missing data in the numerical and categorical variables.
 - Removing Outliers.
 - Scaling/transfomr the data. (*Solved in the Model section*)

#### Make a copy of the original DataFrame

Since we're going to be manipulating the data, we'll make a copy of the original DataFrame and perform our changes there.

This will keep the original DataFrame in tact if we need it again.

In [None]:
clean_data = CleanData()

In [None]:
# copy the original DataFrame for preprocessing
df_tmp = df_Innrikes_Paket_kost.copy()

#### Dropping Columns

In [None]:
df_remove_signMat_dup_cons_quas_row_col = clean_data.drop_const_quasi_dupl(df_tmp)

In [None]:
df_remove_signMat_dup_cons_quas_row_col.columns

#### Converting Numerical <==> Categorical

In [None]:
col_conv_num_to_cat = ["ToZone", "FromZone"]
col_conv_cat_to_num = ['GrossWeight']
df_tmp = clean_data.convert_num_to_cat(df_remove_signMat_dup_cons_quas_row_col,col_conv_num_to_cat)
df_tmp = clean_data.convert_cat_to_num(df_remove_signMat_dup_cons_quas_row_col,col_conv_cat_to_num)

In [None]:
df_numerics = df_tmp.select_dtypes(include=numerics)
Col_num = df_tmp.select_dtypes(include=np.number).columns.tolist()
Col_cat = df_tmp.select_dtypes(include=np.object_).columns.tolist()

### Imputing Missing Data

After investigation the missing data in EDA section, we notice that the missing values are only in the categorical columns:
- ToZone --> 0.679478%
- FromZone --> 3.476942%
- PlaceOfDestination --> 17.366724%
- PlaceOfDeparture --> 20.131676%
- ConsigneeName --> 23.878576%

To impute missing data, we are going to use `The RandomSampleImputer()` from feature_engine to replace missing data( works both numerical and categorical ) with a random sample extracted from the variables in the training set.

In [None]:
#!pip install feature-engine
imputer = RandomSampleImputer(variables=Col_cat)
imputer.fit(df_tmp)
df_tmp_impute = imputer.transform(df_tmp)

In [None]:
# No missing values
find_missing_percentage(df_tmp_impute)

### Remove Outliers

In [None]:
# create the capper
from feature_engine.outliers import Winsorizer
windsoriser = Winsorizer(capping_method='iqr', # choose iqr for IQR rule boundaries or gaussian for mean and std
                          tail='both', # cap left, right or both tails 
                          fold=3)
windsoriser.fit(df_tmp_impute)
df_remov_outliers = windsoriser.transform(df_tmp_impute)

In [None]:
windsoriser.left_tail_caps_

In [None]:
windsoriser.right_tail_caps_

In [None]:
# function to create histogram, Q-Q plot and
# boxplot. We learned this in section 3 of the course
import seaborn as sns
# for Q-Q plots
def diagnostic_plots(df, variable):
    # function takes a dataframe (df) and
    # the variable of interest as arguments

    # define figure size
    plt.figure(figsize=(16, 4))

    # histogram
    plt.subplot(1, 3, 1)
    sns.histplot(df[variable], bins=30)
    plt.title('Histogram')

    # Q-Q plot
    plt.subplot(1, 3, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.ylabel('Variable quantiles')

    # boxplot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=df[variable])
    plt.title('Boxplot')

    plt.show()

In [None]:
diagnostic_plots(df_tmp_impute, 'GrossWeight')

In [None]:
diagnostic_plots(df_remov_outliers, 'GrossWeight')

In [None]:
# No missing data
df_remov_outliers.isnull().sum()

#### Saving a copy of cleaned data

In [None]:
df_cleaned_data = df_remov_outliers.copy()
df_cleaned_data

## 5. Model


In [None]:
# We can count per day how many packets were sent for every postal code by aggregating.
# 
data = df_cleaned_data.groupby(['DepartureDate','FromZone'],as_index = False).agg({'GrossWeight':'sum','ChargeWeight':'sum','Volume':'sum','NumberOfPieces':'sum','CostTotalAmount':'sum','FromZone':'count'})

In [None]:
data.rename(columns = {'FromZone':'Count'}, inplace = True)

#### Separate Target Variable and Predictor Variables
`Number Of Pieces` also as a predictor alongwith the `weight`, `cost`, `vol`. Reasoning is Higher number of pieces, higher weight, higher vol, and lower cost might result in higher NumerOfTransports the company has to arrange. `NumerOfTransports` is important because company needs more Drivers as a consequence for the packages.



In [None]:
Target = ['Count']
Predictors=['GrossWeight', 'ChargeWeight', 'Volume', 'NumberOfPieces', 'CostTotalAmount']

X = data[Predictors].values
y = data[Target].values

#### Sandardization of data

In [None]:
Predictor_Scaler=StandardScaler()
Target_Scaler=StandardScaler()

# Storing the fit object for later reference
Predictor_ScalerFit = Predictor_Scaler.fit(X)
Target_ScalerFit = Target_Scaler.fit(y)
 
# Generating the standardized values of X and y
X = Predictor_ScalerFit.transform(X)
y = Target_ScalerFit.transform(y)

#### Splitting the data into training and testing set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Quick sanity check with the shapes of Training and testing datasets
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)


#### 5.1 Creating a Ann Model to train our data

In [None]:
# create ANN model
model = Sequential()
 
# Defining the Input layer 
model.add(Dense(units=5, input_dim=5, kernel_initializer='normal', activation='relu'))
 
# Defining the Second layer of the model
model.add(Dense(units=5, kernel_initializer='normal', activation='tanh'))
 
# The output neuron is a single fully connected node 
# Since we will be predicting a single number
model.add(Dense(1, kernel_initializer='normal'))
 
# Compiling the model
model.compile(loss='mean_squared_error', optimizer='adam')
 
# Fitting the ANN to the Training set
model.fit(X_train, y_train ,batch_size = 20, epochs = 50, verbose=1)

#### Generating Predictions

In [None]:
# Generating Predictions on testing data
test_pred = model.predict(X_test)

# Generating Predictions on training data
train_pred = model.predict(X_train)

#### 5.2 RandomForest Regressor

In [None]:
# Let's build a machine learning model 
from sklearn.ensemble import RandomForestRegressor

# Change max_samples value
model_rfr = RandomForestRegressor(n_jobs=-1,
                              random_state=42,
                              max_samples=10000)


In [None]:
%%time
model_rfr.fit(X_train, y_train)

#### 5.3 Hyerparameter tuning with RandomizedSearchCV


In [None]:
%%time
from sklearn.model_selection import RandomizedSearchCV

# Different RandomForestRegressor hyperparameters
rf_grid = {"n_estimators": np.arange(10, 100, 10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2),
           "max_features": [0.5, 1, "sqrt", "auto"],
           "max_samples": [10000]}

# Instantiate RandomizedSearchCV model
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
                                                    random_state=42),
                              param_distributions=rf_grid,
                              n_iter=2,
                              cv=5,
                              verbose=True)

# Fit the RandomizedSearchCV model
rs_model.fit(X_train, y_train)

In [None]:
# Find the best model hyperparameters
rs_model.best_params_

In [None]:
%%time

# Most ideal hyperparamters
ideal_model = RandomForestRegressor(n_estimators=20,
                                    min_samples_leaf=7,
                                    min_samples_split=4,
                                    max_features='auto',
                                    n_jobs=-1,
                                    max_samples=None,
                                    random_state=42) # random state so our results are reproducible

# Fit the ideal model
ideal_model.fit(X_train, y_train)

## 6. Evaluating a model.

### Evaluating Model Performance on Train & Test Data

In [None]:
# Create function to evaluate model on a few different levels
def show_scores(model):
    train_preds = model.predict(X_train)
    test_preds = model.predict(X_test)
    scores = {"MAE for Train Data is: ": mean_absolute_error(y_train, train_preds),
              "MAE for Test Data is: ": mean_absolute_error(y_test, test_preds),
              "MSE for Train Data is: ": mean_squared_error(y_train, train_preds),
              "MSE for Test Data is: ": mean_squared_error(y_test, test_preds),
              "R2 Score for Train Data is": r2_score(y_train, train_preds)*100,
              "R2 Score for Test Data is": r2_score(y_test, test_preds)*100}
    return scores

#### 6.1 ANN

In [None]:
show_scores(model)

**Conclusion** : Our Model performed very well and was able to give a R2 Score of 99% which is quite Decent

#### 6.2 RandomForest Regressor

In [None]:
show_scores(model_rfr)

#### 6.3 Hyerparameter tuning with RandomizedSearchCV


In [None]:
# Evaluate the RandomizedSearch model
show_scores(rs_model)

#### Train a model with the best hyperparamters


In [None]:
# Scores for ideal_model (trained on all the data)
show_scores(ideal_model)

#### Make predictions on test data


In [None]:
# Make predictions on the test dataset
test_preds = ideal_model.predict(X_test)

In [None]:
# Find feature importance of our best model
ideal_model.feature_importances_

Final Result


| Ml/DL      | R2 Score (train/test)% | MAE (train/test)     | MSE (train/test)     |
| :---        |    :----:   |          ---: |          ---: |
| Neural Network (ANN)     | 99.78/99.80      | 0.0080/0.0073   |0.0022/0.0014
| Random forest   | 99.9753/99.9744      | 0.1190/0.118     |7.171/5.226
| Random forest with best Hyperparameter   | 99.63/99.46       | 0.4366/0.397      | 106.18/108.45



**Conclusion** : All the models performed very well and was able to give a R2 Score of 99% which is quite Decent, but Random forest is the best model for this case!

## Feature Importance

Feature selection is a critical step for most data science projects as it enables the models to train faster, reduces the complexity and makes it easier to interpret. It has the potential to improve model performance and reduce the problem of overfitting if the optimal set of features are chosen. In our data scince assigment, we don't this step because the data is not too large and it's quite easy to find what the most relevant columns in the dataset.

The next step is to perform featue selection using the following techniques:

- Select feature importance from random forest
- Select the features identified by Lasso regression
- Select features based on absolute value of beta coefficients of features


In [None]:
#DataFrame
X_df = data[Predictors]
y_df = data[Target]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)


In [None]:
# Helper function for plotting feature importance
def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({"features": columns,
                        "feature_importances": importances})
          .sort_values("feature_importances", ascending=False)
          .reset_index(drop=True))
    
    # Plot the dataframe
    fig, ax = plt.subplots()
    ax.barh(df["features"][:n], df["feature_importances"][:20])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature importance")
    ax.invert_yaxis()

In [None]:
plot_features(X_train.columns, ideal_model.feature_importances_)


#### Method 1: Variable Importance from Random Forest
Random forests consist of multiple decision trees, each of them built over a random sample of the observations from the dataset and a random sample of the features. This random selection guarantees that the trees are not correlated and therefore less susceptible to over-fitting. For forecasting exercises, we use variable importance feature of random forest which measures how much the accuracy decreases when a variable is excluded.

In [None]:
#1.Select the top n features based on feature importance from random forest
import random
np.random.seed(10)

# define the model
model = RandomForestRegressor(random_state = random.seed(10))
# fit the model
model.fit(X_df, y_df)

# get importance
features = X_df
importances = model.feature_importances_
indices = np.argsort(importances)

feat_importances = pd.Series(model.feature_importances_, index=X_df.columns)
feat_importances.nlargest(30).plot(kind='barh')

In [None]:
#Final Features from Random Forest (Select Features with highest feature importance)
rf_top_features = pd.DataFrame(feat_importances.nlargest(4)).axes[0].tolist()
rf_top_features

#### Method 2: L1 regularisation using Lasso regression
Lasso or L1 regularisation is based on the property that is able to shrink some of the coefficients in a linear regression to zero. Therefore, such features can be removed from the model. This is another example of an embedded method of feature selection.

In [None]:
from sklearn.linear_model import Lasso, LassoCV

np.random.seed(10)
from sklearn.feature_selection import SelectFromModel

estimator = LassoCV(cv=5, normalize=True)

sfm = SelectFromModel(estimator, prefit=False, norm_order=1, max_features=None)

sfm.fit(X_df, y_df)

feature_idx = sfm.get_support()
Lasso_features = X_df.columns[feature_idx].tolist()
Lasso_features


#### Method 3: Beta Coefficients
The absolute value of the coefficients of a standardized regression, also known as beta coefficients, can be considered a proxy for feature importance. This is a type of filter method of feature selection.

In [None]:
from sklearn.linear_model import LinearRegression

#4.Perform recursive feature selection and use cross validation to identify the best number of features
#Feature ranking with recursive feature elimination and cross-validated selection of the best number of features
sr_reg = LinearRegression(fit_intercept = False).fit(X, y)
coef_table = pd.DataFrame(list(X_df.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",sr_reg.coef_.transpose())
coef_table = coef_table.iloc[coef_table.Coefs.abs().argsort()] 


sr_data2 = coef_table.tail(10)
sr_top_features = sr_data2.iloc[:,0].tolist()
sr_top_features

#### Combining Feature Selection Methods


In [None]:
# Combining features from all the models

combined_feature_list = sr_top_features + Lasso_features + rf_top_features

combined_feature = {x:combined_feature_list.count(x) for x in combined_feature_list}
combined_feature_data = pd.DataFrame.from_dict(combined_feature,orient='index')

combined_feature_data.rename(columns={ combined_feature_data.columns[0]: "number_of_models" }, inplace = True)


combined_feature_data = combined_feature_data.sort_values(['number_of_models'], ascending=[False])

combined_feature_data.head(10)

In [None]:
#Final Features: features which were selected in at least 3 models

combined_feature_data = combined_feature_data.loc[combined_feature_data['number_of_models'] > 2]
final_features = combined_feature_data.axes[0].tolist()
final_features

**Obviously, the number of Pieces is the most important feature that affects the number of transport**

# Problem 2 : Data Analytics

## 1. Problem defition

> Do a data exploration of several transport KPIs as a function of space and time, and derive some data-driven insights that may improve business decisions.


In [None]:
#cleaned_data.sort_values(by=["DepartureDate"], inplace=True, ascending=True)
df_cleaned_data['DepartureDate'] = df_cleaned_data['DepartureDate'].apply(pd.to_datetime)

### Add datetime parameters for DepartureDate column

Why?

So we can enrich our dataset with as much information as possible.

Because we imported the data using `read_csv()` and we asked pandas to parse the dates using `parase_dates=["DepartureDate"]`, we can now access the [different datetime attributes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html) of the `DepartureDate` column.

In [None]:
# Add datetime parameters for saledate
df_cleaned_data["DepartureDate_Year"] = df_cleaned_data.DepartureDate.dt.year
df_cleaned_data["DepartureDate_Month"] = df_cleaned_data.DepartureDate.dt.month_name()
df_cleaned_data["DepartureDate_Dayofweek"] = df_cleaned_data.DepartureDate.dt.dayofweek
df_cleaned_data["DepartureDate_Dayofyear"] = df_cleaned_data.DepartureDate.dt.dayofyear

In [None]:
df_cleaned_data.DepartureDate.min(), df_cleaned_data.DepartureDate.max()

##### Who are the top 5 Senders..?

In [None]:
df_cleaned_data['FromZone'] = df_cleaned_data['FromZone'].apply(lambda x : x[:5])
df_cleaned_data['ToZone'] = df_cleaned_data['ToZone'].apply(lambda x : x[:5])

In [None]:
# Plotting the Predictions 
sns.set(rc={'figure.figsize':(9.7,6.27)})
sns.set_style("ticks")

ax = sns.countplot(y=df_cleaned_data['FromZone'], order=df_cleaned_data['FromZone'].value_counts().iloc[:5].index)
ax.tick_params(axis='y', length=0)
plt.xlabel("", size=12)
plt.ylabel("Sender", size=12)
plt.title("Top 5 Senders", size=15)
plt.tight_layout()
plt.show()


Top Five Senders are Clearly `50464`, `55652`, `19560`, `63346`, `43437`


##### Who are the top 5 Receivers..?

In [None]:
ax = sns.countplot(x = df_cleaned_data['ToZone'], order = df_cleaned_data['ToZone'].value_counts().iloc[:5].index)
ax.tick_params(axis='y', length = 0)
plt.xlabel("Receiver", size = 12)
plt.ylabel("", size = 12)
plt.title("Top 5 Receivers", size = 15)
plt.tight_layout()
plt.show()

Top Five Receivers are Clearly `50464`, `16979`, `26036`, `18334`, `11121`



##### What's the top 5 Busiest Routes..?


In [None]:
df_cleaned_data['Route'] = df_cleaned_data['FromZone'] + "-" + df_cleaned_data['ToZone']

ax = sns.countplot(y=df_cleaned_data['Route'], order=df_cleaned_data['Route'].value_counts().iloc[:5].index)
ax.tick_params(axis='y', length=0)
plt.xlabel("", size=12)
plt.ylabel("Routes", size=12)
plt.title("Top 5 Busiest Routes", size=15)
plt.show()

**Top Busiest Route is with `50464` as the sender and `11121` as the Receiver**

##### How many Total Routes are there?

In [None]:
#Total Routes are 42120
df_cleaned_data['Route'].nunique()

##### Which Month is Busiest for the Busiest Route..?

In [None]:
Busy_Route = df_cleaned_data[df_cleaned_data['Route']=='50464-11121']
Busy_Route['Month'] = Busy_Route['DepartureDate'].dt.month_name()
Busy_Route = Busy_Route.groupby(['Month','Route'],as_index = False).agg({'Route':'count'})

In [None]:
df_cleaned_data.head()

In [None]:
import plotly.express as px

fig = px.bar(Busy_Route, x='Month', y='Route',  title="Busiest Month for the Busiest Route")
fig.show()

Busiest Months for the Busiest Route '50464 - 11121' are `July` and `August` with July having 536 Routes and August having 535 Route

##### Which cities are the most Place of Destination

In [None]:
df_Innrikes_Paket_kost['PlaceOfDestination'].value_counts().iloc[:5].index

### Conclusion
- Busiest routes will the company to make more resources available. Recognizing the business loads always help attract more.

- Any transport business has different requirements for loading and unloading side. Recignizing important postal codes on broadcasting n receiving end is important

- Company can do a Root Cause analysis or identify busy months in certain countries to aid with extra resources. Reasons for such peaks could be some special months where people send gifts, cards or tax filing end dates etc. Depends on many possible reason.

- `STOCKHOLM` is the most destinated city in Sweden, which leads to more pollution than other cities!

# Problem 3 Statistics

> Q: A warehouse has a fixed operating cost of 1000 euros per day. For every truck it ships, it has an additional cost of 100 euros, but also a revenue after tax of 400 euros. The number of truck shipment orders the warehouse receives every day is modelled by a Poisson distribution with lambda = 4. Due to limited capacity in trucks, space and personnel, the maximum shipments the warehouse is able to deliver every given day is 8. What is the expected profit or loss in a month of 30 days?

>A: Given that we have Poisson distribution we can calculate the:

Given the equation:
$$
P(x) = \frac{{e^{-\lambda}} \cdot {\lambda^x}}{x!}
$$


-  **Probability of getting 8 trucks served per day**: `0.0298` 
- **If we serve 8 trucks a day we will get** `8x30 =240` **trucks per month**. We can find the expected number of trucks served per month: number of trucks served per month x probability.
-   Hence, **the expected number of trucks served per month** is 8x30x0.0298 = `7` 
-   Costs for serving 7 trucks is : fixed costs per day ( 1 000) + costs for serving trucks ( 100 per truck) 30x1000 + 7x100 = 30 700
-   ***Revenue* for serving trucks is** `7x400 = 2800`.
-   **The expected *loss* is:** `2800 - 30 700 = -27 900`