#### Author: **Álvaro M.D.**
#### Twitter: **@alvarojonsson**
#### Master's Degree in Data Science - University of Alicante

# 1. Dataset

The selected data was published by the United Nations Statistics Division. This dataset covers import and export volumes for 5.000 commodities across most countries on Earth over the last 30 years.

Personally, I find commodities quite interesting because not only represents one of the biggest chunks of income of a country but also shows the international behaviours and relations that some countries have.

Moreover, this dataset is already quite clean, hence we don't have to repeat all the process from the previous task but focus on the main goals.

The original dataset can be found here: http://data.un.org/Explorer.aspx

In [None]:
# first, we will import all the necessary libraries

# utils
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.model_selection import train_test_split
from scipy import stats
from sklearn.decomposition import PCA

# models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import MiniBatchKMeans

# pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer

# model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import plot_confusion_matrix

# general configuration
import warnings
warnings.filterwarnings('ignore')
pd.options.display.float_format = '{:.2f}'.format

In [None]:
# dataset from kaggle, makes easier the import of the file since it's >1Gb of data
df_raw = pd.read_csv('../input/global-commodity-trade-statistics/commodity_trade_statistics_data.csv', sep=',', header=0)
df_raw = df_raw.sample(frac = 0.05) # from 8.2 mill rows, we'll sample 400k to reduce computing time

# 2. Exploring and preprocessing

In [None]:
# total number of rows and columns in the dataset
df_raw.shape

The original dataset contains 10 columns and 8.225.871 rows, but we have sampled a fraction (5%) and got 411294 rows.

Description of the columns:
- **country_or_area**: country name of record
- **year**: year in which the trade has taken place
- **comm_code**: the harmonized and coding system generally referred
- **commodity**: description of a particular commodity code
- **flow**: flow of trade i.e. export, import, others
- **trade_usd**: value of the trade in usd
- **weight_kg**: weight of the commodity in kilograms
- **quantity_name**: description of the quantity measurement type given the type of item (i.e. number of items, weight in, etc)
- **quantity**: count of the quantity of a given item based of the quantity name
- **category**: category to identify commodity

In [None]:
df_raw.head()

In [None]:
# pandas info() allow os to get column names, type and count number of nulls all in just one command
df_raw.info()

We find some int and floats, and some other objects, let's explore first. We can also see most of the objects are plain strings such as the country name or the commodity.

In [None]:
# check how many null values we got
df_raw.isnull().sum()

Only 'weight_kg' and 'quantity' have null values. We'll get percentages of missing values to decide if we can just remove this data, but before we'll explore a bit to get a better understanding of the data set.

Now we will check how balanced the classes are.

In [None]:
df_raw['country_or_area'].value_counts().plot(kind='bar')

In [None]:
df_raw['category'].value_counts().plot(kind='bar')

In [None]:
df_raw['flow'].value_counts().plot(kind='bar')

Since 'flow' is one of the problems we have chosen to tackle, we're going to include an under-sampling step in the pipeline to balance the classes. For so, we will use the library imblearn.

In [None]:
# groups null values by country, gets size of each group, sort in descending order, get first 10 rows
total_nulls_country = df_raw[df_raw['weight_kg'].isnull()].groupby('country_or_area').size().sort_values(ascending=False)
total_nulls_country.head(10)

In [None]:
# repeat but grouping categories
total_nulls_category = df_raw[df_raw['weight_kg'].isnull()].groupby('category').size().sort_values(ascending=False)
total_nulls_category.head(10)

In [None]:
# % of missing rows for each column
missing_percentage = df_raw.isnull().sum() * 100 / len(df_raw)
missing_percentage

In [None]:
# number of unique values for each variable
df_raw.nunique(axis=0)

After some exploration, we checked that missing values only correspond to a certain types of categories where we assume data is more difficult to obtain, retain or update.

Moreover, since the total percentage of missing values of both 'weight_kg' and 'quantity' is lower than 4% and taking into account we cannot discover or assume the data (i.e. we cannot randomly write the amount of kilogram that was sent 20 years ago), we will just remove this rows.

In [None]:
# drop rows with at least 1 NA value, get new shape
df = df_raw.dropna()

In [None]:
# cleaning results
print(f'Rows before cleaning: {df_raw.shape[0]}')
print(f'Rows after cleaning: {df.shape[0]}')
print(f'Rows deleted: {df_raw.shape[0] - df.shape[0]}.')
print(f'Percentage of rows deleted: {((df_raw.shape[0] - df.shape[0]) * 100) / df_raw.shape[0]:.2f}%')

In [None]:
# statistical summary of numeric variables
df.describe()

On a first sight, values look quite normal with the exception of the maximum values found. Since we're not experts on international commodity trading, we can use plots to discover how far are the outliers from the other values.

In [None]:
sns.boxplot(x=df['trade_usd'])

In [None]:
sns.boxplot(x=df['quantity'])

In [None]:
sns.boxplot(x=df['weight_kg'])

Now is confirmed there is at least one outlier in the data. However, apparently only one value is quite larger than rest. What if we plot the same but removing only the maximum value found?

In [None]:
aux = df[~(df['quantity']>=df['quantity'].max())]
sns.boxplot(x=aux['quantity'])

In [None]:
aux = df[~(df['weight_kg']>=df['weight_kg'].max())]
sns.boxplot(x=aux['weight_kg'])

In [None]:
# remove the max outlier
df_clean = df[~(df['quantity']>=df['quantity'].max())]

Another way of removing outliers is to use the IQR score. We first will calculate IQR and then remove every row that goes below or above with a threshold of 1.5 times the IQR.

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
# get cleaned dataframe without outliers based on IQR
df_iqr = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
sns.boxplot(x=df_iqr['quantity'])

In [None]:
sns.boxplot(x=df_clean['trade_usd'])

Outliers have been successfully removed. Let's compare the shape of our current dataset versus the previous one.

In [None]:
# cleaning results
print(f'Rows before cleaning: {df.shape[0]}')
print(f'Rows after cleaning: {df_iqr.shape[0]}')
print(f'Rows deleted: {df.shape[0] - df_iqr.shape[0]}.')
print(f'Percentage of rows deleted: {((df.shape[0] - df_iqr.shape[0]) * 100) / df.shape[0]:.2f}%')

Oops! We have just removed ~= 20% of the rows. How come is this possible? They were supposed to be outliers.

If we go back to the exploration data, we can see there are several commodities in different categories. Each category has it's own range of weight and quantity, which means we cannot generalize the IQR score to the whole dataset. This can be seen if we plot the quantity of rows per category.

In [None]:
df.groupby('category').size().sort_values(ascending=False).plot.density();

The plot shows different categories having different densities of rows in the dataset. Hence, we will discard the previous IQR filter and keep working with the dataset after max value was removed.

After that, we can scale the data to avoid the noise and extreme values from different categories and ranges.

# 3. Definition of the problem

Now that the data was processed and cleaned, we can define the problem or questions that we can answer with the dataset.

The requirements for the job are:

**Part 1**
- perform a regression problem
- perform a classification problem

**Part 2**
- perform an ensemble problem
- perform a clustering problem

**General tasks**
- try a variety of algorithms and compare results
- interpreting the results with error values, metrics, confusion matrix and ROC curve (ROC matrix and curve only for classification)

Now that requirements are clear, let's define our goals for each of the algorithms to implement.

- **Regression**: we will focus on the value of the "trade_usd" to check how much a product or category will increase its import/export for the next years.
- **Classification**: sometimes a row will have missing data about what was the flow of this trade. We will do a multilabel classification to predict which class of trade was done, such as "Import", "Export", "Re-Import" or "Re-Export".
- **Ensemble**: we will repeat the classification problem but using an ensemble algorithm that improves previous results.
- **Clustering**: the "commodity" column has 5031 different unique values that correspond to the specific description of each category. We will cluster look-alike commodities by their category description in order to have more specific information of the import/export trades each country is operating.

# 4. Preparation of the Data for Machine Learning Algorithms

Before apply any ML algorithm we have to prepare the data, in our case, by:

- handling text and categorical attributes
- performing feature scaling

This will allow us to create a transformation pipeline that will execute the needed steps in the right order every time.

In [None]:
# let's check the data again
df = df_clean.copy()
df.head()

### Handling text and categorical attributes

We have several attributes to handle: "country_or_area", "flow", "category". There's no need to handle "commodity" since we also have "comm_code".

In [None]:
cat_encoder = OneHotEncoder()

# one-hot encode text/categorical attributes
country_cat_1hot = cat_encoder.fit_transform(df[['country_or_area']])
flow_cat_1hot = cat_encoder.fit_transform(df[['flow']])
category_cat_1hot = cat_encoder.fit_transform(df[['category']])
category_cat_1hot

### Feature scaling
Since Machine Learning algorithms don't perform well when the input numerical attributes have very different scales and we have already explored the data to confirm this is our case, we will have to perform feature scaling. For this problem we will use a normalization technique called min-max scaling.

In [None]:
scaler = MinMaxScaler()
data = df[['trade_usd', 'weight_kg', 'quantity']]
scaled = scaler.fit_transform(data)
print(scaled)

### Transformation Pipeline

So far, we have handled the categorical columns and the numerical columns separately. It would be more convenient to have a pipeline to handle all the transformations. We will use Scikit-Learn ColumnTransformer for this purpose.

In [None]:
def transform_data(num_at, cat_at, dataframe):
    """Passes the input df through the """
    pipeline = ColumnTransformer([
        ('num', MinMaxScaler(), num_at),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_at), #ignore errors because dataset is huge and might encounter new categories
    ])
    return pipeline.fit_transform(dataframe), pipeline

In [None]:
# utility functions to improve prints
def display_scores(scores):
    print(f"Scores: {np.round(scores/1000000, decimals=2)}")
    print(f"RMSE: {to_millions(scores.mean()):.2f}")
    print(f"Standard deviation: {scores.std()/1000000:.2f}")

def to_millions(usd):
    return round(usd/1000000, 2)

### Train and Test set

We will use native Scikit-Learn function "train_test_split" to split the dataset into multiple subsets.

In [None]:
train_set, test_set = train_test_split(df, test_size=0.3, random_state=42)

With the next function, we will choose which columns we want as inputs and outputs for each model and prepare the data for the algorithm fitting and evaluation steps.

In [None]:
def prepare_data(dataset, chosen_column, df_num_attribs, df_cat_attribs, test=False):
    
    df_input = dataset.drop(chosen_column, axis=1)
    df_output = dataset[chosen_column].copy()

    
    if not test:
        df_prepared, pipeline = transform_data(df_num_attribs, df_cat_attribs, df_input)
        return df_prepared, df_output, pipeline
    else:
        return df_input, df_output

# 5 . Regression

We will prepare the data for the regression problem and use the previous designed pipeline to perform scaling and one-hot encoding. As we defined in point 2, we will perform linear regression over the 'trade_usd' column. 

In [None]:
df_prepared, df_output, pipeline = prepare_data(train_set,
                          'trade_usd',
                          ['weight_kg', 'quantity'],
                          ['country_or_area', 'flow', 'category'])

Before fitting any model, let's check what's the mean value of the column we want to predict. This will help us with the interpretation of the evaluation results.

In [None]:
# Mean of 'trade_usd' in millions
print(f"Mean of 'trade_usd': {round(df['trade_usd'].mean()/1000000, 2)} millions")

### Linear Regression

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(df_prepared, df_output)

With the model fitted, let's evaluate:

In [None]:
# evaluation of the model
df_predictions = lin_reg.predict(df_prepared)
lin_mse = mean_squared_error(df_output, df_predictions)
lin_rmse = np.sqrt(lin_mse)
print(f'R2: {r2_score(df_output, df_predictions):.3f}')
print(f'RMSE: {to_millions(lin_rmse)}')

From the R2 we can say the model is not capturing all the variation of the data. Moreover, the RMSE is quite high. However, we'll also perform cross validation to get the average mean of all the data splits and also the standard deviation to get a better understanding of the model results.

In [None]:
scores = cross_val_score(lin_reg, df_prepared, df_output, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-scores)
display_scores(lin_rmse_scores)

With a mean of 21 million, having RMSE=503.85 is really bad.

We can already forsee results will be bad, but anyway let's evaluate our model on the test set.

In [None]:
X_test, y_test = prepare_data(test_set,
                          'trade_usd',
                          ['weight_kg', 'quantity'],
                          ['country_or_area', 'flow', 'category'], True)
X_test_prepared = pipeline.transform(X_test)
final_predictions = lin_reg.predict(X_test_prepared)
lin_mse = mean_squared_error(y_test, final_predictions)
lin_rmse = np.sqrt(lin_mse)
print(f'R2: {r2_score(y_test, final_predictions):.3f}')
print(f'RMSE: {to_millions(lin_rmse)}')

In [None]:
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
interval = np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                        loc=squared_errors.mean(),
                        scale=stats.sem(squared_errors)))

print(f'Confidence interval: {np.round(interval/1000000, decimals=2)}')

With the given R2, the model only explains ~10% of the variance. Taking into consideration the confidence interval, a RMSE=481 is too large.

Therefore, we can conclude the model is not generalizing well and therefore Linear Regression is not a valid model for this dataset.

### Support Vector Regression

In [None]:
svr_reg = SVR(max_iter=1000, coef0=2, C=50, kernel="poly")
svr_reg.fit(df_prepared, df_output)

In [None]:
# evaluation of the model
df_predictions = svr_reg.predict(df_prepared)
svr_mse = mean_squared_error(df_output, df_predictions)
svr_rmse = np.sqrt(svr_mse)
print(f'R2: {r2_score(df_output, df_predictions):.3f}')
print(f'RMSE: {to_millions(svr_rmse)}')

With the SVR model, R2 tell us the model can only explain a limited amount of the data. RMSE is too high. Let's repeat cross validation for a more exhaustive exploration to be sure this result is accurate.

In [None]:
scores = cross_val_score(svr_reg, df_prepared, df_output, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-scores)
display_scores(lin_rmse_scores)

In [None]:
X_test, y_test = prepare_data(test_set,
                          'trade_usd',
                          ['weight_kg', 'quantity'],
                          ['country_or_area', 'flow', 'category'], True)
X_test_prepared = pipeline.transform(X_test)
final_predictions = svr_reg.predict(X_test_prepared)
svr_mse = mean_squared_error(y_test, final_predictions)
svr_rmse = np.sqrt(svr_mse)
print(f'R2: {r2_score(y_test, final_predictions):.3f}')
print(f'RMSE: {to_millions(svr_rmse)}')

In [None]:
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
interval = np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                        loc=squared_errors.mean(),
                        scale=stats.sem(squared_errors)))

print(f'Confidence interval: {np.round(interval/1000000, decimals=2)}')

Again, the model cannot fit the data properly and is showing a low R2 both in train and test dataset and a huge RMSE that makes the model useless so far.

Although the results are not good, we can draw some conclusions from this regression problem:
- Data is not showing linear dependencies
- There must be other techniques we could apply to improve the results, such as bucketing ranges of 'trade_usd' or use a deep learning model like LSTM
- Huge datasets make the job even harder and are more difficult to assess

# 6. Classification

As we stated in point 2, sometimes a row will have missing data about what was the flow of this trade. We will do a multilabel classification to predict which class of trade was done, such as "Import", 

In [None]:
# choosing attributes for classification
df_prepared, df_output, pipeline = prepare_data(train_set,
                          'flow',
                          ['weight_kg', 'trade_usd', 'quantity'],
                          ['country_or_area', 'category'])

X_test, y_test = prepare_data(test_set,
                          'flow',
                          ['weight_kg', 'trade_usd', 'quantity'],
                          ['country_or_area', 'category'], True)

### Logistic Regression

In [None]:
lr_clf = LogisticRegression(C=100, class_weight='balanced')
lr_clf.fit(df_prepared, df_output)

In [None]:
# predict train and test
X_test_prepared = pipeline.transform(X_test)
test_predicted = lr_clf.predict(X_test_prepared)

# evaluate with confusion matrix
plot_confusion_matrix(lr_clf, X_test_prepared, y_test)

In [None]:
precision = precision_score(y_test, test_predicted, average='weighted')
accuracy = lr_clf.score(X_test_prepared, y_test)
f1_score_ = f1_score(y_test, test_predicted, average='weighted')

print(f'Accuracy: {round(accuracy, 2)}')
print(f'Precision: {round(precision, 2)}')
print(f'F1: {round(f1_score_, 2)}')

Looking at the confusion matrix, precision and accuracy, we can say the model is not good at all. Most of the classes are missclassified. This is due to an imbalanced dataset and, eventhough we are using the class_weight hyperparameter as 'balanced' to automatically adjust weights, model is generalizing to both of the most frequent labels: Export and Import.

To tackle the imbalanced dataset, we're going to perform over-sampling because if we do under-sampling we'll have less features than the original dataset and we won't be able to train the model. 

First, let's check how imbalanced the classes are.

In [None]:
imp, exp, rexp, reimp = df['flow'].value_counts()
df['flow'].value_counts().plot(kind='bar')

Second, we will divide the dataset into classes. and we will sample 

In [None]:
df_import = df[df['flow'] == 'Import']
df_export = df[df['flow'] == 'Export']
df_re_import = df[df['flow'] == 'Re-Import']
df_re_export = df[df['flow'] == 'Re-Export']

In [None]:
df_re_export_under = df_re_export.sample(reimp, replace=True)
df_import_under = df_import.sample(reimp, replace=True)
df_export_under = df_export.sample(reimp, replace=True)
df_under = pd.concat([df_import_under, df_export_under, df_re_export_under, df_re_import], axis=0)
df_under['flow'].value_counts().plot(kind='bar');

Now that classes were over-sampled we can train and evaluate the model again.

In [None]:
cls_train_set, cls_test_set = train_test_split(df_under, random_state=42)

cls_df_prepared, cls_df_output, cls_pipeline = prepare_data(df,
                          'flow',
                          ['weight_kg', 'trade_usd', 'quantity'],
                          ['country_or_area', 'category'])

cls_X_test, cls_y_test = prepare_data(cls_test_set,
                          'flow',
                          ['weight_kg', 'trade_usd', 'quantity'],
                          ['country_or_area', 'category'], True)

In [None]:
lr_clf = LogisticRegression(C=100, class_weight='balanced')
lr_clf.fit(cls_df_prepared, cls_df_output)

# predict train and test
X_test_prepared = pipeline.transform(cls_X_test)
test_predicted = lr_clf.predict(X_test_prepared)

# evaluate with confusion matrix
plot_confusion_matrix(lr_clf, X_test_prepared, cls_y_test)

In [None]:
precision = precision_score(cls_y_test, test_predicted, average='weighted')
accuracy = lr_clf.score(X_test_prepared, cls_y_test)
f1_score_ = f1_score(cls_y_test, test_predicted, average='weighted')

print(f'Precision: {round(precision, 2)}')
print(f'Accuracy: {round(accuracy, 2)}')
print(f'F1: {round(f1_score_, 2)}')

After over-sampling the original dataset, we find out that precision is even lower than before. The main reason for this is that classes were so imbalanced that now the most over-sampled class ('Re-Import') is causing the model to overfit, but the under-sampling has improved both accuracy and F1.

### Gaussian Naive Bayes

In [None]:
gn_clf = GaussianNB()
gn_clf.fit(df_prepared.todense(), df_output)

In [None]:
# predict train and test
train_predicted = gn_clf.predict(df_prepared.todense())
X_test_prepared = pipeline.transform(X_test)
test_predicted = gn_clf.predict(X_test_prepared.todense())

plot_confusion_matrix(gn_clf, X_test_prepared.todense(), y_test)

In [None]:
precision = precision_score(y_test, test_predicted, average='weighted')
accuracy = gn_clf.score(X_test_prepared.todense(), y_test)
f1_score_ = f1_score(y_test, test_predicted, average='weighted')

print(f'Precision: {round(precision, 2)}')
print(f'Accuracy: {round(accuracy, 2)}')
print(f'F1: {round(f1_score_, 2)}')

When we use a probabilistic approach, having imbalanced classes make the model to overfit on most frequent. Both accuracy and F1 are really low and therefore the model is not valid for our purpose.

### Stochastic Gradient Classifier

We're going to use the SGD instead ov SVC because the fit time of SVC scales at least quadratically with the number of samples and may be impractical with our dataset that contains more than 400k of samples.

In [None]:
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3)
sgd_clf.fit(df_prepared, df_output)

# predict train and test
X_test_prepared = pipeline.transform(X_test)
test_predicted = lr_clf.predict(X_test_prepared)

# evaluate with confusion matrix
plot_confusion_matrix(sgd_clf, X_test_prepared, y_test)

In [None]:
precision = precision_score(y_test, test_predicted, average='weighted')
accuracy = sgd_clf.score(X_test_prepared, y_test)
f1_score_ = f1_score(y_test, test_predicted, average='weighted')

print(f'Precision: {round(precision, 2)}')
print(f'Accuracy: {round(accuracy, 2)}')
print(f'F1: {round(f1_score_, 2)}')

Let's try it again with the balanced dataset:

In [None]:
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3)
sgd_clf.fit(cls_df_prepared, cls_df_output)

# predict train and test
X_test_prepared = pipeline.transform(cls_X_test)
test_predicted = sgd_clf.predict(X_test_prepared)

# evaluate with confusion matrix
plot_confusion_matrix(sgd_clf, X_test_prepared, cls_y_test);

In [None]:
precision = precision_score(cls_y_test, test_predicted, average='weighted')
accuracy = sgd_clf.score(X_test_prepared, cls_y_test)
f1_score_ = f1_score(cls_y_test, test_predicted, average='weighted')

print(f'Precision: {round(precision, 2)}')
print(f'Accuracy: {round(accuracy, 2)}')
print(f'F1: {round(f1_score_, 2)}')

According to the metrics, with proper balancing of the data and tuning a SGD classifier we could further improve results.

However, as with regression, we can confirm this dataset is difficult to work with and overall results are not good.

# 7. Ensemble

For the ensemble problem, we would try to apply an averaging method (Bagging) that build several estimators and average their predictions. This would theoretically perform better than any of the single base estimator because its variance is reduced.

In [None]:
# choosing attributes for classification
df_prepared, df_output, pipeline = prepare_data(train_set,
                          'flow',
                          ['weight_kg', 'trade_usd', 'quantity'],
                          ['country_or_area', 'category'])

X_test, y_test = prepare_data(test_set,
                          'flow',
                          ['weight_kg', 'trade_usd', 'quantity'],
                          ['country_or_area', 'category'], True)

In [None]:
num_models=10
bagging = BaggingClassifier(DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2),
                            n_estimators=num_models)
ens_clf = bagging.fit(df_prepared, df_output)

In [None]:
# predict train and test
X_test_prepared = pipeline.transform(X_test)
test_predicted = ens_clf.predict(X_test_prepared)

# evaluate with confusion matrix
plot_confusion_matrix(ens_clf, X_test_prepared, y_test);

In [None]:
precision = precision_score(y_test, test_predicted, average='weighted')
accuracy = ens_clf.score(X_test_prepared, y_test)
f1_score_ = f1_score(y_test, test_predicted, average='weighted')

print(f'Precision: {round(precision, 2)}')
print(f'Accuracy: {round(accuracy, 2)}')
print(f'F1: {round(f1_score_, 2)}')

In [None]:
train_err = (ens_clf.predict(df_prepared) != df_output).mean()
print(f'Train error: {train_err:.1%}')

Having a this train error and overall results, we can try to apply under-sample this time to see if we can improve them. For so, we have to balance the data to the less frequent class, 'Re-Import'.

In [None]:
df['flow'].value_counts().plot(kind='bar');

In [None]:
df_import = df[df['flow'] == 'Import']
df_export = df[df['flow'] == 'Export']
df_re_import = df[df['flow'] == 'Re-Import']
df_re_export = df[df['flow'] == 'Re-Export']

In [None]:
df_re_export_under = df_re_export.sample(reimp, replace=True)
df_import_under = df_import.sample(reimp, replace=True)
df_export_under = df_export.sample(reimp, replace=True)
df_under = pd.concat([df_import_under, df_export_under, df_re_export_under, df_re_import], axis=0)
df_under['flow'].value_counts().plot(kind='bar');

Now that the dataset was under-sampled we can try to fit the model again and compare results. But we have to prepare the data first.

In [None]:
cls_train_set, cls_test_set = train_test_split(df_under, random_state=42)

df_prepared, df_output, pipeline = prepare_data(cls_train_set,
                          'flow',
                          ['weight_kg', 'trade_usd', 'quantity'],
                          ['country_or_area', 'category'])

X_test, y_test = prepare_data(cls_test_set,
                          'flow',
                          ['weight_kg', 'trade_usd', 'quantity'],
                          ['country_or_area', 'category'], True)

In [None]:
num_models=10
bagging = BaggingClassifier(DecisionTreeClassifier(),
                            n_estimators=num_models)
ens_clf = bagging.fit(df_prepared, df_output)

In [None]:
# predict train and test
X_test_prepared = pipeline.transform(X_test)
test_predicted = ens_clf.predict(X_test_prepared)

# evaluate with confusion matrix
plot_confusion_matrix(ens_clf, X_test_prepared, y_test);

In [None]:
precision = precision_score(y_test, test_predicted, average='weighted')
accuracy = ens_clf.score(X_test_prepared, y_test)
f1_score_ = f1_score(y_test, test_predicted, average='weighted')

print(f'Precision: {round(precision, 2)}')
print(f'Accuracy: {round(accuracy, 2)}')
print(f'F1: {round(f1_score_, 2)}')

Now that dataset is balanced we have achieved a better result not only in accuracy but also for all the classes since F1 is higher tan before.

# 8. Clustering

We will cluster look-alike commodities by their category description in order to have more specific information of the import/export trades each country is operating.

Since our dataset is huge, we will reduce computation times by applying the Mini Batch version of KMeans.

In [None]:
# choosing attributes for classification
df_prepared, df_output, pipeline = prepare_data(train_set,
                          'category',
                          ['weight_kg', 'trade_usd', 'quantity'],
                          ['country_or_area', 'flow'])

X_test, y_test = prepare_data(test_set,
                          'category',
                          ['weight_kg', 'trade_usd', 'quantity'],
                          ['country_or_area', 'flow'], True)

### K-Means

KMeans requires to know how many clusters we want to group. To determine the best K by comparing the results for different k values, we can use the Elbow method, which uses the sum of squared distance between each point and the centroid in a cluster.

In [None]:
from scipy.spatial.distance import cdist

sse = {}
for k in range(1, 10):
    kmeans = MiniBatchKMeans(n_clusters=k, max_iter=1000).fit(df_prepared)
    df_output = kmeans.labels_
    sse[k] = kmeans.inertia_ # sum of distances of samples to their closest centroid

plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

With the elbow method we can determine that the best K is 3 or 5. Since this is a learning exercise, we would just use 3 for saving computing time.

Since we're not which K number use between 3 and 6, a more precise approach is to use the silhouette score, which is the mean silhouette coefficient over all the instances.

In [None]:
from sklearn.metrics import silhouette_score

sl = {}
for k in range(3, 6):
    kmeans = MiniBatchKMeans(n_clusters=k, max_iter=1000).fit(df_prepared)
    sl[k] = silhouette_score(df_prepared, kmeans.labels_)

plt.figure()
plt.plot(list(sl.keys()), list(sl.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

In [None]:
kmeans = MiniBatchKMeans(n_clusters = 3, max_iter=1000).fit(df_prepared)

In [None]:
X_test_prepared = pipeline.transform(X_test)
cluster_predictions = kmeans.predict(X_test_prepared)

Now that we checked everything is working, we can perform a PCA to reduce the dimensional space into a 2-dimensional space in order to plot the data and the clusters. This is done also because otherwise we won't be able to make an interpretation of the cluster results.

In [None]:
reduced_data = PCA(n_components=2).fit_transform(df_prepared.todense())
kmeans = MiniBatchKMeans(n_clusters=3, max_iter=1000).fit(reduced_data)

In [None]:
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation="nearest",
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired, aspect="auto", origin="lower")

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], marker="x", s=169, linewidths=3,
            color="w", zorder=10)
plt.title("K-means clustering on the digits dataset (PCA-reduced data)\n"
          "Centroids are marked with white cross")
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

We will now try K=5 in order to deal with the lowest SSE possible but still following the Elbow Method.

In [None]:
kmeans = MiniBatchKMeans(n_clusters=5, max_iter=1000).fit(reduced_data)

In [None]:
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation="nearest",
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired, aspect="auto", origin="lower")

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], marker="x", s=169, linewidths=3,
            color="w", zorder=10)
plt.title("K-means clustering on the digits dataset (PCA-reduced data)\n"
          "Centroids are marked with white cross")
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

Results show that if K=5 we have two centroids that are pretty much together. Therefore, it is worth it to give it a try with K=4 this time and check if both centroids mix into one.

In [None]:
kmeans = MiniBatchKMeans(n_clusters=4, max_iter=1000).fit(reduced_data)

In [None]:
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation="nearest",
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired, aspect="auto", origin="lower")

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], marker="x", s=169, linewidths=3,
            color="w", zorder=10)
plt.title("K-means clustering on the digits dataset (PCA-reduced data)\n"
          "Centroids are marked with white cross")
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

In [None]:
from sklearn.metrics.pairwise import euclidean_distances

dists = euclidean_distances(kmeans.cluster_centers_)

# euclidean distance between points of clusters 3 and 1
tri_dists = dists[np.triu_indices(3, 1)]
max_dist, avg_dist, min_dist = tri_dists.max(), tri_dists.mean(), tri_dists.min()
print(max_dist)
print(avg_dist)
print(min_dist)

As we expected, now all the data is properly clustered and the euclidean distance. Let's train it again with K=4 but without reduced data and predict with test data to see if the algorithm is clustering.

In [None]:
kmeans = MiniBatchKMeans(n_clusters=4, max_iter=1000).fit(df_prepared)
X_test_prepared = pipeline.transform(X_test)
kmeans_predicted = kmeans.predict(X_test_prepared)
print(kmeans_predicted)

### DBSCAN

DBSCAN defines clusters as continuous regions of high density. We can use it to find clusters of arbitrary shape, modelled as dense regions in the data space, separated by sparse regions.

Let's train our model selecting 0.3 for eps and setting min_samples to 5.

In [None]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan.fit(df_prepared)

The labels_ property contains the list of clusters and their respective points.

In [None]:
dbscan.labels_

Now, we will map every individual cluster to a color and plot the results. 

As we can see on the chart below, all the dark blue points are considered noise.

Since DBSCAN does not have a predict method, we can train a KNeighborsClassifier to predict which cluster a new instance belongs to.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(dbscan.components_, dbscan.labels_[dbscan.core_sample_indices_])

Now we can predict which cluster they most likely belong to and even estimate a probability for each cluster.

In [None]:
X_test_prepared = pipeline.transform(X_test)
predict = knn.predict(X_test_prepared)
print(predict)
print(knn.predict_proba(X_test_prepared))

In [None]:
from sklearn import metrics

labels = dbscan.labels_
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[dbscan.core_sample_indices_] = True

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(df_output, labels))
print("Completeness: %0.3f" % metrics.completeness_score(df_output, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(df_output, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(df_output, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(df_output, labels))

In [None]:
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = df_prepared.todense()[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = df_prepared.todense()[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

As the above graph plots, we're obtaining a huge amount of clusters that does not correspond within KMeans results. Therefore, hyperparameter tuning should be applied to improve this model, in case we need to use it. However, since this notebook is just for learning, there's no need to further explore the improvement possibilities.

We have proven that commodities can be clustered with K=4 with KMeans clustering and by density with DBSCAN, which determines the best number of clusters by itself.

The chosen dataset is quite complex for a person without specific knowledge of international trading. Therefore, we will skip the interpretation phase that normally follows a clustering problem and conclude the notebook with an advice: huge datasets are not meant to be used for learning purposes since there are many things a beginner could miss within all the different techniques and models we have used for the aforementioned notebook.