<a href="https://colab.research.google.com/github/tokien1998/Company_Recommendation_System/blob/master/Capstone_Project_To_Trong_Kien.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Company Peer Discovery using Autoencoder

Autoencoders are a neural network used for dimensionality reduction. The neural network learn the latent features of the dataset to transform input features into a compressed representation. 

It can be used for encoding high dimensionality data to reduce noise and compute the similarity between high dimensionality data points.

In this case, we have a list of compnay with their profiles. The problem statement is to find the peer of these companies. 

Traditionally, we would engage an expert and define ruleset to filter and sort these companies into specific buckets which is rather subjective. It would be prohibitive if the list is long, we have 12,491 companies in this list.

We can use autoencoders to encode companies into their latent vector and programmatically search for their peers objectively. Hence, machine learning enabled the discovery of peer companies in a scalable manner.

In this project, you will use an autoencoder to encode company profiles and create a machine learning system for peer discovery. You will go through the applied data science journey:

Data wrangling >> Machine Learning model training >> Serving machine learning model >> evaluating the effectiveness of the machine learning system.

Data file can be downloaded at:
https://drive.google.com/file/d/1FrqsCW758NbZgfKbEoCMDDLmf7kUPf2j/view?usp=sharing



## Adding GPU

We can add a GPU by going to the menu and selecting:

Edit -> Notebook Settings -> Add accelerator (GPU)

Then run the following cell to confirm that the GPU is detected.

In [0]:
!pip uninstall tensorflow -y
!pip install tensorflow-gpu==2.0.0



# Data Cleaning and Encoding

In [0]:
import tensorflow as tf
print("Tensorflow version: {}".format(tf.__version__))
device_name = tf.test.gpu_device_name()

if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Tensorflow version: 2.0.0
Found GPU at: /device:GPU:0


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving data.f to data.f
User uploaded file "data.f" with length 2068848 bytes


## Read data into dataframe and do some simple cleaning


In [0]:
import pandas as pd
import numpy as np

In [0]:
data_path = 'data.f'
raw_df = pd.read_feather(data_path)
raw_df.head(10)

Unnamed: 0,index,Country,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Name,OS,Organization type,Region,Sector,Size,year
0,42434,Italy,OECD,No,No,Listed,Acea,,Private company,Europe,Energy Utilities,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
1,36391,United States of America,OECD,No,No,Listed,Bristol-Myers Squibb Company,,Private company,Northern America,Healthcare Products,Large,"2016,2016,2014,2012,2011,2010,2009,2008,2007,2..."
2,29899,United Kingdom of Great Britain and Northern I...,OECD,No,,Non-listed,British Airways,No,Subsidiary,Europe,Aviation,Large,20152014201320001999
3,44375,Sweden,OECD,No,No,Listed,Electrolux,,Private company,Europe,Consumer Durables,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
4,22,Sweden,OECD,,,,ESAB,,,Europe,Construction Materials,Large,20001999
5,50196,United States of America,OECD,No,Yes,Listed,General Motors Company,,Private company,Northern America,Automotive,MNE,"2018,2017,2016,2015,2014,2013,2012,2011,2004,2..."
6,51019,Japan,OECD,No,No,Listed,Panasonic Corporation,,Private company,Asia,Consumer Durables,Large,"2018,2017,2016,2015,2014,2013,2012,2011,2010,2..."
7,47087,United States of America,OECD,No,No,Listed,Procter & Gamble,,Private company,Northern America,Household and Personal Products,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
8,1176,Netherlands,OECD,,,,Procter & Gamble Netherlands,,,Europe,Household and Personal Products,Large,200520032002200120001999
9,51465,Canada,OECD,No,No,Listed,Suncor Energy,,Private company,Northern America,Energy,Large,"2018,2017,2016,2015,2014,2013,2012,2011,2010,2..."


In [0]:
# duplicate raw dataframe for build ML model
dup_raw_df = raw_df.copy()
dup_raw_df.head(10)

Unnamed: 0,index,Country,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Name,OS,Organization type,Region,Sector,Size,year
0,42434,Italy,OECD,No,No,Listed,Acea,,Private company,Europe,Energy Utilities,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
1,36391,United States of America,OECD,No,No,Listed,Bristol-Myers Squibb Company,,Private company,Northern America,Healthcare Products,Large,"2016,2016,2014,2012,2011,2010,2009,2008,2007,2..."
2,29899,United Kingdom of Great Britain and Northern I...,OECD,No,,Non-listed,British Airways,No,Subsidiary,Europe,Aviation,Large,20152014201320001999
3,44375,Sweden,OECD,No,No,Listed,Electrolux,,Private company,Europe,Consumer Durables,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
4,22,Sweden,OECD,,,,ESAB,,,Europe,Construction Materials,Large,20001999
5,50196,United States of America,OECD,No,Yes,Listed,General Motors Company,,Private company,Northern America,Automotive,MNE,"2018,2017,2016,2015,2014,2013,2012,2011,2004,2..."
6,51019,Japan,OECD,No,No,Listed,Panasonic Corporation,,Private company,Asia,Consumer Durables,Large,"2018,2017,2016,2015,2014,2013,2012,2011,2010,2..."
7,47087,United States of America,OECD,No,No,Listed,Procter & Gamble,,Private company,Northern America,Household and Personal Products,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
8,1176,Netherlands,OECD,,,,Procter & Gamble Netherlands,,,Europe,Household and Personal Products,Large,200520032002200120001999
9,51465,Canada,OECD,No,No,Listed,Suncor Energy,,Private company,Northern America,Energy,Large,"2018,2017,2016,2015,2014,2013,2012,2011,2010,2..."


In [0]:
#df preprocessing
ncols_raw = len(raw_df.columns)

#drop duplicate index
raw_df.drop(raw_df.columns[0], axis=1, inplace=True)
raw_df.drop(labels=['year', 'Name', 'OS'], axis=1, inplace=True)
raw_df.replace(to_replace=[None], value=np.nan, inplace=True)

In [0]:
raw_df.head()

Unnamed: 0,Country,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Organization type,Region,Sector,Size
0,Italy,OECD,No,No,Listed,Private company,Europe,Energy Utilities,Large
1,United States of America,OECD,No,No,Listed,Private company,Northern America,Healthcare Products,Large
2,United Kingdom of Great Britain and Northern I...,OECD,No,,Non-listed,Subsidiary,Europe,Aviation,Large
3,Sweden,OECD,No,No,Listed,Private company,Europe,Consumer Durables,Large
4,Sweden,OECD,,,,,Europe,Construction Materials,Large


In [0]:
#find columns that has na
raw_df.columns[raw_df.isna().any()]

Index(['Featured Report?', 'GOLD Community', 'Listed/Non-listed',
       'Organization type', 'Size'],
      dtype='object')

In [0]:
raw_df.describe()

Unnamed: 0,Country,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Organization type,Region,Sector,Size
count,12491,12491,9802,9114,11833,11730,12491,12491,12458
unique,130,6,2,2,3,7,6,39,3
top,Mainland China,OECD,No,No,Listed,Private company,Asia,Financial Services,Large
freq,1337,6392,9788,8897,7018,8423,4488,1481,7468


In [0]:
#fill na with majorituy category
raw_df.fillna({'Featured Report?':'No', \
               'GOLD Community': 'No', \
               'Listed/Non-listed': 'Listed', \
               'OS':'No', \
               'Organization type': 'Private Company', 
               'Size': 'Large'}, inplace=True)

In [0]:
n_countries = raw_df[raw_df.columns[0]].unique().size
print("Total number of countries: {}".format(n_countries))

Total number of countries: 130


In [0]:
data_clean = raw_df.loc[:, 'Country Status':'Size'].copy()
countries = raw_df.loc[:, 'Country'].unique()

## Encode data

In [0]:
#Creates feature tokens for each feature (Hot-One-Encoding)
features_dataframe = data_clean.copy()
for feature in data_clean.columns:
      dfDummies = pd.get_dummies(data_clean[feature], prefix = feature)
      features_dataframe = pd.concat([features_dataframe, dfDummies], axis=1)

In [0]:
features_dataframe.head()

Unnamed: 0,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Organization type,Region,Sector,Size,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
0,OECD,No,No,Listed,Private company,Europe,Energy Utilities,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,OECD,No,No,Listed,Private company,Northern America,Healthcare Products,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,OECD,No,No,Non-listed,Subsidiary,Europe,Aviation,Large,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,OECD,No,No,Listed,Private company,Europe,Consumer Durables,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,OECD,No,No,Listed,Private Company,Europe,Construction Materials,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [0]:
ncolumns_data = len(data_clean.columns) 
ncolumns_code = len(features_dataframe.columns) - ncolumns_data

print("Original data columns: {}".format(ncolumns_data))
print("Encoding columns: {}".format(ncolumns_code))

Original data columns: 8
Encoding columns: 69


In [0]:
# Removes encoding from the list of variables, the operation 
code_columns = list(features_dataframe.columns)[ncolumns_data:]

In [0]:
# Takes only the columns which have been one-hot-encoded
training_df = features_dataframe[code_columns].copy() 

In [0]:
training_df.head(2)

Unnamed: 0,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [0]:
#Save File 
training_df.to_csv("encoded_data.csv")

# TF Dataset definition

In [0]:
import numpy as np
import tensorflow as tf
import os

np.random.seed(1)
tf.random.set_seed(1)

In [0]:
batch_size = 128

In [0]:
train_array = training_df.to_numpy(dtype=np.float32,copy=True)

In [0]:
train_dataset = tf.data.Dataset.from_tensor_slices(train_array)

In [0]:
train_dataset = train_dataset.batch(batch_size=batch_size)

In [0]:
train_dataset = train_dataset.shuffle(train_array.shape[0])

In [0]:
train_dataset = train_dataset.prefetch(batch_size*4)

# Model Definition

## Encoder

In [0]:
intermediate_dim = 23
original_dim = 69

In [0]:
class Encoder(tf.keras.layers.Layer):
  def __init__(self, intermediate_dim, original_dim):
    super(Encoder, self).__init__()
    hidden_dim_1 = int((original_dim + intermediate_dim)/2)
    self.hidden_layer_1 = tf.keras.layers.Dense(
      units=hidden_dim_1,
      activation=tf.nn.relu,
      kernel_initializer='he_uniform'
    )
    self.hidden_layer_2 = tf.keras.layers.Dense(
      units=intermediate_dim,
      activation=tf.nn.relu,
      kernel_initializer='he_uniform'
    )
    self.output_layer = tf.keras.layers.Dense(
      units=intermediate_dim,
      activation=tf.nn.sigmoid
    )
    
  def call(self, input_features):
    activation = self.hidden_layer_1(input_features)
    activation = self.hidden_layer_2(activation)
    return self.output_layer(activation)

## Decoder

In [0]:
class Decoder(tf.keras.layers.Layer):
  def __init__(self, intermediate_dim, original_dim):
    super(Decoder, self).__init__()
    hidden_dim_2 = int((original_dim + intermediate_dim)/2)
    self.hidden_layer_1 = tf.keras.layers.Dense(
      units=intermediate_dim,
      activation=tf.nn.relu,
      kernel_initializer='he_uniform'
    )
    self.hidden_layer_2 = tf.keras.layers.Dense(
      units=hidden_dim_2,
      activation=tf.nn.relu,
      kernel_initializer='he_uniform'
    )
    self.output_layer = tf.keras.layers.Dense(
      units=original_dim,
      activation=tf.nn.sigmoid
    )
  
  def call(self, code):
    activation = self.hidden_layer_1(code)
    activation = self.hidden_layer_2(activation)
    return self.output_layer(activation)

## Autoencoder

In [0]:
class Autoencoder(tf.keras.Model):
  def __init__(self, intermediate_dim, original_dim):
    super(Autoencoder, self).__init__()
    self.encoder = Encoder(
        intermediate_dim=intermediate_dim, 
        original_dim=original_dim
    )
    self.decoder = Decoder(
        intermediate_dim=intermediate_dim,
        original_dim=original_dim
    )
  
  def call(self, input_features):
    code = self.encoder(input_features)
    reconstructed = self.decoder(code)
    return reconstructed

# Model and Training Setup

In [0]:
autoencoder = Autoencoder(
  intermediate_dim=intermediate_dim,
  original_dim=original_dim
)

In [0]:
learning_rate = 0.5e-2
opt = tf.optimizers.Adam(learning_rate=learning_rate)

In [0]:
def loss(model, original):
  reconstruction_error = tf.reduce_mean(tf.square(tf.subtract(model(original), original)))
  return reconstruction_error
  
def train(loss, model, opt, original):
  with tf.GradientTape() as tape:
    gradients = tape.gradient(loss(model, original), model.trainable_variables)
  gradient_variables = zip(gradients, model.trainable_variables)
  opt.apply_gradients(gradient_variables)

# Training

In [0]:
epochs = 50
save_path = "drive/My Drive/model_checkpoints"

In [0]:
def make_folder(path):
  try:
    os.mkdir(path)
  except FileExistsError:
    print("Directory already exist.")
  return

In [0]:
make_folder(save_path)

Directory already exist.


In [0]:
save_freq = 5
writer = tf.summary.create_file_writer('{}/tmp'.format(save_path))
with writer.as_default():
  with tf.summary.record_if(True):
    for epoch in range(epochs):
      for step, batch_features in enumerate(train_dataset):
        train(loss, autoencoder, opt, batch_features)
        loss_values = loss(autoencoder, batch_features)
        original = batch_features
        reconstructed = autoencoder(tf.constant(batch_features))
        tf.summary.scalar('loss', loss_values, step=step)
        tf.summary.write('original', original, step=step)
        tf.summary.write('reconstructed', reconstructed, step=step)
      
      if epoch%save_freq == 0:
        tf.print("Epoch: {}".format(epoch))
        tf.print(" . Loss:", loss_values)
        path = "{}/epoch_{:03d}".format(save_path, epoch)
        make_folder(path)
        autoencoder.save_weights(path)


Epoch: 0
 . Loss: 0.0375696234
Directory already exist.
Epoch: 5
 . Loss: 0.0159702674
Directory already exist.
Epoch: 10
 . Loss: 0.0152507937
Directory already exist.
Epoch: 15
 . Loss: 0.0133871352
Directory already exist.
Epoch: 20
 . Loss: 0.010589866
Directory already exist.
Epoch: 25
 . Loss: 0.00961121637
Directory already exist.
Epoch: 30
 . Loss: 0.0131718908
Directory already exist.
Epoch: 35
 . Loss: 0.0118441591
Directory already exist.
Epoch: 40
 . Loss: 0.0115874736
Directory already exist.
Epoch: 45
 . Loss: 0.0116139008
Directory already exist.


# Capstone Project Statement

## Instructions:
1. Create a copy of this collab notebook and suffix it with your name
2. Run through the code example above
3. Work on the project objectives below inside the copy of the notebook

## 1. Use the autoencoder as a basis to create a machine learning model to find the Top 5 similar company given an target company. 

Note: Implement an API/function to query for Top 5 similar company given an target company. Assuming the target company is within the data.f file.

### Get information of a specific company

In [0]:
#@title Write down a company name
company_name = "Electrolux" #@param {type:"string"}


In [0]:
if(company_name in dup_raw_df['Name'].values):
    print('The Company is founded!')
else:
    raise NameError('This company name is not founded!')

The Company is founded!


In [0]:
# df preprocessing
target_company_df = dup_raw_df[dup_raw_df.Name == company_name].copy()
other_companies_df = dup_raw_df[dup_raw_df.Name != company_name].copy()

### Get the target vector of target company

In [0]:
func_features_df = features_dataframe.copy()

def df_preprocessing(dataframe):
    # drop duplicate index
    dataframe.drop(labels='index', axis=1, inplace=True)
    dataframe.drop(labels=['year', 'Name', 'OS'], axis=1, inplace=True)
    dataframe.replace(to_replace=[None], value=np.nan, inplace=True)

    # find columns that has na
    dataframe.columns[dataframe.isna().any()]

    #fill na with majorituy category
    dataframe.fillna({'Featured Report?':'No', \
                'GOLD Community': 'No', \
                'Listed/Non-listed': 'Listed', \
                'OS':'No', \
                'Organization type': 'Private Company', 
                'Size': 'Large'}, inplace=True)
    
    dataframe_clean = dataframe.loc[:, 'Country Status':'Size'].copy()

    dataframe_ori_cols = dataframe_clean.shape[1] # Number of cleaned original columns
    return dataframe_clean, dataframe_ori_cols

In [0]:
# Create a function to get one-hot code dataframe
def get_code_df(dataframe,df_type):
    dataframe_clean, dataframe_ori_cols = df_preprocessing(dataframe)
    if(df_type == 'target'):
        # create a dictionary of origin values of the target company 
        target_com_dict = dataframe_clean.to_dict(orient='records')[0]

        # show a data list with satisfying conditions
        for key, value in target_com_dict.items():
            global func_features_df
            func_features_df = func_features_df[func_features_df[key] == value].copy()
            target_features_df = func_features_df

        # drop duplicates
        target_features_df = target_features_df.drop_duplicates()

        # Takes only the columns which have been one-hot-encoded
        target_vector = target_features_df[target_features_df.columns[dataframe_ori_cols:]]
        return dataframe_clean, dataframe_clean.head(), target_vector, target_vector.head()
        
    if(df_type == 'remaining'):
        # One-hot-encoding
        token_dataframe = dataframe_clean.copy()
        for feature in dataframe_clean.columns:
            dfDummies = pd.get_dummies(dataframe_clean[feature], prefix = feature)
            token_dataframe = pd.concat([token_dataframe, dfDummies], axis=1)

        # Take only the columns which have been one-hot-encoded
        token_df = token_dataframe[token_dataframe.columns[dataframe_ori_cols:]]
        return dataframe_clean, dataframe_clean.head(), token_df, token_df.head()

In [0]:
# preview target company
target_company, prv_target_company, encoded_target_company, prv_encoded_target_company = get_code_df(target_company_df, df_type='target')
display(prv_target_company, prv_encoded_target_company, prv_encoded_target_company.shape)

Unnamed: 0,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Organization type,Region,Sector,Size
3,OECD,No,No,Listed,Private company,Europe,Consumer Durables,Large


Unnamed: 0,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
3,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


(1, 69)

In [0]:
# preview other companies
other_companies, prv_other_companies, encoded_other_companies, prv_encoded_other_companies = get_code_df(other_companies_df, df_type='remaining')
display(prv_other_companies, prv_encoded_other_companies, prv_encoded_other_companies.shape)

Unnamed: 0,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Organization type,Region,Sector,Size
0,OECD,No,No,Listed,Private company,Europe,Energy Utilities,Large
1,OECD,No,No,Listed,Private company,Northern America,Healthcare Products,Large
2,OECD,No,No,Non-listed,Subsidiary,Europe,Aviation,Large
4,OECD,No,No,Listed,Private Company,Europe,Construction Materials,Large
5,OECD,No,Yes,Listed,Private company,Northern America,Automotive,MNE


Unnamed: 0,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
5,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


(5, 69)

In [0]:
target_vector_69 = encoded_target_company.to_numpy(dtype=np.float32, copy=True)
target_vector_69.shape

numpy.ndarray

In [0]:
other_vectors_69 = encoded_other_companies.to_numpy(dtype=np.float32, copy=True)
other_vectors_69.shape

(12490, 69)

### Dimension Reducing by Encoder

In [0]:
encoder = Encoder(
    intermediate_dim=intermediate_dim,
    original_dim=original_dim
)

decoder = Decoder(
    intermediate_dim=intermediate_dim,
    original_dim=original_dim
)

In [0]:
# list of 23-dimensions vectors
target_vector_23 = encoder(target_vector_69)
target_vector_23.shape

TensorShape([1, 23])

In [0]:
other_vectors_23 = encoder(other_vectors_69)
other_vectors_23.shape

TensorShape([12490, 23])

### Calculate the distance between the company vector and vectors in dataset

In [0]:
def vector_loss(target_vector, selected_vector):
    norm_value = np.linalg.norm(target_vector - selected_vector)
    return norm_value

In [0]:
train_dict = {}
loss_dict = {} # for checking

for step, vector in enumerate(other_vectors_23):
    loss_value = vector_loss(target_vector=target_vector_23, selected_vector=vector)
    train_dict.update({vector.experimental_ref() : loss_value})
    loss_dict.update({step : loss_value})

    if step%1000 == 0:
        print('Step: ',step)
        print('. Vector: ', vector)
        print('. Loss: {}'.format(loss_value))
        print('________\n')

Step:  0
. Vector:  tf.Tensor(
[0.6593223  0.52991694 0.5300096  0.5109138  0.53552294 0.46297282
 0.5756589  0.5640135  0.54374576 0.63916785 0.5523897  0.48967662
 0.55614585 0.5690693  0.4235473  0.53759193 0.46717918 0.3715962
 0.48032174 0.6213877  0.6203355  0.5142615  0.47182268], shape=(23,), dtype=float32)
. Loss: 0.1594184935092926
________

Step:  1000
. Vector:  tf.Tensor(
[0.5431844  0.46220088 0.5206056  0.49174005 0.5605048  0.476045
 0.47863996 0.54090977 0.546007   0.5605823  0.52877456 0.47826016
 0.5262644  0.49506944 0.51521266 0.5023218  0.4935623  0.4925796
 0.46676314 0.5455513  0.48484713 0.45840585 0.48071623], shape=(23,), dtype=float32)
. Loss: 0.22992609441280365
________

Step:  2000
. Vector:  tf.Tensor(
[0.5971109  0.49322364 0.528209   0.44503143 0.4925152  0.4251901
 0.53922117 0.5500984  0.51584595 0.5727555  0.4814058  0.40835887
 0.58696675 0.5533949  0.4201365  0.5372612  0.48332745 0.39367378
 0.41537273 0.6015871  0.6133498  0.43220583 0.5167773 ]

### Choose 5 smallest distances (5 other companies)

In [0]:
top5_smallest_loss_step = sorted(loss_dict.items(), key= lambda item: item[1])[:5]
top5_smallest_loss_step

[(373, 1.074538e-07),
 (5650, 1.074538e-07),
 (9096, 1.074538e-07),
 (9579, 1.074538e-07),
 (1734, 0.068126455)]

In [0]:
top5_23dim_vector = sorted(train_dict.items(), key= lambda item: item[1])[:5]
top5_23dim_vector[0][0]

    #convert Reference object to ndarray

<Reference wrapping <tf.Tensor: id=870329, shape=(23,), dtype=float32, numpy=
array([0.58664703, 0.5345533 , 0.5219954 , 0.473352  , 0.48925287,
       0.44958842, 0.54683065, 0.5773044 , 0.53952235, 0.5658525 ,
       0.55383265, 0.4812393 , 0.53506076, 0.5888219 , 0.47412458,
       0.530871  , 0.46594116, 0.39890587, 0.48466036, 0.5873571 ,
       0.58020765, 0.47325674, 0.5062909 ], dtype=float32)>>

### Decode 5 vectors to 69-dimensions vectors

### Find the information of Top 5

## 2. Improve on the company encoder by using more features or use a different architecture to improve performance of the Top 5 similar company API.