<a href="https://colab.research.google.com/github/tokien1998/Company_Recommendation_System/blob/master/Capstone_Project_To_Trong_Kien.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Company Peer Discovery using Autoencoder

Autoencoders are a neural network used for dimensionality reduction. The neural network learn the latent features of the dataset to transform input features into a compressed representation. 

It can be used for encoding high dimensionality data to reduce noise and compute the similarity between high dimensionality data points.

In this case, we have a list of compnay with their profiles. The problem statement is to find the peer of these companies. 

Traditionally, we would engage an expert and define ruleset to filter and sort these companies into specific buckets which is rather subjective. It would be prohibitive if the list is long, we have 12,491 companies in this list.

We can use autoencoders to encode companies into their latent vector and programmatically search for their peers objectively. Hence, machine learning enabled the discovery of peer companies in a scalable manner.

In this project, you will use an autoencoder to encode company profiles and create a machine learning system for peer discovery. You will go through the applied data science journey:

Data wrangling >> Machine Learning model training >> Serving machine learning model >> evaluating the effectiveness of the machine learning system.

Data file can be downloaded at:
https://drive.google.com/file/d/1FrqsCW758NbZgfKbEoCMDDLmf7kUPf2j/view?usp=sharing



## Adding GPU

We can add a GPU by going to the menu and selecting:

Edit -> Notebook Settings -> Add accelerator (GPU)

Then run the following cell to confirm that the GPU is detected.

In [779]:
!pip uninstall tensorflow -y
!pip install tensorflow-gpu==2.0.0



# Data Cleaning and Encoding

In [780]:
import tensorflow as tf
print("Tensorflow version: {}".format(tf.__version__))
device_name = tf.test.gpu_device_name()

if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Tensorflow version: 2.0.0
Found GPU at: /device:GPU:0


In [781]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [785]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving data.f to data.f
User uploaded file "data.f" with length 2068848 bytes


## Read data into dataframe and do some simple cleaning


In [0]:
import pandas as pd
import numpy as np

In [787]:
data_path = 'data.f'
raw_df = pd.read_feather(data_path)
raw_df.head(10)

Unnamed: 0,index,Country,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Name,OS,Organization type,Region,Sector,Size,year
0,42434,Italy,OECD,No,No,Listed,Acea,,Private company,Europe,Energy Utilities,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
1,36391,United States of America,OECD,No,No,Listed,Bristol-Myers Squibb Company,,Private company,Northern America,Healthcare Products,Large,"2016,2016,2014,2012,2011,2010,2009,2008,2007,2..."
2,29899,United Kingdom of Great Britain and Northern I...,OECD,No,,Non-listed,British Airways,No,Subsidiary,Europe,Aviation,Large,20152014201320001999
3,44375,Sweden,OECD,No,No,Listed,Electrolux,,Private company,Europe,Consumer Durables,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
4,22,Sweden,OECD,,,,ESAB,,,Europe,Construction Materials,Large,20001999
5,50196,United States of America,OECD,No,Yes,Listed,General Motors Company,,Private company,Northern America,Automotive,MNE,"2018,2017,2016,2015,2014,2013,2012,2011,2004,2..."
6,51019,Japan,OECD,No,No,Listed,Panasonic Corporation,,Private company,Asia,Consumer Durables,Large,"2018,2017,2016,2015,2014,2013,2012,2011,2010,2..."
7,47087,United States of America,OECD,No,No,Listed,Procter & Gamble,,Private company,Northern America,Household and Personal Products,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
8,1176,Netherlands,OECD,,,,Procter & Gamble Netherlands,,,Europe,Household and Personal Products,Large,200520032002200120001999
9,51465,Canada,OECD,No,No,Listed,Suncor Energy,,Private company,Northern America,Energy,Large,"2018,2017,2016,2015,2014,2013,2012,2011,2010,2..."


In [788]:
# duplicate raw dataframe for build ML model
dup_raw_df = raw_df.copy()
dup_raw_df.head(10)

Unnamed: 0,index,Country,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Name,OS,Organization type,Region,Sector,Size,year
0,42434,Italy,OECD,No,No,Listed,Acea,,Private company,Europe,Energy Utilities,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
1,36391,United States of America,OECD,No,No,Listed,Bristol-Myers Squibb Company,,Private company,Northern America,Healthcare Products,Large,"2016,2016,2014,2012,2011,2010,2009,2008,2007,2..."
2,29899,United Kingdom of Great Britain and Northern I...,OECD,No,,Non-listed,British Airways,No,Subsidiary,Europe,Aviation,Large,20152014201320001999
3,44375,Sweden,OECD,No,No,Listed,Electrolux,,Private company,Europe,Consumer Durables,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
4,22,Sweden,OECD,,,,ESAB,,,Europe,Construction Materials,Large,20001999
5,50196,United States of America,OECD,No,Yes,Listed,General Motors Company,,Private company,Northern America,Automotive,MNE,"2018,2017,2016,2015,2014,2013,2012,2011,2004,2..."
6,51019,Japan,OECD,No,No,Listed,Panasonic Corporation,,Private company,Asia,Consumer Durables,Large,"2018,2017,2016,2015,2014,2013,2012,2011,2010,2..."
7,47087,United States of America,OECD,No,No,Listed,Procter & Gamble,,Private company,Northern America,Household and Personal Products,Large,"2017,2016,2015,2014,2013,2012,2011,2010,2009,2..."
8,1176,Netherlands,OECD,,,,Procter & Gamble Netherlands,,,Europe,Household and Personal Products,Large,200520032002200120001999
9,51465,Canada,OECD,No,No,Listed,Suncor Energy,,Private company,Northern America,Energy,Large,"2018,2017,2016,2015,2014,2013,2012,2011,2010,2..."


In [0]:
#df preprocessing
ncols_raw = len(raw_df.columns)

#drop duplicate index
raw_df.drop(raw_df.columns[0], axis=1, inplace=True)
raw_df.drop(labels=['year', 'Name', 'OS'], axis=1, inplace=True)
raw_df.replace(to_replace=[None], value=np.nan, inplace=True)

In [790]:
raw_df.head()

Unnamed: 0,Country,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Organization type,Region,Sector,Size
0,Italy,OECD,No,No,Listed,Private company,Europe,Energy Utilities,Large
1,United States of America,OECD,No,No,Listed,Private company,Northern America,Healthcare Products,Large
2,United Kingdom of Great Britain and Northern I...,OECD,No,,Non-listed,Subsidiary,Europe,Aviation,Large
3,Sweden,OECD,No,No,Listed,Private company,Europe,Consumer Durables,Large
4,Sweden,OECD,,,,,Europe,Construction Materials,Large


In [791]:
#find columns that has na
raw_df.columns[raw_df.isna().any()]

Index(['Featured Report?', 'GOLD Community', 'Listed/Non-listed',
       'Organization type', 'Size'],
      dtype='object')

In [792]:
raw_df.describe()

Unnamed: 0,Country,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Organization type,Region,Sector,Size
count,12491,12491,9802,9114,11833,11730,12491,12491,12458
unique,130,6,2,2,3,7,6,39,3
top,Mainland China,OECD,No,No,Listed,Private company,Asia,Financial Services,Large
freq,1337,6392,9788,8897,7018,8423,4488,1481,7468


In [0]:
#fill na with majorituy category
raw_df.fillna({'Featured Report?':'No', \
               'GOLD Community': 'No', \
               'Listed/Non-listed': 'Listed', \
               'OS':'No', \
               'Organization type': 'Private Company', 
               'Size': 'Large'}, inplace=True)

In [794]:
n_countries = raw_df[raw_df.columns[0]].unique().size
print("Total number of countries: {}".format(n_countries))

Total number of countries: 130


In [0]:
data_clean = raw_df.loc[:, 'Country Status':'Size'].copy()
countries = raw_df.loc[:, 'Country'].unique()

## Encode data

In [0]:
#Creates feature tokens for each feature (Hot-One-Encoding)
features_dataframe = data_clean.copy()
for feature in data_clean.columns:
      dfDummies = pd.get_dummies(data_clean[feature], prefix = feature)
      features_dataframe = pd.concat([features_dataframe, dfDummies], axis=1)

In [797]:
features_dataframe.head()

Unnamed: 0,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Organization type,Region,Sector,Size,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
0,OECD,No,No,Listed,Private company,Europe,Energy Utilities,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,OECD,No,No,Listed,Private company,Northern America,Healthcare Products,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,OECD,No,No,Non-listed,Subsidiary,Europe,Aviation,Large,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,OECD,No,No,Listed,Private company,Europe,Consumer Durables,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,OECD,No,No,Listed,Private Company,Europe,Construction Materials,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [798]:
ncolumns_data = len(data_clean.columns) 
ncolumns_code = len(features_dataframe.columns) - ncolumns_data

print("Original data columns: {}".format(ncolumns_data))
print("Encoding columns: {}".format(ncolumns_code))

Original data columns: 8
Encoding columns: 69


In [0]:
# Removes encoding from the list of variables, the operation 
code_columns = list(features_dataframe.columns)[ncolumns_data:]

In [0]:
# Takes only the columns which have been one-hot-encoded
training_df = features_dataframe[code_columns].copy() 

In [801]:
training_df.head(2)

Unnamed: 0,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [0]:
#Save File 
training_df.to_csv("encoded_data.csv")

# TF Dataset definition

In [0]:
import numpy as np
import tensorflow as tf
import os

np.random.seed(1)
tf.random.set_seed(1)

In [0]:
batch_size = 128

In [0]:
train_array = training_df.to_numpy(dtype=np.float32,copy=True)

In [0]:
train_dataset = tf.data.Dataset.from_tensor_slices(train_array)

In [0]:
train_dataset = train_dataset.batch(batch_size=batch_size)

In [0]:
train_dataset = train_dataset.shuffle(train_array.shape[0])

In [0]:
train_dataset = train_dataset.prefetch(batch_size*4)

# Model Definition

## Encoder

In [0]:
intermediate_dim = 23
original_dim = 69

In [0]:
class Encoder(tf.keras.layers.Layer):
  def __init__(self, intermediate_dim, original_dim):
    super(Encoder, self).__init__()
    hidden_dim_1 = int((original_dim + intermediate_dim)/2)
    self.hidden_layer_1 = tf.keras.layers.Dense(
      units=hidden_dim_1,
      activation=tf.nn.relu,
      kernel_initializer='he_uniform'
    )
    self.hidden_layer_2 = tf.keras.layers.Dense(
      units=intermediate_dim,
      activation=tf.nn.relu,
      kernel_initializer='he_uniform'
    )
    self.output_layer = tf.keras.layers.Dense(
      units=intermediate_dim,
      activation=tf.nn.sigmoid
    )
    
  def call(self, input_features):
    activation = self.hidden_layer_1(input_features)
    activation = self.hidden_layer_2(activation)
    return self.output_layer(activation)

## Decoder

In [0]:
class Decoder(tf.keras.layers.Layer):
  def __init__(self, intermediate_dim, original_dim):
    super(Decoder, self).__init__()
    hidden_dim_2 = int((original_dim + intermediate_dim)/2)
    self.hidden_layer_1 = tf.keras.layers.Dense(
      units=intermediate_dim,
      activation=tf.nn.relu,
      kernel_initializer='he_uniform'
    )
    self.hidden_layer_2 = tf.keras.layers.Dense(
      units=hidden_dim_2,
      activation=tf.nn.relu,
      kernel_initializer='he_uniform'
    )
    self.output_layer = tf.keras.layers.Dense(
      units=original_dim,
      activation=tf.nn.sigmoid
    )
  
  def call(self, code):
    activation = self.hidden_layer_1(code)
    activation = self.hidden_layer_2(activation)
    return self.output_layer(activation)

## Autoencoder

In [0]:
class Autoencoder(tf.keras.Model):
  def __init__(self, intermediate_dim, original_dim):
    super(Autoencoder, self).__init__()
    self.encoder = Encoder(
        intermediate_dim=intermediate_dim, 
        original_dim=original_dim
    )
    self.decoder = Decoder(
        intermediate_dim=intermediate_dim,
        original_dim=original_dim
    )
  
  def call(self, input_features):
    code = self.encoder(input_features)
    reconstructed = self.decoder(code)
    return reconstructed

# Model and Training Setup

In [0]:
autoencoder = Autoencoder(
  intermediate_dim=intermediate_dim,
  original_dim=original_dim
)

In [0]:
learning_rate = 0.5e-2
opt = tf.optimizers.Adam(learning_rate=learning_rate)

In [0]:
def loss(model, original):
  reconstruction_error = tf.reduce_mean(tf.square(tf.subtract(model(original), original)))
  return reconstruction_error
  
def train(loss, model, opt, original):
  with tf.GradientTape() as tape:
    gradients = tape.gradient(loss(model, original), model.trainable_variables)
  gradient_variables = zip(gradients, model.trainable_variables)
  opt.apply_gradients(gradient_variables)

# Training

In [0]:
epochs = 50
save_path = "drive/My Drive/model_checkpoints"

In [0]:
def make_folder(path):
  try:
    os.mkdir(path)
  except FileExistsError:
    print("Directory already exist.")
  return

In [819]:
make_folder(save_path)

Directory already exist.


In [820]:
save_freq = 5
writer = tf.summary.create_file_writer('{}/tmp'.format(save_path))
with writer.as_default():
  with tf.summary.record_if(True):
    for epoch in range(epochs):
      for step, batch_features in enumerate(train_dataset):
        train(loss, autoencoder, opt, batch_features)
        loss_values = loss(autoencoder, batch_features)
        original = batch_features
        reconstructed = autoencoder(tf.constant(batch_features))
        tf.summary.scalar('loss', loss_values, step=step)
        tf.summary.write('original', original, step=step)
        tf.summary.write('reconstructed', reconstructed, step=step)
      
      if epoch%save_freq == 0:
        tf.print("Epoch: {}".format(epoch))
        tf.print(" . Loss:", loss_values)
        path = "{}/epoch_{:03d}".format(save_path, epoch)
        make_folder(path)
        autoencoder.save_weights(path)


Epoch: 0
 . Loss: 0.0375696234
Directory already exist.
Epoch: 5
 . Loss: 0.015970286
Directory already exist.
Epoch: 10
 . Loss: 0.0153024541
Directory already exist.
Epoch: 15
 . Loss: 0.0133663164
Directory already exist.
Epoch: 20
 . Loss: 0.0107426587
Directory already exist.
Epoch: 25
 . Loss: 0.00959133916
Directory already exist.
Epoch: 30
 . Loss: 0.0133537054
Directory already exist.
Epoch: 35
 . Loss: 0.0126118558
Directory already exist.
Epoch: 40
 . Loss: 0.0113625349
Directory already exist.
Epoch: 45
 . Loss: 0.0112214265
Directory already exist.


# Capstone Project Statement

## Instructions:
1. Create a copy of this collab notebook and suffix it with your name
2. Run through the code example above
3. Work on the project objectives below inside the copy of the notebook

## 1. Use the autoencoder as a basis to create a machine learning model to find the Top 5 similar company given an target company. 

Note: Implement an API/function to query for Top 5 similar company given an target company. Assuming the target company is within the data.f file.

### Get information of a specific company

In [0]:
#@title Write down a company name
company_name = "British Airways" #@param {type:"string"}


In [822]:
if(company_name in dup_raw_df['Name'].values):
    print('The Company is founded!')
else:
    raise NameError('This company name is not founded!')

The Company is founded!


In [0]:
# df preprocessing
target_company_df = dup_raw_df[dup_raw_df.Name == company_name].copy()
other_companies_df = dup_raw_df[dup_raw_df.Name != company_name].copy()

### Get the target vector of target company

In [0]:
func_features_df = features_dataframe.copy()

def df_preprocessing(dataframe):
    # drop duplicate index
    dataframe.drop(labels='index', axis=1, inplace=True)
    dataframe.drop(labels=['year', 'Name', 'OS'], axis=1, inplace=True)
    dataframe.replace(to_replace=[None], value=np.nan, inplace=True)

    # find columns that has na
    dataframe.columns[dataframe.isna().any()]

    #fill na with majorituy category
    dataframe.fillna({'Featured Report?':'No', \
                'GOLD Community': 'No', \
                'Listed/Non-listed': 'Listed', \
                'OS':'No', \
                'Organization type': 'Private Company', 
                'Size': 'Large'}, inplace=True)
    
    dataframe_clean = dataframe.loc[:, 'Country Status':'Size'].copy()

    dataframe_ori_cols = dataframe_clean.shape[1] # Number of cleaned original columns
    return dataframe_clean, dataframe_ori_cols

In [0]:
# Create a function to get one-hot code dataframe
def get_code_df(dataframe,df_type):
    dataframe_clean, dataframe_ori_cols = df_preprocessing(dataframe)
    if(df_type == 'target'):
        # create a dictionary of origin values of the target company 
        target_com_dict = dataframe_clean.to_dict(orient='records')[0]

        # show a data list with satisfying conditions
        for key, value in target_com_dict.items():
            global func_features_df
            func_features_df = func_features_df[func_features_df[key] == value].copy()
            target_features_df = func_features_df

        # drop duplicates
        target_features_df = target_features_df.drop_duplicates()

        # Takes only the columns which have been one-hot-encoded
        target_vector = target_features_df[target_features_df.columns[dataframe_ori_cols:]]
        return dataframe_clean, dataframe_clean.head(), target_vector, target_vector.head()
        
    if(df_type == 'the rest'):
        # One-hot-encoding
        token_dataframe = dataframe_clean.copy()
        for feature in dataframe_clean.columns:
            dfDummies = pd.get_dummies(dataframe_clean[feature], prefix = feature)
            token_dataframe = pd.concat([token_dataframe, dfDummies], axis=1)

        # Get the full dataframe for the last step
        full_df = token_dataframe 

        # Take only the columns which have been one-hot-encoded
        token_df = token_dataframe[token_dataframe.columns[dataframe_ori_cols:]]
        return dataframe_clean, dataframe_clean.head(), token_df, token_df.head(), full_df

In [826]:
# preview target company
target_company, prv_target_company, encoded_target_company, prv_encoded_target_company = get_code_df(target_company_df, df_type='target')
display(prv_target_company, prv_encoded_target_company, prv_encoded_target_company.shape)

Unnamed: 0,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Organization type,Region,Sector,Size
2,OECD,No,No,Non-listed,Subsidiary,Europe,Aviation,Large


Unnamed: 0,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
2,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


(1, 69)

In [827]:
# preview other companies
other_companies, prv_other_companies, encoded_other_companies, prv_encoded_other_companies, df_for_last_step = get_code_df(other_companies_df, df_type='the rest')
display(prv_other_companies, prv_encoded_other_companies, prv_encoded_other_companies.shape)

Unnamed: 0,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Organization type,Region,Sector,Size
0,OECD,No,No,Listed,Private company,Europe,Energy Utilities,Large
1,OECD,No,No,Listed,Private company,Northern America,Healthcare Products,Large
3,OECD,No,No,Listed,Private company,Europe,Consumer Durables,Large
4,OECD,No,No,Listed,Private Company,Europe,Construction Materials,Large
5,OECD,No,Yes,Listed,Private company,Northern America,Automotive,MNE


Unnamed: 0,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
5,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


(5, 69)

In [828]:
target_vector_69 = encoded_target_company.to_numpy(dtype=np.float32, copy=True)
target_vector_69.shape

(1, 69)

In [829]:
other_vectors_69 = encoded_other_companies.to_numpy(dtype=np.float32, copy=True)
other_vectors_69.shape

(12490, 69)

### Dimension Reducing by Encoder

In [0]:
encoder = Encoder(
    intermediate_dim=intermediate_dim,
    original_dim=original_dim
)

decoder = Decoder(
    intermediate_dim=intermediate_dim,
    original_dim=original_dim
)

In [831]:
# list of 23-dimensions vectors
target_vector_23 = encoder(target_vector_69)
target_vector_23.shape

TensorShape([1, 23])

In [832]:
other_vectors_23 = encoder(other_vectors_69)
other_vectors_23.shape

TensorShape([12490, 23])

### Calculate the distance between the company vector and vectors in dataset

In [0]:
def vector_loss(vector_1, vector_2):
    norm_value = np.linalg.norm(vector_1 - vector_2)
    return norm_value

In [834]:
loss_vector_dict = {}
loss_step_dict = {} # for checking

for step, vector in enumerate(other_vectors_23):
    loss_value = vector_loss(vector_1=target_vector_23, vector_2=vector)
    loss_vector_dict.update({loss_value : vector})
    loss_step_dict.update({loss_value : step})

    if step%1000 == 0:
        print('Step: ',step)
        print('. Vector: ', vector)
        print('. Loss: {}'.format(loss_value))
        print('________\n')

Step:  0
. Vector:  tf.Tensor(
[0.6593223  0.52991694 0.5300096  0.5109138  0.53552294 0.46297282
 0.5756589  0.5640135  0.54374576 0.63916785 0.5523897  0.48967662
 0.55614585 0.5690693  0.4235473  0.53759193 0.46717918 0.3715962
 0.48032174 0.6213877  0.6203355  0.5142615  0.47182268], shape=(23,), dtype=float32)
. Loss: 0.3034869134426117
________

Step:  1000
. Vector:  tf.Tensor(
[0.5431844  0.46220088 0.5206056  0.49174005 0.5605048  0.476045
 0.47863996 0.54090977 0.546007   0.5605823  0.52877456 0.47826016
 0.5262644  0.49506944 0.51521266 0.5023218  0.4935623  0.4925796
 0.46676314 0.5455513  0.48484713 0.45840585 0.48071623], shape=(23,), dtype=float32)
. Loss: 0.2211018204689026
________

Step:  2000
. Vector:  tf.Tensor(
[0.5971109  0.49322364 0.528209   0.44503143 0.4925152  0.4251901
 0.53922117 0.5500984  0.51584595 0.5727555  0.4814058  0.40835887
 0.58696675 0.5533949  0.4201365  0.5372612  0.48332745 0.39367378
 0.41537273 0.6015871  0.6133498  0.43220583 0.5167773 ],

### Choose 5 smallest distances (5 other companies)

In [835]:
top5_smallest_loss_step = sorted(loss_step_dict.items(), key= lambda item: item[0])[:5]
top5_smallest_loss_step

[(6.664002e-08, 8629),
 (0.06611242, 11636),
 (0.092745095, 6996),
 (0.10762594, 4897),
 (0.10815629, 6229)]

In [836]:
top5_23dim_vector = sorted(loss_vector_dict.items(), key= lambda item: item[0])[:5]
top5_23dim_vector

[(6.664002e-08, <tf.Tensor: id=10847136, shape=(23,), dtype=float32, numpy=
  array([0.5148056 , 0.463833  , 0.56756145, 0.56216437, 0.5218636 ,
         0.48853582, 0.5170681 , 0.49759245, 0.52543557, 0.53032786,
         0.5548152 , 0.53945565, 0.53138876, 0.56581974, 0.52506435,
         0.4860138 , 0.4878537 , 0.45924512, 0.49402517, 0.58200496,
         0.55666625, 0.5419357 , 0.57768947], dtype=float32)>),
 (0.06611242, <tf.Tensor: id=10862171, shape=(23,), dtype=float32, numpy=
  array([0.5190928 , 0.46817294, 0.5640065 , 0.55139035, 0.535544  ,
         0.46466538, 0.52147174, 0.47730443, 0.52209085, 0.5334885 ,
         0.57206506, 0.53430617, 0.5150491 , 0.5457676 , 0.51291525,
         0.49515146, 0.49775103, 0.47232762, 0.51543653, 0.57319415,
         0.5512842 , 0.5219977 , 0.5544543 ], dtype=float32)>),
 (0.092745095, <tf.Tensor: id=10838971, shape=(23,), dtype=float32, numpy=
  array([0.5129862 , 0.48400128, 0.56624514, 0.5496834 , 0.5097616 ,
         0.5224654 , 0.538

### Decode 5 vectors to 69-dimensions vectors

In [837]:
# edit tensor shape (23,) -> (x, 23)
tensor_shape = np.array([top5_23dim_vector[0][1].numpy()])
for tensor in top5_23dim_vector[1:]:
    tensor_shape = np.append(tensor_shape, tf.constant([tensor[1].numpy()]), axis=0)

tensor_shape.shape

(5, 23)

In [838]:
decoded_top5 = decoder(tensor_shape)
decoded_top5.shape

TensorShape([5, 69])

### Find the information of Top 5

In [0]:
# value for each features
value_dummie = decoded_top5.numpy()

In [840]:
# Create Dataframe with decoded top5 for preview results
top5_df = pd.DataFrame(data=value_dummie, columns=encoded_other_companies.columns)
top5_df

Unnamed: 0,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
0,0.387195,0.4405,0.422762,0.523005,0.499232,0.57396,0.50224,0.584104,0.27841,0.572296,0.369857,0.510886,0.63474,0.551159,0.507123,0.655974,0.511022,0.595664,0.610833,0.63072,0.523265,0.339854,0.405282,0.427493,0.570726,0.50355,0.516882,0.311831,0.474316,0.560003,0.558769,0.563665,0.449343,0.536849,0.431539,0.582813,0.619454,0.449128,0.404407,0.583231,0.447648,0.488812,0.562898,0.491915,0.274824,0.536991,0.319151,0.599873,0.393694,0.423067,0.514632,0.553942,0.682917,0.447662,0.477604,0.341271,0.615375,0.540994,0.358535,0.443974,0.590299,0.53938,0.619303,0.641555,0.443299,0.543601,0.662312,0.445283,0.428862
1,0.390247,0.446983,0.426128,0.524257,0.499272,0.571041,0.500265,0.583549,0.281209,0.56663,0.368098,0.507695,0.629221,0.550867,0.502939,0.649318,0.509205,0.591835,0.613376,0.627758,0.528637,0.344292,0.40604,0.427913,0.57092,0.502453,0.518752,0.315931,0.471982,0.560512,0.556489,0.563102,0.451623,0.535653,0.429416,0.579355,0.61798,0.446726,0.406229,0.581314,0.44608,0.483485,0.565704,0.491936,0.280047,0.5389,0.320575,0.598972,0.397813,0.430217,0.512094,0.554223,0.677854,0.450378,0.475019,0.342999,0.613004,0.542766,0.362905,0.443803,0.589073,0.536665,0.617981,0.638475,0.442964,0.538707,0.655581,0.447179,0.432805
2,0.38903,0.439522,0.423268,0.520669,0.501045,0.576903,0.507329,0.585443,0.281926,0.57923,0.373238,0.513111,0.636049,0.545769,0.510752,0.659298,0.517104,0.595522,0.604928,0.631649,0.513694,0.338092,0.408671,0.427646,0.567568,0.50457,0.512954,0.313541,0.473287,0.553843,0.562728,0.558296,0.445389,0.537246,0.433586,0.578788,0.622826,0.454233,0.405204,0.582238,0.451928,0.496593,0.556961,0.494395,0.274042,0.534177,0.32272,0.597293,0.39415,0.417362,0.511599,0.547699,0.682577,0.44194,0.478434,0.345201,0.619375,0.536525,0.359749,0.4467,0.587727,0.536216,0.620963,0.642936,0.446466,0.548863,0.662895,0.444777,0.42866
3,0.386754,0.448036,0.425032,0.534177,0.494453,0.583634,0.494052,0.592622,0.279686,0.56085,0.366404,0.496644,0.628821,0.542197,0.504237,0.652641,0.515197,0.588517,0.616138,0.62839,0.532769,0.345091,0.410633,0.420297,0.56631,0.504384,0.526343,0.320333,0.468584,0.561669,0.550671,0.571728,0.451935,0.543653,0.43025,0.588178,0.618278,0.446701,0.399691,0.588185,0.442395,0.474824,0.559555,0.497798,0.27707,0.540824,0.315673,0.605913,0.386975,0.435613,0.513396,0.555189,0.679214,0.449528,0.480358,0.330364,0.62124,0.543715,0.358453,0.451042,0.592912,0.527574,0.627779,0.643491,0.436578,0.543118,0.651689,0.457999,0.431889
4,0.397288,0.449897,0.428717,0.53256,0.494681,0.577687,0.494724,0.588956,0.287973,0.566256,0.372131,0.502115,0.623094,0.547752,0.499799,0.650657,0.51713,0.590709,0.614052,0.626653,0.523883,0.345794,0.414058,0.423079,0.56435,0.504924,0.518418,0.3216,0.466139,0.555523,0.551835,0.559659,0.452267,0.53798,0.435009,0.575825,0.619632,0.447796,0.401447,0.582777,0.447792,0.485005,0.561845,0.497576,0.275513,0.535053,0.325576,0.59863,0.394336,0.433958,0.508503,0.550303,0.675165,0.449478,0.472647,0.34625,0.617972,0.534112,0.363096,0.446843,0.590696,0.531864,0.623504,0.640703,0.444716,0.541755,0.646904,0.453307,0.431648


In [841]:
origin_df_clean = other_companies_df.loc[:, 'Country Status':'Size'].copy()
origin_df_clean_cols = list(origin_df_clean.columns)
print('Unique values for each column of cleaned dataframe:')
unique_value = 0
col_size_dict = {}
for col in origin_df_clean_cols:
    print('. {column}: {unique_values}'.format(column=col, unique_values=origin_df_clean[col].nunique()))
    col_size_dict.update({col : origin_df_clean[col].nunique()})
    unique_value += origin_df_clean[col].nunique()
print('\n Total unique values: ', unique_value)

Unique values for each column of cleaned dataframe:
. Country Status: 6
. Featured Report?: 2
. GOLD Community: 2
. Listed/Non-listed: 3
. Organization type: 8
. Region: 6
. Sector: 39
. Size: 3

 Total unique values:  69


In [0]:
# get column indexes of each features
country_stt_idx     = col_size_dict['Country Status']
feature_rp_idx      = country_stt_idx   + col_size_dict['Featured Report?']
gold_idx            = feature_rp_idx    + col_size_dict['GOLD Community']
lst_idx             = gold_idx          + col_size_dict['Listed/Non-listed']
org_type_idx        = lst_idx           + col_size_dict['Organization type']
region_idx          = org_type_idx      + col_size_dict['Region']
sector_idx          = region_idx        + col_size_dict['Sector']
size_idx            = sector_idx        + col_size_dict['Size']

In [0]:
# split dataframe of each features
country_stt_df   = top5_df.iloc[:, 0:country_stt_idx]
feature_rp_df    = top5_df.iloc[:, country_stt_idx:feature_rp_idx]
gold_df          = top5_df.iloc[:, feature_rp_idx:gold_idx]
lst_df           = top5_df.iloc[:, gold_idx:lst_idx]
org_type_df      = top5_df.iloc[:, lst_idx:org_type_idx]
region_df        = top5_df.iloc[:, org_type_idx:region_idx]
sector_df        = top5_df.iloc[:, region_idx:sector_idx]
size_df          = top5_df.iloc[:, sector_idx:size_idx]

In [0]:
# create function to replace min value of each row to 1 and the rest to 0
def replace_1_0(dataframe):
    replaced_df = dataframe.eq(dataframe.where(dataframe != 0).max(1), axis=0).astype(int)
    return replaced_df

In [845]:
display(country_stt_df)
country_stt_rdf = replace_1_0(country_stt_df)
display(country_stt_rdf)

Unnamed: 0,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD
0,0.387195,0.4405,0.422762,0.523005,0.499232,0.57396
1,0.390247,0.446983,0.426128,0.524257,0.499272,0.571041
2,0.38903,0.439522,0.423268,0.520669,0.501045,0.576903
3,0.386754,0.448036,0.425032,0.534177,0.494453,0.583634
4,0.397288,0.449897,0.428717,0.53256,0.494681,0.577687


Unnamed: 0,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD
0,0,0,0,0,0,1
1,0,0,0,0,0,1
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,0,0,0,0,0,1


In [846]:
display(feature_rp_df)
feature_rp_rdf = replace_1_0(feature_rp_df)
display(feature_rp_rdf)

Unnamed: 0,Featured Report?_No,Featured Report?_Yes
0,0.50224,0.584104
1,0.500265,0.583549
2,0.507329,0.585443
3,0.494052,0.592622
4,0.494724,0.588956


Unnamed: 0,Featured Report?_No,Featured Report?_Yes
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


In [847]:
display(gold_df)
gold_rdf = replace_1_0(gold_df)
display(gold_rdf)

Unnamed: 0,GOLD Community_No,GOLD Community_Yes
0,0.27841,0.572296
1,0.281209,0.56663
2,0.281926,0.57923
3,0.279686,0.56085
4,0.287973,0.566256


Unnamed: 0,GOLD Community_No,GOLD Community_Yes
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


In [848]:
display(lst_df)
lst_rdf = replace_1_0(lst_df)
display(lst_rdf)

Unnamed: 0,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable
0,0.369857,0.510886,0.63474
1,0.368098,0.507695,0.629221
2,0.373238,0.513111,0.636049
3,0.366404,0.496644,0.628821
4,0.372131,0.502115,0.623094


Unnamed: 0,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1


In [849]:
display(org_type_df)
org_type_rdf = replace_1_0(org_type_df)
display(org_type_rdf)

Unnamed: 0,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary
0,0.551159,0.507123,0.655974,0.511022,0.595664,0.610833,0.63072,0.523265
1,0.550867,0.502939,0.649318,0.509205,0.591835,0.613376,0.627758,0.528637
2,0.545769,0.510752,0.659298,0.517104,0.595522,0.604928,0.631649,0.513694
3,0.542197,0.504237,0.652641,0.515197,0.588517,0.616138,0.62839,0.532769
4,0.547752,0.499799,0.650657,0.51713,0.590709,0.614052,0.626653,0.523883


Unnamed: 0,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary
0,0,0,1,0,0,0,0,0
1,0,0,1,0,0,0,0,0
2,0,0,1,0,0,0,0,0
3,0,0,1,0,0,0,0,0
4,0,0,1,0,0,0,0,0


In [850]:
display(region_df)
region_rdf = replace_1_0(region_df)
display(region_rdf)

Unnamed: 0,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania
0,0.339854,0.405282,0.427493,0.570726,0.50355,0.516882
1,0.344292,0.40604,0.427913,0.57092,0.502453,0.518752
2,0.338092,0.408671,0.427646,0.567568,0.50457,0.512954
3,0.345091,0.410633,0.420297,0.56631,0.504384,0.526343
4,0.345794,0.414058,0.423079,0.56435,0.504924,0.518418


Unnamed: 0,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania
0,0,0,0,1,0,0
1,0,0,0,1,0,0
2,0,0,0,1,0,0
3,0,0,0,1,0,0
4,0,0,0,1,0,0


In [851]:
display(sector_df)
sector_rdf = replace_1_0(sector_df)
display(sector_rdf)

Unnamed: 0,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities
0,0.311831,0.474316,0.560003,0.558769,0.563665,0.449343,0.536849,0.431539,0.582813,0.619454,0.449128,0.404407,0.583231,0.447648,0.488812,0.562898,0.491915,0.274824,0.536991,0.319151,0.599873,0.393694,0.423067,0.514632,0.553942,0.682917,0.447662,0.477604,0.341271,0.615375,0.540994,0.358535,0.443974,0.590299,0.53938,0.619303,0.641555,0.443299,0.543601
1,0.315931,0.471982,0.560512,0.556489,0.563102,0.451623,0.535653,0.429416,0.579355,0.61798,0.446726,0.406229,0.581314,0.44608,0.483485,0.565704,0.491936,0.280047,0.5389,0.320575,0.598972,0.397813,0.430217,0.512094,0.554223,0.677854,0.450378,0.475019,0.342999,0.613004,0.542766,0.362905,0.443803,0.589073,0.536665,0.617981,0.638475,0.442964,0.538707
2,0.313541,0.473287,0.553843,0.562728,0.558296,0.445389,0.537246,0.433586,0.578788,0.622826,0.454233,0.405204,0.582238,0.451928,0.496593,0.556961,0.494395,0.274042,0.534177,0.32272,0.597293,0.39415,0.417362,0.511599,0.547699,0.682577,0.44194,0.478434,0.345201,0.619375,0.536525,0.359749,0.4467,0.587727,0.536216,0.620963,0.642936,0.446466,0.548863
3,0.320333,0.468584,0.561669,0.550671,0.571728,0.451935,0.543653,0.43025,0.588178,0.618278,0.446701,0.399691,0.588185,0.442395,0.474824,0.559555,0.497798,0.27707,0.540824,0.315673,0.605913,0.386975,0.435613,0.513396,0.555189,0.679214,0.449528,0.480358,0.330364,0.62124,0.543715,0.358453,0.451042,0.592912,0.527574,0.627779,0.643491,0.436578,0.543118
4,0.3216,0.466139,0.555523,0.551835,0.559659,0.452267,0.53798,0.435009,0.575825,0.619632,0.447796,0.401447,0.582777,0.447792,0.485005,0.561845,0.497576,0.275513,0.535053,0.325576,0.59863,0.394336,0.433958,0.508503,0.550303,0.675165,0.449478,0.472647,0.34625,0.617972,0.534112,0.363096,0.446843,0.590696,0.531864,0.623504,0.640703,0.444716,0.541755


Unnamed: 0,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [852]:
display(size_df)
size_rdf = replace_1_0(size_df)
display(size_rdf)

Unnamed: 0,Size_Large,Size_MNE,Size_SME
0,0.662312,0.445283,0.428862
1,0.655581,0.447179,0.432805
2,0.662895,0.444777,0.42866
3,0.651689,0.457999,0.431889
4,0.646904,0.453307,0.431648


Unnamed: 0,Size_Large,Size_MNE,Size_SME
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [0]:
# Create function to concatenate two dataframes
def concat_dataframes(df_1, df_2):
    new_df = pd.concat([df_1, df_2.reindex(df_1.index)], axis=1)
    return new_df

In [0]:
# Concatenate dataframes
country_report_ = concat_dataframes(country_stt_rdf, feature_rp_rdf)
_gold           = concat_dataframes(country_report_, gold_rdf)
_lst            = concat_dataframes(_gold, lst_rdf)
_org            = concat_dataframes(_lst, org_type_rdf)
_region         = concat_dataframes(_org, region_rdf)
_sector         = concat_dataframes(_region, sector_rdf)
replaced_top5   = concat_dataframes(_sector, size_rdf)

In [855]:
display(replaced_top5)
print('\nShape:', replaced_top5.shape)

Unnamed: 0,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0



Shape: (5, 69)


In [856]:
unique_encoded = replaced_top5.drop_duplicates()
unique_encoded

Unnamed: 0,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [857]:
checking_source = df_for_last_step
checking_source.head()

Unnamed: 0,Country Status,Featured Report?,GOLD Community,Listed/Non-listed,Organization type,Region,Sector,Size,Country Status_DAC-LDC,Country Status_DAC-LMICT,Country Status_DAC-OLIC,Country Status_DAC-UMICT,Country Status_Non-OECD / Non-DAC,Country Status_OECD,Featured Report?_No,Featured Report?_Yes,GOLD Community_No,GOLD Community_Yes,Listed/Non-listed_Listed,Listed/Non-listed_Non-listed,Listed/Non-listed_Not applicable,Organization type_Cooperative,Organization type_Non-profit organization,Organization type_Partnership,Organization type_Private Company,Organization type_Private company,Organization type_Public institution,Organization type_State-owned company,Organization type_Subsidiary,Region_Africa,Region_Asia,Region_Europe,Region_Latin America & the Caribbean,Region_Northern America,Region_Oceania,Sector_=,Sector_Agriculture,Sector_Automotive,Sector_Aviation,Sector_Chemicals,Sector_Commercial Services,Sector_Computers,Sector_Conglomerates,Sector_Construction,Sector_Construction Materials,Sector_Consumer Durables,Sector_Energy,Sector_Energy Utilities,Sector_Equipment,Sector_Financial Services,Sector_Food and Beverage Products,Sector_Forest and Paper Products,Sector_Healthcare Products,Sector_Healthcare Services,Sector_Household and Personal Products,Sector_Logistics,Sector_Media,Sector_Metals Products,Sector_Mining,Sector_Non-Profit / Services,Sector_Other,Sector_Public Agency,Sector_Railroad,Sector_Real Estate,Sector_Retailers,Sector_Technology Hardware,Sector_Telecommunications,Sector_Textiles and Apparel,Sector_Tobacco,Sector_Tourism/Leisure,Sector_Toys,Sector_Universities,Sector_Waste Management,Sector_Water Utilities,Size_Large,Size_MNE,Size_SME
0,OECD,No,No,Listed,Private company,Europe,Energy Utilities,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,OECD,No,No,Listed,Private company,Northern America,Healthcare Products,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,OECD,No,No,Listed,Private company,Europe,Consumer Durables,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,OECD,No,No,Listed,Private Company,Europe,Construction Materials,Large,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
5,OECD,No,Yes,Listed,Private company,Northern America,Automotive,MNE,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [0]:
# create a dictionary of values of the 'unique_encoded'
unique_encoded_to_dict = unique_encoded.to_dict(orient='records')

# show a data list with satisfying conditions
for record in range(len(unique_encoded_to_dict)):
    dict_item = unique_encoded_to_dict[record]
    for key, value in dict_item.items():
        global checking_source
        checking_source = checking_source[checking_source[key] == value].copy()
        final_df = checking_source

In [859]:
if final_df.shape[0] == 0:
    print('No companies are founded!')
else:
    display(final_df)

No companies are founded!


## 2. Improve on the company encoder by using more features or use a different architecture to improve performance of the Top 5 similar company API.