# The objective of this project is to classify comapnies based on their Industry Classification Tags using attributes provided in the datasets

Inputs provided:
* Training dataset
* Test dataset

Models/Frameworks/Libraries used:
* simpletransfomers library
* BERT transfomer model for tokenization and text classification 
* panndas and numpy for data processing and linear algebra
* sci-kit learn library for metrics evaluation
* IPython library to generate file links
* CUDA and apex framework for using NVIDIA GPU core

The broad workflow followed in this notebook is as follows:
* Splitting the labeled training dataset provided into training data and validation data saved and addressed on this notebook as 'train.csv' and 'valid.csv' respectively.
* Downloading the simpletransfomers library and using the classification model package for multiclass classification
* Preprocessing the training and validation data using standard text cleaning techniques for NLP and preparing the data in compliance with the requirements of the simpletranformers model
* Initializing and training the model
* Evaluation on validation data and determing the metrics
* Preprocessing the test dataset using standard text cleaning techniques for NLP and preparing the data in compliance with the requirements of the simpletranformers model 
* Deploying the model on the test dataset and generating predictions
* Generating the final output dataset with Industry Classification Tags for data provided in the test dataset



### Importing the necessary linear algebra and data processing libraries

In [1]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


### Simpletransformers is a cutting edge NLP library mainly used for text classification, language model training, token classification (NER) and Conversational AI along with many other tasks. Simpletransformers library was specifically chosen for this task as it provides a seamless pipeline to implement the much powerful and sophesticated Transformers library developed by Huggingface through which some of the industry standard transformer based text classification algorithms like BERT, RoBerta, XLnet etc.. can be used to build our NLP models without much hassle.

### pip package management system is used to download and install the simpletransformers library

In [2]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.40.2-py3-none-any.whl (190 kB)
[K     |████████████████████████████████| 190 kB 3.7 MB/s eta 0:00:01
Collecting seqeval
  Downloading seqeval-0.0.12.tar.gz (21 kB)
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25ldone
[?25h  Created wheel for seqeval: filename=seqeval-0.0.12-py3-none-any.whl size=7423 sha256=09b0ac2e2d581aef63843268530bff441a155acd359109aefd4572094fc2be00
  Stored in directory: /root/.cache/pip/wheels/dc/cc/62/a3b81f92d35a80e39eb9b2a9d8b31abac54c02b21b2d466edc
Successfully built seqeval
Installing collected packages: seqeval, simpletransformers
Successfully installed seqeval-0.0.12 simpletransformers-0.40.2


### As the main aim is to classify companies based on their business description, the task primarily would be to implement a multiclass classification which can be achieved using the classification model package of the simpletransformers library

In [3]:
from simpletransformers.classification import ClassificationModel




### Initialize the model with Bert as the transfomer, bert-base-uncased as the tokenizer, number of training epochs to be 2 and 62 labels

In [4]:
model = ClassificationModel('bert', 'bert-base-uncased', num_labels=62, args={'reprocess_input_data': True, 'overwrite_output_dir': True,"num_train_epochs": 2},use_cuda=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




### The training dataset provided for the task is shuffled and split into two datasets named 'train.csv' and 'valid.csv'. train.csv file contains data that the model will be training on and valid.csv file will be treated as validation dataset on which the evaluation will be performed in order to obtain the performance metrics of the model

### Pandas framework is used to read in the train.csv file containing the data to be trained on

In [5]:
train_df = pd.read_csv('../input/traincsv/train.csv')

### Preprocessing of data is done through some standard text cleaning techniques

In [6]:
# to clean data
def normalise_business_description(description):
    description = description.str.lower() # lowercase
    description = description.str.replace(r"\#","") # replaces hashtags
    description = description.str.replace(r"http\S+","URL")  # remove URL addresses
    description = description.str.replace(r"@","")
    description = description.str.replace(r"[^A-Za-z0-9()!?\'\`\"]", " ")
    description = description.str.replace("\s{2,}", " ")
    return description

In [7]:
train_df['Business Description'] = normalise_business_description(train_df['Business Description'])

### As the simpletransformers model require labels to be of integer type, the 'Industry Classification Tag' attribute is treated as categorical data. Categorifying the data results in each unique entry in the Industry Classification Tag attribute being mapped to a unique integer which in then stored in a dictionary (class_weights) in order to do remaping once the results are obtained after prediction. A new column 'Target' is created which mimic the 'Industry Classification Tag' attribute but otherwise containiing the mapping of the labels
** Note: The aforementioned procedure is implemented on both training and validation datasets*

In [8]:
train_df.drop('Company Name',axis=1,inplace=True)
train_df['Industry Classification Tag'] = train_df['Industry Classification Tag'].astype('category')
class_weights = dict(enumerate(train_df['Industry Classification Tag'].cat.categories)) 
train_df['Target'] = train_df['Industry Classification Tag'].cat.codes.values

In [9]:
valid_df = pd.read_csv('../input/validationcsv/valid.csv')
valid_df['Business Description'] = normalise_business_description(valid_df['Business Description'])

In [10]:
valid_df.drop('Company Name',axis=1,inplace=True)
valid_df['Industry Classification Tag'] = valid_df['Industry Classification Tag'].astype('category')
valid_df['Target'] = valid_df['Industry Classification Tag'].cat.codes.values

### Drop the attributes not required for training and validation

In [12]:
train_df.drop(['Unnamed: 0','Industry Classification Tag',],axis=1,inplace=True)
valid_df.drop(['Unnamed: 0','Industry Classification Tag',],axis=1,inplace=True)

### The simpletransfomers model require the first column to be of type string and hence in this case the data type of the attribute 'Business Description' is converted from object to string

In [14]:
train_df['Business Description'] = train_df['Business Description'].astype('string')

### Drop any entries in the dataset with null values

In [15]:
train_df.dropna(inplace=True)

### Check whether the attributes of training data is in compliance with the requirement of the simpletranformers model

In [16]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4517 entries, 0 to 4532
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Business Description  4517 non-null   string
 1   Target                4517 non-null   int8  
dtypes: int8(1), string(1)
memory usage: 75.0 KB


### The simpletransfomers model require the first column to be of type string and hence in this case the data type of the attribute 'Business Description' is converted from object to string. Also, any null entries in the dataset is dropped

In [17]:
valid_df['Business Description'] = valid_df['Business Description'].astype('string')
valid_df.dropna(inplace=True)

### Check whether the attributes of validation data is in compliance with the requirement of the simpletranformers model

In [18]:
valid_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1505 entries, 0 to 1511
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Business Description  1505 non-null   string
 1   Target                1505 non-null   int8  
dtypes: int8(1), string(1)
memory usage: 25.0 KB


### Download and install the apex library required for working on a NVIDIA GPU with CUDA

In [19]:
%%writefile setup.sh

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir ./

Writing setup.sh


In [20]:
!sh setup.sh

Cloning into 'apex'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 7293 (delta 20), reused 19 (delta 6), pack-reused 7255[K
Receiving objects: 100% (7293/7293), 13.87 MiB | 18.52 MiB/s, done.
Resolving deltas: 100% (4920/4920), done.
Non-user install because site-packages writeable
Created temporary directory: /tmp/pip-ephem-wheel-cache-e4leullk
Created temporary directory: /tmp/pip-req-tracker-zojojaz9
Initialized build tracking at /tmp/pip-req-tracker-zojojaz9
Created build tracker: /tmp/pip-req-tracker-zojojaz9
Entered build tracker: /tmp/pip-req-tracker-zojojaz9
Created temporary directory: /tmp/pip-install-mt5vuy_b
Processing /kaggle/working/apex
  Created temporary directory: /tmp/pip-req-build-3w9vxub1
  Added file:///kaggle/working/apex to build tracker '/tmp/pip-req-tracker-zojojaz9'
    Running setup.py (path:/tmp/pip-req-build-3w9vxub1/setup.py) egg_info for pack

## Train the model on the training data

In [21]:
model.train_model(train_df)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


HBox(children=(FloatProgress(value=0.0, max=4517.0), HTML(value='')))


Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0', max=565.0, style=ProgressStyle(descript…

Running loss: 4.234992



Running loss: 4.057360Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Running loss: 4.207181Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Running loss: 4.005649



Running loss: 2.243986




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 1', max=565.0, style=ProgressStyle(descript…

Running loss: 1.813078



### Run evaluation on validation data

In [22]:
result, model_outputs, wrong_predictions = model.eval_model(valid_df)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


HBox(children=(FloatProgress(value=0.0, max=1505.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Running Evaluation', max=189.0, style=ProgressStyle(descr…




### Model evaluation on certain specific metrics

In [23]:

from sklearn.metrics import f1_score, accuracy_score


def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='micro')


    
result, model_outputs, wrong_predictions = model.eval_model(valid_df, f1=f1_multiclass, acc=accuracy_score)

HBox(children=(FloatProgress(value=0.0, max=1505.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Running Evaluation', max=189.0, style=ProgressStyle(descr…




### The results of evaluation

In [24]:
print(result)

{'mcc': 0.6702921752109217, 'f1': 0.6803986710963456, 'acc': 0.6803986710963456, 'eval_loss': 1.3807645705011156}


### Raw model outputs obtained after evaluation on validation data

In [25]:
print(model_outputs)

[[-1.1308594e+00 -2.8784180e-01 -1.0406494e-01 ... -9.0625000e-01
   7.1435547e-01 -5.1318359e-01]
 [ 3.6303711e-01  2.5131226e-02  7.2509766e-01 ... -4.2895508e-01
  -5.1464844e-01 -1.1396484e+00]
 [-9.7119141e-01  2.3327637e-01 -8.9794922e-01 ... -1.6967773e-01
   2.2619629e-01  1.0771484e+00]
 ...
 [ 3.9428711e-01  1.1152344e+00  4.3261719e-01 ... -4.6850586e-01
  -7.6611328e-01  6.1816406e-01]
 [-2.5878906e-01  2.0058594e+00 -3.0468750e-01 ...  1.2500000e-01
  -3.5351562e-01  1.4833984e+00]
 [-3.6499023e-01  2.9611588e-04 -6.3037109e-01 ...  3.1518555e-01
  -7.7099609e-01 -1.2048340e-01]]


### Reading in the test data through the pandas framework

In [26]:
test_df = pd.read_csv('../input/testcsv/test.csv',encoding = "ISO-8859-1")

In [27]:
test_df.head()

Unnamed: 0,Company,Business Description
0,3rd Rock Multimedia Ltd,3rd Rock Multimedia Limited is an India-based ...
1,Andhra Petrochemicals Ltd,The Andhra Petrochemicals Limited is an India-...
2,Force Motors Ltd,Force Motors Limited is a holding company. The...
3,Diamines And Chemicals Ltd,Diamines and Chemicals Limited is a holding co...
4,Insilco Ltd,Insilco Limited is engaged in manufacturing an...


In [28]:
test_df.columns

Index(['Company ', 'Business Description'], dtype='object')

### Preprocessing the testing data with nlp cleaning techniques and converting the 'Business Description'attribute to string type in compliance with the simpletransformers model

In [29]:
test_df['Business Description'] = normalise_business_description(test_df['Business Description'])
test_df['Business Description'] = test_df['Business Description'].astype('string')

### Generating predictions for the test dataset

In [30]:
predictions, raw_outputs = model.predict(test_df['Business Description'])

HBox(children=(FloatProgress(value=0.0, max=772.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=97.0), HTML(value='')))




### Precitions will be category codes which is then remaped to appropriate Industry Classification Tags

In [31]:
print(predictions)

[40 10 13 10 10  5  8 13 10 10 33 10 56 10  5 10  5  5  8  5 13 19 19 19
 19 56  8 19 56 13 12 13 13  8 13 13 14 13 13 12  5  5 10 19 10  3 10 10
 10 10 10 10 31 10 14 13  8 19 14  4 10 13 10 56  5  5 14 14 14 12 14 51
  5  4  5  5  8  8 10  8  8  8 51 19 11 19  8 19 19 19 10  5 13 13 13 13
 56 13 13  1 12 56 19 13 13 33 13 14  1 13  8 10 10 10  8 10 10 10 10 10
 10 10 10  7 10  7 54 10 10  8  8 10  8 10  4 19 10 10 11 28 28  4  8 14
 56 14  8  5 14 19 10 17  5  5  8 13 21  0 51  4 14 11 10 14  0 10 14 51
 14 56  5 14  5 56 10 14  5 12  8  4 51 14 13  5 22  8 28 56 19 55 47  5
  5  4  4  5  4 51  4  4 11 14 31  8 13 10  8 14  8  7 13  4 11 13 51  5
 51  8  5 13 14  4 28  3  5 14 14 51  8 14 14 19 33 13 19 21 13  5 14  5
 14 14  8  8  5 14 19  8 19  4 14  8 12  5 10 14 13 14 14  5 10 56  8 51
 14 13 52 45  5 19 19 14 10  5 18 10 19 10  3 51 51 10  8 54 17 31 51  3
 19 33  8 19  4 17 18 19 17 51 13 19 51  3  8 13 10 13 51 13 18 18 19  4
  8 51 13 19  3 18 51  4 21 13 13 13  0 56  8 28 51

### Converting the predictions which is a numpy array into a pandas series and then mapping the category codes into Industry Classification Tags using the class_weights dictionary that was created in one of the previous steps

In [32]:
predictions = pd.Series(predictions)
test_df['Predicted Industry Tags'] = predictions.map(class_weights)

In [33]:
test_df.columns

Index(['Company ', 'Business Description', 'Predicted Industry Tags'], dtype='object')

In [34]:
test_df.head()

Unnamed: 0,Company,Business Description,Predicted Industry Tags
0,3rd Rock Multimedia Ltd,3rd rock multimedia limited is an india based ...,Movies & Entertainment
1,Andhra Petrochemicals Ltd,the andhra petrochemicals limited is an india ...,Commodity Chemicals
2,Force Motors Ltd,force motors limited is a holding company the ...,Construction Machinery & Heavy Trucks
3,Diamines And Chemicals Ltd,diamines and chemicals limited is a holding co...,Commodity Chemicals
4,Insilco Ltd,insilco limited is engaged in manufacturing an...,Commodity Chemicals


### Saving the output to a csv file

In [36]:
test_df.to_csv(r'Output_dataset_BertModel.csv')

### Genearting a downloadable link to the output csv file

In [37]:
    from IPython.display import FileLink
    FileLink(r'Output_dataset_BertModel.csv')