<a href="https://colab.research.google.com/github/Mosh094/MultiLabel_Classification_Project/blob/main/Multi_Label_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Solution to MultiLabel Classification Project

## Recommendation of relevant work categories offered by customers based of their description

The project focuses on building a machine learning algorithm that recommends possible offerings from clients, based on the details extracted from thier provided description


## Table of Contents
 - Problem Statement
 - Data set
 - Preprocessing
     - Cleaning
     - Label Encoding
     - Validation split
 - Model building
     - Binary Relevace
     - Classifier Chains
     - Label Powerset

### Problem Statement

- The task is obviously not a recommender system activity, as it focuses on extracting categories of business from the provided details, rather follow any historical or inherent pattern as required for recommender system
- The fact that it is a supervised learning approach removes any element of clustering.
- The approach employed (Classification of recommend relevant work categories to customers based of their description) is closely related to topic modelling, as we are required to use description provided by clients, to predict the possible category of their activities, by extracting key topics from their details. This is however rulled out a the presented data has features and label, erasing any unsupervised learning approach
- The eventual approach deployed model is multi-label classification. As a classification approach, the model helps to classify categories bassed on the corresponding inherent relationship with the description. Muli-label classification deiffers from binary and multi-class classification. 
- Binary refers to classification done with two possible outcomes, i.e yes/no. Multiclass refers to classification involving more than two possible outcomes. 
- However, while, binary and multi-class refers to possibility of outputing only one result per obseration, multi-label classification refers to the case of possibly outputing one or more results for every observation, as it is in our case
- There is also a case of class imbalance, which will require more investigations, but a quick fix is the train/validation/test split approach

### Import Relevant Libraries

In [7]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import SGDClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
import ast
from ast import literal_eval
import seaborn as sns
from nltk.corpus import stopwords
from wordcloud import wordcloud 
from socket import socket
from nltk.stem.snowball import SnowballStemmer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from skmultilearn.problem_transform import ClassifierChain
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import hamming_loss
from skmultilearn.problem_transform import BinaryRelevance

### Import and Describe Dataset

Mount Google Drive to access file

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
DATA_PATH = '/content/drive/MyDrive/GitHub/Sample_Multi_Label_Classification_Poject/Sample_Multi_Label_Classification_Poject/descriptions-data.csv'
df_Description = pd.read_csv(DATA_PATH)
df_Description

Unnamed: 0,CompanyID,Description,TargetWorkCatergoryIDs
0,1,Celebrating 40 years in business! Established ...,"[61, 62, 63, 70, 89, 91]"
1,2,High Point Construction is a well-established ...,[47]
2,3,West Yorkshire Render & Flooring Ltd are a spe...,"[3, 4]"
3,4,We are access experts based in Manchester and ...,[25]
4,5,Premier PAT Testing Ltd has established a well...,"[89, 91, 69, 71]"
...,...,...,...
1803,1804,We are a highly adaptive electrical contractor...,[64]
1804,1805,Horizon Construction Firm has been around for ...,"[16, 3, 4]"
1805,1806,The Smith family purchased Berkyn Manor Farm &...,"[50, 51, 52]"
1806,1807,West Scotland Joinery Solutions is a joinery c...,"[9, 14, 15]"



The Description dataset consists of 1,808 observations and 3 columns.As shown above, the "TargetWorkCatergoryIDs" consists of "multiple labels" in individual column, triggering our decision for a multi-label approach.

Below, we check to confrim the data type of the target column

In [10]:
type(df_Description['TargetWorkCatergoryIDs'].iloc[0])

str

The target column is of string data type. This confirms the earlier and was confirmed below, as it is outputed in quotes.

In [11]:
df_Description['TargetWorkCatergoryIDs'].iloc[0]

'[61, 62, 63, 70, 89, 91]'

Using the "literal_eval" library, we confirm that the target column can still maintain its list type, but not expressly seen as same by python, without this fuction from ast library 

In [12]:
ast.literal_eval(df_Description['TargetWorkCatergoryIDs'].iloc[0])

[61, 62, 63, 70, 89, 91]

We then convert the target column to list and apply the way its displayed above to python

In [13]:
df_Description['TargetWorkCatergoryIDs'] = df_Description.TargetWorkCatergoryIDs.apply(lambda x: literal_eval(str(x)))
df_Description.head()

Unnamed: 0,CompanyID,Description,TargetWorkCatergoryIDs
0,1,Celebrating 40 years in business! Established ...,"[61, 62, 63, 70, 89, 91]"
1,2,High Point Construction is a well-established ...,[47]
2,3,West Yorkshire Render & Flooring Ltd are a spe...,"[3, 4]"
3,4,We are access experts based in Manchester and ...,[25]
4,5,Premier PAT Testing Ltd has established a well...,"[89, 91, 69, 71]"


In [14]:
target = df_Description['TargetWorkCatergoryIDs']
target

0       [61, 62, 63, 70, 89, 91]
1                           [47]
2                         [3, 4]
3                           [25]
4               [89, 91, 69, 71]
                  ...           
1803                        [64]
1804                  [16, 3, 4]
1805                [50, 51, 52]
1806                 [9, 14, 15]
1807                    [61, 63]
Name: TargetWorkCatergoryIDs, Length: 1808, dtype: object

### Multi-Label Encoding

Next we carry out multi label encoding usign the Multi Label Binarizer library, to turn the target column to one-hot encoding (where its allocated 1 when present in an observation, and 0 when not present)

In [15]:
targetlabel = MultiLabelBinarizer()
y = targetlabel.fit_transform(df_Description['TargetWorkCatergoryIDs'])
y

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Quick confirmation carried out below, with a few sample of the target column, against the dervied one hot encoding

In [16]:
df_Description['TargetWorkCatergoryIDs'].iloc[2]

[3, 4]

In [17]:
y[2]

array([0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

In [18]:
df_Description['TargetWorkCatergoryIDs'].iloc[25]

[0, 9, 14, 15, 18]

In [19]:
y[25]

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

Calling the column headers for the dummy columns derived from multi label encoding

In [20]:
targetlabel.classes_

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
       53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
       70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86,
       87, 88, 89, 90, 91, 92, 93, 94, 95])

### Conactinating the target colums with the other variables (description)

This is mapped into a table for the target columns alone. Meanwhile, it will be observed that when compared with the categories table, category 29 and 30 does not feature in any of the observation. This was inputed back to balance up the target columns with the category table

In [21]:
cleaned_y = pd.DataFrame(y, columns=targetlabel.classes_)
cleaned_y

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,86,87,88,89,90,91,92,93,94,95
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1803,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1804,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1805,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1806,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Columns mapped to including missing columns (29 and 30)

In [22]:
y_Col = cleaned_y.reindex([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
       53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
       70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86,
       87, 88, 89, 90, 91, 92, 93, 94, 95], axis=1)

In [23]:
y_Col

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,86,87,88,89,90,91,92,93,94,95
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1803,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1804,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1805,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1806,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


As shown above, the newly added columns are displaying NaN, this is further confirmed belwo and treated acocordingly

In [24]:
y_Col29 = y_Col[29]
y_Col29

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1803   NaN
1804   NaN
1805   NaN
1806   NaN
1807   NaN
Name: 29, Length: 1808, dtype: float64

In [25]:
y_Col30 = y_Col[30]
y_Col30

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1803   NaN
1804   NaN
1805   NaN
1806   NaN
1807   NaN
Name: 30, Length: 1808, dtype: float64

In [26]:
y_Col28 = y_Col[28]
y_Col28

0       0
1       0
2       0
3       0
4       0
       ..
1803    0
1804    0
1805    0
1806    0
1807    0
Name: 28, Length: 1808, dtype: int64

In [27]:
y_Col32 = y_Col[32]
y_Col32

0       0
1       0
2       0
3       0
4       0
       ..
1803    0
1804    0
1805    0
1806    0
1807    0
Name: 32, Length: 1808, dtype: int64

Replace NaN with 0 (that is no observation for all rows)

In [28]:
y_Col[29] = y_Col[29].fillna(0)
y_Col[30] = y_Col[30].fillna(0)

In [29]:
pd.set_option("display.max_rows", 10, "display.max_columns", None)
print(y_Col)

      0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  \
0      0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   
1      0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   
2      0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   
3      0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   
4      0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   
...   ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..   
1803   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   
1804   0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0   1   0   
1805   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   
1806   0   0   0   0   0   0   0   0   0   1   0   0   0   0   1   1   0   0   
1807   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   

      18  19  20  21  22  23  24  25  2

In [30]:
y_Col

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1803,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1804,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1805,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1806,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


The added columns have "0" values, but float data types, for uniformity, this will be converted to "int" 

In [31]:
y_Col = y_Col.astype({29:'int', 30:'int'}) 
  
# displaying the datatypes
display(y_Col.dtypes)

0     int64
1     int64
2     int64
3     int64
4     int64
      ...  
91    int64
92    int64
93    int64
94    int64
95    int64
Length: 96, dtype: object

Load Category Table 

In [32]:
DATA_PATH2 = '/content/drive/MyDrive/GitHub/Sample_Multi_Label_Classification_Poject/Sample_Multi_Label_Classification_Poject/work-catergories-data.csv'
df_Categories = pd.read_csv(DATA_PATH2)
df_Categories 

Unnamed: 0,WorkCatID,WorkCatName,Annotation
0,0,Building finishes > Doors > Doors,Build_Doors
1,1,Building finishes > Doors > Doors (fire proof),Build_Door_FireProof
2,2,Building finishes > External walls > Cladding,Build_Ext_Cladding
3,3,Building finishes > External walls > Rendering,Buidl_Ext_Rendering
4,4,Building finishes > Flooring > Flooring (cemen...,Build_Floor_CementScreeds
...,...,...,...
91,91,Electrical > Power supply > Emergency lighting...,Electrical_Power_Lighting-Maintenance
92,92,Building general > Sector > Listed building work,Build_Sector_Listed
93,93,Other services > Security > Cctv installation ...,Security_CCTV_Installation
94,94,Mechanical > Boilers > Gas-fired condensing bo...,Mechanical_Boilers_Condensing


We only need the annotation column, which was manually derived to enable us have unique column titles

In [33]:
df_Categories['Annotation']

0                               Build_Doors
1                      Build_Door_FireProof
2                        Build_Ext_Cladding
3                       Buidl_Ext_Rendering
4                 Build_Floor_CementScreeds
                      ...                  
91    Electrical_Power_Lighting-Maintenance
92                      Build_Sector_Listed
93               Security_CCTV_Installation
94            Mechanical_Boilers_Condensing
95               Electrical_LightingSystems
Name: Annotation, Length: 96, dtype: object

Retrieve list of values in Anotation column 

In [34]:
col_list = df_Categories.Annotation.values.tolist()
col_list 

['Build_Doors',
 'Build_Door_FireProof',
 'Build_Ext_Cladding',
 'Buidl_Ext_Rendering',
 'Build_Floor_CementScreeds',
 'Build_Flooring',
 'Build_Floor_Carpet',
 'Build_Floor_Tiles',
 'Build_Furniture_Fitting',
 'Build_Joinery',
 'Build_Fitout',
 'Build_Decorating',
 'Build_Glazing',
 'Build_PVCU',
 'Build_Ceiling',
 'Build_Partition',
 'Plastering',
 'Build_Tiling',
 'Build_Painting',
 'Build_Roofing',
 'Build_Roofing_Felt',
 'Build_Roofing_Slating',
 'Building_Type_Blockwork',
 'Building_Type_Concrete',
 'Building_Type_Demolition',
 'Building_Type_Scaffolding',
 'Building_Type_Clearance',
 'Building_Type_Steelwork',
 'Buiding_construct',
 'Building',
 'Building_Refurbishment',
 'Building_Improvements',
 'Electrical_Services',
 'Building_Contracting',
 'Building_Sector_Hospital',
 'Building_Sector_Industrial',
 'Building_Sector_Offices',
 'Building_Sector_Private',
 'Building_Sector_Schools',
 'Building_Sector_Social',
 'Building_Sector_Leisure',
 'Building_Sector_Occupied',
 'Civil_Dr

Map the values above to the respective column title (0-95). This gives more sensible description of the columns, rather than numbers which doesnt interprete anything

In [35]:
y_Col = y_Col.set_axis(df_Categories.Annotation.values.tolist(), axis=1, inplace=False)
y_Col

Unnamed: 0,Build_Doors,Build_Door_FireProof,Build_Ext_Cladding,Buidl_Ext_Rendering,Build_Floor_CementScreeds,Build_Flooring,Build_Floor_Carpet,Build_Floor_Tiles,Build_Furniture_Fitting,Build_Joinery,Build_Fitout,Build_Decorating,Build_Glazing,Build_PVCU,Build_Ceiling,Build_Partition,Plastering,Build_Tiling,Build_Painting,Build_Roofing,Build_Roofing_Felt,Build_Roofing_Slating,Building_Type_Blockwork,Building_Type_Concrete,Building_Type_Demolition,Building_Type_Scaffolding,Building_Type_Clearance,Building_Type_Steelwork,Buiding_construct,Building,Building_Refurbishment,Building_Improvements,Electrical_Services,Building_Contracting,Building_Sector_Hospital,Building_Sector_Industrial,Building_Sector_Offices,Building_Sector_Private,Building_Sector_Schools,Building_Sector_Social,Building_Sector_Leisure,Building_Sector_Occupied,Civil_Drain_Inspection,Civil_Drainage,Civil_Sewer,Civil_Renovation,Civil_Underground,Civil,Civil_Repairs,Civil_Earthwork,Civil_Fencing,Civil_GroundMaintenance,Civil_Landscapping,Civil_Road_Asphalt,Civil_Road_Kerbing,Civil_Road_Construction,Civil_Road_Maintenance,Civil_Road_Paving,Electrical_Comms,Electrical_Cabling,Electrical_Floodlighting,Electrical_FireAlarm,Electrical_Installation_Maintenance,Electrical_Installation_construct,Electrical_Services.1,Electrical_External_Install,Electrical_Internal_Install,Electrical_Power_Lighting,Electrical_Security_Intruder,Electrical_Testing_Appliance,Electrical_Testing_Install,Electrical_Testing_Contracts,Electrical_Boiler_Domestic,Mechanical_Equipment_Services,Mechanical_maintenance,Mechanical_HeatingVent_AC,Mechanical_HeatingVent_Boiler,Mechanical_HeatingVent_GasInstall,Mechanical_HeatingVent_CentInstallation,Mechanical_HeatingVent_CentMaintenance,Mechanical_HeatingVent_ServiceInstallation,Mechanical_HeatingVent_Ventilation,Mechanical_Access_ControlSys,Mechanical_Pipework,Mechanical_WaterTreatment_Plumbing,BuildingService_Maintenance_M-E,BuildingService_Maintenance_Fabric,Building_Type_Masonry,Electical_TempInstallation,Electrical_FireAlarm_Maintenance,Architectural_Design,Electrical_Power_Lighting-Maintenance,Build_Sector_Listed,Security_CCTV_Installation,Mechanical_Boilers_Condensing,Electrical_LightingSystems
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1803,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1804,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1805,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1806,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


A few test run to validate that the mapping was correctly done

In [36]:
Col_Building = y_Col['Building']
Col_Building

0       0
1       0
2       0
3       0
4       0
       ..
1803    0
1804    0
1805    0
1806    0
1807    0
Name: Building, Length: 1808, dtype: int64

The "Building" column was compared to the raw data and it validates our position

### Intergrating the table back

In [37]:
result = pd.concat([df_Description, y_Col], axis=1, join='inner')
display(result)

Unnamed: 0,CompanyID,Description,TargetWorkCatergoryIDs,Build_Doors,Build_Door_FireProof,Build_Ext_Cladding,Buidl_Ext_Rendering,Build_Floor_CementScreeds,Build_Flooring,Build_Floor_Carpet,Build_Floor_Tiles,Build_Furniture_Fitting,Build_Joinery,Build_Fitout,Build_Decorating,Build_Glazing,Build_PVCU,Build_Ceiling,Build_Partition,Plastering,Build_Tiling,Build_Painting,Build_Roofing,Build_Roofing_Felt,Build_Roofing_Slating,Building_Type_Blockwork,Building_Type_Concrete,Building_Type_Demolition,Building_Type_Scaffolding,Building_Type_Clearance,Building_Type_Steelwork,Buiding_construct,Building,Building_Refurbishment,Building_Improvements,Electrical_Services,Building_Contracting,Building_Sector_Hospital,Building_Sector_Industrial,Building_Sector_Offices,Building_Sector_Private,Building_Sector_Schools,Building_Sector_Social,Building_Sector_Leisure,Building_Sector_Occupied,Civil_Drain_Inspection,Civil_Drainage,Civil_Sewer,Civil_Renovation,Civil_Underground,Civil,Civil_Repairs,Civil_Earthwork,Civil_Fencing,Civil_GroundMaintenance,Civil_Landscapping,Civil_Road_Asphalt,Civil_Road_Kerbing,Civil_Road_Construction,Civil_Road_Maintenance,Civil_Road_Paving,Electrical_Comms,Electrical_Cabling,Electrical_Floodlighting,Electrical_FireAlarm,Electrical_Installation_Maintenance,Electrical_Installation_construct,Electrical_Services.1,Electrical_External_Install,Electrical_Internal_Install,Electrical_Power_Lighting,Electrical_Security_Intruder,Electrical_Testing_Appliance,Electrical_Testing_Install,Electrical_Testing_Contracts,Electrical_Boiler_Domestic,Mechanical_Equipment_Services,Mechanical_maintenance,Mechanical_HeatingVent_AC,Mechanical_HeatingVent_Boiler,Mechanical_HeatingVent_GasInstall,Mechanical_HeatingVent_CentInstallation,Mechanical_HeatingVent_CentMaintenance,Mechanical_HeatingVent_ServiceInstallation,Mechanical_HeatingVent_Ventilation,Mechanical_Access_ControlSys,Mechanical_Pipework,Mechanical_WaterTreatment_Plumbing,BuildingService_Maintenance_M-E,BuildingService_Maintenance_Fabric,Building_Type_Masonry,Electical_TempInstallation,Electrical_FireAlarm_Maintenance,Architectural_Design,Electrical_Power_Lighting-Maintenance,Build_Sector_Listed,Security_CCTV_Installation,Mechanical_Boilers_Condensing,Electrical_LightingSystems
0,1,Celebrating 40 years in business! Established ...,"[61, 62, 63, 70, 89, 91]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
1,2,High Point Construction is a well-established ...,[47],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,3,West Yorkshire Render & Flooring Ltd are a spe...,"[3, 4]",0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4,We are access experts based in Manchester and ...,[25],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,5,Premier PAT Testing Ltd has established a well...,"[89, 91, 69, 71]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1803,1804,We are a highly adaptive electrical contractor...,[64],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1804,1805,Horizon Construction Firm has been around for ...,"[16, 3, 4]",0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1805,1806,The Smith family purchased Berkyn Manor Farm &...,"[50, 51, 52]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1806,1807,West Scotland Joinery Solutions is a joinery c...,"[9, 14, 15]",0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Checking the total number of times each categories appeared in all observations

In [38]:
k = result.iloc[:,3:].sum()
k

Build_Doors                              30
Build_Door_FireProof                     23
Build_Ext_Cladding                       28
Buidl_Ext_Rendering                      35
Build_Floor_CementScreeds                23
                                         ..
Electrical_Power_Lighting-Maintenance    46
Build_Sector_Listed                      34
Security_CCTV_Installation               49
Mechanical_Boilers_Condensing            22
Electrical_LightingSystems               43
Length: 96, dtype: int64

Comparing the total number of observation, to the total number of times all variables were repeated. This validates our reason for multi-label classification as total label output exceed total observation

In [39]:
rowsums=result.iloc[:,2:].sum(axis=0)
no_label_count = 0
for sum in rowsums.items():
    if sum==0:
        no_label_count +=1

print("Total number of articles = ",len(result))
print("Total number of articles without label = ",no_label_count)
print("Total labels = ",k.sum()) 

Total number of articles =  1808
Total number of articles without label =  0
Total labels =  5014


Check for missing values

In [40]:
print("Check for missing values in Train dataset")
print(result.isnull().sum().sum())

Check for missing values in Train dataset
0


Confirm data type of columns

In [41]:
result.dtypes

CompanyID                                 int64
Description                              object
TargetWorkCatergoryIDs                   object
Build_Doors                               int64
Build_Door_FireProof                      int64
                                          ...  
Electrical_Power_Lighting-Maintenance     int64
Build_Sector_Listed                       int64
Security_CCTV_Installation                int64
Mechanical_Boilers_Condensing             int64
Electrical_LightingSystems                int64
Length: 99, dtype: object

### Data Cleaning

Clean data, remove stopwords and do stemming

- Next we remove all the stop-words present in the comments using the default set of stop-words that can be downloaded from NLTK library. We also add few stop-words to the standard list. Stop words are basically a set of commonly used words in any language, not just English. The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead.

- Next we do stemming. There exist different kinds of stemming which basically transform words with roughly the same semantics to one standard form. For example, for amusing, amusement, and amused, the stem would be amus.

In [42]:
import nltk
import re 
nltk.download("stopwords")
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [43]:
stop_words = set(stopwords.words('english'))

# function to remove stopwords
def remove_stopwords(Description):
    no_stopword_Description = [w for w in Description.split() if not w in stop_words]
    return ' '.join(no_stopword_Description)

result['Description'] = result['Description'].apply(lambda x: remove_stopwords(x))

In [44]:
stemmer = SnowballStemmer("english")
def stemming(sentence):
    stemSentence = ""
    for word in sentence.split():
        stem = stemmer.stem(word)
        stemSentence += stem
        stemSentence += " "
    stemSentence = stemSentence.strip()
    return stemSentence

result['Description'] = result['Description'].apply(stemming)

In [45]:
result['Description'][5]

'smith construct primari busi activ plan respons mainten trade oper divers public sector property. our expertis success within sector much attribut strong partner relationship develop 40 year work public sector organis local councils. we adopt partner approach long term client exist arrang success kpi demonstr approach. we promot earli contract involv prepar work schedul assist client priorities. we high skill stabl workforc vast experi work within multi-cultur divers community. smith construct carri mani general build work programm year includ plan mainten rang of; corpor buildings, schools, occupi domest properties, residenti care home well shelter hous scheme . our project rang general refurbish work new build across area work categori including; general building, adaptations, instal kitchen & bathroom , window & door , electr wire , plumb fixtur , gas servic , plaster job , glaze repairs/instal pre-paint repair paint & decor task tile roof floor joineri piec plus structur repair se

### Display Cleaned Data

In [46]:
result

Unnamed: 0,CompanyID,Description,TargetWorkCatergoryIDs,Build_Doors,Build_Door_FireProof,Build_Ext_Cladding,Buidl_Ext_Rendering,Build_Floor_CementScreeds,Build_Flooring,Build_Floor_Carpet,Build_Floor_Tiles,Build_Furniture_Fitting,Build_Joinery,Build_Fitout,Build_Decorating,Build_Glazing,Build_PVCU,Build_Ceiling,Build_Partition,Plastering,Build_Tiling,Build_Painting,Build_Roofing,Build_Roofing_Felt,Build_Roofing_Slating,Building_Type_Blockwork,Building_Type_Concrete,Building_Type_Demolition,Building_Type_Scaffolding,Building_Type_Clearance,Building_Type_Steelwork,Buiding_construct,Building,Building_Refurbishment,Building_Improvements,Electrical_Services,Building_Contracting,Building_Sector_Hospital,Building_Sector_Industrial,Building_Sector_Offices,Building_Sector_Private,Building_Sector_Schools,Building_Sector_Social,Building_Sector_Leisure,Building_Sector_Occupied,Civil_Drain_Inspection,Civil_Drainage,Civil_Sewer,Civil_Renovation,Civil_Underground,Civil,Civil_Repairs,Civil_Earthwork,Civil_Fencing,Civil_GroundMaintenance,Civil_Landscapping,Civil_Road_Asphalt,Civil_Road_Kerbing,Civil_Road_Construction,Civil_Road_Maintenance,Civil_Road_Paving,Electrical_Comms,Electrical_Cabling,Electrical_Floodlighting,Electrical_FireAlarm,Electrical_Installation_Maintenance,Electrical_Installation_construct,Electrical_Services.1,Electrical_External_Install,Electrical_Internal_Install,Electrical_Power_Lighting,Electrical_Security_Intruder,Electrical_Testing_Appliance,Electrical_Testing_Install,Electrical_Testing_Contracts,Electrical_Boiler_Domestic,Mechanical_Equipment_Services,Mechanical_maintenance,Mechanical_HeatingVent_AC,Mechanical_HeatingVent_Boiler,Mechanical_HeatingVent_GasInstall,Mechanical_HeatingVent_CentInstallation,Mechanical_HeatingVent_CentMaintenance,Mechanical_HeatingVent_ServiceInstallation,Mechanical_HeatingVent_Ventilation,Mechanical_Access_ControlSys,Mechanical_Pipework,Mechanical_WaterTreatment_Plumbing,BuildingService_Maintenance_M-E,BuildingService_Maintenance_Fabric,Building_Type_Masonry,Electical_TempInstallation,Electrical_FireAlarm_Maintenance,Architectural_Design,Electrical_Power_Lighting-Maintenance,Build_Sector_Listed,Security_CCTV_Installation,Mechanical_Boilers_Condensing,Electrical_LightingSystems
0,1,"celebr 40 year business! establish 1976, provi...","[61, 62, 63, 70, 89, 91]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
1,2,high point construct well-establish famili run...,[47],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,3,west yorkshir render & floor ltd specialist re...,"[3, 4]",0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4,we access expert base manchest northampton. we...,[25],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,5,premier pat test ltd establish well-respect re...,"[89, 91, 69, 71]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1803,1804,we high adapt electr contractor commit provid ...,[64],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1804,1805,horizon construct firm around two decad provid...,"[16, 3, 4]",0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1805,1806,the smith famili purchas berkyn manor farm & m...,"[50, 51, 52]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1806,1807,west scotland joineri solut joineri contractor...,"[9, 14, 15]",0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Drop Irrelevant Columns

In [47]:
cleaned_result = result.drop(["CompanyID", "TargetWorkCatergoryIDs"], axis =1)
cleaned_result

Unnamed: 0,Description,Build_Doors,Build_Door_FireProof,Build_Ext_Cladding,Buidl_Ext_Rendering,Build_Floor_CementScreeds,Build_Flooring,Build_Floor_Carpet,Build_Floor_Tiles,Build_Furniture_Fitting,Build_Joinery,Build_Fitout,Build_Decorating,Build_Glazing,Build_PVCU,Build_Ceiling,Build_Partition,Plastering,Build_Tiling,Build_Painting,Build_Roofing,Build_Roofing_Felt,Build_Roofing_Slating,Building_Type_Blockwork,Building_Type_Concrete,Building_Type_Demolition,Building_Type_Scaffolding,Building_Type_Clearance,Building_Type_Steelwork,Buiding_construct,Building,Building_Refurbishment,Building_Improvements,Electrical_Services,Building_Contracting,Building_Sector_Hospital,Building_Sector_Industrial,Building_Sector_Offices,Building_Sector_Private,Building_Sector_Schools,Building_Sector_Social,Building_Sector_Leisure,Building_Sector_Occupied,Civil_Drain_Inspection,Civil_Drainage,Civil_Sewer,Civil_Renovation,Civil_Underground,Civil,Civil_Repairs,Civil_Earthwork,Civil_Fencing,Civil_GroundMaintenance,Civil_Landscapping,Civil_Road_Asphalt,Civil_Road_Kerbing,Civil_Road_Construction,Civil_Road_Maintenance,Civil_Road_Paving,Electrical_Comms,Electrical_Cabling,Electrical_Floodlighting,Electrical_FireAlarm,Electrical_Installation_Maintenance,Electrical_Installation_construct,Electrical_Services.1,Electrical_External_Install,Electrical_Internal_Install,Electrical_Power_Lighting,Electrical_Security_Intruder,Electrical_Testing_Appliance,Electrical_Testing_Install,Electrical_Testing_Contracts,Electrical_Boiler_Domestic,Mechanical_Equipment_Services,Mechanical_maintenance,Mechanical_HeatingVent_AC,Mechanical_HeatingVent_Boiler,Mechanical_HeatingVent_GasInstall,Mechanical_HeatingVent_CentInstallation,Mechanical_HeatingVent_CentMaintenance,Mechanical_HeatingVent_ServiceInstallation,Mechanical_HeatingVent_Ventilation,Mechanical_Access_ControlSys,Mechanical_Pipework,Mechanical_WaterTreatment_Plumbing,BuildingService_Maintenance_M-E,BuildingService_Maintenance_Fabric,Building_Type_Masonry,Electical_TempInstallation,Electrical_FireAlarm_Maintenance,Architectural_Design,Electrical_Power_Lighting-Maintenance,Build_Sector_Listed,Security_CCTV_Installation,Mechanical_Boilers_Condensing,Electrical_LightingSystems
0,"celebr 40 year business! establish 1976, provi...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
1,high point construct well-establish famili run...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,west yorkshir render & floor ltd specialist re...,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,we access expert base manchest northampton. we...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,premier pat test ltd establish well-respect re...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1803,we high adapt electr contractor commit provid ...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1804,horizon construct firm around two decad provid...,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1805,the smith famili purchas berkyn manor farm & m...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1806,west scotland joineri solut joineri contractor...,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Separated Train and Test Data

In [48]:
train, test = train_test_split(cleaned_result,test_size=0.30, random_state=40, shuffle=True)

Confirm shape

In [49]:
print(train.shape)
print(test.shape)

(1265, 97)
(543, 97)


Identify feature

In [50]:
train_Description = train['Description']
test_Description = test['Description']

- After splitting the dataset into train & test sets, we want to summarize our comments and convert them into numerical vectors.
- One technique is to pick the most frequently occurring terms (words with high term frequency or tf). However, the most frequent word is a less useful metric since some words like ‘this’, ‘a’ occur very frequently across all documents.
- Hence, we also want a measure of how unique a word is i.e. how infrequently the word occurs across all documents (inverse document frequency or idf).
- So, the product of tf & idf (TF-IDF) of a word gives a product of how frequent this word is in the document multiplied by how unique the word is w.r.t. the entire corpus of documents.
- Words in the document with a high tfidf score occur frequently in the document and provide the most information about that specific document.
- TF-IDF is easy to compute but its disadvantage is that it does not capture position in text, semantics, co-occurrences in different documents, etc.

In [51]:
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_Description)
vectorizer.fit(test_Description)

TfidfVectorizer(ngram_range=(1, 3), strip_accents='unicode')

In [52]:
x_train = vectorizer.transform(train_Description)
y_train = train.drop(labels = ['Description'], axis=1)

x_test = vectorizer.transform(test_Description)
y_test = test.drop(labels = ['Description'], axis=1)

### Binary Relevance
- For Binary Relevance an ensemble of single-label binary classifiers is trained, one for each class. Each classifier predicts either the membership or the non-membership of one class. The union of all classes that were predicted is taken as the multi-label output. This approach is popular because it is easy to implement, however it also ignores the possible correlations between class labels.
- In other words, if there’s q labels, the binary relevance method create q new data sets from the images, one for each label and train single-label classifiers on each new data set. One classifier may answer yes/no to the question “does it contain trees?”, thus the “binary” in “binary relevance”. This is a simple approach but does not work well when there’s dependencies between the labels.

In [53]:

# initialize binary relevance multi-label classifier
# with a gaussian naive bayes base classifier
classifier = BinaryRelevance(GaussianNB())

# train
classifier.fit(x_train, y_train)

# predict
predictions = classifier.predict(x_test)

# accuracy
print('Accuracy = ', accuracy_score(y_test,predictions))
print('Hamming Loss is ', hamming_loss(y_test, predictions))

Accuracy =  0.001841620626151013
Hamming Loss is  0.031978974831184774


In [54]:
classifier = BinaryRelevance(LogisticRegression())

# train
classifier.fit(x_train, y_train)

# predict
predictions = classifier.predict(x_test)


from sklearn.metrics import accuracy_score
print('Accuracy = ', accuracy_score(y_test,predictions))
print('Hamming Loss is ', hamming_loss(y_test, predictions))

ValueError: ignored

The above error is a typical case in clasification problem, when the data is problematic, it affects the predictions, as radomn selection of test data might bring about otputs of same class, affecting the predictability

It is also an impact of class imbalance, more data to train and test with can help here

In [57]:
y_train

Unnamed: 0,Build_Doors,Build_Door_FireProof,Build_Ext_Cladding,Buidl_Ext_Rendering,Build_Floor_CementScreeds,Build_Flooring,Build_Floor_Carpet,Build_Floor_Tiles,Build_Furniture_Fitting,Build_Joinery,Build_Fitout,Build_Decorating,Build_Glazing,Build_PVCU,Build_Ceiling,Build_Partition,Plastering,Build_Tiling,Build_Painting,Build_Roofing,Build_Roofing_Felt,Build_Roofing_Slating,Building_Type_Blockwork,Building_Type_Concrete,Building_Type_Demolition,Building_Type_Scaffolding,Building_Type_Clearance,Building_Type_Steelwork,Buiding_construct,Building,Building_Refurbishment,Building_Improvements,Electrical_Services,Building_Contracting,Building_Sector_Hospital,Building_Sector_Industrial,Building_Sector_Offices,Building_Sector_Private,Building_Sector_Schools,Building_Sector_Social,Building_Sector_Leisure,Building_Sector_Occupied,Civil_Drain_Inspection,Civil_Drainage,Civil_Sewer,Civil_Renovation,Civil_Underground,Civil,Civil_Repairs,Civil_Earthwork,Civil_Fencing,Civil_GroundMaintenance,Civil_Landscapping,Civil_Road_Asphalt,Civil_Road_Kerbing,Civil_Road_Construction,Civil_Road_Maintenance,Civil_Road_Paving,Electrical_Comms,Electrical_Cabling,Electrical_Floodlighting,Electrical_FireAlarm,Electrical_Installation_Maintenance,Electrical_Installation_construct,Electrical_Services.1,Electrical_External_Install,Electrical_Internal_Install,Electrical_Power_Lighting,Electrical_Security_Intruder,Electrical_Testing_Appliance,Electrical_Testing_Install,Electrical_Testing_Contracts,Electrical_Boiler_Domestic,Mechanical_Equipment_Services,Mechanical_maintenance,Mechanical_HeatingVent_AC,Mechanical_HeatingVent_Boiler,Mechanical_HeatingVent_GasInstall,Mechanical_HeatingVent_CentInstallation,Mechanical_HeatingVent_CentMaintenance,Mechanical_HeatingVent_ServiceInstallation,Mechanical_HeatingVent_Ventilation,Mechanical_Access_ControlSys,Mechanical_Pipework,Mechanical_WaterTreatment_Plumbing,BuildingService_Maintenance_M-E,BuildingService_Maintenance_Fabric,Building_Type_Masonry,Electical_TempInstallation,Electrical_FireAlarm_Maintenance,Architectural_Design,Electrical_Power_Lighting-Maintenance,Build_Sector_Listed,Security_CCTV_Installation,Mechanical_Boilers_Condensing,Electrical_LightingSystems
1385,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
808,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1085,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
356,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1319,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1016,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
165,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
219,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [58]:
y_train_logit=y_train.drop(["Building", "Building_Refurbishment"], axis =1)
y_train_logit

Unnamed: 0,Build_Doors,Build_Door_FireProof,Build_Ext_Cladding,Buidl_Ext_Rendering,Build_Floor_CementScreeds,Build_Flooring,Build_Floor_Carpet,Build_Floor_Tiles,Build_Furniture_Fitting,Build_Joinery,Build_Fitout,Build_Decorating,Build_Glazing,Build_PVCU,Build_Ceiling,Build_Partition,Plastering,Build_Tiling,Build_Painting,Build_Roofing,Build_Roofing_Felt,Build_Roofing_Slating,Building_Type_Blockwork,Building_Type_Concrete,Building_Type_Demolition,Building_Type_Scaffolding,Building_Type_Clearance,Building_Type_Steelwork,Buiding_construct,Building_Improvements,Electrical_Services,Building_Contracting,Building_Sector_Hospital,Building_Sector_Industrial,Building_Sector_Offices,Building_Sector_Private,Building_Sector_Schools,Building_Sector_Social,Building_Sector_Leisure,Building_Sector_Occupied,Civil_Drain_Inspection,Civil_Drainage,Civil_Sewer,Civil_Renovation,Civil_Underground,Civil,Civil_Repairs,Civil_Earthwork,Civil_Fencing,Civil_GroundMaintenance,Civil_Landscapping,Civil_Road_Asphalt,Civil_Road_Kerbing,Civil_Road_Construction,Civil_Road_Maintenance,Civil_Road_Paving,Electrical_Comms,Electrical_Cabling,Electrical_Floodlighting,Electrical_FireAlarm,Electrical_Installation_Maintenance,Electrical_Installation_construct,Electrical_Services.1,Electrical_External_Install,Electrical_Internal_Install,Electrical_Power_Lighting,Electrical_Security_Intruder,Electrical_Testing_Appliance,Electrical_Testing_Install,Electrical_Testing_Contracts,Electrical_Boiler_Domestic,Mechanical_Equipment_Services,Mechanical_maintenance,Mechanical_HeatingVent_AC,Mechanical_HeatingVent_Boiler,Mechanical_HeatingVent_GasInstall,Mechanical_HeatingVent_CentInstallation,Mechanical_HeatingVent_CentMaintenance,Mechanical_HeatingVent_ServiceInstallation,Mechanical_HeatingVent_Ventilation,Mechanical_Access_ControlSys,Mechanical_Pipework,Mechanical_WaterTreatment_Plumbing,BuildingService_Maintenance_M-E,BuildingService_Maintenance_Fabric,Building_Type_Masonry,Electical_TempInstallation,Electrical_FireAlarm_Maintenance,Architectural_Design,Electrical_Power_Lighting-Maintenance,Build_Sector_Listed,Security_CCTV_Installation,Mechanical_Boilers_Condensing,Electrical_LightingSystems
1385,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
808,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1085,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
356,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1319,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1016,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
165,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
219,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [59]:
y_test_logit=y_test.drop(["Building", "Building_Refurbishment"], axis =1)
y_test_logit

Unnamed: 0,Build_Doors,Build_Door_FireProof,Build_Ext_Cladding,Buidl_Ext_Rendering,Build_Floor_CementScreeds,Build_Flooring,Build_Floor_Carpet,Build_Floor_Tiles,Build_Furniture_Fitting,Build_Joinery,Build_Fitout,Build_Decorating,Build_Glazing,Build_PVCU,Build_Ceiling,Build_Partition,Plastering,Build_Tiling,Build_Painting,Build_Roofing,Build_Roofing_Felt,Build_Roofing_Slating,Building_Type_Blockwork,Building_Type_Concrete,Building_Type_Demolition,Building_Type_Scaffolding,Building_Type_Clearance,Building_Type_Steelwork,Buiding_construct,Building_Improvements,Electrical_Services,Building_Contracting,Building_Sector_Hospital,Building_Sector_Industrial,Building_Sector_Offices,Building_Sector_Private,Building_Sector_Schools,Building_Sector_Social,Building_Sector_Leisure,Building_Sector_Occupied,Civil_Drain_Inspection,Civil_Drainage,Civil_Sewer,Civil_Renovation,Civil_Underground,Civil,Civil_Repairs,Civil_Earthwork,Civil_Fencing,Civil_GroundMaintenance,Civil_Landscapping,Civil_Road_Asphalt,Civil_Road_Kerbing,Civil_Road_Construction,Civil_Road_Maintenance,Civil_Road_Paving,Electrical_Comms,Electrical_Cabling,Electrical_Floodlighting,Electrical_FireAlarm,Electrical_Installation_Maintenance,Electrical_Installation_construct,Electrical_Services.1,Electrical_External_Install,Electrical_Internal_Install,Electrical_Power_Lighting,Electrical_Security_Intruder,Electrical_Testing_Appliance,Electrical_Testing_Install,Electrical_Testing_Contracts,Electrical_Boiler_Domestic,Mechanical_Equipment_Services,Mechanical_maintenance,Mechanical_HeatingVent_AC,Mechanical_HeatingVent_Boiler,Mechanical_HeatingVent_GasInstall,Mechanical_HeatingVent_CentInstallation,Mechanical_HeatingVent_CentMaintenance,Mechanical_HeatingVent_ServiceInstallation,Mechanical_HeatingVent_Ventilation,Mechanical_Access_ControlSys,Mechanical_Pipework,Mechanical_WaterTreatment_Plumbing,BuildingService_Maintenance_M-E,BuildingService_Maintenance_Fabric,Building_Type_Masonry,Electical_TempInstallation,Electrical_FireAlarm_Maintenance,Architectural_Design,Electrical_Power_Lighting-Maintenance,Build_Sector_Listed,Security_CCTV_Installation,Mechanical_Boilers_Condensing,Electrical_LightingSystems
1762,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
176,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
652,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1398,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1088,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
357,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
468,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
1058,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
215,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [60]:
classifier = BinaryRelevance(LogisticRegression())

# train
classifier.fit(x_train, y_train_logit)

# predict
predictions = classifier.predict(x_test)


from sklearn.metrics import accuracy_score
print('Accuracy = ', accuracy_score(y_test_logit,predictions))
print('Hamming Loss is ', hamming_loss(y_test, predictions))

Accuracy =  0.0


ValueError: ignored

The above error again evidences why the data is not sufficient enough for the project, splitting to balance the initial error led to a further error as shown above.

The above shows that given more time, a lot more can be achieved as discribed in the summary below

### Evaluation Metrics Used

- Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:  The accuracy of the model is very low
    
- Hamming loss is the fraction of wrong labels to the total number of labels. In multi-class classification, hamming loss is calculated as the hamming distance between y_true and y_pred . In multi-label classification, hamming loss penalizes only the individual labels.
- accuracy counts no of correctly classified data instance, Hamming Loss calculates loss generated in the bit string of class labels during prediction,

### Answers to questions

Discuss some of the limitations you found in this task?

- As touched on in various section of the solution above, there are a few observations worthy of note about the data set

    - Firstly, the dataset shows that not all the features have been utilised and this affected the logistic regression model. typically, logistic regression requires 2 possible output to be present in a target column, however, the Building and
Building_Refurbishment columns, given that it never appeared in the dataset, returnsed onlu "0" value, affecting the model and pushing out an error. This was corrected by removign the columns, however the model still returned an error as it is not suitable for multi-label classification
    - Secondly, the accuracy and harming loss are very low and typifies a poor model. Typically, In multi-label classification, a misclassification is no longer a hard wrong or right. A prediction containing a subset of the actual classes should be considered better than a prediction that contains none of them. 
    - The task is obviously not a recommender system activity, as it focuses on extracting categories of business from the provided details, rather follow any historical or inherent pattern as required for recommender system
    - The fact that it is a supervised learning approach removes any element of clustering.
    - The approach employed (Classification of recommend relevant work categories to customers based of their description) is closely related to topic modelling, as we are required to use description provided by clients, to predict the possible category of their activities, by extracting key topics from their details. This is however rulled out a the presented data has features and label, erasing any unsupervised learning approach
    - The eventual approach deployed model is multi-label classification. As a classification approach, the model helps to classify categories bassed on the corresponding inherent relationship with the description. Muli-label classification deiffers from binary and multi-class classification.
    - Binary refers to classification done with two possible outcomes, i.e yes/no. Multiclass refers to classification involving more than two possible outcomes.
    - However, while, binary and multi-class refers to possibility of outputing only one result per obseration, multi-label classification refers to the case of possibly outputing one or more results for every observation, as it is in our case
    - Spot checks on the data suggests that the collection process is a bit faulty. taking the first observation for instance, many of the key words in the description aligns with a lot of categories that were not alloted to this section. This inaccurate anottation of data affects the model
    - Time constrain was a huge limiting factor to exploring different other models asides the 2 used
- Given more time, I will spend more time in cleaning the data and presenting better annotation
- With more time, I will explore using other models like OneVsRest, Label Powerset, and Adapted Algorithms
- With more time I will explore deep learning application using LSTM, GRU etc and attemp extensive prunning to ensure improved model performance