# Problem

The dataset provided correspond to a set of companies identified by their `company_id`. They are coming from different groups that you can consider as sectors or industries identified by the field `group_id`.

We are interested in knowing which companies are consultancies (in a very broad sense). To do so, we labelled the companies in the field `is_consultancy` as `1` if the company is a consulting one and `0` otherwise.

We also have a previous tool that predicts the consultancy status. It has been applied to this dataset and its output is stored in the field `predicted_consultancy` (`1` meaning that the company is predicted to be a consultancy).

Finally the field `text` is some text extracted from each company website.

The objective of this challenge is to build a solution to detect the companies that are consultancies. The main points that will be assessed are your analysis and understanding of the dataset, your approach to solve the problem and how you evaluate the performance of your solution.

Note that this is not a Kaggle competition and you will not be assessed on the performance of your solution. We are mindful of your own time and do not expect you to dedicate too much of it to this challenge.

🍀 Good luck!

In [1]:
import pandas as pd
df = pd.read_csv("nlp_data_scientist_challenge_dataset.csv")
df.head(20)

Unnamed: 0,company_id,group_id,is_consultancy,predicted_consultancy,text
0,d2d78cb1fe31d2c8fb7cc84bb1cab3ef,654,0,0.0,Legal Disclaimer MICRO-ALGAE PRODUCTION FACILI...
1,defd325d9bfa59e9843c23da0b0c2d8c,414,0,0.0,Home What We Do Technology Products EktoTherix...
2,a29dda8947c86bd7206137744fa201ef,414,0,0.0,top of page HOME ABOUT ADVISORS NEWSLETTER IMP...
3,e99d0615faa5a5405519c3c5dedc24ab,521,1,0.0,Skip to content Ayruz Data Marketing: Data Dri...
4,ee1951864f9ed6812b5350ccb04ef83b,654,0,0.0,top of page HOME OUR STORY PRODUCTS SUPERPLANT...
5,4f8e7202dea25498b0b4bfe671f27816,414,0,0.0,Login Register Item ID Product Name Search Tog...
6,059093b92dd58eb0e8b9108bee5c2fa4,647,0,0.0,Skip to content Oyster Technologies MENU MENU ...
7,2e36fb297fd572bb833261cb8cf330f4,521,1,0.0,About Services Cloud Quality Assurance and Aut...
8,247dffbff111e47ed4d7c5abeeb7b004,654,0,1.0,"Skip to content IMPACT Agronomics, Inc. Making..."
9,c7e5a48bc256d8c7d81caf80404c6809,668,0,0.0,Achiko About Products Aptamex Teman Sehat Team...


In [2]:
df["is_consultancy"].value_counts() # slightly imbalanced problem

0    576
1    216
Name: is_consultancy, dtype: int64

In [3]:
df["text"][3]

"Skip to content Ayruz Data Marketing: Data Driven Digital Marketing and Analytics Agency Home About Us Products & Services SERVICES SERVICES DIGITAL STRATEGY CONSUMER JOURNEY MAPPING DEVELOP COHESIVE STRATEGY CREATE DIGITAL EXPERIENCE GDPR COMPLIANCE CONSULTING DIGITAL MARKETING Paid Search & Display Retargeting SEO & Content Marketing Email Social Media MANAGE CHANNEL E-Learning Webinars Lead Management Segmentation MARKETING-AUTOMATION Digital Analytics Google Analytics Tag Management Business Analytics Conversion Rate Optimization GOOGLE TAG MANAGERCLIENT-SIDE TAGGING GOOGLE TAG MANAGERSERVER-SIDE TAGGING GOOGLE UNIVERSAL ANALYTICS MIGRATION TO GA4 PRODUCTS PRODUCTS DYNAQR Engage your customers and build loyality with ready-to-use QR solutions. Insights Managed Remote Team Contact Us +1 267 908 9290 Philadelphia, USA +91 98468 31128 Trivandrum, INDIA +91 99471 06111 Kochi, INDIA 3 rd PARTY COOKIES ARE DEAD: HOW TO MARKET AND MEASURE IN THE COOKIE-LESS WORLD LEARN MORE FREE WHITE PA

Basic analysis:
- 792 entries, of which 576 class 0
- Language seems to be english
- A lot of numbers and special characters, sometimes the text is not ver ymeaningful. It has been stripped of paragraphs

# Steps to reproduce
# 1. Preprocessing
## 1.1 Using transformers embedding and perform classification
## 1.2 Clean the dataset
- remove words that contain numbers or special characters
- set everything to lowercase
- remove stopwords and some other grammatical categories
- strip words to their radicands and keep only the root (for non proper words)
- Use TFIDF on that
- package all this preprocessing into a custom preprocessor sklearn style
## 1.3 Try word-based classifiers?
- NLTK or Spacy methods with bayesian inference or LDA?
- Other classic methods like logisticregression

# Implement crossvalidation

# Good sources of information
[sklearn guidelines for text](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

In [4]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [5]:
! pip install -U spacy
! pip install transformers[sentencepiece]
! pip install -U pip setuptools wheel
! python -m spacy download en_core_web_lg
! pip install inflect nltk swifter

[0m2023-11-22 13:05:52.041453: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-22 13:05:52.041514: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-22 13:05:52.041546: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-22 13:05:52.049636: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-22 13:05:54.822342: I tensorfl

In [None]:
import nltk
nltk.download('words')

## Testing document conversion to vectors

In [7]:
import spacy
from thinc.api import set_gpu_allocator, require_gpu

set_gpu_allocator("pytorch")
# require_gpu(0)
nlp = spacy.load("en_core_web_lg")
tokens = nlp("whatever this is a complicated text")
tokens.vector.shape

(300,)

## Preprocessing

In [8]:
import numpy as np
import re
from string import punctuation
import inflect
from nltk.corpus import words

singular_converter = inflect.engine()

def custom_singularize(spacy_lemma:str, spacy_pos:str)->str:
        if spacy_pos in ["NNS", "NNPS"]:
            return singular_converter.singular_noun(spacy_lemma)
        else:
            return spacy_lemma

def basic_preprocessor(doc:str)->str:
    """
    Converts the text to a more readable format and keeps only words that make sense.
    The order of the words or exact context does not matter for now as spacy's transformer
    only averages tokens for most significative words
    """
    # Remove special characters and set everything to lowercase
    doc = doc.lower()
    doc = re.sub(rf"[{punctuation}]",'', doc)

    # Use spacys's splitter
    # Remove multiple spaces, '"' etc
    tokens = nlp(doc)

    # Retrieve only lemmas
    # Converting everything to singular form
    # Using POS to remove unwanted grammatical categories and words that are too short (like pronouns)
    # Removing default english stopwords
    # Check if the word exists in an english dictionnary
    # We choose not to remove redundance at this stage so that TFIDF can take into account word importance

    unwanted_grammatical_categories = ["NUM", "SYM", "PUNCT", "CCONJ", "ADP", "DET", "PRON"]
    minimal_word_length = 2
    lemmas = np.sort([custom_singularize(token.lemma_, token.pos_) for token in tokens
                        if token.pos_ not in unwanted_grammatical_categories
                        and len(token.lemma_) > minimal_word_length
                        and token.is_stop == False
                        and token.text in words.words()
                        ])
    return " ".join(lemmas)

sample_text = df["text"][3]
sample_text, basic_preprocessor(sample_text)

("Skip to content Ayruz Data Marketing: Data Driven Digital Marketing and Analytics Agency Home About Us Products & Services SERVICES SERVICES DIGITAL STRATEGY CONSUMER JOURNEY MAPPING DEVELOP COHESIVE STRATEGY CREATE DIGITAL EXPERIENCE GDPR COMPLIANCE CONSULTING DIGITAL MARKETING Paid Search & Display Retargeting SEO & Content Marketing Email Social Media MANAGE CHANNEL E-Learning Webinars Lead Management Segmentation MARKETING-AUTOMATION Digital Analytics Google Analytics Tag Management Business Analytics Conversion Rate Optimization GOOGLE TAG MANAGERCLIENT-SIDE TAGGING GOOGLE TAG MANAGERSERVER-SIDE TAGGING GOOGLE UNIVERSAL ANALYTICS MIGRATION TO GA4 PRODUCTS PRODUCTS DYNAQR Engage your customers and build loyality with ready-to-use QR solutions. Insights Managed Remote Team Contact Us +1 267 908 9290 Philadelphia, USA +91 98468 31128 Trivandrum, INDIA +91 99471 06111 Kochi, INDIA 3 rd PARTY COOKIES ARE DEAD: HOW TO MARKET AND MEASURE IN THE COOKIE-LESS WORLD LEARN MORE FREE WHITE P

In [9]:
from tqdm.notebook import tqdm
from swifter import swifter
tqdm.pandas()
# df["preprocessed"] = df["text"].swifter.allow_dask_on_strings(enable=True).force_parallel(enable=True).progress_bar(True).apply(basic_preprocessor)
# df["preprocessed"] = df["text"].progress_apply(basic_preprocessor) # Too slow on a single process

In [10]:
from tqdm.contrib.concurrent import process_map
r = process_map(basic_preprocessor, list(df["text"]), max_workers=8)
df["preprocessed"] = r
df.to_csv("preprocessed_df.csv") # takes a while because it is not optimized for batch execution, therefore I am backing it up to avoid collab crashing

  0%|          | 0/792 [00:00<?, ?it/s]

## Using Vectorization with basic prediction model

In [71]:
# apply nlp.vect to each column, and get it as a matrix with 300 columns

def spacy_vectorizer(preprocessed_text:str)->np.array:
    t = nlp(preprocessed_text).vector
    return np.array(t)

r = process_map(spacy_vectorizer, list(df["preprocessed"]), max_workers=8)

  0%|          | 0/792 [00:00<?, ?it/s]

In [79]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score

X = np.array(r)
y = df.is_consultancy

clf = SGDClassifier(class_weight="balanced", loss='log_loss', penalty='l2', alpha=1e-3, random_state=42, max_iter=100, tol=None)
clf.fit(X,y)
cross_val_f1 = cross_val_score(clf, X, y, cv=5, scoring='f1').mean()
cross_val_precision = cross_val_score(clf, X, y, cv=5, scoring='precision').mean()
cross_val_recall = cross_val_score(clf, X, y, cv=5, scoring='recall').mean()
cross_val_auc = cross_val_score(clf, X, y, cv=5, scoring='roc_auc').mean()
cross_val_auc, cross_val_f1, cross_val_precision, cross_val_recall

(0.8287726115800449,
 0.6014940052444674,
 0.6480588014887718,
 0.5891120507399576)

## Using TFIDF features with sklearn language classifier models

In [65]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

clf = SGDClassifier(class_weight="balanced", loss='log_loss', penalty='l2', alpha=1e-3, random_state=42, max_iter=100, tol=None)

text_clf = Pipeline([('tfidf', TfidfVectorizer(max_features=2000, min_df=.0004, max_df=.9)),('clf', clf)])
text_clf.fit(df.preprocessed, df.is_consultancy)
cross_val_f1 = cross_val_score(text_clf, df.preprocessed, df.is_consultancy, cv=5, scoring='f1').mean()
cross_val_precision = cross_val_score(text_clf, df.preprocessed, df.is_consultancy, cv=5, scoring='precision').mean()
cross_val_recall = cross_val_score(text_clf, df.preprocessed, df.is_consultancy, cv=5, scoring='recall').mean()
cross_val_auc = cross_val_score(text_clf, df.preprocessed, df.is_consultancy, cv=5, scoring='roc_auc').mean()
cross_val_auc, cross_val_f1, cross_val_precision, cross_val_recall

(0.8701173409067137,
 0.6816206846742742,
 0.6482909303732105,
 0.7268498942917547)

In [56]:
# Comparing this to the original scorer
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
mask = ~df.predicted_consultancy.isna()
original_f1 = f1_score(df.is_consultancy[mask], df.predicted_consultancy[mask])
original_precision = precision_score(df.is_consultancy[mask], df.predicted_consultancy[mask])
original_recall = recall_score(df.is_consultancy[mask], df.predicted_consultancy[mask])
original_auc = roc_auc_score(df.is_consultancy[mask], df.predicted_consultancy[mask])
original_auc, original_f1, original_precision, original_recall

(0.6507796038769489,
 0.48179271708683474,
 0.5850340136054422,
 0.4095238095238095)

So we considerably improved the original score. an AUC of .87 is rather good.

## Basic hypertuning

In [70]:
from sklearn.model_selection import GridSearchCV
parameters = {"tfidf__max_features": [2000],
              "clf__alpha": [1e-2],
              "clf__loss": ["modified_huber", "log_loss", "perceptron"],
              "clf__epsilon": (1e-2, 1e-1),
              "clf__penalty":["l2", "l1"],
              "clf__tol":[1e-4, 1e-3],
              "clf__power_t":[.5, 1.]
              }
gs_clf = GridSearchCV(text_clf, parameters, scoring="roc_auc" , cv=5, n_jobs=-1)
gs_clf = gs_clf.fit(df.preprocessed, df.is_consultancy)
pd.DataFrame(gs_clf.cv_results_)[["mean_test_score", "params", "rank_test_score"]].sort_values(by=["rank_test_score"]).head(10)

Unnamed: 0,mean_test_score,params,rank_test_score
33,0.871283,"{'clf__alpha': 0.01, 'clf__epsilon': 0.1, 'clf...",1
35,0.871283,"{'clf__alpha': 0.01, 'clf__epsilon': 0.1, 'clf...",1
11,0.871283,"{'clf__alpha': 0.01, 'clf__epsilon': 0.01, 'cl...",1
9,0.871283,"{'clf__alpha': 0.01, 'clf__epsilon': 0.01, 'cl...",1
34,0.870878,"{'clf__alpha': 0.01, 'clf__epsilon': 0.1, 'clf...",5
32,0.870878,"{'clf__alpha': 0.01, 'clf__epsilon': 0.1, 'clf...",5
10,0.870878,"{'clf__alpha': 0.01, 'clf__epsilon': 0.01, 'cl...",5
8,0.870878,"{'clf__alpha': 0.01, 'clf__epsilon': 0.01, 'cl...",5
25,0.868664,"{'clf__alpha': 0.01, 'clf__epsilon': 0.1, 'clf...",9
27,0.868664,"{'clf__alpha': 0.01, 'clf__epsilon': 0.1, 'clf...",9


In [96]:
# vocab = gs_clf.best_estimator_.named_steps["tfidf"].vocabulary_
# tfidf = gs_clf.best_estimator_.named_steps["tfidf"]
# tfidf_matrix = tfidf.transform(df.preprocessed).toarray()
# reverse_vocab = {v:k for k,v in vocab.items()}
# feature_names = tfidf.get_feature_names_out()
# df_tfidf = pd.DataFrame(tfidf_matrix, columns = feature_names)
# idx = tfidf_matrix.argsort(axis=1)
# tfidf_max2 = idx[:,-2:]
# print ([[(reverse_vocab.get(item), tfidf_matrix[i, item])  for item in row] for i, row in enumerate(tfidf_max2) ])

[[('composition', 0.2686161000656953), ('algae', 0.4280576287446159)], [('wound', 0.4216152240667197), ('repair', 0.5351848102104416)], [('medical', 0.3111613837282349), ('doctor', 0.38800760035489684)], [('marketing', 0.40749097955776276), ('digital', 0.4377863366325991)], [('plant', 0.32013149452922657), ('organic', 0.4669565350843272)], [('medical', 0.45700536125146446), ('equipment', 0.4804442313704832)], [('isolation', 0.31322855616332873), ('security', 0.6073366424097489)], [('delivery', 0.2715984358388708), ('agile', 0.4790484277763655)], [('research', 0.30931681395795013), ('production', 0.32619230614005645)], [('class', 0.25755164301646494), ('investor', 0.27656928338745207)], [('explore', 0.313048660798688), ('datum', 0.3759419300823129)], [('aid', 0.44052136429470123), ('water', 0.5744899520442677)], [('coverage', 0.2661214161482182), ('testing', 0.6347105182970897)], [('digital', 0.28699988247960506), ('sap', 0.45849654474198054)], [('digital', 0.34257230859721144), ('marke

##Conclusion

It took approximately 2.5h
- 10mins to review the data and check it
- 45mins to review possibilities for text preprocessing and re-study nltk/spacy
- 45mins to implement the text preprocessing
- 20mins to organize the hyperparameter tuning and sklearn pipeline
- 30 mins to wrap things up

We have a model with AUC = .87 (the decision threshold can then be tweeked with logloss function to improve precision/recall as needed).

In the end on so little data, "traditional" datascience worked better than generic embeddings.