## Competitor Analysis

The purpose of this notebook is to extract key competitor companies and products of a target company from proprietary interviews. The goal is to complete two tasks:
1. Extract company and product names from expert interview transcripts
2. Classify whether the company / product are competitors to the target company

Desired output: 
* List of products and companies and whether they are a competitor to Snowflake ranked in order of importance

In [2]:
import json
import logging
import numpy as np
from pprint import pprint
from tqdm import tqdm
import warnings

import pandas as pd

from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    BartForSequenceClassification,
    BartTokenizer
)
from transformers import pipeline

warnings.filterwarnings("ignore")
logging.disable(logging.WARNING)

  from .autonotebook import tqdm as notebook_tqdm
2023-08-22 21:03:07.113193: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Load Data

In [3]:
# Read JSON file
with open('../data/ml_exercise_conversation_cleaned.json', 'r') as json_file:
    interview = json.load(json_file)

# Pretty print the JSON data
pprint(interview)

{'utterances': [{'paragraphs': ['So currently researching the data analytics '
                                "space. So I have a bunch of questions I'd "
                                'love to go through. And I have a bit of your '
                                'background, but it would be helpful if we '
                                'could just spend a second with you telling '
                                'you about your background. And also, I guess, '
                                "yes, I don't know if there's something you "
                                'are doing currently or just if we want to '
                                'talk about Verve for this conversation? And '
                                'if so, like what Verve does?'],
                 'speaker': 'Tegus Client'},
                {'paragraphs': ['Sure. Well so if you have my resume, Verve '
                                'had an exit in January, where the company '
                             

## I. Named Entity Recognition

Pull out named entities from the interview using a pretrained NER model from Hugging Face. 

In [4]:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="max")  # Word entity will be the token with the maximum score.

In [4]:
entities = set()

for thought in tqdm(interview['utterances']):
    _thought = thought['paragraphs']
    for paragraph in _thought:
        ner_results = nlp(paragraph)
        orgs = [org['word'] for org in ner_results if org['entity_group']=='ORG' and org['score'] > 0.75]
        entities.update(orgs)

print(entities)
print(len(entities))
        


100%|██████████| 61/61 [00:31<00:00,  1.93it/s]

{'MapReduce', 'Dynamo', 'MongoDB', 'LA CTO', 'Snowflakes', 'Redshift', 'Azure', 'Google', 'Google Cloud', 'Oracle', 'Amazon', 'Excel', 'Snowflake', 'UI', 'EMR', 'Rubicon', 'Apache', 'Verve', 'Cassandra', 'Elastic', 'Couchbase', 'Gartner', 'Databricks'}
23





## II. Competitor Classification 

Now that we have our list of products and companies, we want to classify whether they are competitors to the target company (Snowflake).

Since we only have one example, I will rely on pretrained models. There is a max token length for the model, so I will batch the interview and iteratively pass through the model. 

In [5]:
# load model pretrained on MNLI (Multi-Genre Natural Language Inference)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli')

In [6]:
def classify_competitor(batched_lists, possible_competitor, target_org):
    """
    Classify a possible competitor company or producat as being a competitor to Snowflake based on the
    interview context using a pretrained Sequence classification model.
    """
    competitor_probability = []
    for idx, batch in enumerate(batched_lists):   
        # pose sequence as a NLI premise
        premise = " ".join(batched_lists[idx])

        hypothesis = f'{possible_competitor} and {target_org} are competitors.'

        # run through model pre-trained on MNLI to get premise and hypothesis embeddings
        input_ids = tokenizer.encode(premise, hypothesis, return_tensors='pt', max_length=1024, truncation=True)
        logits = model(input_ids)[0]  # logits - activations from final layer of the model

        # we throw away "neutral" (dim 1) and take the probability of "entailment" (2) as the probability
        # of the label being true
        entail_contradiction_logits = logits[:,[0,2]]
        probs = entail_contradiction_logits.softmax(dim=1) # convert logits to a probability using sigmoid across output classes
        true_prob = probs[:,1].item() * 100
        competitor_probability.append(true_prob)

    return max(competitor_probability)

In [7]:
# Try using paragraphs that contain Snowflake to try to tease out the entity's relationship to Snowflake
context = [thought['paragraphs'] for thought in interview['utterances']]

sf_context = []
for chunk in context:
    concat_chunk = " ".join([paragraph for paragraph in chunk])
    if "Snowflake" in concat_chunk:
        sf_context.append(concat_chunk)
len(sf_context)

31

In [8]:
# split the text into batches to avoid max token length issues
batch_size = 4
batched_lists = [sf_context[i:i + batch_size] for i in range(0, len(sf_context), batch_size)]
len(batched_lists)

8

In [10]:
# researched another batching method that is better
tokens = tokenizer.encode_plus(" ".join(sf_context), add_special_tokens=False, return_tensors='pt')
input_id_chunks = tokens['input_ids'][0].split(1022)  # 1022 due to CLS and PAD special tokens we want to add later
mask_chunks = tokens['attention_mask'][0].split(1022)

for tensor in input_id_chunks:
    print(len(tensor))

# then stack and run through the model to generate probabilities

1022
1022
1022
1022
1022
1022
162


In [10]:
# take out Snowflake from the entity list
filtered_entities = {item for item in entities if not item.startswith("Snowflake")}

In [11]:
results = {}
for org in tqdm(filtered_entities):
    competitor_probability = classify_competitor(batched_lists=batched_lists,
                                                 possible_competitor=org,
                                                 target_org='Snowflake')
    print(f"{org}: {competitor_probability:0.2f}%")
    results[org] = competitor_probability

  0%|          | 0/21 [00:00<?, ?it/s]

  5%|▍         | 1/21 [01:39<33:19, 99.99s/it]

MapReduce: 83.62%


 10%|▉         | 2/21 [03:10<29:48, 94.15s/it]

Dynamo: 97.90%


 14%|█▍        | 3/21 [04:32<26:38, 88.83s/it]

MongoDB: 80.83%


 19%|█▉        | 4/21 [06:05<25:39, 90.53s/it]

LA CTO: 88.32%


 24%|██▍       | 5/21 [07:37<24:18, 91.17s/it]

Redshift: 84.50%


 29%|██▊       | 6/21 [09:03<22:18, 89.21s/it]

Azure: 83.56%


 33%|███▎      | 7/21 [10:30<20:38, 88.45s/it]

Google: 98.40%


 38%|███▊      | 8/21 [11:56<19:00, 87.71s/it]

Google Cloud: 99.04%


 43%|████▎     | 9/21 [13:23<17:31, 87.62s/it]

Oracle: 72.04%


 48%|████▊     | 10/21 [15:00<16:34, 90.44s/it]

Amazon: 97.70%


 52%|█████▏    | 11/21 [16:32<15:09, 90.94s/it]

Excel: 89.59%


 57%|█████▋    | 12/21 [17:57<13:22, 89.19s/it]

UI: 87.71%


 62%|██████▏   | 13/21 [19:24<11:46, 88.32s/it]

EMR: 92.72%


 67%|██████▋   | 14/21 [20:51<10:15, 87.93s/it]

Rubicon: 85.82%


 71%|███████▏  | 15/21 [22:19<08:48, 88.10s/it]

Apache: 83.10%


 76%|███████▌  | 16/21 [23:45<07:17, 87.40s/it]

Verve: 91.67%


 81%|████████  | 17/21 [25:16<05:53, 88.47s/it]

Cassandra: 95.28%


 86%|████████▌ | 18/21 [26:51<04:31, 90.52s/it]

Elastic: 83.26%


 90%|█████████ | 19/21 [28:18<02:58, 89.32s/it]

Couchbase: 91.75%


 95%|█████████▌| 20/21 [29:50<01:30, 90.16s/it]

Gartner: 87.03%


100%|██████████| 21/21 [31:20<00:00, 89.56s/it]

Databricks: 94.64%





In [13]:
# sort results in order of probability
results_df = pd.DataFrame(results.items(), columns=['entity', 'competitor_probability'])

# create binary classification based on minimum probability threshold
results_df['competitor'] = np.where(round(results_df['competitor_probability']) >= 95, 1, 0)
results_df.sort_values('competitor_probability', ascending=False, inplace=True)
results_df

Unnamed: 0,entity,competitor_probability,competitor
7,Google Cloud,99.035293,1
6,Google,98.396146,1
1,Dynamo,97.901756,1
9,Amazon,97.704494,1
16,Cassandra,95.284843,1
20,Databricks,94.639993,1
12,EMR,92.721331,0
18,Couchbase,91.748416,0
15,Verve,91.669631,0
10,Excel,89.589107,0


In [14]:
# save as csv
results_df.to_csv('results.csv', index=False)