In [1]:
import sys
sys.path.insert(1, 'Utilities')

# 1.) Model Type Detection

## Strategy for matching the keywords to Sales Description:

#### From what I understand:
- Every request body has just one model code. If a request contains several (or maybe an ambigious one), we'll need to create multiple request bodies.
- There is also a lot of overlap in their names, which might cause a problem if people misspell them.

### Potential approach:
- We'll need to find some way to extract the keywords from the request text and find a similarity score with the Sales description.
- We can calculate the score with cosine similarity, and then softmax the results. 

#### How to get the score:

##### Method 1:
- Maybe encode the table with BERT (SBERT would work here) and compare the BERT encodings of the keywords with the BERT encodings of the sales description to find which one matches the best.
- Why BERT?:  This takes care of any spelling and typographical errors AND if there is some ambiguity, we'll get very similary softmax scores for different sales descriptions.


In [2]:
from utilities import *

In [4]:
target_embeddings = [get_sbert_embeddings('iX xDrive50'), get_sbert_embeddings('iX xDrive40'), get_sbert_embeddings('X7 xDrive40i'), get_sbert_embeddings('X7 xDrive40d'), get_sbert_embeddings('M8'), get_sbert_embeddings('318i')]
test_emb = get_sbert_embeddings('iX xDrive50')

In [5]:
softmax(np.array([get_cosine_similarity(i, test_emb) for i in target_embeddings]))

array([0.16695336, 0.16686176, 0.16672765, 0.16676882, 0.16661336,
       0.16607507], dtype=float32)

##### Problems:
- BERT (or FastText, Word2Vec, etc.) is really heavy and an overkill for this problem. 
- On top of that, the similarity scores are not different for big changes in the model type code, let alone for the minor ones (typing errors, partial names etc.). Even pooling strategies don't work well here.

##### Method 2:
- We can calculate the score with the longest common subsequence as a percentage of the length of the sales description extracted from the text. 
- This is a lot more interpretable and since a typgraphical error will change the score for every match, the softmax output will remain consistent.
- Rest all is same as the BERT strategy.

In [12]:
target_string = ['iX xDrive50', 'iX xDrive40','X7 xDrive40i', 'X7 xDrive40d','M8','318i']

In [13]:
test_string = 'X7 xDrive40i'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.18498475, 0.20258965, 0.22186999, 0.20413024, 0.08162141,
       0.10480396])

In [18]:
# With spelling error
test_string = 'ix Drve40'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.19354117, 0.21628595, 0.19354117, 0.19354117, 0.08891781,
       0.11417273])

In [20]:
# With ambiguity
test_string = 'Drive40'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.18416297, 0.21244395, 0.21244395, 0.21244395, 0.07815376,
       0.10035142])

##### Potential Selection method: 
- If two score are equal till the 2nd decimal place, we can declare that the request is ambigious enough for us to create multiple request bodies.

## 2.) Keyword detection
- Essentially, I need to find all the key terms in the text body and find which ones of them match with the sales description of model type codes.

#### Here is an idea: 
- The bolean formula taks is going to be a POS-Tagging job. If I am going to extract nouns there anyway, why not use the same information to find the potnetial model type codes as well?

## 2 (redefined) Keyword detection with some POS and conjugation context:
- Spacy's 'en_core_web_sm' might be good enough here.

In [227]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

# text = "Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024"
text = 'I am planning to order the BMW M8 with a sunroof or a panorama glass roof sky lounge along with the M Sport Package and right-hand drive on 12th April 2018. Is this configuration possible?'
doc = nlp(text)
displacy.render(doc, style = "dep")

- Most of these keywords look like they are compound NOUNS or PROPNs. Some are joined with amod though (like the model names or ones separated with -).
- The one with "panorama glass roof sky lounge" is giving a wierd compound structure (this can be solved with some recursive function to extract the compounds).

- This is also going to require some sort of reduction as the lookup function for compound words create redundant lists. We can check if a term is a subset of another term and if yes, we can remove them.

- The with/without etc are ADP and can directly be taken form the head of the NOUN/PROPN. That gives me my and/and not boolean logic. (However, due to some grammatical mistake, the head might not end up being a directy with or without. Here, I'll need to find some way to determine whether the head has a positive or negative connotation.)

- Now about OR. <b>Here is an idea </b>. You technically cannot ask for two configuration for the same thing (like sunroof and panaroma glass roof). So say if I find two things that belong to the roof configuration, they'll go into a bracket with OR/OR-NOT. (To reinforce this however, I can check for the any child conjugations that these terms have).

- Another observation: If the sentence is A or B, then the head of A is usually wrong. If both belong to the same class, we can assume both have the same head as the one for B.

In [235]:
def get_key_terms_with_pos():
    
    all_tags = []
    for token in doc:
        if token.pos_ == "NOUN" or token.pos_ == "PROPN":

            # Getting the full term
            compound = recursive_compound_extraction(token, [])
            compound.append(token)
            
            tags = {'values': compound,
                    'text': ' '.join([i.text for i in compound]),
                    'child_conj': [i for i in token.children if i.pos_ == 'CCONJ'],
                    'pos': token.pos_,
                    'dep': token.dep_,
                    'head': token.head,}
            
            all_tags.append(tags)

    # Reduction Logic:
    all_values = [i['values'] for i in all_tags]

    unique_tags = []
    for t in all_tags:
        # Logic: If there are more than one tag that contains all the values of the current tag, then there is a superset persent in all_tags
        number_of_supersets = sum([len(set(t['values'])-set(all_tags[0])-set(j)) == 0 for j in all_values])

        # There should only be one superset (itself)
        if number_of_supersets==1:
            unique_tags.append(t)

In [233]:
for t in all_tags:
    print(t)

{'values': [BMW], 'text': 'BMW', 'child_conj': [], 'pos': 'PROPN', 'dep': 'compound', 'head': M8}
{'values': [BMW, M8], 'text': 'BMW M8', 'child_conj': [], 'pos': 'PROPN', 'dep': 'dobj', 'head': order}
{'values': [sunroof], 'text': 'sunroof', 'child_conj': [or], 'pos': 'NOUN', 'dep': 'pobj', 'head': with}
{'values': [panorama], 'text': 'panorama', 'child_conj': [], 'pos': 'NOUN', 'dep': 'compound', 'head': lounge}
{'values': [glass], 'text': 'glass', 'child_conj': [], 'pos': 'NOUN', 'dep': 'compound', 'head': roof}
{'values': [glass, roof], 'text': 'glass roof', 'child_conj': [], 'pos': 'NOUN', 'dep': 'compound', 'head': sky}
{'values': [glass, roof, sky], 'text': 'glass roof sky', 'child_conj': [], 'pos': 'NOUN', 'dep': 'compound', 'head': lounge}
{'values': [panorama, glass, roof, sky, lounge], 'text': 'panorama glass roof sky lounge', 'child_conj': [], 'pos': 'NOUN', 'dep': 'conj', 'head': sunroof}
{'values': [M], 'text': 'M', 'child_conj': [], 'pos': 'PROPN', 'dep': 'compound', 'he

In [234]:
for t in unique_tags:
    print(t)

{'values': [BMW, M8], 'text': 'BMW M8', 'child_conj': [], 'pos': 'PROPN', 'dep': 'dobj', 'head': order}
{'values': [sunroof], 'text': 'sunroof', 'child_conj': [or], 'pos': 'NOUN', 'dep': 'pobj', 'head': with}
{'values': [panorama, glass, roof, sky, lounge], 'text': 'panorama glass roof sky lounge', 'child_conj': [], 'pos': 'NOUN', 'dep': 'conj', 'head': sunroof}
{'values': [M, Sport, Package], 'text': 'M Sport Package', 'child_conj': [and], 'pos': 'PROPN', 'dep': 'pobj', 'head': with}
{'values': [right, hand, drive], 'text': 'right hand drive', 'child_conj': [], 'pos': 'NOUN', 'dep': 'conj', 'head': Package}
{'values': [12th, April], 'text': '12th April', 'child_conj': [], 'pos': 'PROPN', 'dep': 'pobj', 'head': on}
{'values': [possible, configuration], 'text': 'possible configuration', 'child_conj': [], 'pos': 'NOUN', 'dep': 'nsubj', 'head': Is}


In [47]:
## get the compound of every word in the sentence\
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])
    

I nsubj planning VERB []
am aux planning VERB []
planning ROOT planning VERB [I, am, order, ,, and, Package, .]
to aux order VERB []
order xcomp planning VERB [to, M8, with]
the det M8 PROPN []
BMW compound M8 PROPN []
M8 dobj order VERB [the, BMW]
with prep order VERB [sunroof]
a det sunroof NOUN []
sunroof pobj with ADP [a, or, lounge]
or cc sunroof NOUN []
panorama compound lounge NOUN []
glass compound roof NOUN []
roof compound sky NOUN [glass]
sky compound lounge NOUN [roof]
lounge conj sunroof NOUN [panorama, sky]
, punct planning VERB []
and cc planning VERB []
the det Package PROPN []
M compound Package PROPN []
Sport compound Package PROPN []
Package conj planning VERB [the, M, Sport, on]
on prep Package PROPN [April]
12th amod April PROPN []
April pobj on ADP [12th, 2018]
2018 nummod April PROPN []
. punct planning VERB []
Is ROOT Is AUX [configuration, ?]
this det configuration NOUN []
configuration nsubj Is AUX [this, possible]
possible amod configuration NOUN []
? punct I

In [35]:
compound_nouns

[]

## Method - 2 Topic Modeling / Text Segmentation

In [3]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer()

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

No sentence-transformers model found with name C:\Users\GP65/.cache\torch\sentence_transformers\T-Systems-onsite_cross-en-de-roberta-sentence-transformer. Creating a new one with MEAN pooling.


In [4]:
topic_model = BERTopic(language='english',
                      embedding_model = embedding_model)

In [7]:
topic_model.fit_transform(["I want to order a BMW iX with right-hand drive configuration. I will be ordering it at the start of October 2022."])

ValueError: Transform unavailable when model was fit with only a single data sample.