In [1]:
import sys
sys.path.insert(1, 'Utilities')

# 1.) Model Type Detection

## Strategy for matching the keywords to Sales Description:

#### From what I understand:
- Every request body has just one model code. If a request contains several (or maybe an ambigious one), we'll need to create multiple request bodies.
- There is also a lot of overlap in their names, which might cause a problem if people misspell them.

### Potential approach:
- We'll need to find some way to extract the keywords from the request text and find a similarity score with the Sales description.
- We can calculate the score with cosine similarity, and then softmax the results. 

#### How to get the score:

##### Method 1:
- Maybe encode the table with BERT (SBERT would work here) and compare the BERT encodings of the keywords with the BERT encodings of the sales description to find which one matches the best.
- Why BERT?:  This takes care of any spelling and typographical errors AND if there is some ambiguity, we'll get very similary softmax scores for different sales descriptions.


In [10]:
from utilities import *

In [261]:
target_embeddings = [get_sbert_embeddings('iX xDrive50'), get_sbert_embeddings('iX xDrive40'), get_sbert_embeddings('X7 xDrive40i'), get_sbert_embeddings('X7 xDrive40d'), get_sbert_embeddings('M8'), get_sbert_embeddings('318i')]
test_emb = get_sbert_embeddings('iX xDrive50')

In [262]:
softmax(np.array([get_cosine_similarity(i, test_emb) for i in target_embeddings]))

array([0.16695336, 0.16686176, 0.16672765, 0.16676882, 0.16661336,
       0.16607507], dtype=float32)

##### Problems:
- BERT (or FastText, Word2Vec, etc.) is really heavy and an overkill for this problem. 
- On top of that, the similarity scores are not different for big changes in the model type code, let alone for the minor ones (typing errors, partial names etc.). Even pooling strategies don't work well here.

##### Method 2:
- We can calculate the score with the longest common subsequence as a percentage of the length of the sales description extracted from the text. 
- This is a lot more interpretable and since a typgraphical error will change the score for every match, the softmax output will remain consistent.
- Rest all is same as the BERT strategy.

In [263]:
target_string = ['iX xDrive50', 'iX xDrive40','X7 xDrive40i', 'X7 xDrive40d','M8','318i']

In [264]:
test_string = 'X7 xDrive40i'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.18498475, 0.20258965, 0.22186999, 0.20413024, 0.08162141,
       0.10480396])

In [265]:
# With spelling error
test_string = 'ix Drve40'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.19354117, 0.21628595, 0.19354117, 0.19354117, 0.08891781,
       0.11417273])

In [266]:
# With ambiguity
test_string = 'Drive40'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.18416297, 0.21244395, 0.21244395, 0.21244395, 0.07815376,
       0.10035142])

##### Potential Selection method: 
- If two score are equal till the 2nd decimal place, we can declare that the request is ambigious enough for us to create multiple request bodies.

## 2.) Keyword detection
- Essentially, I need to find all the key terms in the text body and find which ones of them match with the sales description of model type codes.

#### Here is an idea: 
- The bolean formula taks is going to be a POS-Tagging job. If I am going to extract nouns there anyway, why not use the same information to find the potnetial model type codes as well?

## 2 (redefined) Keyword detection with some POS and conjugation context:
- Spacy's 'en_core_web_sm' might be good enough here.

In [68]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

# text = "Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024"
# text = 'I am planning to order the BMW M8 with a sunroof or a panorama glass roof sky lounge along with the M Sport Package and right-hand drive on 12th April 2018. Is this configuration possible?'
# text = "I want to order a BMW iX with right-hand drive configuration. I will be ordering it at the start of October 2022."
text = "I want to order a BMW iX. Please don't add a right-hand drive configuration. I will be ordering it at the start of October 2022."

doc = nlp(text)
displacy.render(doc, style = "dep")

- Most of these keywords look like they are compound NOUNS or PROPNs. Some are joined with amod though (like the model names or ones separated with -).
- The one with "panorama glass roof sky lounge" is giving a wierd compound structure (this can be solved with some recursive function to extract the compounds).

- This is also going to require some sort of reduction as the lookup function for compound words create redundant lists. We can check if a term is a subset of another term and if yes, we can remove them.

- The with/without etc are ADP and can directly be taken form the head of the NOUN/PROPN. That gives me my and/and not boolean logic. (However, due to some grammatical mistake, the head might not end up being a directy with or without. Here, I'll need to find some way to determine whether the head has a positive or negative connotation.)

- Now about OR. <b>Here is an idea </b>. You technically cannot ask for two configuration for the same thing (like sunroof and panaroma glass roof). So say if I find two things that belong to the roof configuration, they'll go into a bracket with OR/OR-NOT. (To reinforce this however, I can check for the any child conjugations that these terms have).

- Another observation: If the sentence is A or B, then the head of A is usually wrong. If both belong to the same class, we can assume both have the same head as the one for B.

In [3]:
from pos_tagging import *

In [57]:
samples = ["Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024",
'I am planning to order the BMW M8 with a sunroof or a panorama glass roof sky lounge along with the M Sport Package and right-hand drive on 12th April 2018. Is this configuration possible?',
"I want to order a BMW iX with right-hand drive configuration. I will be ordering it at the start of October 2022.",
"I want to order a BMW iX. I do not want a right-hand drive configuration. I will be ordering it at the start of October 2022."]

for text in samples:
    print("Prompt: ", text, '\n')
    unique_tags, all_tags = get_key_terms_with_pos(text)

    for t in all_tags:
        print(t)

    print('\n------------------\nUnique:')

    for t in unique_tags:
        print(t)

    print('\n\n\n')

Prompt:  Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024 

{'values': [available, M8], 'main_token': M8, 'text': 'available M8', 'child_conj': [], 'pos': 'NOUN', 'dep': 'attr', 'head': is}
{'values': [panorama], 'main_token': panorama, 'text': 'panorama', 'child_conj': [], 'pos': 'NOUN', 'dep': 'compound', 'head': roof}
{'values': [glass], 'main_token': glass, 'text': 'glass', 'child_conj': [], 'pos': 'NOUN', 'dep': 'compound', 'head': roof}
{'values': [panorama, glass, roof], 'main_token': roof, 'text': 'panorama glass roof', 'child_conj': [], 'pos': 'NOUN', 'dep': 'pobj', 'head': without}
{'values': [EU], 'main_token': EU, 'text': 'EU', 'child_conj': [], 'pos': 'PROPN', 'dep': 'compound', 'head': Package}
{'values': [Comfort], 'main_token': Comfort, 'text': 'Comfort', 'child_conj': [], 'pos': 'PROPN', 'dep': 'compound', 'head': Package}
{'values': [EU, Comfort, Package], 'main_token': Package, 't

#### Let's try to find the model type code

In [58]:
MODEL_TYPE_CODE = { 'iX xDrive50': '21CF',
                    'iX xDrive40': '11CF',
                    'X7 xDrive40i': '21EM',
                    'X7 xDrive40d': '21EN',
                    'M8': 'DZ01',
                    '318i': '28FF',}

STEERING_CONFIG_CODE = { 'Left-Hand Drive': 'LL',
                         'Right-Hand Drive': 'RL',}

PACKAGE_CODE = {    'M Sport Package': 'P337A',
                    'M Sport Package Pro': 'P33BA',
                    'Comfort Package EU': 'P7LGA'}

ROOF_CONFIG_CODE = {'Panorama Glass Roof': 'S402A',
                    'Panorama Glass Roof Sky Lounge': 'S407A',
                    'Sunroof': 'S403A'}

In [112]:
tags,_ = get_key_terms_with_pos(samples[1])

In [124]:
segregated = segregated_tags(tags)
for s in segregated:
    print(s)
    print('\n\n')

[{'values': [BMW, M8], 'main_token': M8, 'text': 'BMW M8', 'child_conj': [], 'pos': 'PROPN', 'dep': 'dobj', 'head': order}, {'values': [12th, April], 'main_token': April, 'text': '12th April', 'child_conj': [], 'pos': 'PROPN', 'dep': 'pobj', 'head': on}]



[{'values': [right, hand, drive], 'main_token': drive, 'text': 'right hand drive', 'child_conj': [], 'pos': 'NOUN', 'dep': 'conj', 'head': Package}]



[{'values': [M, Sport, Package], 'main_token': Package, 'text': 'M Sport Package', 'child_conj': [and], 'pos': 'PROPN', 'dep': 'pobj', 'head': with}]



[{'values': [sunroof], 'main_token': sunroof, 'text': 'sunroof', 'child_conj': [or], 'pos': 'NOUN', 'dep': 'pobj', 'head': with}, {'values': [panorama, glass, roof, sky, lounge], 'main_token': lounge, 'text': 'panorama glass roof sky lounge', 'child_conj': [], 'pos': 'NOUN', 'dep': 'conj', 'head': sunroof}, {'values': [possible, configuration], 'main_token': configuration, 'text': 'possible configuration', 'child_conj': [], 'pos': 'N

In [123]:
def segregated_tags(tags):
    segregated = [[]]*4

    for t in tags:
        tag_text = t['text']
        match_score = [ max([lcs_similarity(tag_text,i,type='min') for i in list(MODEL_TYPE_CODE.keys())]),
                        max([lcs_similarity(tag_text,i,type='min') for i in list(STEERING_CONFIG_CODE.keys())]),
                        max([lcs_similarity(tag_text,i,type='min') for i in list(PACKAGE_CODE.keys())]),
                        max([lcs_similarity(tag_text,i,type='min') for i in list(ROOF_CONFIG_CODE.keys())])]
        

        if max(match_score) >= 0.5:
            segregated[np.argmax(match_score)] = segregated[np.argmax(match_score)] + [t]

    return segregated 

    

In [70]:
def segregated_tags(tags):
    """Segregates the tags into 4 lists: Model Type, Steering Config, Package, Roof Config

    Args:
        tags (list): List of dictionaries containing the tags and their POS tags

    Returns:
        list: List of lists containing the segregated tags
    """
    # Four Lists: Model Type, Steering Config, Package, Roof Config

    segregated = [[]]*4

    for t in tags:
        tag_text = t['text']

        # Getting the match score with each dictionary
        match_score = [ max([lcs_similarity(tag_text,i,type='min') for i in list(MODEL_TYPE_CODE.keys())]),
                        max([lcs_similarity(tag_text,i,type='min') for i in list(STEERING_CONFIG_CODE.keys())]),
                        max([lcs_similarity(tag_text,i,type='min') for i in list(PACKAGE_CODE.keys())]),
                        max([lcs_similarity(tag_text,i,type='min') for i in list(ROOF_CONFIG_CODE.keys())])]
        

        if max(match_score) >= 0.5:
            segregated[np.argmax(match_score)] = segregated[np.argmax(match_score)] + [t]

    return segregated