In [1]:
import sys
sys.path.insert(1, 'Utilities')

# 1.) Model Type Detection

## Strategy for matching the keywords to Sales Description:

#### From what I understand:
- Every request body has just one model code. If a request contains several (or maybe an ambigious one), we'll need to create multiple request bodies.
- There is also a lot of overlap in their names, which might cause a problem if people misspell them.

### Potential approach:
- We'll need to find some way to extract the keywords from the request text and find a similarity score with the Sales description.
- We can calculate the score with cosine similarity, and then softmax the results. 

#### How to get the score:

##### Method 1:
- Maybe encode the table with BERT (SBERT would work here) and compare the BERT encodings of the keywords with the BERT encodings of the sales description to find which one matches the best.
- Why BERT?:  This takes care of any spelling and typographical errors AND if there is some ambiguity, we'll get very similary softmax scores for different sales descriptions.


In [2]:
from utilities import *

In [3]:
target_embeddings = [get_sbert_embeddings('iX xDrive50'), get_sbert_embeddings('iX xDrive40'), get_sbert_embeddings('X7 xDrive40i'), get_sbert_embeddings('X7 xDrive40d'), get_sbert_embeddings('M8'), get_sbert_embeddings('318i')]
test_emb = get_sbert_embeddings('iX xDrive50')

In [4]:
softmax(np.array([get_cosine_similarity(i, test_emb) for i in target_embeddings]))

array([0.16695336, 0.16686176, 0.16672765, 0.16676882, 0.16661336,
       0.16607507], dtype=float32)

##### Problems:
- BERT (or FastText, Word2Vec, etc.) is really heavy and an overkill for this problem. 
- On top of that, the similarity scores are not different for big changes in the model type code, let alone for the minor ones (typing errors, partial names etc.). Even pooling strategies don't work well here.

##### Method 2:
- We can calculate the score with the longest common subsequence as a percentage of the length of the sales description extracted from the text. 
- This is a lot more interpretable and since a typgraphical error will change the score for every match, the softmax output will remain consistent.
- Rest all is same as the BERT strategy.

In [5]:
target_string = ['iX xDrive50', 'iX xDrive40','X7 xDrive40i', 'X7 xDrive40d','M8','318i']

In [6]:
test_string = 'X7 xDrive40i'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.18498475, 0.20258965, 0.22186999, 0.20413024, 0.08162141,
       0.10480396])

In [7]:
# With spelling error
test_string = 'ix Drve40'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.19776486, 0.221006  , 0.19776486, 0.19776486, 0.08130357,
       0.10439584])

In [8]:
# With ambiguity
test_string = 'Drive40'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.18416297, 0.21244395, 0.21244395, 0.21244395, 0.07815376,
       0.10035142])

##### Potential Selection method: 
- If two score are equal till the 2nd decimal place, we can declare that the request is ambigious enough for us to create multiple request bodies.

## 2.) Keyword detection
- Essentially, I need to find all the key terms in the text body and find which ones of them match with the sales description of model type codes.

#### Here is an idea: 
- The bolean formula taks is going to be a POS-Tagging job. If I am going to extract nouns there anyway, why not use the same information to find the potnetial model type codes as well?

## 2 (redefined) Keyword detection with some POS and conjugation context:
- Spacy's 'en_core_web_sm' might be good enough here.

In [39]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

# text = "Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024"
text = 'I am planning to order the BMW M8 with a sunroof or a panorama glass roof sky lounge or with the M Sport Package or right-hand drive on 12th April 2018. Is this configuration possible?'
# text = "I want to order a BMW iX with right-hand drive configuration. I will be ordering it at the start of October 2022."
# text = "I want to order a BMW iX. Please don't add a right-hand drive configuration. I will be ordering it at the start of October 2022."
# text = "I want to order a BMW iX. I do not want a right-hand drive configuration. I will be ordering it at the start of October 2022."

doc = nlp(text)
displacy.render(doc, style = "dep")

- Most of these keywords look like they are compound NOUNS or PROPNs. Some are joined with amod though (like the model names or ones separated with -).
- The one with "panorama glass roof sky lounge" is giving a wierd compound structure (this can be solved with some recursive function to extract the compounds).

- This is also going to require some sort of reduction as the lookup function for compound words create redundant lists. We can check if a term is a subset of another term and if yes, we can remove them.

- The with/without etc are ADP and can directly be taken form the head of the NOUN/PROPN. That gives me my and/and not boolean logic. (However, due to some grammatical mistake, the head might not end up being a directy with or without. Here, I'll need to find some way to determine whether the head has a positive or negative connotation.)

- Now about OR. <b>Here is an idea </b>. You technically cannot ask for two configuration for the same thing (like sunroof and panaroma glass roof). So say if I find two things that belong to the roof configuration, they'll go into a bracket with OR/OR-NOT. (To reinforce this however, I can check for the any child conjugations that these terms have).

- Another observation: If the sentence is A or B, then the head of A is usually wrong. If both belong to the same class, we can assume both have the same head as the one for B.

In [10]:
from pos_tagging import *

In [11]:
samples = ["Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024",
'I am planning to order the BMW M8 with a sunroof or a panorama glass roof sky lounge along with the M Sport Package and right-hand drive on 12th April 2018. Is this configuration possible?',
"I want to order a BMW iX with right-hand drive configuration. I will be ordering it at the start of October 2022.",
"I want to order a BMW iX. I do not want a right-hand drive configuration. I will be ordering it at the start of October 2022."]

for text in samples:
    print("Prompt: ", text, '\n')
    unique_tags, all_tags = get_key_terms_with_pos(text)

    for t in all_tags:
        print(t)

    print('\n------------------\nUnique:')

    for t in unique_tags:
        print(t)

    print('\n\n\n')

Prompt:  Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024 

{'values': [available, M8], 'main_token': M8, 'text': 'available M8', 'child_conj': [], 'head_conj': [], 'pos': 'NOUN', 'dep': 'attr', 'head': is}
{'values': [panorama], 'main_token': panorama, 'text': 'panorama', 'child_conj': [], 'head_conj': [], 'pos': 'NOUN', 'dep': 'compound', 'head': roof}
{'values': [glass], 'main_token': glass, 'text': 'glass', 'child_conj': [], 'head_conj': [], 'pos': 'NOUN', 'dep': 'compound', 'head': roof}
{'values': [panorama, glass, roof], 'main_token': roof, 'text': 'panorama glass roof', 'child_conj': [], 'head_conj': [and], 'pos': 'NOUN', 'dep': 'pobj', 'head': without}
{'values': [EU], 'main_token': EU, 'text': 'EU', 'child_conj': [], 'head_conj': [], 'pos': 'PROPN', 'dep': 'compound', 'head': Package}
{'values': [Comfort], 'main_token': Comfort, 'text': 'Comfort', 'child_conj': [], 'head_conj': [], 'pos': 

#### Let's try to create the boolean rule

Methods that have failed so far:
- TextBlob for getting the sentiment
- vaderSentiment

One possible way here can be either lemmatize the term and make a look up dictionary. 

In [12]:
from request_body_creation import *
import numpy as np

In [13]:
def segregated_tags(tags):
    # Adding the types to the tags
    
    segregated = []

    for t in tags:
        tag_text = t['text']

        # Getting the match score with each dictionary
        match_score = [ max([lcs_similarity(tag_text,i,type='min') for i in list(MODEL_TYPE_CODE.keys())]),
                        max([lcs_similarity(tag_text,i,type='min') for i in list(STEERING_CONFIG_CODE.keys())]),
                        max([lcs_similarity(tag_text,i,type='min') for i in list(PACKAGE_CODE.keys())]),
                        max([lcs_similarity(tag_text,i,type='min') for i in list(ROOF_CONFIG_CODE.keys())])]
        
        
        if max(match_score) >= 0.5:
            segregated.append(np.argmax(match_score))
        else:
            segregated.append(-1)

    return segregated

In [14]:
def revert_tense(word, tense):
    
    blob = TextBlob(word)
    return blob.words[0].lemmatize(tense)

def get_word_sentiment(word):
    # Create a custom sentiment lexicon
    custom_lexicon = {'with': 0.5, 'include': 0.5, 'want': 0.5, 'have':0.5, 
                  'without': -0.5, 'exclude': -0.5}
    
    word = revert_tense(word, 'v')
    if word in custom_lexicon:
        return custom_lexicon[word]
    else:
        blob = TextBlob(word, analyzer=NaiveBayesAnalyzer())
        return blob.sentiment.p_pos - blob.sentiment.p_neg

In [63]:
def get_boolean_logic_datastruct(tags, segregated):
    code_dictionaries = [STEERING_CONFIG_CODE, PACKAGE_CODE, ROOF_CONFIG_CODE]
    # Just so that we can look at the sentiment of the previous tag as well (if there is a conjugation)
    prev_tag = None
    prev_cat = None
    logic = []
    logic_sentiment = []
    for s,t in zip(segregated, tags):
        if s in [1,2,3]:
            print(t)

            ## Getting the sentiment of the head
            head_sentiment = get_word_sentiment(t['head'].text)

            if abs(head_sentiment) < 0.2:
                ## We use the previous one
                t['connotation'] = prev_tag['connotation'] if prev_tag else 'pos'
            else:
                # Checking if there is a negation in the children of head
                if 'neg' in [i.dep_ for i in t['head'].children]:
                    head_sentiment = -head_sentiment

                t['connotation'] = 'pos' if head_sentiment > 0 else 'neg'

            # Fetch the code dictionary
            code_dict = code_dictionaries[s-1]

            # Get the correct code from the dictionary
            code = None
            try:
                # First try to fetch with direct indexing
                code = code_dict[t['text']]
            except:
                # if not found, try to use the LCS similarity (Using max in this case)
                similarity_score = [lcs_similarity(t['text'],i,type='max') for i in list(code_dict.keys())]
                if max(similarity_score) >= 0.5:
                    code = list(code_dict.values())[np.argmax(similarity_score)]

            # If the code ends up being None, then the tag was misclassified

            ''' The crux of the logic:

            - The goal is to divide the boolean string into 2 parts: Sentiment and Code.
            - The code(s) will be joined with either / or + based on the conjugation of the previous term.
            - The code(s) will carry the - sign if the connotation is negative.
            - Since you can't ask for 'and' of multiple components of the same category, we are assuming 
            that they will be joined with a '/' only. codes that are to be put in parantheses would be kept in 
            the same list [].
            
            Example : "+(A/-B)+-C" will have the data structures: [[A,-B],[-C]] and signs: [+,+]
            
            '''
            if code is not None:
                if prev_tag is not None:
                    if prev_cat == s:
                        t['code'] = code if t['connotation']=='pos' else '-'+code
                        logic[-1].append(t)
                    else:
                        t['code'] = code if t['connotation']=='pos' else '-'+code
                        logic.append([t])
                        # The or conjugation between the two categories comes from the head of the fisrt tag of the previous group or the head of the current tag (if singular).
                        # Can be seen in the displacy image above
                        if (len(logic[-2][0]['head_conj']) > 0 and logic[-2][0]['head_conj'][0].text == 'or') or (len(t['head_conj']) > 0 and t['head_conj'][0].text == 'or'):
                            logic_sentiment.append('/')
                        else:
                            logic_sentiment.append('+')

                else:
                    t['code'] = code if t['connotation']=='pos' else '-'+code
                    logic.append([t])
                    logic_sentiment.append('+')

                # print([[i['code'] for i in j] for j in logic])
                # print(logic_sentiment)

            prev_tag = t
            prev_cat = s

    return logic, logic_sentiment


In [64]:
samples

['Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024',
 'I am planning to order the BMW M8 with a sunroof or a panorama glass roof sky lounge along with the M Sport Package and right-hand drive on 12th April 2018. Is this configuration possible?',
 'I want to order a BMW iX with right-hand drive configuration. I will be ordering it at the start of October 2022.',
 'I want to order a BMW iX. I do not want a right-hand drive configuration. I will be ordering it at the start of October 2022.']

In [71]:
custom_sample = 'I am planning to order the BMW M8 with a sunroof or a panorama glass roof sky lounge or a panorama glass roof or with the M Sport Package or right-hand drive on 12th April 2018. Is this configuration possible?'
tags,_ = get_key_terms_with_pos(custom_sample)
segregated = segregated_tags(tags)

In [72]:
logic, logic_sentiment = get_boolean_logic_datastruct(tags, segregated)

{'values': [sunroof], 'main_token': sunroof, 'text': 'sunroof', 'child_conj': [or], 'head_conj': [or], 'pos': 'NOUN', 'dep': 'pobj', 'head': with}
{'values': [panorama, glass, roof, sky, lounge], 'main_token': lounge, 'text': 'panorama glass roof sky lounge', 'child_conj': [or], 'head_conj': [or], 'pos': 'NOUN', 'dep': 'conj', 'head': sunroof}
{'values': [panorama, glass, roof], 'main_token': roof, 'text': 'panorama glass roof', 'child_conj': [], 'head_conj': [or], 'pos': 'NOUN', 'dep': 'conj', 'head': lounge}
{'values': [M, Sport, Package], 'main_token': Package, 'text': 'M Sport Package', 'child_conj': [or], 'head_conj': [], 'pos': 'PROPN', 'dep': 'pobj', 'head': with}
{'values': [right, hand, drive], 'main_token': drive, 'text': 'right hand drive', 'child_conj': [], 'head_conj': [or], 'pos': 'NOUN', 'dep': 'conj', 'head': Package}
{'values': [possible, configuration], 'main_token': configuration, 'text': 'possible configuration', 'child_conj': [], 'head_conj': [], 'pos': 'NOUN', 'de

In [73]:
print([[i['code'] for i in j] for j in logic])
print(logic_sentiment)

[['S403A', 'S407A', 'S402A'], ['P337A'], ['RL']]
['+', '/', '/']


In [37]:
for t in tags:
    print(t)

{'values': [BMW, iX.], 'main_token': iX., 'text': 'BMW iX.', 'child_conj': [], 'pos': 'PROPN', 'dep': 'dobj', 'head': order}
{'values': [right, hand, drive, configuration], 'main_token': configuration, 'text': 'right hand drive configuration', 'child_conj': [], 'pos': 'NOUN', 'dep': 'dobj', 'head': want, 'connotation': 'neg'}
{'values': [start], 'main_token': start, 'text': 'start', 'child_conj': [], 'pos': 'NOUN', 'dep': 'pobj', 'head': at, 'connotation': 'neg'}
{'values': [October], 'main_token': October, 'text': 'October', 'child_conj': [], 'pos': 'PROPN', 'dep': 'pobj', 'head': of, 'connotation': 'neg'}


#### Getting the Date

In [148]:
import datefinder
text = samples[3]
print(text)
matches = datefinder.find_dates(text)
for match in matches:
    print(match)

I want to order a BMW iX. I do not want a right-hand drive configuration. I will be ordering it at the start of October 2022.
2022-10-13 00:00:00


In [128]:
samples[1]

'I am planning to order the BMW M8 with a sunroof or a panorama glass roof sky lounge along with the M Sport Package and right-hand drive on 12th April 2018. Is this configuration possible?'

In [135]:
from sutime import SUTime
import json
from dateparser.search import search_dates

In [137]:
search_dates(samples[3])

[('of October 2022', datetime.datetime(2022, 10, 13, 0, 0))]

In [147]:
!mvn dependency:copy-dependencies -DoutputDirectory=./jars -f $(python3 -c 'import importlib; import pathlib; print(pathlib.Path(importlib.util.find_spec("sutime").origin).parent / "pom.xml")')

'mvn' is not recognized as an internal or external command,
operable program or batch file.


In [145]:
from dateutil.parser import parse

text = samples[3]
date = parse(text, fuzzy=True)
print(date)

2022-10-13 00:00:00
