In [1]:
import sys
sys.path.insert(1, 'Utilities')

# 1.) Model Type Detection

## Strategy for matching the keywords to Sales Description:

#### From what I understand:
- Every request body has just one model code. If a request contains several (or maybe an ambigious one), we'll need to create multiple request bodies.
- There is also a lot of overlap in their names, which might cause a problem if people misspell them.

### Potential approach:
- We'll need to find some way to extract the keywords from the request text and find a similarity score with the Sales description.
- We can calculate the score with cosine similarity, and then softmax the results. 

#### How to get the score:

##### Method 1:
- Maybe encode the table with BERT (SBERT would work here) and compare the BERT encodings of the keywords with the BERT encodings of the sales description to find which one matches the best.
- Why BERT?:  This takes care of any spelling and typographical errors AND if there is some ambiguity, we'll get very similary softmax scores for different sales descriptions.


In [2]:
from utilities import *

In [6]:
target_embeddings = [get_sbert_embeddings('iX xDrive50'), get_sbert_embeddings('iX xDrive40'), get_sbert_embeddings('X7 xDrive40i'), get_sbert_embeddings('X7 xDrive40d'), get_sbert_embeddings('M8'), get_sbert_embeddings('318i')]
test_emb = get_sbert_embeddings('iX xDrive50')

In [7]:
softmax(np.array([get_cosine_similarity(i, test_emb) for i in target_embeddings]))

array([0.16695336, 0.16686176, 0.16672765, 0.16676882, 0.16661336,
       0.16607507], dtype=float32)

##### Problems:
- BERT (or FastText, Word2Vec, etc.) is really heavy and an overkill for this problem. 
- On top of that, the similarity scores are not different for big changes in the model type code, let alone for the minor ones (typing errors, partial names etc.). Even pooling strategies don't work well here.

##### Method 2:
- We can calculate the score with the longest common subsequence as a percentage of the length of the sales description extracted from the text. 
- This is a lot more interpretable and since a typgraphical error will change the score for every match, the softmax output will remain consistent.
- Rest all is same as the BERT strategy.

In [8]:
target_string = ['iX xDrive50', 'iX xDrive40','X7 xDrive40i', 'X7 xDrive40d','M8','318i']

In [9]:
test_string = 'X7 xDrive40i'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.18498475, 0.20258965, 0.22186999, 0.20413024, 0.08162141,
       0.10480396])

In [10]:
# With spelling error
test_string = 'ix Drve40'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.19776486, 0.221006  , 0.19776486, 0.19776486, 0.08130357,
       0.10439584])

In [11]:
# With ambiguity
test_string = 'Drive40'
softmax(np.array([lcs_similarity(i, test_string) for i in target_string]))

array([0.18416297, 0.21244395, 0.21244395, 0.21244395, 0.07815376,
       0.10035142])

##### Potential Selection method: 
- If two score are equal till the 2nd decimal place, we can declare that the request is ambigious enough for us to create multiple request bodies.

## 2.) Keyword detection
- Essentially, I need to find all the key terms in the text body and find which ones of them match with the sales description of model type codes.

#### Here is an idea: 
- The bolean formula taks is going to be a POS-Tagging job. If I am going to extract nouns there anyway, why not use the same information to find the potnetial model type codes as well?

## 2 (redefined) Keyword detection with some POS and conjugation context:
- Spacy's 'en_core_web_sm' might be good enough here.

In [26]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_md")

# text = "Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024"
# text = 'I want to order a iX xDrive40 with a sunroof or a panorama glass roof sky lounge or with the M Sport Package or right-hand drive on 12th April 2018. Is this configuration possible?'
# text = "I want to order a BMW iX with right-hand drive configuration. I will be ordering it at the start of October 2022."
# text = "I want to order a BMW iX. Please don't add a right-hand drive configuration. I will be ordering it at the start of October 2022."
# text = "I want to order a BMW iX. I do not want a right-hand drive configuration. I will be ordering it at the start of October 2022."
# text = "I want to order a X7 XDrive50 with right-hand drive configuration. I will be ordering it at the start of October 2022."
# text =  'I want to order a BMW xDrive which does Not include panaroma glass roof or sunroof along with a right-hand drive config. Please do not add the m sport package or the m sport package pro. Is a delivery on 13th of December possible?'
text = 'Hello, is the X7 xDrive40i available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024.'.lower()
doc = nlp(text)
displacy.render(doc, style = "dep")

- Most of these keywords look like they are compound NOUNS or PROPNs. Some are joined with amod though (like the model names or ones separated with -).
- The one with "panorama glass roof sky lounge" is giving a wierd compound structure (this can be solved with some recursive function to extract the compounds).

- This is also going to require some sort of reduction as the lookup function for compound words create redundant lists. We can check if a term is a subset of another term and if yes, we can remove them.

- The with/without etc are ADP and can directly be taken form the head of the NOUN/PROPN. That gives me my and/and not boolean logic. (However, due to some grammatical mistake, the head might not end up being a directy with or without. Here, I'll need to find some way to determine whether the head has a positive or negative connotation.)

- Now about OR. <b>Here is an idea </b>. You technically cannot ask for two configuration for the same thing (like sunroof and panaroma glass roof). So say if I find two things that belong to the roof configuration, they'll go into a bracket with OR/OR-NOT. (To reinforce this however, I can check for the any child conjugations that these terms have).

- Another observation: If the sentence is A or B, then the head of A is usually wrong. If both belong to the same class, we can assume both have the same head as the one for B.

In [3]:
from pos_tagging import *

In [35]:
samples = ["Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024",
'I am planning to order the BMW M8 with a sunroof or a panorama glass roof sky lounge along with the M Sport Package and right-hand drive on 12th April 2018. Is this configuration possible?',
"I want to order a BMW iX XDrive40 model with right-hand drive configuration. I will be ordering it at the start of October 2022.",
"I want to order a BMW iX. I do not want a right-hand drive configuration. I will be ordering it at the start of October 2022."]


for text in samples:
    print("Prompt: ", text, '\n')
    unique_tags, all_tags = get_key_terms_with_pos(text)

    for t in all_tags:
        print(t)

    print('\n------------------\nUnique:')

    for t in unique_tags:
        print(t)

    print('\n\n\n')

Prompt:  Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024 

{'values': [available, M8], 'main_token': M8, 'text': 'available M8', 'child_conj': [], 'head_conj': [], 'pos': 'PROPN', 'dep': 'attr', 'head': is}
{'values': [glass], 'main_token': glass, 'text': 'glass', 'child_conj': [], 'head_conj': [], 'pos': 'NOUN', 'dep': 'compound', 'head': roof}
{'values': [panorama, glass, roof], 'main_token': roof, 'text': 'panorama glass roof', 'child_conj': [], 'head_conj': [and], 'pos': 'NOUN', 'dep': 'pobj', 'head': without}
{'values': [EU], 'main_token': EU, 'text': 'EU', 'child_conj': [], 'head_conj': [], 'pos': 'PROPN', 'dep': 'compound', 'head': Package}
{'values': [Comfort], 'main_token': Comfort, 'text': 'Comfort', 'child_conj': [], 'head_conj': [], 'pos': 'PROPN', 'dep': 'compound', 'head': Package}
{'values': [EU, Comfort, Package], 'main_token': Package, 'text': 'EU Comfort Package', 'child_conj': []

#### Let's try to extract the boolean rule

Methods that have failed so far:
- TextBlob for getting the sentiment
- vaderSentiment

One possible way here can be either lemmatize the term and make a look up dictionary. 

In [4]:
from request_body_creation import *

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\GP65\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\GP65\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\GP65\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\GP65\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\GP65\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\GP65\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [20]:
custom_sample = 'On December 6th, I need a BMW M8 with a sunroof or panaroma glass roof sky lounge or with the M Sport Package. How much would it cost?'#.title()
# custom_sample = "I'm looking to buy the X7 xDrive40i without the M Sport Package or a sunroof on November 4th, 2027. Is that configuration possible?"
tags,_ = get_key_terms_with_pos(custom_sample)
segregated = segregated_tags(tags)

In [21]:
logic, logic_sentiment = get_boolean_logic_datastruct(tags, segregated)

In [11]:
boolean_formula = convert_to_boolean_formula(logic, logic_sentiment)
print(boolean_formula)

+S407A/P337A


#### Getting the Model Type Code


In [16]:
model_type_codes = get_model_type_codes(tags,segregated)
print(model_type_codes)

['DZ01', '21CF', '11CF']


### Getting the Date

In [36]:
import datefinder
text = samples[0]
print(text)
matches = datefinder.find_dates(text)
for match in matches:
    print(match.date().strftime("%Y-%m-%d"))

Hello, is the M8 available without a panorama glass roof and with the EU Comfort Package. I need the vehicle on the 8th of November 2024
2024-11-08


#### Problem:
- This cannot deal with statements like "Start of ....." and uses the middle of the month.
- Other libraries like dateutils also fail here.
- SUTime might be able to identify this but the dependency is too large to download for this application.

### Getting the Request Body

In [32]:
custom_sample = 'I am planning to order the BMW xDrive with a sunroof or a panorama glass roof sky lounge along with the M Sport Package or M Sport Package Pro and right-hand drive on 12th April. Is this configuration possible?'
get_request_body(custom_sample)

([{'modelTypeCodes': ['21CF'],
   'booleanFormulas': ['+(S403A/S407A)+(P337A/P33BA)+RL'],
   'dates': ['2023-04-12']},
  {'modelTypeCodes': ['11CF'],
   'booleanFormulas': ['+(S403A/S407A)+(P337A/P33BA)+RL'],
   'dates': ['2023-04-12']},
  {'modelTypeCodes': ['21EM'],
   'booleanFormulas': ['+(S403A/S407A)+(P337A/P33BA)+RL'],
   'dates': ['2023-04-12']},
  {'modelTypeCodes': ['21EN'],
   'booleanFormulas': ['+(S403A/S407A)+(P337A/P33BA)+RL'],
   'dates': ['2023-04-12']}],
 [['21CF', '11CF', '21EM', '21EN'],
  '+(S403A/S407A)+(P337A/P33BA)+RL',
  '2023-04-12'])

### Running on Test Cases (courtesy of ChatGPT)

In [5]:
import pandas as pd
from tqdm.notebook import tqdm
test_case = pd.read_csv('./test_cases.csv')
# Fixing the datetime format
test_case['Date'] = test_case['Date'].apply(lambda x: list(datefinder.find_dates(' '.join(x.split(' ')[1:])))[0])

In [6]:
prompts,values = [],[]

for text in tqdm(test_case["Prompts"]):
    try:
        request_body, request_values = get_request_body(text, exception_handeling= False)

        prompts.append(request_body)
        values.append(request_values)
    except Exception as e:
        print(e)
        prompts.append([None])
        values.append([None,None,None])


  0%|          | 0/39 [00:00<?, ?it/s]

I'm looking to buy the X7 xDrive40i without the M Sport Package or a sunroof on November 4th, 2027. Is that configuration possible?
{'values': [X7, xDrive40i], 'main_token': xDrive40i, 'text': 'X7 xDrive40i', 'child_conj': [], 'head_conj': [], 'pos': 'PROPN', 'dep': 'dobj', 'head': buy} 0
{'values': [M, Sport, Package], 'main_token': Package, 'text': 'M Sport Package', 'child_conj': [or], 'head_conj': [], 'pos': 'PROPN', 'dep': 'pobj', 'head': without} 2
{'values': [sunroof], 'main_token': sunroof, 'text': 'sunroof', 'child_conj': [], 'head_conj': [or], 'pos': 'NOUN', 'dep': 'conj', 'head': Package} 3
{'values': [November, 4th], 'main_token': 4th, 'text': 'November 4th', 'child_conj': [], 'head_conj': [], 'pos': 'NOUN', 'dep': 'pobj', 'head': on} 0
{'values': [configuration], 'main_token': configuration, 'text': 'configuration', 'child_conj': [], 'head_conj': [], 'pos': 'NOUN', 'dep': 'nsubj', 'head': Is} -1

I'm planning to buy the BMW M8 without a Panorama Glass Roof Sky Lounge and l

In [7]:
result = pd.DataFrame(values)
result.iloc[:,1] = result.iloc[:,1].astype(str).apply(lambda a : str([a]))

In [8]:
print(f"ModelTypeCode accuracy: {(test_case.iloc[:,1] == result.iloc[:,0].astype(str)).sum()}/{len(test_case)}")
print(f"booleanFormulas accuracy: {(test_case.iloc[:,2] == result.iloc[:,1].astype(str)).sum()}/{len(test_case)}")
print(f"dates accuracy: {(pd.to_datetime(test_case.iloc[:,3]) == pd.to_datetime(result.iloc[:,2])).sum()}/{len(test_case)}")

ModelTypeCode accuracy: 39/39
booleanFormulas accuracy: 39/39
dates accuracy: 38/39


In [12]:
# Checking the error
test_case[~(pd.to_datetime(test_case.iloc[:,3]) == pd.to_datetime(result.iloc[:,2]))]

Unnamed: 0,Prompts,Model,Boolean,Date
20,Can you tell me if it's possible to order a BM...,['11CF'],['+-(S403A/S402A)+P33BA'],2022-10-31


#### Observations:

- All the model types and boolean formulas from these test cases are getting detected correctly.
- The date inaccuracy is because the datefinder is not able to recognize "End of October" as 31st October.