```
Why Text Search 2.0?


Existing Text Search is kind of obsolete for following reasons

1. Cannot Establish Relation between Categories and Features which is essential for contextual search

Eg - Red Salwar with Blue Dupatta
Co - [‘Red’, ‘Blue’], Ca - [‘Salwar, Dupatta]

While actually it should have been
Tag Extracted - [{“Co”:[“Red”], “Ca”:[“Salwar”]}, {“Co”:[“Blue”], “Ca”:[“Dupatta”]}]

So that we can form query like
fq:[(“color:Red” AND “category:Salwar”) OR (“color:Blue” AND “category:Dupatta”)]


2. Cannot be used for understanding semantic context

Eg - Android Phones under 20000
Where under 20000 here represents Price under 20000 so that we can perform range boosting
bq:(Price:[1 TO 20000])^5)

3. Spacy and other modern NLP libraries  offer better state of the solution and facilities to perform customized training. 
Also, different architectures can be quickly iterated and tested out using modern NLP libraries.
```

Pip Package Installations

In [0]:
%sh pip install word2number

In [0]:
%sh pip install -U pip setuptools wheel

In [0]:
%sh pip install -U spacy

In [0]:
%sh python -m spacy download en_core_web_sm

In [0]:
%sh python -m spacy download en_core_web_lg

In [0]:
%sh python -m spacy download en_core_web_trf

In [0]:
%sh pip install networkx

In [0]:
import re
import pandas as pd
import bs4
import requests
import spacy
from spacy import displacy
from spacy.matcher import Matcher 
from spacy.tokens import Span 

import networkx as nx

import matplotlib.pyplot as plt
from tqdm import tqdm
from collections import defaultdict

pd.set_option('display.max_colwidth', 200)
%matplotlib inline

Performance Analysis between Spacy English Models and Comparison of Spacy Models with Other NER-POS Models

```
Different Spacy English Models
sm - small
lg - large
trf - transformer model


PIPELINE	PARSER	TAGGER	NER
en_core_web_trf (spaCy v3)	95.1	97.8	89.8
en_core_web_lg (spaCy v3)	92.0	97.4	85.5
en_core_web_lg (spaCy v2)	91.9	97.2	85.5
Full pipeline accuracy on the OntoNotes 5.0 corpus (reported on the development set).

NAMED ENTITY RECOGNITION SYSTEM	ONTONOTES	CONLL ‘03
spaCy RoBERTa (2020)	89.8	91.6
Stanza (StanfordNLP)1	88.8	92.1
Flair2	89.7	93.1
```

In [0]:
# nlp = spacy.load('en_core_web_sm')
nlp = spacy.load('en_core_web_lg')
# nlp = spacy.load('en_core_web_trf')

In [0]:
import pandas as pd
# candidate_sentences = spark.read.format("csv").load("dbfs:/FileStore/shared_uploads/t_karthik.ragunath@tatadigital.com/qa.csv", delimiter='\t', encoding='utf-8', index_col=0)
candidate_sentences = pd.read_csv("/dbfs/FileStore/shared_uploads/t_karthik.ragunath@tatadigital.com/qa.csv", delimiter='\t', encoding='utf-8', index_col=0)
check_df = pd.read_csv("/dbfs/FileStore/shared_uploads/t_karthik.ragunath@tatadigital.com/data_cleanser.csv",  header=0, delimiter='\t', error_bad_lines=False)

1. Preprocessing of converting symbol words

```
Why?

POS and NER Taggers in a sentence tends to work well with Symbols mentioned explicitly
```

In [0]:
symbol_extraction_dictionary = {
    "length_symbols":{
        "km",
        "hm",
        "dam",
        "m",
        "dm",
        "cm",
        "mm"
    },

    "length_values":{
        "kilometre",
        "hectometre",
        "decametre",
        "metre",
        "decimetre",
        "centimetre",
        "millimetre",
        "kilometer",
        "hectometer",
        "decameter",
        "meter",
        "decimeter",
        "centimeter",
        "millimeter"
    },

    "weight_symbols":{
        "t",
        "kg",
        "hg",
        "dag",
        "g",
        "dg",
        "cg", 
        "mg"
    },

    "weight_values":{
        "tonne",
        "kilogram",
        "hectogram",
        "decagram",
        "gram",
        "decigram",
        "centigram",
        "milligram"
    },

    "volume_symbols":{
        "kL",
        "hL",
        "daL",
        "L",
        "dL",
        "cL",
        "mL"
    },

    "volume_values":{
        "kilolitre",
        "hectolitre",
        "decalitre",
        "litre",
        "decilitre",
        "centilitre",
        "millilitre",
        "kiloliter",
        "hectoliter",
        "decaliter",
        "liter",
        "deciliter",
        "centiliter",
        "milliliter"
    }
}

value_symbol_conversion = {
    "kilometre": "km",
    "hectometre": "hm",
    "decametre": "dam",
    "metre": "m",
    "decimetre": "dm",
    "centimetre": "cm",
    "millimetre": "mm",
    "kilometer": "km",
    "hectometer": "hm",
    "decameter": "dam",
    "meter": "m",
    "decimeter": "dm",
    "centimeter": "cm",
    "millimeter": "mm",
    "kilometres": "km",
    "hectometres": "hm",
    "decametres": "dam",
    "metres": "m",
    "decimetres": "dm",
    "centimetres": "cm",
    "millimetres": "mm",
    "kilometers": "km",
    "hectometers": "hm",
    "decameters": "dam",
    "meters": "m",
    "decimeters": "dm",
    "centimeters": "cm",
    "millimeters": "mm",

    "tonne": "t",
    "kilogram":	"kg",
    "hectogram": "hg",
    "decagram":	"dag",
    "gram":	"g",
    "decigram":	"dg",
    "centigram": "cg",
    "milligram": "mg",
    "tonnes": "t",
    "kilograms": "kg",
    "hectograms": "hg",
    "decagrams": "dag",
    "grams": "g",
    "decigrams": "dg",
    "centigrams": "cg",
    "milligrams": "mg",

    "kilolitre": "kL",
    "hectolitre": "hL",
    "decalitre": "daL",
    "litre": "L",
    "decilitre": "dL",
    "centilitre": "cL",
    "millilitre": "mL",
    "kiloliter": "kL",
    "hectoliter": "hL",
    "decaliter": "daL",
    "liter": "L",
    "deciliter": "dL",
    "centiliter": "cL",
    "milliliter": "mL",
    "kilolitres": "kL",
    "hectolitres": "hL",
    "decalitres": "daL",
    "litres": "L",
    "decilitres": "dL",
    "centilitres": "cL",
    "millilitres": "mL",
    "kiloliters": "kL",
    "hectoliters": "hL",
    "decaliters": "daL",
    "liters": "L",
    "deciliters": "dL",
    "centiliters": "cL",
    "milliliters": "mL",
    
    "dollars": "$",
    "Dollars": "$",
    "dollar": "$",
    "Dollar": "$",
    "Euro": "€",
    "euro": "€",
    "Euros": "€",
    "euros": "€",
    "Pound": "£",
    "pound": "£",
    "Pounds": "£",
    "pounds": "£",
    "Rupee": "₹",
    "rupee": "₹",
    "Rupees": "₹",
    "rupees": "₹"
}

In [0]:
def replace_with_symbols(sent):
    substring_overlap_list = list(filter(lambda x: x in sent, value_symbol_conversion.keys()))
    substring_overlap_list_sorted = sorted(substring_overlap_list, key=len, reverse=True)
    for r in substring_overlap_list_sorted:
        sent = sent.replace(r, value_symbol_conversion[r])
    return sent

In [0]:
string = "Puma Shoes above ten thousand and two hundred dollars"
replacement_string = replace_with_symbols(string)
print(replacement_string)

In [0]:
string = "Playstation between four thousand and five thousand rupees"
replacement_string = replace_with_symbols(string)
print(replacement_string)

(2) Convert number words to numbers for semantic understanding and query parsing

Eg: fifty thousand and five hundred - 50500, 
      Four thousand and five thousand - [4000, 5000]

In [0]:
from word2number import w2n

In [0]:
def convert_word_to_number(sent):
    doc_num = nlp(sent)
    doc_num_len = len(doc_num)
    k = 0
    category_tok_indices = []
    word_2_num_dict = defaultdict()
    while k < doc_num_len:
        tok_n = doc_num[k]
        if tok_n.pos_.lower() == 'num':
            if not tok_n.text.isnumeric():
                word_num = tok_n.text
                l = k + 1
                while l < doc_num_len:
                    tok_n_next = doc_num[l]
                    if (tok_n_next.pos_.lower() == 'num' and not tok_n_next.text.isnumeric()):
                        word_num += " " + tok_n_next.text
                    elif tok_n_next.pos_.lower() == 'cconj':
                        if l + 1 < doc_num_len and (doc_num[l + 1].pos_.lower() == 'num' and not doc_num[l + 1].text.isnumeric()):
                            word_num += " " + tok_n_next.text
                        else:
                            break
                    else:
                        break
                    l += 1
                try:
                    word_2_num_dict[word_num] = w2n.word_to_num(word_num)
                except:
                    nlp_word_num = nlp(word_num)
                    nlp_word_num_len = len(nlp_word_num)
                    word_num = ""
                    
                    for index in range(nlp_word_num_len):
                        word_tok = nlp_word_num[index]
                        if word_tok.pos_.lower() == 'num':
                            word_num += " " + word_tok.text
                            if index == (nlp_word_num_len - 1):
                                try:
                                    word_num = word_num.strip()
                                    word_2_num_dict[word_num] = w2n.word_to_num(word_num)
                                    word_num = ""
                                except:
                                    print("exception", "word num:", word_num)   
                        else:
                            try:
                                word_num = word_num.strip()
                                word_2_num_dict[word_num] = w2n.word_to_num(word_num)
                                word_num = ""
                            except:
                                print("exception", "word num:", word_num)

                k = l - 1
        k += 1        
    for key, val in word_2_num_dict.items():
        sent = sent.replace(key, str(val))
    return sent

In [0]:
word_num_to_be_converted = "forty thousand four hundred and eighty six"
number_string = convert_word_to_number(word_num_to_be_converted)
print("Number String:", number_string)

In [0]:
word_num_to_be_converted = "four thousand eighty six and five thousand sixty nine"
number_string = convert_word_to_number(word_num_to_be_converted)
print("Number String:", number_string)

In [0]:
# ----------- Should still make the code generic enough to handle the case where string contains two number words where both contains 'and' --------------------------
# ----------- Yet to handle Indiam System Number Words ------------------------------------

(3) Extract Preposition and Contextual Meaning Around Them
 
 Eg - Red Shirts between 4000 and 5000 dollars
 
  Preposition - between
  
  Preposition Meaning - 4000 and 5000 $

In [0]:
def get_preposition_meaning(token, cur_string):
    left_string = ""
    right_string = ""
    for left_val in token.lefts:
        left_string += " " + left_val.text
        left_string = get_preposition_meaning(left_val, left_string)
        
    for right_val in token.rights:
        right_string += " " + right_val.text
        right_string = get_preposition_meaning(right_val, right_string)
        
    cur_string = left_string + " " + cur_string + " " + right_string
    cur_string = cur_string.strip()
    cur_string = " ".join(cur_string.split())
#     print(cur_string, list(token.lefts), list(token.rights))
    return cur_string

In [0]:
def get_preposition_and_preposition_meaning(sent):
    sent = replace_with_symbols(sent)
    sent = convert_word_to_number(sent)
    doc = nlp(sent)
    doc_len = len(doc)
    i = 0
    preposition_list = []
    preposition_meaning_list = []
    while i < doc_len:
        tok = doc[i]
        if tok.dep_ == 'prep':
          prep_string = ""
          preposition_list.append(tok.text)
          preposition_meaning_list.append(get_preposition_meaning(tok, prep_string))
        i += 1
    return preposition_list, preposition_meaning_list

In [0]:
string = "Playstation between 4000 and 5000 ₹" 
preposition_list, preposition_meaning_list = get_preposition_and_preposition_meaning(string)
print(preposition_list, preposition_meaning_list)

In [0]:
string = "Puma Shoes above ten thousand and two hundred dollars"
preposition_list, preposition_meaning_list = get_preposition_and_preposition_meaning(string)
print(preposition_list, preposition_meaning_list)

In [0]:
string = "Preeti grinder under ten kilograms"
preposition_list, preposition_meaning_list = get_preposition_and_preposition_meaning(string)
print(preposition_list, preposition_meaning_list)

(4) Reverse map facet values to facet fields and field types

Taking the case of brand data from thredup

------------------- I tried doing it with NER tagger but enough data is not present and I am ending up messing up the pre-trained weights. have to work on training custom NER --------------------------

In [0]:
from collections import defaultdict
def create_brand_dict():
    check_df = pd.read_csv("/dbfs/FileStore/shared_uploads/t_karthik.ragunath@tatadigital.com/data_cleanser.csv",  header=0, delimiter='\t', error_bad_lines=False)
    check_df['data_checker'] = check_df['data_checker'].fillna('').str.lower()
    data_checker_list = list(check_df['data_checker'])
    max_len = 0
    for ind_data in data_checker_list:
        comma_sep_data = ind_data.split(',')
        for datum in comma_sep_data:
            datum = ' '.join(datum.split())
            try:
                length = len(datum.split(' '))
            except:
                print(datum, comma_sep_data)
            if length > max_len:
                max_len = length

    data_checker_dict = defaultdict(set)
    data_checker_set = set(data_checker_list)
    words_to_remove = ['Dress', 'Skirt', 'Black']
    data_checker_set = set(filter(lambda x:(len(x)!=1 and x not in words_to_remove), data_checker_set))
    for ind_data in data_checker_list:
        comma_sep_data = ind_data.split(',')
        for datum in comma_sep_data:
            datum = ' '.join(datum.split())
            datum_split = datum.split(' ')
            length = len(datum_split)
            initial_key = ""
            for i in range(length):
                data_checker_dict[initial_key].add(' '.join(datum_split[0:(i + 1)]))
                initial_key = ' '.join(datum_split[0:(i + 1)])
    return data_checker_dict, data_checker_set

In [0]:
data_checker_dict, data_checker_set = create_brand_dict()

In [0]:
def fetch_brand_overlaps():
    candidate_sentences['Query'] = candidate_sentences['Query'].fillna('').str.lower()
    brand_list = []
    sentence_cleaned_of_brands = []
    for k in tqdm(candidate_sentences["Query"]):
        sentence_split = k.split()
        split_len = len(sentence_split)
        brand_str = ""
        initial_key = ""

        for i in range(split_len):
            if sentence_split[i] in data_checker_dict[""]:
                start_index = i
                j = 2
                loop_index = i + 1
                brand_tuple_index = tuple()
                while loop_index <= split_len and " ".join(sentence_split[start_index:loop_index]) in data_checker_dict[" ".join(sentence_split[start_index:(loop_index - 1)])]:
                    if " ".join(sentence_split[start_index:loop_index]) in data_checker_set:
                        brand_tuple_index = (start_index, loop_index) 
                    j += 1
                    loop_index += 1
                if brand_tuple_index:
                    brand_str = ' '.join(sentence_split[brand_tuple_index[0]:brand_tuple_index[1]])
                    brand_list.append(brand_str)
#                     print('brand:', brand_str, 'sentence:', k)
                    sentence_split[brand_tuple_index[0]:brand_tuple_index[1]] = []
                    break
        if not brand_str:
            brand_list.append("")
        sentence_cleaned_of_brands.append(' '.join(sentence_split))
    return brand_list, sentence_cleaned_of_brands

In [0]:
brand_list, sentence_cleaned_of_brands = fetch_brand_overlaps()
brand_set = set(brand_list)

In [0]:
brand_set

Unsupervised feature extraction to extract category and qualities separately and create dependency mapping between them

Created entirely from Spacy English Language Model (Large)

For more details regarding accuracy and performance of this model

https://spacy.io/models/en

In [0]:
def clean_text(str_data):
    str_data = str_data.strip()
    str_data = " ".join(str_data.split())
    return str_data

In [0]:
from collections import defaultdict
def create_brand_dict_case_sensitive():
    check_df = pd.read_csv("/dbfs/FileStore/shared_uploads/t_karthik.ragunath@tatadigital.com/data_cleanser.csv",  header=0, delimiter='\t', error_bad_lines=False)
    check_df['data_checker'] = check_df['data_checker'].fillna('')
    data_checker_list = list(check_df['data_checker'])
    max_len = 0
    for ind_data in data_checker_list:
        comma_sep_data = ind_data.split(',')
        for datum in comma_sep_data:
            datum = ' '.join(datum.split())
            try:
                length = len(datum.split(' '))
            except:
                print(datum, comma_sep_data)
            if length > max_len:
                max_len = length

    data_checker_dict = defaultdict(set)
    data_checker_set = set(data_checker_list)
    words_to_remove = ['Dress', 'Skirt', 'Black']
    data_checker_set = set(filter(lambda x:(len(x)!=1 and x not in words_to_remove), data_checker_set))
    for ind_data in data_checker_list:
        comma_sep_data = ind_data.split(',')
        for datum in comma_sep_data:
            datum = ' '.join(datum.split())
            datum_split = datum.split(' ')
            length = len(datum_split)
            initial_key = ""
            for i in range(length):
                data_checker_dict[initial_key].add(' '.join(datum_split[0:(i + 1)]))
                initial_key = ' '.join(datum_split[0:(i + 1)])
    return data_checker_dict, data_checker_set

In [0]:
def fetch_brand_overlaps_case_sensitive(data_checker_dict=None, data_checker_set=None):
    candidate_sentences['Query'] = candidate_sentences['Query'].fillna('')
    brand_list = []
    sentence_cleaned_of_brands = []
    for k in tqdm(candidate_sentences["Query"]):
        sentence_split = k.split()
        split_len = len(sentence_split)
        brand_str = ""
        initial_key = ""

        for i in range(split_len):
            if sentence_split[i] in data_checker_dict[""]:
                start_index = i
                j = 2
                loop_index = i + 1
                brand_tuple_index = tuple()
                while loop_index <= split_len and " ".join(sentence_split[start_index:loop_index]) in data_checker_dict[" ".join(sentence_split[start_index:(loop_index - 1)])]:
                    if " ".join(sentence_split[start_index:loop_index]) in data_checker_set:
                        brand_tuple_index = (start_index, loop_index) 
                    j += 1
                    loop_index += 1
                if brand_tuple_index:
                    brand_str = ' '.join(sentence_split[brand_tuple_index[0]:brand_tuple_index[1]])
                    brand_list.append(brand_str)
                    sentence_split[brand_tuple_index[0]:brand_tuple_index[1]] = []
                    break
        if not brand_str:
            brand_list.append("")
        sentence_cleaned_of_brands.append(' '.join(sentence_split))
    return brand_list, sentence_cleaned_of_brands

Learning Rules

```
Modifiers, Compound Words (Not Nouns) - Qualities
Nouns or Pronouns - Categories

Map Categories -> Qualities with dependency parser
```

In [0]:
def unsupervised_feature_extraction(sent):
    sent = replace_with_symbols(sent)
    sent = convert_word_to_number(sent)
    
    is_prev_tok_prep = False
    category_tok_indices = []
    category_list = []
    quality_list = []
    preposition_list = []
    preposition_meaning_list = []
    
    prefix = ""
    modifier = ""
    category = ""
    doc = nlp(sent)
    doc_len = len(doc)
    i = 0
    while i < doc_len:
        tok = doc[i]        
        ## chunk 2: check if token is a modifier or not
        if tok.dep_.endswith("mod") == True and (tok.dep_.lower() != 'nummod' or (not is_prev_tok_prep and (not preposition_meaning_list or tok.text not in preposition_meaning_list[-1]))):
            modifier = tok.text
            j = i + 1
            while j < doc_len:
                tok_next = doc[j]
                if tok_next.dep_.endswith("mod"):
                    cleaned_text = clean_text(modifier)
                    quality_list.append(cleaned_text)
                    modifier = tok_next.text
                    j += 1
                else:
                    break
            i = j - 1
            cleaned_text = clean_text(modifier)
            quality_list.append(cleaned_text)
            prefix = ""
            modifier = ""
            category = ""
            is_prev_tok_prep = False
            
        ## chunk 2: check if token is a coumpuund word or not
        elif tok.dep_ != "punct" and (tok.dep_.lower() == 'compound' or (tok.pos_.lower() != 'propn' and tok.pos_.lower() != 'noun' and tok.dep_.lower() == 'root')):
            prefix = tok.text
            j = i + 1
            while j < doc_len:
                tok_next = doc[j]
                if tok_next.dep_.lower() == 'compound' or (tok_next.pos_.lower() != 'propn' and tok_next.pos_.lower() != 'noun' and tok_next.dep_.lower() == 'root'):
                    prefix += " " + tok_next.text
                    j += 1
                else:
                    break
            i = j - 1
            quality_list.append(prefix)
            prefix = ""
            modifier = ""
            category = ""
            is_prev_tok_prep = False
            
        ## chunk 3: check if token is a noun or not    
        elif (tok.pos_.lower() == 'propn' or tok.pos_.lower() == 'noun'):
            category = prefix + " " + tok.text
            j = i + 1
            while j < doc_len:
                tok_next = doc[j]
                if (tok_next.pos_.lower() == 'propn' or tok_next.pos_.lower() == 'noun'):
                    category += " " + tok_next.text
                    j += 1
                else:
                    break
            
            category_tok_indices.append((i, j))
            i = j - 1
            category = category.strip()
            category = " ".join(category.split())
            category_list.append(category)
            prefix = ""
            modifier = ""
            category = ""
            is_prev_tok_prep = False
            
                  
        ## chunk 4: check if token is a prep or not and to extract meaning around preposition
        elif tok.dep_ == 'prep':
            prep_string = ""
            preposition_list.append(tok.text)
            preposition_meaning_list.append(get_preposition_meaning(tok, prep_string))
            is_prev_tok_prep = True
        
        else:
            is_prev_tok_prep = False
        
        i += 1

    category_feature_relation = []
    for cat_index, cat_tuple in enumerate(category_tok_indices):
        quality_set = set()
        for tok_index in range(cat_tuple[0], cat_tuple[1]):
            tok = doc[tok_index]
            for child in tok.children:
                for quality in quality_list:
                    if child.text in quality and quality not in quality_set:
                        category_feature_relation.append(category_list[cat_index] + ":" + quality)
                        quality_set.add(quality)
                        break
            
    
    return quality_list, category_list, preposition_list, preposition_meaning_list, category_feature_relation

In [0]:
unsupervised_feature_extraction("Floral print midi dress")

Test with Fashion Search Queries used by our QA Team

In [0]:
quality_tags_list = []
category_tags_list = []
preposition_list = [] 
preposition_meaning_list = []
category_feature_relation_list = []
data_checker_dict, data_checker_set = create_brand_dict_case_sensitive()
brand_list, sentence_cleaned_of_brands = fetch_brand_overlaps_case_sensitive(data_checker_dict=data_checker_dict, data_checker_set=data_checker_set)
for k in sentence_cleaned_of_brands:
    quality_tags, category_tags, preposition_tags, preposition_meaning_tags, category_feature_relation = unsupervised_feature_extraction(k)
    quality_tags_list.append(quality_tags)
    category_tags_list.append(category_tags)
    preposition_list.append(preposition_tags)
    preposition_meaning_list.append(preposition_meaning_tags)
    category_feature_relation_list.append(category_feature_relation)

In [0]:
qa_tag_extraction = pd.DataFrame(columns=['Query','Brand', 'Quality', 'Category', 'Preposition', 'Preposition Meaning'])
qa_tag_extraction['Query'] = list(candidate_sentences["Query"])
qa_tag_extraction['Brand'] = brand_list
qa_tag_extraction['Quality'] = quality_tags_list
qa_tag_extraction['Category'] = category_tags_list
qa_tag_extraction['Preposition'] = preposition_list
qa_tag_extraction['Preposition Meaning'] = preposition_meaning_list
qa_tag_extraction['Category_Feature_Relation'] = category_feature_relation_list
qa_tag_extraction.to_csv('/dbfs/mnt/nemo/qa_queries_tagger_demo_sm.csv', encoding='utf-8')
# qa_tag_extraction.to_csv('/dbfs/mnt/nemo/qa_queries_tagger_demo_lg.csv', encoding='utf-8')
# qa_tag_extraction.to_csv('/dbfs/mnt/nemo/qa_queries_tagger_demo_trf.csv', encoding='utf-8')

Location where files are written
```
https://statddevdemsdci02.blob.core.windows.net/client-data/qa_queries_tagger_demo_lg.csv
https://statddevdemsdci02.blob.core.windows.net/client-data/qa_queries_tagger_demo_sm.csv
https://statddevdemsdci02.blob.core.windows.net/client-data/qa_queries_tagger_demo_trf.csv
```

```
'en_core_web_lg' results looks better when compared to sm and trf models
```

In [0]:
# string = "Floral print midi dress"
# string = "denim mini skirts"
# string = "blue bucket bag"
# string = "crocodile print bag"
# string = "off shoulder dress"
# string = "red sheath dress"
string = "Playstation between four thousand and five thousand ₹"
# string = "Playstation between four thousand and five thousand $"

doc = nlp(string)
for tok in doc:
    print(tok.text, tok.dep_, tok.pos_, tok.lemma_)
    for child in tok.children:
        print('child:', child, 'type:', type(child))
    for child in tok.rights:
        print('rights:', child, 'type:', type(child))
    for child in tok.lefts:
        print('lefts:', child, 'type:', type(child))
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Custom Training NER

In [0]:
nlp_update = spacy.load('en_core_web_lg')

string = "red dress less than 20000$"
doc = nlp_update(string)
for tok in doc:
    print(tok.text, tok.dep_, tok.pos_, tok.lemma_)
    for child in tok.children:
        print('child:', child, 'type:', type(child))
    for child in tok.rights:
        print('rights:', child, 'type:', type(child))
    for child in tok.lefts:
        print('lefts:', child, 'type:', type(child))
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [0]:
color_dataset = pd.read_csv("/dbfs/FileStore/shared_uploads/t_karthik.ragunath@tatadigital.com/colours_rgb_shades.csv")
color_dataset['Color Name'] = color_dataset['Color Name'].fillna('')
color_names = []
for color in color_dataset['Color Name']:
    color_names.append(''.join(' ' + c if c.isupper() else c for c in color).lower().strip())
color_dataset['Color Names Cleaned'] = color_names
color_dataset['Color Names Cleaned'][:10]

In [0]:
max_len = 0
for ind_color in color_names:
    comma_sep_colors = ind_color.split(',')
    for color in comma_sep_colors:
        color = ' '.join(color.split())
        try:
            length = len(color.split(' '))
        except:
            print(color, comma_sep_colors)
        if length > max_len:
            max_len = length

color_dictionary = defaultdict(set)
color_set = set(color_names)
for ind_color in color_names:
    comma_sep_colors = ind_color.split(',')
    for color in comma_sep_colors:
        color = ' '.join(color.split())
        split_color = color.split(' ')
        length = len(split_color)
        initial_key = ""
        for i in range(length):
            color_dictionary[initial_key].add(' '.join(split_color[0:(i + 1)]))
            initial_key = ' '.join(split_color[0:(i + 1)])

In [0]:
color_set

In [0]:
training_color_data = []
#actually annotaion must be done with context and not as empty words
for color_name in color_set:
  training_color_data.append((color_name, {"entities":[(0, len(color_name), "COLOR")]}))

In [0]:
example_doc = nlp_update('In USA I bought a red salwar and a blue kurta')
for ent in example_doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [0]:
import random
from spacy.training import Example
path_to_model = '/dbfs/mnt/nemo/'
batch_sizes = spacy.util.compounding(4.0, 32.0, 1.001)
optimizer = nlp_update.resume_training()
for i in range(1):
  random.shuffle(training_color_data)
  for batch in spacy.util.minibatch(training_color_data, size=batch_sizes):
    '''https://www.youtube.com/watch?v=THduWAnG97k'''
#     texts = [text for text, annotation in batch]
#     annotations = [annotation for text, annotation in batch]
#     nlp_update.update(texts, annotations)
    examples=[]
    for text, annotations in batch: #https://stackoverflow.com/questions/66377634/convert-code-from-spacy2-to-spacy3-nlp-update-not-working
      doc = nlp_update.make_doc(text)
      example = Example.from_dict(doc, annotations)
      examples.append(example)
    losses = {}
    nlp_update.update(examples, sgd=optimizer, drop=0.35, losses=losses)
nlp_update.to_disk(path_to_model)

In [0]:
example_doc = nlp_update('In USA I bought a red salwar and a blue kurta')
for ent in example_doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Problem of forgetting Pre-Trained learned NERS

In [0]:
string = "red dress less than 20000$"
doc = nlp_update(string)
for tok in doc:
    print(tok.text, tok.dep_, tok.pos_, tok.lemma_)
    for child in tok.children:
        print('child:', child, 'type:', type(child))
    for child in tok.rights:
        print('rights:', child, 'type:', type(child))
    for child in tok.lefts:
        print('lefts:', child, 'type:', type(child))
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Spell Checker

In [0]:
%sh sudo apt-get install -y libhunspell-dev

In [0]:
%sh sudo killall apt apt-get

In [0]:
%sh pip install spacy_hunspell

In [0]:
%sh pip install hunspell

In [0]:
import hunspell
hobj = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')

In [0]:
from spacy_hunspell import spaCyHunSpell
hunspell = spaCyHunSpell(nlp_update, ('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff'))

In [0]:
doc = nlp_update('I have rer scarf.')
red = doc[2]
# red._.hunspell_spell
red._.hunspell_suggest

In [0]:
for token in doc:
    if not hobj.spell(token.text):
        print("text:", token.text)
        possible_candidates = []
        print('suggest:', token._.hunspell_suggest)
        for tok in token._.hunspell_suggest:
            nlp_local = nlp_update(tok)
            for loc in nlp_local:
                print('loc_text:', loc.text, 'loc_pos:', loc.pos_, 'tag:', loc.tag_)
                if loc.pos_ in (token.pos_,'ADJ') and hobj.spell(loc.text):
                    possible_candidates.append(loc.text)
        print('possible_candidates:', possible_candidates)
    print()

In [0]:
# Two ideas to pick the best alternative
#(1)
# Since all the complete suggestions are fixed length, we can train a lstm language model and pick the sentence which has highest probability in the last word
# Since all words are going to be same except incorrect word, instead of actual words we can use `pos` or `pos_ner` concat of word but this has its defects too
#(2)
# Or we can train a simple N-Gram probability based model - Thid could be done on `pos` or `pos_ner` concat of word too which has its own defects

```
have done some initial work on lstm model but need some tech guidance to get it right
```

Things to discuss on

(1) Work on improving POS and NER model for custom catalogs

(2) Work on LSTM or Probability Model to pick best alternatives when there is spelling mistakes

(3) Possible usage of Prodigy Annotation Tool - Similar to our 2-Face for annotating image models

Discussion (1) - Doc Reference

Steps to build pipeline to train and experiment with different architectures for training on custom catalogs

```
https://spacy.io/usage/training
https://spacy.io/usage/processing-pipelines
https://spacy.io/api/architectures
```

Catastrophic Forgetting Problem

```
https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
```

Solution for Catastrophic Forgetting Problem

```
https://deepnote.com/@isaac-aderogba/Spacy-Food-Entities-LMLRnMOsQyGIUwvPLvVlsw#
```

Discussion (2) - Doc Reference

Work on LSTM model and 

https://spacy.io/usage/layers-architectures#frameworks

Worked done till now
```
https://statddevdemsdci02.blob.core.windows.net/client-data/rnn-lstm.ipynb
https://statddevdemsdci02.blob.core.windows.net/client-data/rnn-lstm.xpynb

https://statddevdemsdci02.blob.core.windows.net/client-data/n-gram-model.ipynb
https://statddevdemsdci02.blob.core.windows.net/client-data/n-gram-model.xpynb
```

Discussion (3) - Doc Reference

Spacy Projects Documentation
```
https://spacy.io/usage/projects
```