# Product Search Relevance in E-commerce
### Group member 
Tianyi, Li (tli76) ; Nanzhu, Liu (nliu11) ; Yuqi, Kang (yuqik2) ; Jiaxing, Li (jli132) ; Wei Chen, Lin (wclin2).

# Outline 
#### - Introduction
#### - Data preprocessing 
#### - Feature Engineering
#### - Modeling 
#### - Evaluation and Discussion

# Introduction 
In the world of E-commerce, an accurate product match based on the search word input from the user is extremely essential. In a way that, if the user can be directly led to the products that they are looking for, their user experience could be greatly enhanced and so will the company be able to make more profits.
## Background and Motivation
In text-based search area, the common problems would be how to extract useful information out of the unstructured data and how to efficiently recommend the most relevant products from millions of products. 
Moving from manual calculation to auto-calculation is big motivation for out project, which will save the labor cost, processing time and imporve accuracy in the mean time. Manual calculation is taking the average score of three human raters, using criteria like : a search for "AA battery" would be considered highly relevant to a pack of size AA batteries (relevance = 3), mildly relevant to a cordless drill battery (relevance = 2), and not relevant to a snow shovel (relevance = 1). Platform like Spark enables us to deal with huge dataset and run computationally expensive algorithms to fulfill our goals.

# Data-source 
In this project, home improvement products data from Home Depot and user search input data will be analyzed. A pipeline to predict the relevance for each pair of the query and the product will then be generated by us. 

https://www.kaggle.com/c/home-depot-product-search-relevance/data

# Import data and libraries

In [6]:
import pandas as pd
import numpy as np

import gensim
import gensim.parsing.preprocessing as gsp

import nltk
nltk.download('all')
from gensim import utils
from nltk.stem import PorterStemmer 
from nltk.corpus import stopwords

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[24]").getOrCreate()

from pyspark.sql import functions as F
from pyspark.sql.functions import desc
from pyspark.sql.types import StringType, IntegerType, ArrayType, FloatType, MapType, DoubleType

from itertools import product
from collections import defaultdict

from scipy.spatial.distance import euclidean, cosine
import pulp
import re

# Read Data

In [8]:
# File location and type
file_location = "/FileStore/tables/depot_train.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

title_df = spark.read.format(file_type) \
  .option("header", first_row_is_header) \
  .option("delimiter", delimiter) \
  .option("escape", "\"") \
  .load(file_location)

file_location = "/FileStore/tables/product_descriptions.csv"
desc_df = spark.read.format(file_type) \
  .option("header", first_row_is_header) \
  .option("delimiter", delimiter) \
  .option("escape", "\"") \
  .load(file_location)

# permanent_table_name = "depot"
# df.write.format("parquet").saveAsTable(permanent_table_name)

# Raw data exploration
#### Our raw data includes 6 columns (in total 74067 observations) : 
1. Product UID
2. Product Title
3. Product Description 
4. Search Term 
5. Relevance Score
6. id

In [10]:
alldata = title_df.join(desc_df, on=['product_uid'], how='left')
alldata = alldata.withColumn('relevance', alldata.relevance.cast(FloatType()))
display(alldata.take(5))
alldata.count()

product_uid,id,product_title,search_term,relevance,product_description
100010,34,Valley View Industries Metal Stakes (4-Pack),steele stake,2.6700000762939453,"Valley View Industries Metal Stakes (4-Pack) are 9 in. galvanized steel stakes for use with all Valley View lawn edgings and brick and paver edgings. These utility stakes can also be used for many other purposes. It is recommended that anchor stakes are used every five feet on designs that have the edging in straight lengths. Where there are curved designs for edgings, additional anchor stakes are recommended at the curve points. Anchor stakes should be staked in at a 45 degree angle. Gloves and eye protection are recommended.Can be used with all valley View lawn edgings and brick/ paver edgingsUtility stakes can be used for many purposesGalvanized steel for strength9 in. lengthPriced competitively yet provides much more value in product"
100140,811,Honda GCV190 21 in. Variable Speed Self-Propelled Walk-Behind Gas Mower,lawn mower- electic,2.3299999237060547,"The Honda 21 in. GCV190 Gas Variable Speed Self-Propelled Walk-Behind Mower features hydrostatic cruise control that provides gradual speed adjustment to match your mowing conditions. The Blade Stop System (Roto-Stop) is a safety feature that stops the blades without stopping the engine. So you can safely step away without having to restart the motor.Assembled dimensions: 24 in. W x 24 in. D x 18.37 in. HNeXite deck carries a lifetime warranty and its 21 in. cutting width provides durabilityVariable speed, hydrostatic cruise-control system offers self-propelled operation with a control lever for precise speed adjustment7 mowing heights ranging from 3/4 in. to 4 in. to manicure the lawn to your specificationsTwin-blade micro-cut design provides fine grass clippings4-in-1 Versamow System with Clip Director provides adjustable mulching, bagging, discharge and leaf shredding and helps to prevent clogging of the bag chute9 in. wheels with rear ball bearings provide smooth movement over varied terrainHandle offers 3 positions height and a comfortable grip for maneuvering the lightweight mower, handle folds for easy storageRoto-Stop system stops the blades without stopping the engine so you can safely step away without having to restart the motorManual fuel-shutoff valve for your convenienceLightweight and maneuverable5-year residential warrantyDelivered to your door mostly assembled, simply attach the handle using basic instructions and add gas/oilHome Depot Protection Plan:"
100140,812,Honda GCV190 21 in. Variable Speed Self-Propelled Walk-Behind Gas Mower,Lawnmowers,3.0,"The Honda 21 in. GCV190 Gas Variable Speed Self-Propelled Walk-Behind Mower features hydrostatic cruise control that provides gradual speed adjustment to match your mowing conditions. The Blade Stop System (Roto-Stop) is a safety feature that stops the blades without stopping the engine. So you can safely step away without having to restart the motor.Assembled dimensions: 24 in. W x 24 in. D x 18.37 in. HNeXite deck carries a lifetime warranty and its 21 in. cutting width provides durabilityVariable speed, hydrostatic cruise-control system offers self-propelled operation with a control lever for precise speed adjustment7 mowing heights ranging from 3/4 in. to 4 in. to manicure the lawn to your specificationsTwin-blade micro-cut design provides fine grass clippings4-in-1 Versamow System with Clip Director provides adjustable mulching, bagging, discharge and leaf shredding and helps to prevent clogging of the bag chute9 in. wheels with rear ball bearings provide smooth movement over varied terrainHandle offers 3 positions height and a comfortable grip for maneuvering the lightweight mower, handle folds for easy storageRoto-Stop system stops the blades without stopping the engine so you can safely step away without having to restart the motorManual fuel-shutoff valve for your convenienceLightweight and maneuverable5-year residential warrantyDelivered to your door mostly assembled, simply attach the handle using basic instructions and add gas/oilHome Depot Protection Plan:"
100227,1287,Adjust-A-Gate 2 Rail Gate Frame Kit,fence gates,2.6700000762939453,"Ask for the adjustable, Adjust-A-Gate system every time for your gate needs. This line is ideal for repairing an old gate or building a new one. Including all necessary hardware & parts, our gate is assembled onsite to get the job done right, the first time. The Adjust-A-Gate system comes with all the components needed to build a sturdy, reliable gate that will serve you for years to come. You simply add rails and fence boards, you will have a gate for life. Accessories include an optional drop rod for double drive applications.2-rail Adjust-A-Gate kit, for openings of 36 in.-60 in. wide, 36 in. high frame for 3 ft., 4 ft. and 5 ft. high fencesUse for wood or composite fence gates; single or double drive useOur exclusive adjustable diagonal truss cable will keep your gate true and aligned year after yearKit includes: Vertical frame, spreader bars, frame hinges, post hinges, gate latch kit, truss cable, screws and gate stopEasy to install; perfect for the DIYerHeavy-duty hinges prevent sagging and dragging2-step powder coated all-steel frame, fade and rust resistant, square tube design"
100227,1289,Adjust-A-Gate 2 Rail Gate Frame Kit,gate frame block,2.6700000762939453,"Ask for the adjustable, Adjust-A-Gate system every time for your gate needs. This line is ideal for repairing an old gate or building a new one. Including all necessary hardware & parts, our gate is assembled onsite to get the job done right, the first time. The Adjust-A-Gate system comes with all the components needed to build a sturdy, reliable gate that will serve you for years to come. You simply add rails and fence boards, you will have a gate for life. Accessories include an optional drop rod for double drive applications.2-rail Adjust-A-Gate kit, for openings of 36 in.-60 in. wide, 36 in. high frame for 3 ft., 4 ft. and 5 ft. high fencesUse for wood or composite fence gates; single or double drive useOur exclusive adjustable diagonal truss cable will keep your gate true and aligned year after yearKit includes: Vertical frame, spreader bars, frame hinges, post hinges, gate latch kit, truss cable, screws and gate stopEasy to install; perfect for the DIYerHeavy-duty hinges prevent sagging and dragging2-step powder coated all-steel frame, fade and rust resistant, square tube design"


## Distinct products and search terms

In [12]:
## The number of distinct search terms
## The number of distinct products 
print ("number of distinct search items", alldata.select('search_term').distinct().count())
print ("number of distinct products", alldata.select('product_uid').distinct().count())

## The number of times that each product showed up in our data

In [14]:
## The number of products 
product_result = alldata.groupBy('product_uid').count().orderBy('count', ascending=False)
## Try to use display
display(product_result)

product_uid,count
102893,21
101959,21
101892,18
104691,17
101539,17
102456,17
103763,15
102234,14
100409,14
101312,14


## Distribution of relevance scores

In [16]:
## A histogram plot on relevance 
## Shows how accurate the match is 
import matplotlib.pyplot as plt
var = 'relevance'
plot_data = alldata.select(var).toPandas()
x= plot_data[var]

bins =[0,0.5,1,1.5,2,2.5,3,3.5,4]

hist, bin_edges = np.histogram(x,bins,weights=np.zeros_like(x) + 100. / x.size) # make the histogram

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot the histogram heights against integers on the x axis
ax.bar(range(len(hist)),hist,width=1,alpha=0.8,ec ='black',color = 'gold')

# # Set the ticks to the middle of the bars
ax.set_xticks([0.5+i for i,j in enumerate(hist)])

# Set the xticklabels to a string that tells us what the bin edges were
#labels =['{}k'.format(int(bins[i+1]/1000)) for i,j in enumerate(hist)]
labels =['{}'.format(bins[i+1]) for i,j in enumerate(hist)]
labels.insert(0,'0')
ax.set_xticklabels(labels)
#plt.text(-0.6, -1.4,'0')
plt.xlabel(var)
plt.ylabel('percentage')
plt.show()
display(fig)

# Data preprocessing using Spark RDD APIs

### Change all words into lower case

In [19]:
# Change all words to lower case
tokens_in_desc_list = alldata.select('product_description').rdd.flatMap(lambda x: x)
tokens_in_desc_list = tokens_in_desc_list.collect()
tokens_in_desc_list = [sent.lower() for sent in tokens_in_desc_list]
tokens_in_desc_list = ' '.join(tokens_in_desc_list).split(' ')

tokens_in_title_list = alldata.select('product_title').rdd.flatMap(lambda x: x)
tokens_in_title_list = tokens_in_title_list.collect()
tokens_in_title_list = [sent.lower() for sent in tokens_in_title_list]
tokens_in_title_list = ' '.join(tokens_in_title_list).split(' ')

In [20]:
## To change all words into vectors
# File location and type
file_location = "/FileStore/tables/glove_50d.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

df = df.toPandas()
df = df.set_index('index')
df['combined']= df.values.tolist()
df = df.reset_index()
glove_dict = dict(zip(df['index'], df['combined']))

### Represent words with vectors

In [22]:
# every token (word) has a unique vector representation
print(glove_dict['obama'])

# Helper Functions for feature engineering

In [24]:
# Word Mover's Distance
def tokens_to_fracdict(tokens):
    cntdict = defaultdict(lambda : 0)
    for token in tokens:
        cntdict[token] += 1
    totalcnt = sum(cntdict.values())
    return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens):
    first_sent_tokens  = [token for token in set(first_sent_tokens) if token in glove_dict]
    second_sent_tokens = [token for token in set(second_sent_tokens) if token in glove_dict]
    all_tokens = set(first_sent_tokens + second_sent_tokens)
    if len(first_sent_tokens) == 0 or len(second_sent_tokens) == 0:
      return -9999
    
    wordvecs = {token: glove_dict[token] for token in all_tokens}
    print(wordvecs)
    
    # initial values for each token in a list
    first_sent_buckets = tokens_to_fracdict(first_sent_tokens)
    print(first_sent_buckets)
    second_sent_buckets = tokens_to_fracdict(second_sent_tokens)
    print(second_sent_buckets)

    T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0)
    
    # define it as a minimization problem
    prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize)
    
    # the equation that we want to minimize - 
    # (distance b/w tokens) * 
    # (how much of word i in the first document - d travels to word j in the new document - d')
    # therefore, we want to minimize the traveling distance.
    prob += pulp.lpSum([T[token1, token2] * euclidean(wordvecs[token1], wordvecs[token2])
                        for token1, token2 in product(all_tokens, all_tokens)])
    
    # add constraints
    for token2 in second_sent_buckets:
        prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets]) == second_sent_buckets[token2]
    
    for token1 in first_sent_buckets:
        prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets]) == first_sent_buckets[token1]

    prob.solve()
    return prob

def word_mover_distance(first_sent_tokens, second_sent_tokens):
    prob = word_mover_distance_probspec(first_sent_tokens, second_sent_tokens)
    if prob == -9999:
      return 20.0
    elif pulp.value(prob.objective) is None:
      return 0.0
    else:
      res = pulp.value(prob.objective) / len(second_sent_tokens)
    return res

word_mover_distance_udf = F.udf(word_mover_distance, FloatType())

# --------------------------------------------------------------------------------------------------------
# Euclidean Distance
def euclidean_distance(first_tokens, second_tokens):

  default = np.array([0] * 50)
  first_vectors  = [glove_dict[token] if token in glove_dict else default for token in set(first_tokens)]
  second_vectors = [glove_dict[token] if token in glove_dict else default for token in set(second_tokens)]
  
  first_vectors = np.array([sum(x) for x in zip(*first_vectors)]) / len(first_vectors)
  second_vectors = np.array([sum(x) for x in zip(*second_vectors)]) / len(second_vectors)

  return(euclidean(first_vectors, second_vectors))

euclidean_distance_udf = F.udf(euclidean_distance, FloatType())

# --------------------------------------------------------------------------------------------------------
# Cosine Distance
def cos_sim(a,b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def cos_sim_distance(first_tokens, second_tokens):

  default = np.array([0] * 50)
  first_vectors  = [glove_dict[token] if token in glove_dict else default for token in set(first_tokens)]
  second_vectors = [glove_dict[token] if token in glove_dict else default for token in set(second_tokens)]

  first_vectors = np.array([sum(x) for x in zip(*first_vectors)]) / len(first_vectors)
  second_vectors = np.array([sum(x) for x in zip(*second_vectors)]) / len(second_vectors)

  return(cos_sim(first_vectors, second_vectors))

cos_sim_distance_udf = F.udf(cos_sim_distance, FloatType())

# --------------------------------------------------------------------------------------------------------
# Extract Noun Phrases
stop_words=set(stopwords.words('english'))

def extractPhraseFunct(x):
    def leaves(tree):
        """Finds NP (nounphrase) leaf nodes of a chunk tree."""
        for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'):
            yield subtree.leaves()
    
    def get_terms(tree):
        for leaf in leaves(tree):
            term = [w for w,t in leaf if not w in stop_words]
            yield term

    sentence_re = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:[\+|-]\w+)*       # words with optional internal plus
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

    grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns
        
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
    """
    chunker = nltk.RegexpParser(grammar)
    tokens = nltk.regexp_tokenize(x,sentence_re)
    postoks = nltk.tag.pos_tag(tokens) #Part of speech tagging 
    tree = chunker.parse(postoks) #chunking
    terms = get_terms(tree)
    temp_phrases = []
    for term in terms:
        if len(term):
            temp_phrases.append(' '.join(term))
    
    finalPhrase = [w for w in temp_phrases if w] #remove empty lists
    finalPhrase = ' '.join(finalPhrase)

    return finalPhrase

extractPhraseFunct_udf = F.udf(extractPhraseFunct, StringType())

# --------------------------------------------------------------------------------------------------------
# Brand Extraction
file_location = "/FileStore/tables/attr_brands.txt"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "false"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

brand_list = df.select('_c0').rdd.flatMap(lambda x: x).collect()
brand_list = [token.lower() for token in brand_list]

# in brand list
def in_brand_list(title_string):
  title_list = title_string.lower().split(' ')[:4]

  for i in range(1, 5)[::-1]:
    title = ' '.join(title_list[:i])
    
    if title in set(brand_list):
      return(title)
    
  return('none')
  
in_brand_list_udf = F.udf(in_brand_list, StringType())

# share brand
def share_brand(brand, search_brand):
  if search_brand == 'none':
    return(0)
  elif search_brand not in brand:
    return(0)
  else:
    return(1)

share_brand_udf = F.udf(share_brand, IntegerType())

# --------------------------------------------------------------------------------------------------------
# Attribute Extraction
file_location = "/FileStore/tables/most_common_attrs.txt"
file_type = "csv"

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

attr_list = df.select('_c0').rdd.flatMap(lambda x: x).collect()
attr_list = [re.findall(r'[a-zA-Z]+', sent.lower()) for sent in attr_list]
attr_list = [item for sublist in attr_list for item in sublist]

# --------------------------------------------------------------------------------------------------------
# Color Extraction
file_location = "/FileStore/tables/most_common_colors.txt"
file_type = "csv"

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

color_list = df.select('_c0').rdd.flatMap(lambda x: x).collect()
color_list = [i.split()[1] for i in color_list]

# in color list
def in_color_list(string):
  l = string.lower().split(' ')
  res = []

  for token in l:
    if token in color_list:
      res.append(token)
  
  if len(res) > 0:
    return(res)
  else:
    return(['none'])
  
in_color_list_udf = F.udf(in_color_list, ArrayType(StringType()))

# share color
def share_color(color, search_color):
  if ('none' in color) or ('none' in search_color):
    return(0)
  elif len(set(color) & set(search_color)) > 0:
    return(1)
  else:
    return(0)

share_color_udf = F.udf(share_color, IntegerType())

# --------------------------------------------------------------------------------------------------------
# Spelling Checker
# http://norvig.com/spell-correct.html
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())
word_list = brand_list + attr_list + color_list + tokens_in_desc_list + tokens_in_title_list
# words(open('big.txt').read())

# tokens_in_search_list
# stop_words = set(list(stop_words) + ['electic'])
WORDS = Counter([w for w in word_list if w not in ['electic']])

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

def spell_correct(string):
  res = [correction(word) for word in string.split(' ')]
  return(' '.join(res))

spell_correct_udf = F.udf(spell_correct, StringType())
  
# --------------------------------------------------------------------------------------------------------
# text filtering
filters = [
           gsp.strip_tags, 
           gsp.strip_punctuation,
           gsp.strip_multiple_whitespaces, 
           gsp.strip_short, 
           gsp.remove_stopwords,
           # gsp.stem_text,
           # gsp.strip_numeric
          ]

def clean_text(x):
    s = x
    s = s.lower()
    s = utils.to_unicode(s)
    for f in filters:
        s = f(s)
    return (s)

clean_text_udf = F.udf(clean_text, StringType())

# --------------------------------------------------------------------------------------------------------
# word stemming
def stem_udf(token_list):
  nltk.download('wordnet')
  from nltk.stem import WordNetLemmatizer 
  lemmatizer = WordNetLemmatizer()
  token_list = [lemmatizer.lemmatize(w) for w in token_list]
  return(token_list)

stem_udf = F.udf(stem_udf, ArrayType(StringType()))

# --------------------------------------------------------------------------------------------------------
# Split Concantenated Words
def split_concatenated_words(text):
  return re.sub('([a-z]+)([A-Z]+)([a-z]+)', lambda matched: matched.group(1) + ' ' + matched.group(2) + matched.group(3), text)
split_concatenated_words_udf = F.udf(split_concatenated_words, StringType())

# Text Preprocessing
- Split concatenated words
- Spelling check
- Remove useless info (i.e. stopwords, tag, multiple spaces, etc.)

In [26]:
texts = ["concrete surfaceActual", "storesOnline", 'informationRevives']
for i in texts:
  print(i, '->', split_concatenated_words(i))

In [27]:
alldata = alldata.withColumn('product_description', split_concatenated_words_udf(alldata.product_description))

### Clean Text
1. Remove stop words, tags, double spaces, etc. 
2. Use Wordnet lemmatizer to get word stem

In [29]:
print('punctuations,', '->', clean_text('punctuations,'))
print('hello  world', '->', clean_text('hello  world'))
print('an apple', '->', clean_text('an apple'))

In [30]:
# remove tags, punctuation, multiple_whitespaces, short words, stop words, lower
a = alldata.withColumn('product_title_list', clean_text_udf(alldata.product_title))
a = a.withColumn('search_term_list', clean_text_udf(a.search_term))
a = a.withColumn('product_desc_list', clean_text_udf(a.product_description))

In [31]:
# rearrange column
a = a.select('id', 'product_uid', 
             'product_title', 'product_title_list', 
             'product_description', 'product_desc_list', 
             'search_term', 'search_term_list', 'relevance')

### Spelling Check

1. Build a dictionary with correct words as keys and probability of a word as values
2. Modify the misspelled words (insert letters, delete letters, replace letters, etc.) and see whether the modified word is in the dictionary (maximized probability)

In [33]:
display(alldata.take(5))

product_uid,id,product_title,search_term,relevance,product_description
100010,34,Valley View Industries Metal Stakes (4-Pack),steele stake,2.6700000762939453,"Valley View Industries Metal Stakes (4-Pack) are 9 in. galvanized steel stakes for use with all Valley View lawn edgings and brick and paver edgings. These utility stakes can also be used for many other purposes. It is recommended that anchor stakes are used every five feet on designs that have the edging in straight lengths. Where there are curved designs for edgings, additional anchor stakes are recommended at the curve points. Anchor stakes should be staked in at a 45 degree angle. Gloves and eye protection are recommended.Can be used with all valley View lawn edgings and brick/ paver edgingsUtility stakes can be used for many purposesGalvanized steel for strength9 in. lengthPriced competitively yet provides much more value in product"
100140,811,Honda GCV190 21 in. Variable Speed Self-Propelled Walk-Behind Gas Mower,lawn mower- electic,2.3299999237060547,"The Honda 21 in. GCV190 Gas Variable Speed Self-Propelled Walk-Behind Mower features hydrostatic cruise control that provides gradual speed adjustment to match your mowing conditions. The Blade Stop System (Roto-Stop) is a safety feature that stops the blades without stopping the engine. So you can safely step away without having to restart the motor.Assembled dimensions: 24 in. W x 24 in. D x 18.37 in. HNeXite deck carries a lifetime warranty and its 21 in. cutting width provides durabilityVariable speed, hydrostatic cruise-control system offers self-propelled operation with a control lever for precise speed adjustment7 mowing heights ranging from 3/4 in. to 4 in. to manicure the lawn to your specificationsTwin-blade micro-cut design provides fine grass clippings4-in-1 Versamow System with Clip Director provides adjustable mulching, bagging, discharge and leaf shredding and helps to prevent clogging of the bag chute9 in. wheels with rear ball bearings provide smooth movement over varied terrainHandle offers 3 positions height and a comfortable grip for maneuvering the lightweight mower, handle folds for easy storageRoto-Stop system stops the blades without stopping the engine so you can safely step away without having to restart the motorManual fuel-shutoff valve for your convenienceLightweight and maneuverable5-year residential warrantyDelivered to your door mostly assembled, simply attach the handle using basic instructions and add gas/oilHome Depot Protection Plan:"
100140,812,Honda GCV190 21 in. Variable Speed Self-Propelled Walk-Behind Gas Mower,Lawnmowers,3.0,"The Honda 21 in. GCV190 Gas Variable Speed Self-Propelled Walk-Behind Mower features hydrostatic cruise control that provides gradual speed adjustment to match your mowing conditions. The Blade Stop System (Roto-Stop) is a safety feature that stops the blades without stopping the engine. So you can safely step away without having to restart the motor.Assembled dimensions: 24 in. W x 24 in. D x 18.37 in. HNeXite deck carries a lifetime warranty and its 21 in. cutting width provides durabilityVariable speed, hydrostatic cruise-control system offers self-propelled operation with a control lever for precise speed adjustment7 mowing heights ranging from 3/4 in. to 4 in. to manicure the lawn to your specificationsTwin-blade micro-cut design provides fine grass clippings4-in-1 Versamow System with Clip Director provides adjustable mulching, bagging, discharge and leaf shredding and helps to prevent clogging of the bag chute9 in. wheels with rear ball bearings provide smooth movement over varied terrainHandle offers 3 positions height and a comfortable grip for maneuvering the lightweight mower, handle folds for easy storageRoto-Stop system stops the blades without stopping the engine so you can safely step away without having to restart the motorManual fuel-shutoff valve for your convenienceLightweight and maneuverable5-year residential warrantyDelivered to your door mostly assembled, simply attach the handle using basic instructions and add gas/oilHome Depot Protection Plan:"
100227,1287,Adjust-A-Gate 2 Rail Gate Frame Kit,fence gates,2.6700000762939453,"Ask for the adjustable, Adjust-A-Gate system every time for your gate needs. This line is ideal for repairing an old gate or building a new one. Including all necessary hardware & parts, our gate is assembled onsite to get the job done right, the first time. The Adjust-A-Gate system comes with all the components needed to build a sturdy, reliable gate that will serve you for years to come. You simply add rails and fence boards, you will have a gate for life. Accessories include an optional drop rod for double drive applications.2-rail Adjust-A-Gate kit, for openings of 36 in.-60 in. wide, 36 in. high frame for 3 ft., 4 ft. and 5 ft. high fencesUse for wood or composite fence gates; single or double drive useOur exclusive adjustable diagonal truss cable will keep your gate true and aligned year after yearKit includes: Vertical frame, spreader bars, frame hinges, post hinges, gate latch kit, truss cable, screws and gate stopEasy to install; perfect for the DIYerHeavy-duty hinges prevent sagging and dragging2-step powder coated all-steel frame, fade and rust resistant, square tube design"
100227,1289,Adjust-A-Gate 2 Rail Gate Frame Kit,gate frame block,2.6700000762939453,"Ask for the adjustable, Adjust-A-Gate system every time for your gate needs. This line is ideal for repairing an old gate or building a new one. Including all necessary hardware & parts, our gate is assembled onsite to get the job done right, the first time. The Adjust-A-Gate system comes with all the components needed to build a sturdy, reliable gate that will serve you for years to come. You simply add rails and fence boards, you will have a gate for life. Accessories include an optional drop rod for double drive applications.2-rail Adjust-A-Gate kit, for openings of 36 in.-60 in. wide, 36 in. high frame for 3 ft., 4 ft. and 5 ft. high fencesUse for wood or composite fence gates; single or double drive useOur exclusive adjustable diagonal truss cable will keep your gate true and aligned year after yearKit includes: Vertical frame, spreader bars, frame hinges, post hinges, gate latch kit, truss cable, screws and gate stopEasy to install; perfect for the DIYerHeavy-duty hinges prevent sagging and dragging2-step powder coated all-steel frame, fade and rust resistant, square tube design"


In [34]:
# spelling check example
spell_correct('electic')

In [35]:
a = a.withColumn('product_title_list', spell_correct_udf(a.product_title_list))
a = a.withColumn('product_desc_list', spell_correct_udf(a.product_desc_list))
a = a.withColumn('search_term_list', spell_correct_udf(a.search_term_list))

# Feature Engineering

### Brand / Color Extraction

In [38]:
a = a.withColumn('brand', in_brand_list_udf(a.product_title))
a = a.withColumn('search_brand', in_brand_list_udf(a.search_term_list))
a = a.withColumn('share_brand', share_brand_udf(a.brand, a.search_brand))

In [39]:
p1 = a.groupBy('share_brand').agg({'relevance':'avg'})
display(p1)

share_brand,avg(relevance)
0,2.3645800298774837
1,2.553855073509585


In [40]:
a = a.withColumn('color', in_color_list_udf(a.product_desc_list))
a = a.withColumn('search_color', in_color_list_udf(a.search_term_list))
a = a.withColumn('share_color', share_color_udf(a.color, a.search_color))

In [41]:
p2 = a.groupBy('share_color').agg({'relevance':'avg'})
display(p2)

share_color,avg(relevance)
1,2.41682217020662
0,2.3694262402279294


### Split Into List

In [43]:
# split into list
a = a.withColumn('product_title_list', F.split(a.product_title_list, ' '))
a = a.withColumn('product_desc_list', F.split(a.product_desc_list, ' '))
a = a.withColumn('search_term_list', F.split(a.search_term_list, ' '))

### Word Stemming

In [45]:
aa = a.withColumn('product_title_list', stem_udf(a.product_title_list))
aa = aa.withColumn('product_desc_list', stem_udf(aa.product_desc_list))
aa = aa.withColumn('search_term_list', stem_udf(aa.search_term_list))
aa.cache()

### Euclidean / Cosine / Word Mover's Distance

#### Euclidean / Cosine

1. Transforms each document (search term, description, title) into a vector using the average of all words in the document
2. Calculate distance

In [48]:
aaa = aa.withColumn('search_desc_euclidean', euclidean_distance_udf(aa.search_term_list, aa.product_desc_list))
aaa = aaa.withColumn('search_title_euclidean', euclidean_distance_udf(aaa.search_term_list, aaa.product_title_list))

In [49]:
aaaa  = aaa.withColumn('search_title_cosine', cos_sim_distance_udf(aaa.search_term_list, aaa.product_title_list))
aaaaa = aaaa.withColumn('search_desc_cosine', cos_sim_distance_udf(aaaa.search_term_list, aaaa.product_desc_list))

#### Word Mover's Distance (WMD)
Calculate the shortest amount of distance needed to move the words from one side into the other. The smaller the distance is, the closer they are. Take, for example, two headlines (these two headlines say the same thing in completely different words):
1. Obama speaks to the media in Illinois
2. The President greets the press in Chicago

move the words from one side into the other
- Obama -> President (distance: 0.45)
- speaks -> greets (distance: 0.24)
- media -> press (distance: 0.20)
- Illinois -> Chicago (distance: 0.18)

The total distance = 1.07

In [51]:
aaaaaa = aaaaa.withColumn('search_desc_wmd', word_mover_distance_udf('product_desc_list', 'search_term_list'))
aaaaaa = aaaaaa.withColumn('search_title_wmd', word_mover_distance_udf('product_title_list', 'search_term_list'))

## Percentage of words from the search term that match the title/decription

**Search term**: ***angle*** bracket

**Product title**: Simpson Strong-Tie 12-Gauge ***Angle***

**Feature value**: 0.50

In [53]:
def word_match_percentage(search_term_list, title_or_desc_term_list):
  return len(set(search_term_list) & set(title_or_desc_term_list)) / len(set(title_or_desc_term_list))
word_match_percentage_udf = F.udf(word_match_percentage, DoubleType())

In [54]:
aaaaaaa = aaaaaa.withColumn('search_title_match', word_match_percentage_udf(aaaaaa.search_term_list, aaaaaa.product_title_list))
aaaaaaa = aaaaaaa.withColumn('search_desc_match', word_match_percentage_udf(aaaaaaa.search_term_list, aaaaaaa.product_desc_list))

In [55]:
aaaaaaa.cache()
display(aaaaaaa.take(5))

id,product_uid,product_title,product_title_list,product_description,product_desc_list,search_term,search_term_list,relevance,brand,search_brand,share_brand,color,search_color,share_color,search_desc_euclidean,search_title_euclidean,search_title_cosine,search_desc_cosine,search_desc_wmd,search_title_wmd,search_title_match,search_desc_match
34,100010,Valley View Industries Metal Stakes (4-Pack),"List(valley, view, industry, metal, stake, pack)","Valley View Industries Metal Stakes (4-Pack) are 9 in. galvanized steel stakes for use with all Valley View lawn edgings and brick and paver edgings. These utility stakes can also be used for many other purposes. It is recommended that anchor stakes are used every five feet on designs that have the edging in straight lengths. Where there are curved designs for edgings, additional anchor stakes are recommended at the curve points. Anchor stakes should be staked in at a 45 degree angle. Gloves and eye protection are recommended.Can be used with all valley View lawn edgings and brick/ paver edgings Utility stakes can be used for many purposes Galvanized steel for strength9 in. length Priced competitively yet provides much more value in product","List(valley, view, industry, metal, stake, pack, galvanized, steel, stake, use, valley, view, lawn, edging, brick, paver, edging, utility, stake, purpose, recommended, anchor, stake, foot, design, edging, straight, length, curved, design, edging, additional, anchor, stake, recommended, curve, point, anchor, stake, staked, degree, angle, glove, eye, protection, recommended, valley, view, lawn, edging, brick, paver, edging, utility, stake, purpose, galvanized, steel, strength9, length, priced, competitively, provides, value, product)",steele stake,"List(steel, stake)",2.6700000762939453,valley view industries,none,0,"List(valley, metal, galvanized, steel, valley, brick, edging, straight, curved, valley, brick, galvanized, steel)",List(steel),1,3.456138849258423,3.013258457183838,0.7697099447250366,0.6893013715744019,2.7549731731414795,2.2653465270996094,0.1666666666666666,0.054054054054054
811,100140,Honda GCV190 21 in. Variable Speed Self-Propelled Walk-Behind Gas Mower,"List(honda, gcv190, variable, speed, self, propelled, walk, gas, mower)","The Honda 21 in. GCV190 Gas Variable Speed Self-Propelled Walk-Behind Mower features hydrostatic cruise control that provides gradual speed adjustment to match your mowing conditions. The Blade Stop System (Roto-Stop) is a safety feature that stops the blades without stopping the engine. So you can safely step away without having to restart the motor.Assembled dimensions: 24 in. W x 24 in. D x 18.37 in. HNe Xite deck carries a lifetime warranty and its 21 in. cutting width provides durability Variable speed, hydrostatic cruise-control system offers self-propelled operation with a control lever for precise speed adjustment7 mowing heights ranging from 3/4 in. to 4 in. to manicure the lawn to your specifications Twin-blade micro-cut design provides fine grass clippings4-in-1 Versamow System with Clip Director provides adjustable mulching, bagging, discharge and leaf shredding and helps to prevent clogging of the bag chute9 in. wheels with rear ball bearings provide smooth movement over varied terrain Handle offers 3 positions height and a comfortable grip for maneuvering the lightweight mower, handle folds for easy storage Roto-Stop system stops the blades without stopping the engine so you can safely step away without having to restart the motor Manual fuel-shutoff valve for your convenience Lightweight and maneuverable5-year residential warranty Delivered to your door mostly assembled, simply attach the handle using basic instructions and add gas/oil Home Depot Protection Plan:","List(honda, gcv190, gas, variable, speed, self, propelled, walk, mower, feature, hydrostatic, cruise, control, provides, gradual, speed, adjustment, match, mowing, condition, blade, stop, roto, stop, safety, feature, stop, blade, stopping, engine, safely, step, away, having, restart, motor, assembled, dimension, one, lite, deck, carry, lifetime, warranty, cutting, width, provides, durability, variable, speed, hydrostatic, cruise, control, offer, self, propelled, operation, control, lever, precise, speed, adjustment7, mowing, height, ranging, manicure, lawn, specification, twin, blade, micro, cut, design, provides, fine, grass, clipping, versamow, clip, director, provides, adjustable, mulching, bagging, discharge, leaf, shredding, help, prevent, clogging, bag, chute9, wheel, rear, ball, bearing, provide, smooth, movement, varied, terrain, handle, offer, position, height, comfortable, grip, maneuvering, lightweight, mower, handle, fold, easy, storage, roto, stop, stop, blade, stopping, engine, safely, step, away, having, restart, motor, manual, fuel, shutoff, valve, convenience, lightweight, maneuverable, year, residential, warranty, delivered, door, assembled, simply, attach, handle, basic, instruction, add, gas, oil, home, depot, protection, plan)",lawn mower- electic,"List(lawn, mower, electric)",2.3299999237060547,honda,none,0,"List(match, safety, one, deck, grass, leaf, smooth, varied, terrain, handle, comfortable, handle, easy, door, simply, handle, oil, home)",List(none),0,3.1914854049682617,2.9629976749420166,0.6668685674667358,0.5986939668655396,1.8375784158706665,1.612456440925598,0.1111111111111111,0.018018018018018
812,100140,Honda GCV190 21 in. Variable Speed Self-Propelled Walk-Behind Gas Mower,"List(honda, gcv190, variable, speed, self, propelled, walk, gas, mower)","The Honda 21 in. GCV190 Gas Variable Speed Self-Propelled Walk-Behind Mower features hydrostatic cruise control that provides gradual speed adjustment to match your mowing conditions. The Blade Stop System (Roto-Stop) is a safety feature that stops the blades without stopping the engine. So you can safely step away without having to restart the motor.Assembled dimensions: 24 in. W x 24 in. D x 18.37 in. HNe Xite deck carries a lifetime warranty and its 21 in. cutting width provides durability Variable speed, hydrostatic cruise-control system offers self-propelled operation with a control lever for precise speed adjustment7 mowing heights ranging from 3/4 in. to 4 in. to manicure the lawn to your specifications Twin-blade micro-cut design provides fine grass clippings4-in-1 Versamow System with Clip Director provides adjustable mulching, bagging, discharge and leaf shredding and helps to prevent clogging of the bag chute9 in. wheels with rear ball bearings provide smooth movement over varied terrain Handle offers 3 positions height and a comfortable grip for maneuvering the lightweight mower, handle folds for easy storage Roto-Stop system stops the blades without stopping the engine so you can safely step away without having to restart the motor Manual fuel-shutoff valve for your convenience Lightweight and maneuverable5-year residential warranty Delivered to your door mostly assembled, simply attach the handle using basic instructions and add gas/oil Home Depot Protection Plan:","List(honda, gcv190, gas, variable, speed, self, propelled, walk, mower, feature, hydrostatic, cruise, control, provides, gradual, speed, adjustment, match, mowing, condition, blade, stop, roto, stop, safety, feature, stop, blade, stopping, engine, safely, step, away, having, restart, motor, assembled, dimension, one, lite, deck, carry, lifetime, warranty, cutting, width, provides, durability, variable, speed, hydrostatic, cruise, control, offer, self, propelled, operation, control, lever, precise, speed, adjustment7, mowing, height, ranging, manicure, lawn, specification, twin, blade, micro, cut, design, provides, fine, grass, clipping, versamow, clip, director, provides, adjustable, mulching, bagging, discharge, leaf, shredding, help, prevent, clogging, bag, chute9, wheel, rear, ball, bearing, provide, smooth, movement, varied, terrain, handle, offer, position, height, comfortable, grip, maneuvering, lightweight, mower, handle, fold, easy, storage, roto, stop, stop, blade, stopping, engine, safely, step, away, having, restart, motor, manual, fuel, shutoff, valve, convenience, lightweight, maneuverable, year, residential, warranty, delivered, door, assembled, simply, attach, handle, basic, instruction, add, gas, oil, home, depot, protection, plan)",Lawnmowers,List(lawnmowers),3.0,honda,none,0,"List(match, safety, one, deck, grass, leaf, smooth, varied, terrain, handle, comfortable, handle, easy, door, simply, handle, oil, home)",List(none),0,4.535178184509277,4.248767375946045,0.1082649901509285,-0.0508459731936454,6.145384311676025,5.962362766265869,0.0,0.0
1287,100227,Adjust-A-Gate 2 Rail Gate Frame Kit,"List(adjust, gate, rail, gate, frame, kit)","Ask for the adjustable, Adjust-A-Gate system every time for your gate needs. This line is ideal for repairing an old gate or building a new one. Including all necessary hardware & parts, our gate is assembled onsite to get the job done right, the first time. The Adjust-A-Gate system comes with all the components needed to build a sturdy, reliable gate that will serve you for years to come. You simply add rails and fence boards, you will have a gate for life. Accessories include an optional drop rod for double drive applications.2-rail Adjust-A-Gate kit, for openings of 36 in.-60 in. wide, 36 in. high frame for 3 ft., 4 ft. and 5 ft. high fences Use for wood or composite fence gates; single or double drive use Our exclusive adjustable diagonal truss cable will keep your gate true and aligned year after year Kit includes: Vertical frame, spreader bars, frame hinges, post hinges, gate latch kit, truss cable, screws and gate stop Easy to install; perfect for the DIYer Heavy-duty hinges prevent sagging and dragging2-step powder coated all-steel frame, fade and rust resistant, square tube design","List(ask, adjustable, adjust, gate, time, gate, need, line, ideal, repairing, old, gate, building, new, including, necessary, hardware, part, gate, assembled, onsite, job, right, time, adjust, gate, come, component, needed, build, sturdy, reliable, gate, serve, year, come, simply, add, rail, fence, board, gate, life, accessory, include, optional, drop, rod, double, drive, application, rail, adjust, gate, kit, opening, wide, high, frame, high, fence, use, wood, composite, fence, gate, single, double, drive, use, exclusive, adjustable, diagonal, truss, cable, gate, true, aligned, year, year, kit, includes, vertical, frame, spreader, bar, frame, hinge, post, hinge, gate, latch, kit, truss, cable, screw, gate, stop, easy, install, perfect, diyer, heavy, duty, hinge, prevent, sagging, dragging, step, powder, coated, steel, frame, fade, rust, resistant, square, tube, design)",fence gates,"List(fence, gate)",2.6700000762939453,adjust-a-gate,none,0,"List(old, new, simply, life, kit, high, high, wood, composite, true, kit, kit, easy, heavy, powder, steel, rust)",List(none),0,3.4648513793945312,3.0835936069488525,0.7139539122581482,0.6182488203048706,2.6316978931427,2.1272389888763428,0.2,0.0235294117647058
1289,100227,Adjust-A-Gate 2 Rail Gate Frame Kit,"List(adjust, gate, rail, gate, frame, kit)","Ask for the adjustable, Adjust-A-Gate system every time for your gate needs. This line is ideal for repairing an old gate or building a new one. Including all necessary hardware & parts, our gate is assembled onsite to get the job done right, the first time. The Adjust-A-Gate system comes with all the components needed to build a sturdy, reliable gate that will serve you for years to come. You simply add rails and fence boards, you will have a gate for life. Accessories include an optional drop rod for double drive applications.2-rail Adjust-A-Gate kit, for openings of 36 in.-60 in. wide, 36 in. high frame for 3 ft., 4 ft. and 5 ft. high fences Use for wood or composite fence gates; single or double drive use Our exclusive adjustable diagonal truss cable will keep your gate true and aligned year after year Kit includes: Vertical frame, spreader bars, frame hinges, post hinges, gate latch kit, truss cable, screws and gate stop Easy to install; perfect for the DIYer Heavy-duty hinges prevent sagging and dragging2-step powder coated all-steel frame, fade and rust resistant, square tube design","List(ask, adjustable, adjust, gate, time, gate, need, line, ideal, repairing, old, gate, building, new, including, necessary, hardware, part, gate, assembled, onsite, job, right, time, adjust, gate, come, component, needed, build, sturdy, reliable, gate, serve, year, come, simply, add, rail, fence, board, gate, life, accessory, include, optional, drop, rod, double, drive, application, rail, adjust, gate, kit, opening, wide, high, frame, high, fence, use, wood, composite, fence, gate, single, double, drive, use, exclusive, adjustable, diagonal, truss, cable, gate, true, aligned, year, year, kit, includes, vertical, frame, spreader, bar, frame, hinge, post, hinge, gate, latch, kit, truss, cable, screw, gate, stop, easy, install, perfect, diyer, heavy, duty, hinge, prevent, sagging, dragging, step, powder, coated, steel, frame, fade, rust, resistant, square, tube, design)",gate frame block,"List(gate, frame, block)",2.6700000762939453,adjust-a-gate,none,0,"List(old, new, simply, life, kit, high, high, wood, composite, true, kit, kit, easy, heavy, powder, steel, rust)",List(none),0,2.239935874938965,1.8830853700637815,0.8707652688026428,0.8107318878173828,1.5310730934143066,0.976401388645172,0.4,0.0235294117647058


# Final Dataframe before Modelling :
## Ten features and one response variable
- share_brand<br>
- share_color<br>
- search_title_euclidean<br>
- search_desc_euclidean<br>
- search_title_cosine<br>
- search_desc_cosine<br>
- search_title_wmd<br>
- search_desc_wmd<br>
- search_title_match<br>
- search_desc_match<br>
- relevance<br>

In [57]:
# Keep relevant columns
final_df = aaaaaaa.select('share_brand', 'share_color', 'search_desc_euclidean', 'search_title_euclidean', 'search_title_cosine', 
                         'search_desc_cosine', 'search_title_wmd', 'search_desc_wmd', 'search_title_match','search_desc_match' ,'relevance')

In [58]:
final_df = final_df.na.fill(0)
final_df.cache()
display(final_df.take(5))

share_brand,share_color,search_desc_euclidean,search_title_euclidean,search_title_cosine,search_desc_cosine,search_title_wmd,search_desc_wmd,search_title_match,search_desc_match,relevance
0,1,3.456138849258423,3.013258457183838,0.7697099447250366,0.6893013715744019,2.2653465270996094,2.7549731731414795,0.1666666666666666,0.054054054054054,2.6700000762939453
0,0,3.1914854049682617,2.9629976749420166,0.6668685674667358,0.5986939668655396,1.612456440925598,1.8375784158706665,0.1111111111111111,0.018018018018018,2.3299999237060547
0,0,4.535178184509277,4.248767375946045,0.1082649901509285,-0.0508459731936454,5.962362766265869,6.145384311676025,0.0,0.0,3.0
0,0,3.4648513793945312,3.0835936069488525,0.7139539122581482,0.6182488203048706,2.1272389888763428,2.6316978931427,0.2,0.0235294117647058,2.6700000762939453
0,0,2.239935874938965,1.8830853700637815,0.8707652688026428,0.8107318878173828,0.976401388645172,1.5310730934143066,0.4,0.0235294117647058,2.6700000762939453


In [59]:
final_df.printSchema()

In [60]:
## zero means no share
display(final_df.select("share_brand"))

share_brand
0
0
0
0
0
0
0
0
0
0


In [61]:
display(final_df.select("share_color"))

share_color
1
0
0
0
0
1
0
0
0
0


# Modeling
- Linear regression<br>
- Random Forest<br>
- Gradient Boosting<br>

## Evaluation
- RMSE<br>

In [63]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler 
assembler = VectorAssembler(inputCols=['search_desc_euclidean', 'search_title_euclidean', 
                                       'search_title_cosine', 'search_desc_cosine', 
                                       'search_title_wmd', 'search_desc_wmd', 'search_title_match','search_desc_match',
                                       'share_brand', 'share_color'], 
                            outputCol="features")

In [64]:
dataset = assembler.transform(final_df)
dataset = dataset.select("features", "relevance")

In [65]:
splits = dataset.randomSplit([0.7, 0.3], 123)
train_df = splits[0]
test_df = splits[1]

In [66]:
# Import LinearRegression class
from pyspark.ml.regression import LinearRegression
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import GBTRegressor
# Define GBT, linear regression and decision tree algorithm
lr1 = GBTRegressor(featuresCol = 'features', labelCol='relevance')
lr2 = LinearRegression(featuresCol = 'features', labelCol='relevance', maxIter=10, regParam=0.0, elasticNetParam=0.8)
lr3 = DecisionTreeRegressor(featuresCol = 'features', labelCol='relevance')

In [67]:
modelA = lr1.fit(train_df)

In [68]:
modelB = lr2.fit(train_df)

In [69]:
modelC = lr3.fit(train_df)

# Evaluation

In [71]:
from pyspark.ml.evaluation import RegressionEvaluator
pred_gbt = modelA.transform(test_df)
evaluator = RegressionEvaluator(labelCol="relevance", 
                                predictionCol="prediction", 
                                metricName="rmse")
evaluator.evaluate(pred_gbt)

In [72]:
pred_linear = modelB.transform(test_df)
evaluator = RegressionEvaluator(labelCol="relevance", 
                                predictionCol="prediction", 
                                metricName="rmse")
evaluator.evaluate(pred_linear)

In [73]:
pred_tree = modelC.transform(test_df)
evaluator = RegressionEvaluator(labelCol="relevance", 
                                predictionCol="prediction", 
                                metricName="rmse")
evaluator.evaluate(pred_tree)

# Conclusion
The result above are somewhat encouraging. We use Root Mean Square Error (RMSE) as our metric. All three models provide a small RMSE. Comparing the RMSE generated by the three models, Gradient boost tree model performs the best with the minimum rmse value of 0.4976. At the begining, we only used euclidean distance as our feature and got a high rmse and low R-square. To improve our model, we added more features such as consine distance, word mover distance, whether the brand from search term match with product title, whether the color from search term match with product description. After import more features into the models, rmse has reduced 30%.

# Discussion
The goal of this project is to search matched product based on search input from user. With this model, user can be led to the products that they are looking for. This project can be widely used, such as Amazon product search. This model can solve a lot of product match problems in ecommerce. With a more accurate model, user experience will be improved because the process of looking for a product on the website will be more efficient. Moreover, the model can replace people who give relevance score manually.

# Reference

- https://vene.ro/blog/word-movers-distance-in-python.html
- https://medium.com/towards-artificial-intelligence/multi-class-text-classification-using-pyspark-mllib-doc2vec-dbfcee5b39f2
- http://norvig.com/spell-correct.html