# CA684 Machine Learning Assignment Spring 2022
**Tanmay Potbhare (21262012)**

## Introduction

As a customer proposition, Zalando strives for “trustworthy” prices. That is, the company wants to offer competitive prices in each of its dynamic market environments, to alleviate its customers from having to compare prices, and to drive revenue growth. In order to do that for its hundreds of thousands of individual products, Zalando needs to Identify exact product matches across the relevant European competitors. 

A very similar use case exists at stores like Amazon or Walmart, which allow multiple sellers to offer the same product on their platform: identical products need to be grouped together, even when the names, descriptions, images, etc. are not exactly the same.

## Challenge

Barcode systems like EAN allow for unique identification of every product. Unfortunately, reliable EAN information is not always available. Zalando uses multi-modal data to solve the problem, relying on images and text. For this challenge, we are asking to make intelligent use of text data (such as product titles, colors and descriptions). As these are not standardized, and often manually written / changed for marketing purposes, matching products is a non-trivial task.

This challenge has a direct business impact for a retailer like Zalando. It is also closely related to many other problems, like record deduplication in heterogeneous catalogues, document retrieval, and many more.

## Getting Started

Here is some sample code to get you started on the challenge!

Happy Hacking!

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Installing the libraries needed
!pip install thefuzz
!pip install googletrans
!pip install fuzzywuzzy
!pip install python-Levenshtein
!pip install scikit-learn~=0.22
!pip install gensim~=3.8
!pip install nltk~=3.4
!pip install jellyfish
!pip install sentence-transformers
!pip install tensorflow

Collecting thefuzz
  Downloading thefuzz-0.19.0-py2.py3-none-any.whl (17 kB)
Installing collected packages: thefuzz
Successfully installed thefuzz-0.19.0
Collecting googletrans
  Downloading googletrans-3.0.0.tar.gz (17 kB)
Collecting httpx==0.13.3
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 2.9 MB/s 
[?25hCollecting hstspreload
  Downloading hstspreload-2021.12.1-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 15.1 MB/s 
[?25hCollecting httpcore==0.9.*
  Downloading httpcore-0.9.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.6 MB/s 
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting rfc3986<2,>=1.3
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl (31 kB)
Collecting h11<0.10,>=0.8
  Downloading h11-0.9.0-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 2.5 MB/s 
[?25hCollecting h2==3.*
  Downloading h2-3.2.0-

# Importing the Libraries 

In [4]:
# libraries
import os
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image
import urllib
import json
from random import choices
# Levenshtein Distance in Python
# https://github.com/seatgeek/thefuzz
from thefuzz import fuzz, process
import jellyfish as jf
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import re
import itertools
from itertools import chain
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import euclidean_distances
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
#from tensorflow.keras.applications.vgg16 import VGG16
#from tensorflow.keras.preprocessing import image
#from tensorflow.keras.applications.vgg16 import preprocess_input
from io import BytesIO
import tensorflow as tf
from sklearn.metrics.pairwise import euclidean_distances


# Matplotlib configuration
font = { 'family': 'DejaVu Sans', 'weight': 'bold', 'size': 16 }
plt.rc('font', **font)

# Pandas config
pd.options.mode.chained_assignment = None  # default='warn'

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
# set random seed
np.random.seed(seed=42)

# Reading, Converting toCSV and analysing Datasets

In [6]:
# Reading and Converting the parquet dataset to CSV

offers_training_df = pd.read_parquet('/content/drive/MyDrive/CA684MachineLearning/offers_training.parquet')
offers_training_df.to_csv('TrainingOffers.csv')

offers_test_df = pd.read_parquet('/content/drive/MyDrive/CA684MachineLearning/offers_test.parquet')
offers_test_df.to_csv('TestOffers.csv')

matches_training_df = pd.read_parquet('/content/drive/MyDrive/CA684MachineLearning/matches_training.parquet')
matches_training_df.to_csv('TrainingMatches.csv')

In [18]:
offers_training_df.count()

offer_id       102884
shop           102884
lang           102884
brand          102884
color          102882
title          102884
description    102884
price          102882
url            102884
image_urls     102858
dtype: int64

In [19]:
pd.value_counts(offers_test_df['shop'], sort=True, ascending=False)

aboutyou    70105
zalando     36636
Name: shop, dtype: int64

In [9]:
matches_training_df.head(2)


Unnamed: 0,zalando,aboutyou,brand
0,b33f55d6-0149-4063-8b63-3eeae63562a2,ad5ceb87-0254-4171-b650-1d4d09f48efc,10
1,f04bef4a-f771-4749-914c-1b22718523b8,b68dd42a-9bda-46e2-aa4e-3d7c50881bb2,10


In [10]:
# Checking Unique values and taking Count 
print(matches_training_df['brand'].unique())
print(matches_training_df['brand'].count())

[ 10  31  33  39  54  65  66  77  78  81  97 101 109 118 158 148 155 174
 718 196 211 212 213 220 247 250 275 291 294 300 319 329 342 344 351 361
 375 380 394 396 404 407 420 425 429 430 434 443 462 464 470 473 478 482
 486 494 530 531 572 579 584 591 599 609 613 621 652 676 687 695 735 738
 739 744 748 750]
15170


In [38]:
brands_training = offers_training_df['brand'].unique()

In [37]:
brands_test = offers_test_df['brand'].unique()

## Exploratory Data Analysis

It is important to familiarize yourself with the dataset by using measures of centrality (e.g. mean) and statistical dispersion (e.g. variance) and data visualization methods. The following is just some Pandas preprocessing and Matplotlib visualizations to get you started. Feel free to explore the data much further and come up with ideas that might help you in the matching task!

### Offers of Products

In [11]:
# Checking the Number of products in Training Offers data frame
f'Number of products in training: {len(offers_training_df):,}'

'Number of products in training: 102,884'

In [12]:
# Check the list of the columns
list(offers_training_df.columns)

['offer_id',
 'shop',
 'lang',
 'brand',
 'color',
 'title',
 'description',
 'price',
 'url',
 'image_urls']

In [13]:
# Taking the counts of the valuesin the shop column of training offers DF
pd.value_counts(offers_training_df['shop'], sort=True, ascending=False)

aboutyou    61980
zalando     40904
Name: shop, dtype: int64

# Converting to Normal String

In [14]:
# Concats the Title and Description of the offers_training dataframe
# TD_string is title and description string

def TD_string(title,description):
    return  str(title) + ' ' + str(description)

In [43]:
# These functions will convert the data into normal string and that string will be seen in the newly created two new columns.

# Extracting Converted Json words which finds the words in the string from left to right
def ConvertedwordsJson(text):
    return [word for word in re.findall(r"[^\W\d_]+", text.rsplit(',', 1)[0])]

# Extracting Converted words which finds the words in the string from left to right
def Convertedwords(text):
    final = [word for word in re.findall(r"[^\W\d]+", text.rsplit(' ', 1)[0])]
    return final

# The text is extracted from Json as string
def extractJsonstr(text):
    items = text.replace('\n', '').split(':')[1:]
    return ' '.join(chain(*[ConvertedwordsJson(i) for i in items if '<table' not in i]))

# This is used for extracting from zolando 
# extractZolandoStr 
def extractZolandoStr(text):
    items = text.replace('\n', '').replace('|',' ').split('$')[0:]
    str = ' '.join(chain(*[Convertedwords(i) for i in items if '<table' not in i ]))
    final_items = text.split(' ')[0:]
    return ' '.join(chain(*[Convertedwords(i) for i in final_items if '_' not in i ]))

def simple_Descprtion(var):
  if var.startswith('{'):
    final_String = extractJsonstr(var)
  else:
    final_String = extractZolandoStr(var)
  return final_String


# simple_Descprtion(jsonvar)
offers_training_df['DescriptionString'] = offers_training_df.apply(lambda x:simple_Descprtion(x.description), axis=1)
offers_training_df['Ti_Desc_String'] = offers_training_df.apply(lambda x:TD_string(x.title,x.description_String), axis=1)
# offers_training_df = offers_training_df.drop(["title_Descprtion_string"], axis = 1)
# offers_training_df = offers_training_df.drop(["description_String"], axis = 1)
# offers_training_df = offers_training_df.drop(["TDString"], axis = 1)
# offers_training_df = offers_training_df.drop(["T_D_String"], axis = 1)
offers_training_df.head(1)


Unnamed: 0,offer_id,shop,lang,brand,color,title,description,price,url,image_urls,DescriptionString,Ti_Desc_String
0,d8e0dba8-98e8-48db-9850-dd30cff374e0,aboutyou,de,PIECES,hellblau | Blau,Kleid,"{""Material"": [""Baumwolle""], ""\u00c4rmell\u00e4...",24.99,https://www.aboutyou.de/p/pieces/kleid-6732409,[https://cdn.aboutstatic.com/file/images/06728...,Baumwolle u c rmellos Normaler Tr u e ger Rund...,Kleid Baumwolle u c rmellos Normaler Tr u e ge...


In [74]:
offers_training_df[offers_training_df['shop'].str.lower().str.contains("zalando", na=False)]

Unnamed: 0,offer_id,shop,lang,brand,color,title,description,price,url,image_urls,DescriptionString,Ti_Desc_String
5,02df5ca3-8adc-48fa-bf42-91b41c3ea5a9,zalando,de,guess,white,junior reversible hooded long wintermantel,skirt_details Eingrifftaschen | Ziersteine $ n...,150.685455,https://www.zalando.de/lookup/article/GU123L05...,[https://img01.ztat.net/article/1ec35ff491c54e...,Eingrifftaschen Ziersteine shiny pearl pattern...,JUNIOR REVERSIBLE HOODED LONG Wintermantel Ein...
8,08c47691-4160-41df-81c5-ea108f2ae539,zalando,de,ellesse,white,hollina shirt & legging pyjama nachtwäsche set,name_suffix white $ pattern print $ material.u...,39.958182,https://www.zalando.de/lookup/article/EL981P00...,[https://img01.ztat.net/article/511a8191c10549...,white pattern print Elasthan Baumwolle weiß no...,HOLLINA SHIRT & LEGGING Pyjama Nachtwäsche Set...
11,96fc5065-3a31-42f0-bfbd-34ee94324807,zalando,de,selected,blue,slhslimmark washed businesshemd,main_supplier_code K70240 $ name_suffix dark s...,39.990909,https://www.zalando.de/lookup/article/SE622D0Y...,[https://img01.ztat.net/article/4ca1292819fa35...,K dark sapphire pattern print Baumwolle blau K...,SLHSLIMMARK WASHED Businesshemd K dark sapphir...
12,04341c9e-4043-4478-9a64-266ac38480b5,zalando,de,ellesse,black,nurra fashion trunks 5 pack panties,main_supplier_code K10441 $ name_suffix black ...,32.958182,https://www.zalando.de/lookup/article/EL982O00...,[https://img01.ztat.net/article/ca67693a608332...,K black Elasthan Baumwolle schwarz normal Jers...,NURRA FASHION TRUNKS 5 PACK Panties K black El...
13,346abc38-d9ae-431a-8e6a-64f7bb3ade7e,zalando,de,selected,brown,slfolive cardigan strickjacke,name_suffix carafe $ pattern meliert $ materia...,89.990909,https://www.zalando.de/lookup/article/SE521I0P...,[https://img01.ztat.net/article/a797f668349e4d...,carafe pattern meliert Wolle Alpaka Polyacryl ...,SLFOLIVE CARDIGAN Strickjacke carafe pattern m...
...,...,...,...,...,...,...,...,...,...,...,...,...
102875,6a50227d-de59-42ce-b1e7-a86d9020b7fd,zalando,de,hust & claire,,almer longsleeve unisex langarmshirt,main_supplier_code K70388 $ name_suffix ombre ...,29.867273,https://www.zalando.de/lookup/article/HC326G00...,[https://img01.ztat.net/article/ac1fbf86d17c4f...,K ombre blue pattern unifarben Baumwolle blaug...,ALMER LONGSLEEVE UNISEX Langarmshirt K ombre b...
102876,ff5888dc-0eff-421a-b3c2-4033b56d7cb8,zalando,de,more & more,black,t-shirt basic,main_supplier_code K75573 $ name_suffix black ...,29.990909,https://www.zalando.de/lookup/article/M5821D0K...,[https://img01.ztat.net/article/25124eed82da33...,K black pattern unifarben Elasthan Viskose sch...,T-Shirt basic K black pattern unifarben Elasth...
102877,f6f9eb9e-b3f0-4a9c-8471-92471e65be43,zalando,de,libertine-libertine,blue,sunday strickkleid,name_suffix dark navy $ pattern unifarben $ ma...,199.958182,https://www.zalando.de/lookup/article/LIQ21C02...,[https://img01.ztat.net/article/af834b0991dd45...,dark navy pattern unifarben Polyamid Viskose W...,SUNDAY Strickkleid dark navy pattern unifarben...
102881,60e571e0-461b-456d-85f9-82ec7e0720ea,zalando,de,vero moda,blue light,vmsophia jeans skinny fit,main_supplier_code K70240 $ name_suffix light ...,34.990909,https://www.zalando.de/lookup/article/VEB21N01...,[https://img01.ztat.net/article/a243abb8917a3d...,K light blue denim Elasthan Polyester Baumwoll...,VMSOPHIA Jeans Skinny Fit K light blue denim E...


In [80]:
brands_training = offers_training_df['brand'].unique()
brands_test = offers_test_df['brand'].unique()

In [81]:
# Intersection between brands in training and test
f'Number of brands in train and test: {sum(np.in1d(brands_training, brands_test, assume_unique=True)):,}'

'Number of brands in train and test: 0'

# Matching

In [82]:
f'Number of groundtruth matches: {len(matches_training_df):,}'

'Number of groundtruth matches: 15,170'

In [83]:
matches_training_df.head()

Unnamed: 0,zalando,aboutyou,brand
0,b33f55d6-0149-4063-8b63-3eeae63562a2,ad5ceb87-0254-4171-b650-1d4d09f48efc,10
1,f04bef4a-f771-4749-914c-1b22718523b8,b68dd42a-9bda-46e2-aa4e-3d7c50881bb2,10
2,396c292a-cda8-4477-ac67-86701fc8ab95,7d19213c-b3ea-406a-ac8e-8299823c7bb4,10
3,e72b5d05-fd06-46e9-a183-5e2e26ed18bb,22344dcd-2eca-4576-a89d-916cc47f6cb4,10
4,87b7841b-f44e-4652-ace4-2ac975510226,c2f1a132-c013-4e78-8582-6d3001e05cbf,10


In [85]:
matches_training_df.iloc[0]

zalando     b33f55d6-0149-4063-8b63-3eeae63562a2
aboutyou    ad5ceb87-0254-4171-b650-1d4d09f48efc
brand       10                                  
Name: 0, dtype: object

In [86]:
def get_offer(products, match, shop):
    return products[
        products['offer_id'] == match[shop]
    ].iloc[0]

# Creating the dataframes for Zolando and About You

In [70]:
AboutYou_df = offers_training_df.loc[offers_training_df['shop'] == 'aboutyou'] 
AboutYou_df.to_csv('AboutYouDF.csv')
#Zolando_df = offers_training_df.loc[offers_training_df['shop'] == 'zolando']

In [63]:
len(AboutYou_df)

61980

# Translating the Color column names to english

In [47]:
import pickle
with open(r'/content/drive/MyDrive/CA684MachineLearning/color_map.pickle', 'rb') as f:
    color_map = pickle.load(f)
translation = str.maketrans({'|': '', ' ': ' ', "'": '', '"': ''})

def ColorConvertEN(colors):
  return ' '.join(set(color_map.get(color_de, '') for color_de in colors.translate(translation).split())).casefold()


offers_training_df[['brand', 'color', 'title']] = offers_training_df[
        ['brand', 'color', 'title']].applymap(lambda x: x.casefold() if x else '')
offers_training_df['color'] = offers_training_df['color'].apply(ColorConvertEN)

#offers_training_df.head(3)

In [None]:
# Failed Experiment 
# Don't Run this code as it will easily take 5 to 6 hours for implementation

from googletrans import Translator
import pandas as pd

translator=Translator()
li1=[]
for i in range(len(offers_training_df['title'])):
    try:
        offers_training_df['title'][i]=translator.translate(offers_training_df['color'][i], src='de', dest='en').text
    except:
        li1.append(i)
offers_training_df.head()

# Similarity Scores

1. Cosine Similarity

In [69]:
from nltk.corpus import stopwords

def cosineTitle(main_Offer_str,df_Color):
  text1 = main_Offer_str
  text2 = df_Color
  list_text = []
  list_text.clear()
  list_text = [text1, text2]
  # converts text into vectors with the TF-IDF 
  vect = TfidfVectorizer(stop_words= stopwords.words('german'))
  vect.fit_transform(list_text)
  tfidf_text1, tfidf_text2 = vect.transform([list_text[0]]), vect.transform([list_text[1]])
  # computes the cosine similarity
  cosine_similarity_Score = cosine_similarity(tfidf_text1, tfidf_text2)
  return ((cosine_similarity_Score[0])[0])
  
def cosineDesc(main_Offer_str,df_description):
  text1 = main_Offer_str
  text2 = df_description
  list_text = []
  list_text.clear()
  list_text = [text1, text2]
  # converts text into vectors with the TF-IDF 
  vect = TfidfVectorizer(stop_words= stopwords.words('german'))
  vect.fit_transform(list_text)
  tfidf_text1, tfidf_text2 = vect.transform([list_text[0]]), vect.transform([list_text[1]])
  # computes the cosine similarity
  cosine_similarity_Score = cosine_similarity(tfidf_text1, tfidf_text2)
  return ((cosine_similarity_Score[0])[0])

def cosineColor(main_Offer_str,df_Color):
  text1 = main_Offer_str
  text2 = df_Color
  list_text = []
  list_text.clear()
  list_text = [text1, text2]
  # converts text into vectors with the TF-IDF 
  vect = TfidfVectorizer(stop_words= stopwords.words('english'))
  vect.fit_transform(list_text)
  tfidf_text1, tfidf_text2 = vect.transform([list_text[0]]), vect.transform([list_text[1]])
  # computes the cosine similarity
  cosine_similarity_Score = cosine_similarity(tfidf_text1, tfidf_text2)
  return ((cosine_similarity_Score[0])[0])

#offers_training_df['cosine_similarity_Score'] = offers_training_df.apply(lambda x:cosineDesc(x.cosine_similarity_Score,x.description_String), axis=1)
#offers_training_df.head(1)

2. Eucladian Distance

In [66]:
import math
import sys
def euclidean_distance(main_Offer_str, AYstr, Zolstr):
  common_products = {}
  for item in main_Offer_str[AYstr]:
        if item in main_Offer_str[Zolstr]:
            common_products[item] = True
    #if no item is common
        if len(common_products) == 0: return 0

        #calculate distance
        #√((x1-x2)^2 + (y1-y2)^2)
        distance = sum([math.pow(main_Offer_str[AYstr][itm] - main_Offer_str[Zolstr][itm], 2) for itm in common_products.keys()])
        distance = math.sqrt(distance)
        #return result
        return 1/(distance + 1)
  

In [71]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

#Sentences we want to encode. Example:
#sentence1 = ['This framework generates embeddings for each input sentence']
#sentence2 = ['generates sentence embeddings']

def EucladianDESC_B(sentence1,sentence2):
  sentence1 = [sentence1]
  sentence2 = [sentence2]
  #Sentences are encoded by calling model.encode()
  embedding1 = model.encode(sentence1)
  embedding2 = model.encode(sentence2)
  result = euclidean_distances(
    [embedding1[0]],
    [embedding2[0]]
    )
  return (result[0])[0]


def EucladianTITLE_B(sentence1,sentence2):
  sentence1 = [sentence1]
  sentence2 = [sentence2]
  #Sentences are encoded by calling model.encode()
  embedding1 = model.encode(sentence1)
  embedding2 = model.encode(sentence2)
  result = euclidean_distances(
    [embedding1[0]],
    [embedding2[0]]
    )
  return (result[0])[0]


def EucladianCOLOR_B(sentence1,sentence2):
  sentence1 = [sentence1]
  sentence2 = [sentence2]
  #Sentences are encoded by calling model.encode()
  embedding1 = model.encode(sentence1)
  embedding2 = model.encode(sentence2)
  result = euclidean_distances(
    [embedding1[0]],
    [embedding2[0]]
    )
  return (result[0])[0]


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

3. Fuzz Score

In [72]:
AboutYou_df=offers_training_df.where(offers_training_df['shop']=='aboutyou')
Zolando_df=offers_training_df.where(offers_training_df['shop']=='zalando')

fuzz_desc=[]
fuzz_title=[]
fuzz_color = []
row=len(Zolando_df)

Zolando_df['title']=Zolando_df['title'].values.astype('U')
AboutYou_df['title']=AboutYou_df['title'].values.astype('U')

Zolando_df['description']=Zolando_df['description'].values.astype('U')
AboutYou_df['description']=AboutYou_df['description'].values.astype('U')

Zolando_df['color']=Zolando_df['color'].values.astype('U')
AboutYou_df['color']=AboutYou_df['color'].values.astype('U')
    
#fuzz similarity for description
for i in range(row):
    text1 = Zolando_df['description'][i]
    text2 = AboutYou_df['description'][i]
    fuzz_desc.append(fuzz.partial_ratio(text1, text2))  
    
#fuzz similarity for title
for i in range(row):
    text1 = Zolando_df['title'][i]
    text2 = AboutYou_df['title'][i]
    fuzz_title.append(fuzz.partial_ratio(text1, text2))   

#fuzz similarity for color
for i in range(row):
    text1 = Zolando_df['color'][i]
    text2 = AboutYou_df['color'][i]
    fuzz_color.append(fuzz.partial_ratio(text1, text2))  


fuzz_desc=np.array(fuzz_desc)
fuzz_title=np.array(fuzz_title)
fuzz_color = np.array(fuzz_color)
FuzzFile_df=pd.DataFrame(data={'fuzzDescription':fuzz_desc, 'fuzzTitle':fuzz_title, 'fuzzColor':fuzz_color})
offers_training_df=offers_training_df.join(FuzzFile_df,how='left')
# offers_training_df['fuzz_score'] = offers_training_df.apply(lambda x:fuzz_score(str1,x.fuzz_score), axis=1)
# offers_training_df.head(1)

# Matching

In [20]:
def get_shops_for_brand(offers, brands):
    """ Get mapping for brands in between the two shops """
    
    mapping = {}
    for brand in brands:
        shops = offers[offers["brand"] == brand]["shop"].unique()
        for shop in shops:
            mapping.setdefault(shop, [])
            mapping[shop].append(brand)
        print(f'Brand: "{brand}" is in {", ".join(shops)}')
    return mapping

In [21]:
def get_offers_by_shop(offers, mapping):
    """ Get offers per shop """
    
    offers_zal = offers[
        (offers['shop'] == 'zalando') & 
        (offers['brand'].isin(mapping['zalando']))
    ]
    offers_comp = offers[
        (offers['shop'] == 'aboutyou') &
        (offers['brand'].isin(mapping['aboutyou']))
    ]
    return offers_zal, offers_comp

In [22]:
def get_features(offers):
    """ Extract some text features using title and color """
    
    offers['text'] = offers[
        ['title','color']
    ].apply(lambda x : f"{x[0]} {x[1].split('|')[0]}", axis=1)
    
    return offers[['offer_id', 'text']].values

In [24]:
def matcher(zal_offers, comp_offers, brand_block_index):
    
    
    # Get text from offers
    comp_text = comp_offers[:, 1]
    choices_dict = {idx: el for idx, el in enumerate(comp_text)}
    
    predicted_matches = []

    # For each zalando offer
    for zal_offer_id, zal_text in zal_offers:
        
        # Extract the best match using TheFuzz's package
        title, score, index = process.extractOne(zal_text, choices_dict) 
        comp_offer_id = comp_offers[index][0]

        # Add predicted match
        predicted_matches.append(
            {
                'zalando': zal_id,
                'aboutyou': AB_id,
                'brand': brand_block_index
            }
        )

    return pd.DataFrame(predicted_matches)

In [25]:
def get_brand_predictions(brand_pattern, brand_unique_index):
    """ 
    Custom pipeline to get the brand mapping, offers per shop, extract the features and generate predictions
    """

    list_brands = [
        brand
        for brand in brands_training
        if brand_pattern in brand.lower()
    ]

    # Get brand mapping
    brand_mapping = get_shops_for_brand(offers_training_df, list_brands)
    print(f'Mapping: {brand_mapping}')

    # Get offers
    brand_offers_zal, brand_offers_comp = get_offers_by_shop(offers_training_df, brand_mapping)
    
    print(f'Number of "{brand_pattern}" products: {len(brand_offers_zal) + len(brand_offers_comp):,} (' + \
          f'Zalando: {len(brand_offers_zal):,} ' + \
          f'and AboutYou: {len(brand_offers_comp):,})')

    # Get features
    brand_offers_zal_features = get_features(brand_offers_zal)
    brand_offers_comp_features = get_features(brand_offers_comp)

    # Match!
    predictions = matcher(
        brand_offers_zal_features, 
        brand_offers_comp_features, 
        brand_unique_index
    )
    
    print(f'Number of predicted matches for "{brand_pattern}": {len(predictions):,}')
    
    return brand_offers_zal, brand_offers_comp, predictions

# Evaluation

In [26]:
def explore_match(match):
    """ Explore a match with the offers' images """
    
    # get offer ids
    zal_offer_id = match['zalando']
    comp_offer_id = match['aboutyou']
    
    # get offers
    zalando_offer = offers_training_df[offers_training_df['offer_id'] == zal_offer_id].iloc[0]
    comp_offer = offers_training_df[offers_training_df['offer_id'] == comp_offer_id].iloc[0]
    
    # show images and text
    print(f"Zalando: {zalando_offer['title']} {zalando_offer['color']}")
    plot_images(zalando_offer)
    print(f"AboutYou: {comp_offer['title']} {zalando_offer['color']}")
    plot_images(comp_offer)

In [27]:
def get_true_matches_brand(zal_offers):
    """ Get true matches based on their brand block """
    
    # get brand block / mapping index from the training matches
    indexes = zal_offers.merge(
        matches_training_df,
        left_on='offer_id',
        right_on='zalando',
        suffixes=['offer', 'match']
    )['brandmatch'].unique()
    
    return matches_training_df[matches_training_df['brand'].isin(indexes)]

In [28]:
def get_metrics(true_matches, predicted_matches, offers_comp):
    """ Calculate performance metrics """
    
    # True Positives
    TP = len(
        true_matches.merge(
            predicted_matches, 
            on=['zalando', 'aboutyou'], 
            how='inner', 
        )
    )
    
    # False Negatives
    FN = len(true_matches) - TP
    
    # Actual Positives
    positives = len(true_matches)
    assert positives == TP + FN
    
    # Actual Negatives (with respect to the competitor)
    negatives = len(offers_comp) - positives
    
    # Actual negative values (with respect to the competitor)
    offers_comp_with_matches = offers_comp.merge(
        true_matches, 
        left_on='offer_id',
        right_on='aboutyou',
        how='outer',
        indicator=True
    )
    negative_values = offers_comp_with_matches[
        offers_comp_with_matches['_merge'] == 'left_only'
    ]['offer_id'].unique()
    
    assert negatives == len(negative_values)
    
    # Competitor predictions
    comp_preds = predicted_matches['aboutyou'].unique()
    
    # False Negatives (with respect to the competitor)
    FP = len(np.intersect1d(negative_values, comp_preds))
    
    # True Negatives
    TN = negatives - FP
    
    # Precision, Recall and F1 metrics
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    F1 = 0
    if precision + recall > 0:
        F1 = 2 * precision * recall / (precision + recall)
    
    metrics = dict(
        TP=TP,
        FN=FN,
        FP=FP,
        TN=TN,
        positives=positives,
        negatives=negatives,
        precision=precision,
        recall=recall,
        F1=F1,
    )
        
    return metrics

In [29]:
def get_brand_metrics(brand_offers_zal, brand_offers_comp, brand_predicted_matches):

    # Get groundtruth
    brand_true_matches = get_true_matches_brand(brand_offers_zal)
    
    print(f'Number of true matches: {len(brand_true_matches):,}')

    # Get metrics
    brand_metrics = get_metrics(brand_true_matches, brand_predicted_matches, brand_offers_comp)
    
    return brand_true_matches, brand_metrics

In [31]:
# Explore a particular predicted match
#predicted_match = quiksilver_predicted_matches.iloc[27]
#explore_match(predicted_match)

In [33]:
#get_offer(offers_training_df, predicted_match, 'zalando')#

In [34]:
def get_Top_product(df):
  df_test = df.loc[df['EucladianBERT'] > 0.10]
  df_test = df_test.loc[df_test['EucladianDESC_B'] > 0.50]
  df_test = df_test.loc[df_test['EucladianTITLE_B'] > 0.10]
  df_test = df_test.loc[df_test['EucladianCOLOR_B'] > 0.10]
  Macted_count_aboutyou = df_test['offer_id'].count()
  f"Number of matches: {Macted_count_aboutyou:,}"

  column_List = ['EucladianDESC_B','EucladianTITLE_B','EucladianCOLOR_B']
  ascending_Order_List = ['False','False','False']
  ListWScores_df = df_test.sort_values(by = column_List,ascending= ascending_Order_List )
  ListWScores_df.head()
  ListWScores_df.to_csv('ListWithScores.csv')
  return ListWScores_df

# Submissions

In [39]:
dkny_brands = [
    brand
    for brand in brands_test
    if 'dkny' in brand.lower()
]

dkny_brand_mapping = get_shops_for_brand(offers_test_df, dkny_brands)
dkny_brand_mapping

Brand: "DKNY" is in aboutyou, zalando
Brand: "DKNY Performance" is in aboutyou
Brand: "DKNY Intimates" is in aboutyou
Brand: "DKNY Sport" is in aboutyou


{'aboutyou': ['DKNY', 'DKNY Performance', 'DKNY Intimates', 'DKNY Sport'],
 'zalando': ['DKNY']}

In [41]:
gant_brands = [
    brand
    for brand in brands_test
    if 'gant' in brand.lower()
]

gant_brand_mapping = get_shops_for_brand(offers_test_df, gant_brands)
gant_brand_mapping

Brand: "GANT" is in aboutyou
Brand: "Gant" is in zalando


{'aboutyou': ['GANT'], 'zalando': ['Gant']}

In [42]:

test_mapping = [
    dkny_brand_mapping,
    gant_brand_mapping
]

In [43]:
# Brand mappings for brands in test offers
test_mapping = [
    dkny_brand_mapping,
    gant_brand_mapping
]

In [45]:
test_mapping1 = [
    dkny_brand_mapping
]

test_mapping2 = [
    gant_brand_mapping
]

In [None]:
# Failed Experiments
def PREDICTION():
