# Earbuds Market: Bol vs CoolBlue

In this part of the project I will use the scraped data of the other programs to:
    1. Identify the key difference between product offering at Bol and CoolBlue
    2. Identify marketing strategies by the different companies by analysing review ratings.
    2. Try to predict the outcome of a review using NLP.

Let's start by importing the required libraries

In [1]:
# Analysis
import pandas as pd
import numpy as np

# Visualisation
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('whitegrid')
sns.set_context('notebook')
sns.set(font='times new roman',font_scale=1,palette='Greens')

# NLP
from nltk.corpus import stopwords
stop_words=stopwords.words('dutch')
import string as string_module


In [2]:
def simplify_string(string,remove_stops):
    nopunc = [char for char in string if char not in string_module.punctuation]
    nopunc=''.join(nopunc) 
    nopunc=nopunc.lower()
    if remove_stops:
        clean_string = [word for word in nopunc.split() if word.lower() not in stop_words]
        return ' '.join(clean_string)
    else:
        return nopunc

In [3]:
EarBuds_Bol=pd.read_pickle('EarBuds_Bol')
EarBuds_Bol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Manufacturer   288 non-null    object 
 1   Name           288 non-null    object 
 2   Price [EUR]    286 non-null    float64
 3   Discount       288 non-null    object 
 4   Ret P [EUR]    286 non-null    float64
 5   Stars [x/5.0]  288 non-null    object 
 6   S_count        288 non-null    int64  
 7   Description    288 non-null    object 
 8   Pros           288 non-null    object 
 9   Cons           288 non-null    object 
 10  Reviews        288 non-null    object 
dtypes: float64(2), int64(1), object(8)
memory usage: 24.9+ KB


In [4]:
EarBuds_CoolBlue=pd.read_pickle('EarBuds_CoolBlue')
EarBuds_CoolBlue.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164 entries, 0 to 163
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Manufacturer   164 non-null    object 
 1   Name           164 non-null    object 
 2   Price [EUR]    164 non-null    float64
 3   Discount       164 non-null    object 
 4   Ret P [EUR]    164 non-null    float64
 5   Stars [x/5.0]  164 non-null    object 
 6   S_count        164 non-null    int64  
 7   Description    164 non-null    object 
 8   Pros           164 non-null    object 
 9   Cons           164 non-null    object 
 10  Reviews        164 non-null    object 
dtypes: float64(2), int64(1), object(8)
memory usage: 14.2+ KB


Nice, I managed to get the same types for every row. Further, most data cleaning has already been done during the scraping process. 

In [5]:
EarBuds_CoolBlue['Name']

0      AirPods Pro met Draadloze Oplaadcase
1                  AirPods 2 met oplaadcase
2                      Powerbeats Pro Zwart
3                  Elite 75t Titanium Zwart
4        AirPods 2 met draadloze oplaadcase
                       ...                 
159             Ink'd+ Active Wireless Rood
160                Minor II Bluetooth Zwart
161                             Tarah Zwart
162                         LIVE 300TWS Wit
163                                  Zilver
Name: Name, Length: 164, dtype: object

In [6]:
EarBuds_Bol['Name']

0           Tune 220TWS - Volledig draadloze oordopjes -
1              Tune 120TWS - Zwart - Volledige draadloze
2                         Airpods Pro - met Active Noise
3                                         Galaxy Buds+ -
4                                          Galaxy Buds -
                             ...                        
283    CALIBER Oordopjes MAC070BT/W witte stereo true...
284                  TWS1 - In-ear TWS koptelefoon / Wit
285      Happy Plugs Hoofdtelefoon Air 1 plus Earbud wit
286    HOCO ES32 Plus - Volledig Draadloze Oordopjes ...
287    ES42 - Draadloze oordopjes - Bluetooth oortjes...
Name: Name, Length: 288, dtype: object

In [7]:
import Levenshtein as lev


ModuleNotFoundError: No module named 'Levenshtein'

In [10]:
import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
    """ levenshtein_ratio_and_distance:
        Calculates levenshtein distance between two strings.
        If ratio_calc = True, the function computes the
        levenshtein distance ratio of similarity between two strings
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of s and the
        first j characters of t
    """
    # Initialize matrix of zeros
    rows = len(s)+1
    cols = len(t)+1
    distance = np.zeros((rows,cols),dtype = int)

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    # Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions    
    for col in range(1, cols):
        for row in range(1, rows):
            if s[row-1] == t[col-1]:
                cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
            else:
                # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                if ratio_calc == True:
                    cost = 2
                else:
                    cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions
    if ratio_calc == True:
        # Computation of the Levenshtein Distance Ratio
        Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
        return Ratio
    else:
        # print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
        # insertions and/or substitutions
        # This is the minimum number of edits needed to convert string a to string b
        return "The strings are {} edits away".format(distance[row][col])

In [11]:
levenshtein_ratio_and_distance('rens', 'redafsansdadafsz', ratio_calc = True)

0.4

In [12]:
EarBuds_Bol['Name'].apply(lambda x: simplify_string(x,False))
EarBuds_CoolBlue['Name'].apply(lambda x: simplify_string(x,False))

0      airpods pro met draadloze oplaadcase
1                  airpods 2 met oplaadcase
2                      powerbeats pro zwart
3                  elite 75t titanium zwart
4        airpods 2 met draadloze oplaadcase
                       ...                 
159               inkd active wireless rood
160                minor ii bluetooth zwart
161                             tarah zwart
162                         live 300tws wit
163                                  zilver
Name: Name, Length: 164, dtype: object

In [39]:
Name_corr = np.zeros((len(EarBuds_Bol['Name']),len(EarBuds_CoolBlue['Name'])))
BolRow=2
for j in range(0,10):
    for i in range(0,len(EarBuds_CoolBlue['Name'])):
        Name_corr[j,i]=levenshtein_ratio_and_distance(EarBuds_Bol['Name'].iloc[j],EarBuds_CoolBlue['Name'].iloc[i],True)
        
    row_max=np.argmax(Name_corr[j,:])
    print(np.max(Name_corr[j,:]))
    print(EarBuds_Bol['Name'].iloc[j], 'PREDICT', EarBuds_CoolBlue['Name'].iloc[row_max])




0.46153846153846156
Tune 220TWS - Volledig draadloze oordopjes - PREDICT AirPods 2 met draadloze oplaadcase
0.576271186440678
Tune 120TWS - Zwart - Volledige draadloze PREDICT Tune 120 TWS Zwart
0.5757575757575758
Airpods Pro - met Active Noise PREDICT AirPods Pro met Draadloze Oplaadcase
0.8666666666666667
Galaxy Buds+ - PREDICT Galaxy Buds+ Wit
0.8571428571428571
Galaxy Buds - PREDICT Galaxy Buds Wit
0.48175182481751827
Draadloze Oordopjes - Met Draadloze Oplaadcase - Alternatief Airpods - Airpods - Oortjes - PREDICT AirPods 2 met oplaadcase + AirPods Leren Hoesje
0.4594594594594595
SHB2505 - Volledig draadloze oordopjes - PREDICT AirPods 2 met draadloze oplaadcase
0.5747126436781609
Professional Draadloze Oordopjes - Met Oplaadcase - PREDICT AirPods Pro met Draadloze Oplaadcase
0.5681818181818182
Professional+ Draadloze Oordopjes - Met Oplaadcase - PREDICT AirPods Pro met Draadloze Oplaadcase
0.43137254901960786
S80 - Draadloze Bluetooth Oordopjes - Oortjes Met Oplaadcase - Wit PRED

In [34]:
Name_corr

array([[0.375     , 0.29411765, 0.25      , ..., 0.18181818, 0.23728814,
        0.16      ],
       [0.33766234, 0.33846154, 0.26229508, ..., 0.26923077, 0.25      ,
        0.17021277],
       [0.57575758, 0.55555556, 0.32      , ..., 0.19512195, 0.13333333,
        0.16666667],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [20]:
EarBuds_CoolBlue['Name'].iloc[0]

'AirPods Pro met Draadloze Oplaadcase'

In [19]:
EarBuds_CoolBlue['Name']

0      AirPods Pro met Draadloze Oplaadcase
1                  AirPods 2 met oplaadcase
2                      Powerbeats Pro Zwart
3                  Elite 75t Titanium Zwart
4        AirPods 2 met draadloze oplaadcase
                       ...                 
159             Ink'd+ Active Wireless Rood
160                Minor II Bluetooth Zwart
161                             Tarah Zwart
162                         LIVE 300TWS Wit
163                                  Zilver
Name: Name, Length: 164, dtype: object

In [26]:
a='JBL Tune 220TWS - Volledig draadloze oordopjes - Zwart'
b='JBL TUNE220TWS'
c='sssssssssssss'

a=simplify_string(a,False)
b=simplify_string(b,False)

In [24]:
levenshtein_ratio_and_distance(a,b,True)

0.5555555555555556

In [27]:
levenshtein_ratio_and_distance(b,c,True)

0.07407407407407407