<h1 style="font-family:verdana;"> <center> Customer Segmentation on Online Retail V2 and Data Visualization with Plotly </center> </h1>


***

<p style="font-size:15px; font-family:verdana; line-height: 1.7em"> 2 years after working on the "Online Retail Dataset" I wanted to come back to this project and see if I could push it further and make the best out of this data. <span style="color:green;"> Luckily </span> for me, a V2 of the dataset emerged and we now have the 2009 data. More data equals more fun so let's give it a try. 
    
<div style="font-size:15px; font-family:verdana;"> This new project will be in divided in 3 parts: <br><br>
    
<ol>
    <li>Data cleaning </li>
    <li>Product Tagging </li>
    <li>Feature Engineering </li>
    <li>Customer Segmentation </li>
    <li>Training a supervised model on customer categories </li>
    <li><span style="color:green;">Data visualization with dash for jupyter </span></li>
     
</ol>

</div>

<br>
    
<br><br>


***

In [None]:
!pip install texthero

In [None]:
!pip install jupyter_dash

In [None]:
# Basic libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
from scipy import stats
import json

# NLP libraries
import texthero as hero
from nltk.tokenize import ToktokTokenizer

# Data visualization libraries
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots



# Sklearn libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import model_selection

from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from xgboost import XGBClassifier
from xgboost import plot_importance



In [None]:
df = pd.read_csv('/kaggle/input/online-retail-ii-uci/online_retail_II.csv')

***

<h1 id="clean" style="font-family:verdana;"> 
    <center>1. Cleaning Data 🧹
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#clean">¶</a>
    </center>
</h1>

> <h2 id="missing" style="font-family:verdana;"> 
>          1.1 Missing data 👻
>         <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#missing">¶</a>
> 
</h2>

<p style="font-size:15px; font-family:verdana; line-height: 1.7em"> After an exploratory analysis of the dataset, it appears that <span style="color:crimson;"> 22% </span>of the customer ids are missing which is very problematic since I want to do a <span style="color:crimson;"> Customer </span> Segmentation later. I've tried looking at the invoice number or date without success. It's a shame that we have to lose 22% of the data but we don't have a choice.

In [None]:
df = df.dropna(subset=["Customer ID"])

<h2 id="duplicates" style="font-family:verdana;"> 
         1.2 Duplicate values
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#duplicates">¶</a>

</h2>

In [None]:
print('Duplicate entries: {}'.format(df.duplicated().sum()))
df.drop_duplicates(inplace = True)

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b>Since duplicates values aren't following one another in the dataset, I could think that the customer added the same product in his basket several times without updating the quantity. The choice is here also hard to make. But, by trying this experiment on a few websites, it seems that the quantity is always updated when you add the same product. So I'll consider them as duplicates even though this data is from 2010.</b><br><br>
</div>

<h2 id="dupplicates" style="font-family:verdana;"> 
         1.3 Stock Code
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#dupplicates">¶</a>

</h2>

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b> In this dataset, there are several specific transactions which aren't products. For example, we can have a line with 'Discount' as a description. This probably means that the customer had a discount during his purchase. Before deleting lines that aren't product, I'll create 2 features : discount and postage in which I'll store different discounts and postage customers had.</b><br><br>
</div>

In [None]:
df['Discount'] = 0
for index, col in  df[df['StockCode']=='D'].iterrows():
    invoice = col['Invoice']
    price = col['Price']
    
    df.loc[(df.Invoice == invoice), 'Discount'] = price
    

In [None]:
df['Postage'] = 0
for index, col in  df[df['StockCode']=='POST'].iterrows():
    invoice = col['Invoice']
    price = col['Price']
    
    df.loc[(df.Invoice == invoice), 'Postage'] = price
    

In [None]:
list_special_codes = df[df['StockCode'].str.contains('^[a-zA-Z]+', regex=True)]['StockCode'].unique()
list_special_codes 

In [None]:
for code in list_special_codes : 
    df = df[df['StockCode']!= code]

<h2 id="canceled" style="font-family:verdana;"> 
         1.4 Canceled Orders
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#canceled">¶</a>

</h2>

In [None]:
#This part was inspired by Fabien Daniel's brilliant work in his Notebook on customer segmentation.

df_cleaned = df.copy(deep = True)
df_cleaned['QuantityCanceled'] = 0

entry_to_remove = [] ; doubtfull_entry = []

for index, col in  df.iterrows():
    if (col['Quantity'] > 0) or col['Description'] == 'Discount': continue        
    df_test = df[(df['Customer ID'] == col['Customer ID']) &
                         (df['StockCode']  == col['StockCode']) & 
                         (df['InvoiceDate'] < col['InvoiceDate']) & 
                         (df['Quantity']   > 0)].copy()
    #_________________________________
    # Cancelation WITHOUT counterpart
    if (df_test.shape[0] == 0): 
        doubtfull_entry.append(index)
    #________________________________
    # Cancelation WITH a counterpart
    elif (df_test.shape[0] == 1): 
        index_order = df_test.index[0]
        df_cleaned.loc[index_order, 'QuantityCanceled'] = -col['Quantity']
        entry_to_remove.append(index)        
    #______________________________________________________________
    # Various counterparts exist in orders: we delete the last one
    elif (df_test.shape[0] > 1): 
        df_test.sort_index(axis=0 ,ascending=False, inplace = True)        
        for ind, val in df_test.iterrows():
            if val['Quantity'] < -col['Quantity']: continue
            df_cleaned.loc[ind, 'QuantityCanceled'] = -col['Quantity']
            entry_to_remove.append(index) 
            break    

In [None]:
print("entry_to_remove: {}".format(len(entry_to_remove)))
print("doubtfull_entry: {}".format(len(doubtfull_entry)))

In [None]:
df_cleaned.drop(entry_to_remove, axis = 0, inplace = True)
df_cleaned.drop(doubtfull_entry, axis = 0, inplace = True)
remaining_entries = df_cleaned[(df_cleaned['Quantity'] < 0) & (df_cleaned['StockCode'] != 'D')]
print("nb of entries to delete: {}".format(remaining_entries.shape[0]))
remaining_entries[:5]

In [None]:
df_cleaned.drop(remaining_entries.index, axis = 0, inplace = True)

***

<h1 id="products_tag" style="font-family:verdana;"> 
    <center>2. Product Tagging 🏪
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#products_tag">¶</a>
    </center>
</h1>

<h2 id="desc_clean" style="font-family:verdana;"> 
         2.1 Cleaning the description
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#desc_clean">¶</a>

</h2>

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b>  I'll take this opportunity to try a new nlp library I recently discovered : TextHero in order to clean my data. It is pretty convenient since with one line of code I can do several processing functions like lower_case, removing strop words, lemmatization, ... </b> <br><br>
    
</div>

In [None]:
product_df = df_cleaned.drop(columns=['StockCode', 'Invoice', 'Customer ID', 'Price', 'Quantity', 'InvoiceDate', 'Country'])
product_df['Description'] = df['Description'].pipe(hero.clean)

In [None]:
tw = hero.visualization.top_words(product_df['Description']).head(40)

fig = px.bar(tw)
fig.show()

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b>  From this graph, I decided to make 3 different features out of the product description : color, category and design. I'll group several thing in design since it would like product caracteristics. For example, "Retrospot", "Vintage", "Feltcraft", ...
    And in categories we'll have something like "Cake", "Christmas", "Bottle", ...
         I also took a look at bi-grams even though I didn't put it in this notebook. </b> <br><br>
</div>

<h2 id="prod_color" style="font-family:verdana;"> 
         2.2 Product's color 🌈
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#prod_color">¶</a>

</h2>

In [None]:
token = ToktokTokenizer()

In [None]:
def TagExtractor(text, tags):
    
    words=token.tokenize(text)
    
    filtered = [w for w in words if  w in tags]
    
    return ' '.join(map(str, filtered))

In [None]:
def TagRemove(text, tags):
    
    words=token.tokenize(text)
    
    filtered = [w for w in words if not w in tags]
    
    return ' '.join(map(str, filtered))

In [None]:
colors = ['black', 'blue', 'brown', 'gold', 'green', 'grey', 'orange', 'pink', 'purple', 'red', 'silver', 'white', 'yellow', 'ivory']

In [None]:
product_df['ProductColor'] = product_df['Description'].apply(lambda x: TagExtractor(x, colors)) 

In [None]:
 product_df['Description'] = product_df['Description'].apply(lambda x: TagRemove(x, colors)) 

In [None]:
tw = hero.visualization.top_words(product_df['ProductColor']).head(20)

fig = px.bar(tw)
fig.show()

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b>  To extract colors I basically made a list of colors and wrote a function which will iterate in the dataframe, and each product description will be divided in words. And, if we find one of the colors in the product description we put it in a list. This feature could be very interesting because we could know if a customer has a favourite color which could lead to personalized marketing campaigns for the company. </b><br><br>
</div>

<h2 id="prod_des" style="font-family:verdana;"> 
         2.3 Product's design
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#prod_des">¶</a>

</h2>

In [None]:
Design = ['gingham', 'butterfly', 'chocolate', 'zinc', 'hearts', 'star', 'skull', 'dolly', 'wood', 'retro', 'strawberry',
         'mini', 'polkadot', 'spot', 'cream', 'rose', 'spaceboy', 'ceramic', 'glasse', 'vintage', 'retrospot', 'heart',
         'spots', 'skulls', 'scandinavian', 'london', 'french', 'wooden', 'woodland', 'bakelike', 'feltrcraft', 'porcelain',
         'spaceboy', 'glass', 'traditional', 'bird', 'birds', 'flower', 'antique', 'tube']

In [None]:
product_df['Design'] = product_df['Description'].apply(lambda x: TagExtractor(x, Design)) 

In [None]:
stop_words = ['set', 'pack', 'small', 'large']

In [None]:
 product_df['Description'] = product_df['Description'].apply(lambda x: TagRemove(x, (Design+stop_words))) 

In [None]:
tw = hero.visualization.top_words(product_df['Design']).head(20)

fig = px.bar(tw)
fig.show()

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b>  I used the same process for the design and category. We can see that there are several kind of collections in there. I don't know if this feature will be useful but it wouldn't hurt keeping it for later, especially for the future dashboard. </b><br><br>
</div>

<h2 id="prod_cat" style="font-family:verdana;"> 
         2.4 Product's category
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#prod_cat">¶</a>

</h2>

<h3 id="prod_cat_man" style="font-family:verdana;"> 
         2.4.1 Semi manually tagging products
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#prod_cat_man">¶</a>

</h3>

In [None]:
Categories = ['bag', 'box', 'cake', 'christmas', 'hanging', 'light', 'holder', 'sign', 'jumbo', 'lunch', 'paper', 'tea', 'card',
              'cases', 'decoration', 'water', 'bottle', 'mug', 'party', 'garden', 'wrap', 'bowl', 'birthday', 
              'photo', 'frame', 'candle', 'key', 'ring', 'travel', 'egg', 'cup', 
              'lights', 'cutlery', 'candles', 'door', 'gift', 'clock', 'trinket', 
              'drawer', 'stand', 'pencils', 'ribbons', 'napkins', 'notebook', 'photo', 'alarm', 'dog',
             'kitchen', 'storage', 'childrens', 'cup', 'cat', 'wall', 'art', 'cushion', 'cover', 'popcorn', 'soap', 'baking', 'door']


In [None]:
product_df['Categories'] = product_df['Description'].apply(lambda x: TagExtractor(x, Categories)) 

In [None]:
pd.DataFrame(product_df['Categories'].value_counts()).to_excel('product_categories.xlsx')

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b>  From here, I took it on excel. I renamed and grouped categories into labels in order to have something more reliable for the classifiers later on. </b><br><br>
    
</div>

In [None]:
product_tags = pd.read_excel('/kaggle/input/product-categories/product_categories V2.xlsx')

In [None]:
product_tags.head()

In [None]:
product_df = product_df.reset_index().merge(product_tags, on='Categories', how='left').set_index('index')

In [None]:
product_df.loc[(product_df.Description =='wicker'), 'Labels'] = 'Wicker'

product_df.loc[(product_df.Description =='assorted colour ornament'), 'Labels'] = 'Home Decoration'

product_df.loc[(product_df.Description =='tissues'), 'Labels'] = 'Essentials'

product_df.loc[(product_df.Description =='chalkboard'), 'Labels'] = 'Stationary'

product_df.loc[(product_df.Description =='milk jug'), 'Labels'] = 'Tableware'

product_df.loc[(product_df.Description =='measuring spoons'), 'Labels'] = 'Baking'

product_df.loc[(product_df.Description =='snap cards'), 'Labels'] = 'Cards'

In [None]:
product_df.loc[(product_df.Description =='regency cakestand tier'), 'Labels'] = 'Cake Decoration'

product_df.loc[(product_df.Description =='heart wicker small'), 'Labels'] = 'Hanging Decoration'

product_df.loc[(product_df.Description =='heart wicker large'), 'Labels'] = 'Hanging Decoration'

product_df.loc[(product_df.Description =='edwardian parasol'), 'Labels'] = 'Essentials'

product_df.loc[(product_df.Description =='regency teacup saucer'), 'Labels'] = 'Tea'

product_df.loc[(product_df.Description =='natural slate heart chalkboard'), 'Labels'] = 'Home Decoration'

product_df.loc[(product_df.Description =='french metal door sign'), 'Labels'] = 'Door Sign'

product_df.loc[(product_df.Description =='love building block word'), 'Labels'] = 'Home Decoration'

product_df.loc[(product_df.Description =='vintage snap cards'), 'Labels'] = 'Cards'

product_df.loc[(product_df.Description =='scottie dog hot water bottle'), 'Labels'] = 'Water Bottle'

product_df.loc[(product_df.Labels =='Holders'), 'Labels'] = 'Holding Decoration'

In [None]:
for label in ['Decorative Storage', 'Hanging Decoration', 'Lights', 'Candles', 'Door Signs', 'Wall Signs', 'Wicker', 'Clocks',
             'Storage', 'Frame', 'Photo Frame', 'Wall Art', 'Holding Decoration', 'Popcorn Holder']:
    product_df.loc[(product_df.Labels ==label), 'Labels'] = 'Home Decoration'

In [None]:
for label in ['Lunch Bags', 'Jumbo Bags', 'Jumbo Shopper']:
    product_df.loc[(product_df.Labels ==label), 'Labels'] = 'Bags'

In [None]:
for label in ['Cards', 'Paper', 'Cushions', 'Wraps', 'Gift Wraps']:
    product_df.loc[(product_df.Labels ==label), 'Labels'] = 'Gifts'

In [None]:
for label in ['Water Bottle', 'Essentials', 'Travel', 'Pets', 'Jewelry']:
    product_df.loc[(product_df.Labels ==label), 'Labels'] = 'Other'

In [None]:
for label in ['Cake Decoration', 'Birthday']:
    product_df.loc[(product_df.Labels ==label), 'Labels'] = 'Party'

In [None]:
for label in ['Tea', 'Baking', 'Kitchen', 'Soap']:
    product_df.loc[(product_df.Labels ==label), 'Labels'] = 'Tableware'

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b>   I decided that I had too many categories and wanted to squeeze them into 10 new labels. </b><br><br>
</div>

In [None]:
product_df['Labels'].value_counts().sum() /product_df['Labels'].shape[0]*100

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b> One last finishing touch and we're done. With this semi-manual work almost 60% of the data is tagged which should give us enough data for the next step </b><br><br>
</div>

<h3 id="prod_cat_auto" style="font-family:verdana;"> 
         2.4.2 Training classification algorithms 
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#prod_cat_auto">¶</a>

</h3>

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b> I've now landed on the fun and last part of this notebook. I'll use my labeled data in order to train classifiers and validate results on it. Once this is done, I'll simply choose the best classifiers and predict labels for the test data. This was the plan but it didn't go that smoothly. </b><br><br>
</div>

In [None]:
X = product_df.dropna(subset=['Labels']).drop_duplicates(subset=['Description'])['Description']
X_test = product_df[product_df['Labels'].isnull()]['Description']

In [None]:
X.shape, X_test.shape

In [None]:
le = preprocessing.LabelEncoder()

In [None]:
y = le.fit_transform(product_df.dropna(subset=['Labels']).drop_duplicates(subset=['Description'])['Labels'])

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.4, random_state = 46) # Do 60/40 split

In [None]:
X_vectorizer = TfidfVectorizer(analyzer = 'word',
                            )

In [None]:
X_train = X_vectorizer.fit_transform(X_train)
X_val =  X_vectorizer.transform(X_val)
X_test_tfidf = X_vectorizer.transform(X_test)

In [None]:
def print_score(y_pred, clf):
    print("Clf: ", clf.__class__.__name__)
    print("Accuracy score: {}".format(accuracy_score(y_val, y_pred)))
    print("---")    

In [None]:
dummy = DummyClassifier(strategy='prior')
sgd = SGDClassifier()
mn = MultinomialNB()
svc = LinearSVC()
perceptron = Perceptron()
pac = PassiveAggressiveClassifier()
mlpc = MLPClassifier()
rfc = RandomForestClassifier()
xgb = XGBClassifier()


for classifier in [dummy, sgd, mn, svc, perceptron, pac, mlpc, rfc, xgb]:
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_val)
    print_score(y_pred, classifier)

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b> We could have better results by tuning some parameterers but for now I'm going to keep it that way and come back to it later. </b><br><br>
</div>

In [None]:
y_pred_test = rfc.predict(X_test_tfidf) 

In [None]:
prods_non_labeled = pd.DataFrame()

In [None]:
prods_non_labeled['Description']= X_test

In [None]:
prods_non_labeled['Labels'] = le.inverse_transform(y_pred_test)

In [None]:
prods_non_labeled

In [None]:
product_df.loc[(product_df.Labels.isnull()), 'Labels'] = prods_non_labeled['Labels']

In [None]:
product_df

In [None]:
product_df.shape, df.shape

In [None]:
df_cleaned['ProductColor'] = product_df['ProductColor']
df_cleaned['Design'] = product_df['Design']
df_cleaned['Labels'] = product_df['Labels']

In [None]:
df_cleaned['Labels'].value_counts(dropna=False)

***

<h1 id="Feature_engin" style="font-family:verdana;"> 
    <center>3. Feature Engineering
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#Feature_engin">¶</a>
    </center>
</h1>

<h2 id="total_price" style="font-family:verdana;"> 
         3.1 Total Price
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/total_price">¶</a>

</h2>

In [None]:
# Total price feature

df_cleaned['TotalPrice'] = df_cleaned['Price'] * (df_cleaned['Quantity'] - df_cleaned['QuantityCanceled'])

In [None]:
df_cleaned['TotalPrice'].describe()

In [None]:
df_cleaned[df_cleaned['TotalPrice']<0]

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
     <b> We can't have more quantities canceled than bought initially. I'm deleting these lines. </b><br><br>
</div>

In [None]:
df_cleaned.drop(df_cleaned[df_cleaned['TotalPrice']<0].index, axis = 0, inplace = True)

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
    <b> Let's clean the outliers real quick. I'll delete transactions which are 10 times greater than usual. </b><br><br>
</div>

In [None]:
z = np.abs(stats.zscore(df_cleaned['TotalPrice']))
threshold = 10

df_cleaned_outliers = df_cleaned.copy(deep=True)
df_cleaned_outliers['Outliers'] = z

df_cleaned_outliers[df_cleaned_outliers['Outliers']>threshold]

In [None]:
df_cleaned.drop(df_cleaned_outliers[df_cleaned_outliers['Outliers']>threshold].index, axis = 0, inplace = True)

<h2 id="time_features" style="font-family:verdana;"> 
         3.2 Time Features
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/time_features">¶</a>

</h2>

In [None]:
df_cleaned['InvoiceDate'] = pd.to_datetime(df_cleaned['InvoiceDate'])

In [None]:
df_cleaned['Year'] = df_cleaned["InvoiceDate"].apply(lambda x: x.year)
df_cleaned['Month'] = df_cleaned["InvoiceDate"].apply(lambda x: x.month)
df_cleaned['MonthYear'] = df_cleaned["InvoiceDate"].apply(lambda x: x.strftime("%B %Y"))
df_cleaned['Weekday'] = df_cleaned["InvoiceDate"].apply(lambda x: x.weekday())
df_cleaned['Day'] = df_cleaned["InvoiceDate"].apply(lambda x: x.day)
df_cleaned['Hour'] = df_cleaned["InvoiceDate"].apply(lambda x: x.hour)

<h2 id="rfm" style="font-family:verdana;"> 
         3.3 RFM Principle
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#rfm**">¶</a>

</h2>

In [None]:
df_cleaned['InvoiceDate'].min()

In [None]:
df_cleaned['InvoiceDate'].max()

In [None]:
NOW = dt.datetime(2011,12,10)

In [None]:
df_cleaned.shape[0] / df_cleaned['Invoice'].value_counts().count() 

In [None]:
custom_aggregation = {}
custom_aggregation["InvoiceDate"] = lambda x:x.iloc[0]
custom_aggregation["Customer ID"] = lambda x:x.iloc[0]
custom_aggregation["TotalPrice"] = "sum"

In [None]:
rfmTable = df_cleaned.groupby("Invoice").agg(custom_aggregation)

In [None]:
rfmTable["Recency"] = NOW - rfmTable["InvoiceDate"]
rfmTable["Recency"] = pd.to_timedelta(rfmTable["Recency"]).astype("timedelta64[D]")

In [None]:
rfmTable.head(5)

In [None]:
custom_aggregation = {}

custom_aggregation["Recency"] = ["min", "max"]
custom_aggregation["InvoiceDate"] = lambda x: len(x)
custom_aggregation["TotalPrice"] = "sum"

In [None]:
rfmTable_final = rfmTable.groupby("Customer ID").agg(custom_aggregation)

In [None]:
rfmTable_final.columns = ["min_recency", "max_recency", "frequency", "monetary_value"]

In [None]:
rfmTable_final.head(5)

In [None]:
first_customer = df_cleaned[df_cleaned['Customer ID']==12346.0]
first_customer.head(5)

In [None]:
quantiles = rfmTable_final.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()

In [None]:
segmented_rfm = rfmTable_final

In [None]:
def RScore(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4
    
def FMScore(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1

In [None]:
segmented_rfm['r_quartile'] = segmented_rfm['min_recency'].apply(RScore, args=('min_recency',quantiles,))
segmented_rfm['f_quartile'] = segmented_rfm['frequency'].apply(FMScore, args=('frequency',quantiles,))
segmented_rfm['m_quartile'] = segmented_rfm['monetary_value'].apply(FMScore, args=('monetary_value',quantiles,))
segmented_rfm.head()

In [None]:
segmented_rfm['RFMScore'] = segmented_rfm.r_quartile.map(str) + segmented_rfm.f_quartile.map(str) + segmented_rfm.m_quartile.map(str)
segmented_rfm.head()

In [None]:
segmented_rfm[segmented_rfm['RFMScore']=='111'].sort_values('monetary_value', ascending=False)

In [None]:
segmented_rfm.head(5)

In [None]:
segmented_rfm = segmented_rfm.reset_index()

In [None]:
segmented_rfm.head(5)

In [None]:
df_cleaned.shape

In [None]:
df_cleaned = pd.merge(df_cleaned,segmented_rfm, on='Customer ID')

In [None]:
df_cleaned = df_cleaned.drop(columns=['r_quartile', 'f_quartile', 'm_quartile'])

<h2 id="prod_exp" style="font-family:verdana;"> 
         3.4 Product categories expenses 
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#prod_exp">¶</a>

</h2>

In [None]:
for label in df_cleaned['Labels'].unique():
    col = 'Label_{}'.format(label)        
    df_temp = df_cleaned[df_cleaned['Labels'] == label]
    price_temp = df_temp['TotalPrice']
    df_cleaned.loc[:, col] = price_temp
    df_cleaned[col].fillna(0, inplace = True)

***

<h1 id="cust_segm" style="font-family:verdana;"> 
    <center>4. Custumer Segmentation
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#cust_segm">¶</a>
    </center>
</h1>

In [None]:
df_cleaned['RFMScore'] = df_cleaned['RFMScore'].astype(int)

In [None]:
df_cleaned.loc[(df_cleaned.RFMScore == 111), 'Segment'] = 'Best Customers'

In [None]:
df_cleaned.loc[(df_cleaned.RFMScore == 311), 'Segment'] = 'Almost Lost'

In [None]:
df_cleaned.loc[(df_cleaned.RFMScore == 411), 'Segment'] = 'Lost Customers'

In [None]:
df_cleaned.loc[(df_cleaned.RFMScore == 444), 'Segment'] = 'Bad Customers'

In [None]:
for code in [112, 113, 114, 212, 213, 214, 312, 313, 314, 412, 413, 414] : 
    df_cleaned.loc[(df_cleaned.RFMScore == code), 'Segment'] = 'Loyal Customers'

In [None]:
for code in [121, 131, 141, 221, 231, 241, 321, 331, 341, 421, 431, 441] : 
    df_cleaned.loc[(df_cleaned.RFMScore == code), 'Segment'] = 'Big Spenders'

In [None]:
for code in [211, 222, 122, 123, 223] : 
    df_cleaned.loc[(df_cleaned.RFMScore == code), 'Segment'] = 'Good Customers'

In [None]:
for code in [322, 232, 132, 242, 142, 224, 124] : 
    df_cleaned.loc[(df_cleaned.RFMScore == code), 'Segment'] = 'Average Customer'

In [None]:
df_cleaned.loc[df_cleaned.Segment.isnull()]['RFMScore'].value_counts()

In [None]:
for code in df_cleaned.loc[df_cleaned.Segment.isnull()]['RFMScore'].value_counts().index : 
    df_cleaned.loc[(df_cleaned.RFMScore == code), 'Segment'] = 'Not So Good Customers'

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
    <b> When you're working on a profesional project, this step should be done in close collaboration with the client (or the marketing team) </b><br><br>
</div>

***

<h1 id="sup_learn" style="font-family:verdana;"> 
    <center>5. Supervised Learning 
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#sup_learn">¶</a>
    </center>
</h1>

<h2 id="feature_prep" style="font-family:verdana;"> 
         5.1 Preparing my features
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#feature_prep">¶</a>

</h2>

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
    <b> Here, I decided to separate my data in train data and test data. My test data will be new customers in the 2 last months of this dataset. Train data will be the remaining customers. </b><br><br>
</div>

In [None]:
new_cust = []
for value in df_cleaned[df_cleaned['InvoiceDate']>='2011-10-01 07:45:00']['Customer ID'].value_counts().index:
    if value not in df_cleaned[df_cleaned['InvoiceDate']<'2011-10-01 07:45:00']['Customer ID'].value_counts().index :
        new_cust.append(value)

In [None]:
df_cleaned_new_cust = df_cleaned[df_cleaned['Customer ID'].isin(new_cust)]

In [None]:
df_cleaned_old_cust = df_cleaned[~df_cleaned['Customer ID'].isin(new_cust)]

In [None]:
(df_cleaned_new_cust.shape, df_cleaned_old_cust.shape)

In [None]:
custom_aggregation = {}
custom_aggregation["Customer ID"] = lambda x:x.iloc[0]
for label in df_cleaned['Labels'].unique():
    col = 'Label_{}'.format(label)  
    custom_aggregation[col] = "sum"

custom_aggregation["Quantity"] = 'sum'
custom_aggregation["Price"] = 'mean'
custom_aggregation["TotalPrice"] = 'sum'
custom_aggregation["QuantityCanceled"] = "sum"
custom_aggregation["Postage"] = lambda x:x.iloc[0]
custom_aggregation["Discount"] = lambda x:x.iloc[0]


custom_aggregation["min_recency"] = lambda x:x.iloc[0]
custom_aggregation["max_recency"] = lambda x:x.iloc[0]
custom_aggregation["frequency"] = lambda x:x.iloc[0]
custom_aggregation["monetary_value"] = lambda x:x.iloc[0]

custom_aggregation["Segment"] = lambda x:x.iloc[0]

In [None]:
df_grouped_train = df_cleaned_old_cust.groupby("Invoice").agg(custom_aggregation)

In [None]:
df_grouped_test = df_cleaned_new_cust.groupby("Invoice").agg(custom_aggregation)

In [None]:
custom_aggregation = {}

for label in df_cleaned['Labels'].unique():
    col = 'Label_{}'.format(label)  
    custom_aggregation[col] = "sum"

custom_aggregation["Quantity"] = 'mean'
custom_aggregation["Price"] = 'mean'
custom_aggregation["TotalPrice"] = 'mean'
custom_aggregation["QuantityCanceled"] = "sum"
custom_aggregation["Postage"] = "sum"
custom_aggregation["Discount"] = "sum"


custom_aggregation["min_recency"] = lambda x:x.iloc[0]
custom_aggregation["max_recency"] = lambda x:x.iloc[0]
custom_aggregation["frequency"] = lambda x:x.iloc[0]
custom_aggregation["monetary_value"] = lambda x:x.iloc[0]
custom_aggregation["Segment"] = lambda x:x.iloc[0]


In [None]:
df_grouped_final_train = df_grouped_train.groupby("Customer ID").agg(custom_aggregation)

In [None]:
df_grouped_final_test = df_grouped_test.groupby("Customer ID").agg(custom_aggregation)

In [None]:
X_train = df_grouped_final_train.drop(columns=['Segment'])

In [None]:
y_train = df_grouped_final_train['Segment']

In [None]:
le_label = preprocessing.LabelEncoder()

In [None]:
y_train = le_label.fit_transform(y_train)

In [None]:
X_test = df_grouped_final_test.drop(columns=['Segment'])

In [None]:
y_test = df_grouped_final_test['Segment']

In [None]:
y_test = le_label.transform(y_test)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, random_state = 46) # Do 80/20 split

In [None]:
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

<h2 id="model_train" style="font-family:verdana;"> 
         5.2 Training models and comparing performance
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#model_train">¶</a>

</h2>

In [None]:
def print_score(y_pred, clf):
    print("Clf: ", clf.__class__.__name__)
    print("Accuracy score: {}".format(accuracy_score(y_val, y_pred)))
    print("---")    

In [None]:
dummy = DummyClassifier(strategy='prior')
sgd = SGDClassifier()
mn = MultinomialNB()
svc = LinearSVC()
perceptron = Perceptron()
pac = PassiveAggressiveClassifier()
mlpc = MLPClassifier()
rfc = RandomForestClassifier()
xgb = XGBClassifier()


for classifier in [dummy, sgd, svc, perceptron, pac, mlpc, rfc, xgb]:
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_val)
    print_score(y_pred, classifier)

In [None]:
y_pred = xgb.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
plot_importance(xgb)

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
    <b> This score was predictable since I created my segments on RFM Score only and xgboost can easily recreate segments I've done. In a real context this model would be useful to speed up customer segmentation but here itsn't that useful. The positive aspect of it is that we tested it on new customers in the last 2 months with limited number of transactions. This means that we can classify customers with only a few transactions which is quite powerful. </b><br><br>
</div>

***

<h1 id="data_vis" style="font-family:verdana;"> 
    <center>6. Data Visualization with Dash for Jupyter
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#data_vis">¶</a>
    </center>
</h1>

<h2 id="sales" style="font-family:verdana;"> 
         6.1 Total Sales
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#sales">¶</a>

</h2>

In [None]:
custom_aggregation = {}

custom_aggregation["InvoiceDate"] = lambda x:x.iloc[0]
custom_aggregation["MonthYear"] = lambda x:x.iloc[0]

custom_aggregation["TotalPrice"] = 'sum'

In [None]:
sales_invoices_montly = df_cleaned.groupby('MonthYear').agg(custom_aggregation).sort_values(by='InvoiceDate')
sales_invoices_montly.head()

In [None]:
data = [go.Scatter(x=sales_invoices_montly.index, 
                   y=sales_invoices_montly['TotalPrice'])]

layout = go.Layout(title="Total sales", title_x=0.5)

fig = go.Figure(data=data, layout=layout)
fig.update_xaxes(type='category')

fig.show()

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
    <b> This graph clearly shows that seasonality and especially Christmas has a massive impact on sales. In both 2010 and 2011 sells are growing from august to december </b><br><br>
</div>

<h2 id="countries" style="font-family:verdana;"> 
         6.2 Customers through the world map
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#countries">¶</a>

</h2>

In [None]:
year_options = []
for year in df_cleaned['Year'].unique():
    year_options.append({'label':str(year), 'value':year})
year_options.append({'label':'All', 'value':'All'})

In [None]:
customer_country=df_cleaned[['Country','Customer ID']].drop_duplicates()
df_cleaned_grouped = customer_country.groupby(['Country'])['Customer ID'].aggregate('count').reset_index().sort_values('Customer ID', ascending=False)


filtered_df_2009 = df_cleaned[df_cleaned['Year']==2009]
customer_country_2009=filtered_df_2009[['Country','Customer ID']].drop_duplicates()
filtered_df_2009_grouped = customer_country_2009.groupby(['Country'])['Customer ID'].aggregate('count').reset_index().sort_values('Customer ID', ascending=False)

filtered_df_2010 = df_cleaned[df_cleaned['Year']==2010]
customer_country_2010=filtered_df_2010[['Country','Customer ID']].drop_duplicates()
filtered_df_2010_grouped = customer_country_2010.groupby(['Country'])['Customer ID'].aggregate('count').reset_index().sort_values('Customer ID', ascending=False)

filtered_df_2011 = df_cleaned[df_cleaned['Year']==2011]
customer_country_2011=filtered_df_2011[['Country','Customer ID']].drop_duplicates()
filtered_df_2011_grouped = customer_country_2011.groupby(['Country'])['Customer ID'].aggregate('count').reset_index().sort_values('Customer ID', ascending=False)

In [None]:
data = [go.Choropleth(
                locations = df_cleaned_grouped['Country'],
                locationmode = 'country names',
                z = df_cleaned_grouped['Customer ID'],
                text = df_cleaned_grouped['Country'],
                colorscale = 'Rainbow',
                marker_line_color='darkgray',
                marker_line_width=0.5,
                colorbar_title = 'Customers',
                ),
        go.Choropleth(
                locations = filtered_df_2009_grouped['Country'],
                locationmode = 'country names',
                z = filtered_df_2009_grouped['Customer ID'],
                text = filtered_df_2009_grouped['Country'],
                colorscale = 'Rainbow',
                marker_line_color='darkgray',
                marker_line_width=0.5,
                colorbar_title = 'Customers',
                ),
        go.Choropleth(
                locations = filtered_df_2010_grouped['Country'],
                locationmode = 'country names',
                z = filtered_df_2010_grouped['Customer ID'],
                text = filtered_df_2010_grouped['Country'],
                colorscale = 'Rainbow',
                marker_line_color='darkgray',
                marker_line_width=0.5,
                colorbar_title = 'Customers',
                ),
        go.Choropleth(
                locations = filtered_df_2011_grouped['Country'],
                locationmode = 'country names',
                z = filtered_df_2011_grouped['Customer ID'],
                text = filtered_df_2011_grouped['Country'],
                colorscale = 'Rainbow',
                marker_line_color='darkgray',
                marker_line_width=0.5,
                colorbar_title = 'Customers',
                ),
       ]

In [None]:
layout = go.Layout(
                title_text='Our customers',
                title_x=0.5,
                geo=dict(
                    showframe=False,
                    showcoastlines=False,
                    projection_type='equirectangular'
                ),
                )

In [None]:
fig = go.Figure(data=data, layout=layout)

In [None]:
# Add dropdown 
fig.update_layout( 
    updatemenus=[ 
        dict( 
            active=0, 
            buttons=list([ 
                dict(label="All", 
                     method="update", 
                     args=[{"visible": [True, False, False, False]}, 
                           {"title": "All customers"}]), 
                dict(label="2009", 
                     method="update", 
                     args=[{"visible": [False, True, False, False]}, 
                           {"title": "Customers in 2009", 
                            }]), 
                dict(label="2010", 
                     method="update", 
                     args=[{"visible": [False, False, True, False]}, 
                           {"title": "Customers in 2010", 
                            }]), 
                dict(label="2011", 
                     method="update", 
                     args=[{"visible": [False, False, False, True]}, 
                           {"title": "Customers in 2011", 
                            }]), 
            ]), 
        ) 
    ]) 
  
fig.show() 

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
    <b> This online e-commerce platform has expended through the years internationally. It would be interesting to see how they're doing today.  </b><br><br>
</div>

In [None]:
df_cleaned.groupby('Country')['TotalPrice'].sum().sort_values(ascending=False)[:6]

In [None]:
countries = ['EIRE', 'Netherlands', 'Germany', 'France', 'Australia'] 

countries_options = []
data = []
for country in countries:
    year_options.append({'label':str(country), 'value':country})

for country in countries:
    df_segment = df_cleaned[df_cleaned['Country']==country]
    df_segment_grouped = df_segment.groupby('MonthYear').agg(custom_aggregation).sort_values(by='InvoiceDate')
    
    data.append(go.Bar(x=df_segment_grouped.index, 
                   y=df_segment_grouped['TotalPrice']))
    


In [None]:
layout = go.Layout(
                title_text='Our customers',
                title_x=0.5,
                geo=dict(
                    showframe=False,
                    showcoastlines=False,
                    projection_type='equirectangular'
                ),
                )

In [None]:
fig = go.Figure(data=data, layout=layout)

In [None]:
# Add dropdown 
fig.update_layout( 
    updatemenus=[ 
        dict( 
            active=0, 
            buttons=list([ 
                dict(label=countries[0], 
                     method="update", 
                     args=[{"visible": [True, False, False, False, False]}, 
                           {"title": "{} sells".format(countries[0])}]), 
                dict(label=countries[1], 
                     method="update", 
                     args=[{"visible": [False, True, False, False, False]}, 
                           {"title": "{} sells".format(countries[1])}]), 
                dict(label=countries[2], 
                     method="update", 
                     args=[{"visible": [False, False, True, False, False]}, 
                           {"title": "{} sells".format(countries[2])}]), 
                dict(label=countries[3], 
                     method="update", 
                     args=[{"visible": [False, False, False, True, False]}, 
                           {"title": "{} sells".format(countries[3])}]), 
                dict(label=countries[4], 
                     method="update", 
                     args=[{"visible": [False, False, False, False, True]}, 
                           {"title": "{} sells".format(countries[4])}]), 
                
            ]), 
        ) 
    ]) 
  
fig.show() 

<h2 id="countries" style="font-family:verdana;"> 
         6.3 Customers' Segments
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#countries">¶</a>

</h2>

In [None]:
i = 1
j = 1
data = []
segment_names = ['Best Customers', 'Big Spenders', 'Good Customers', 'Loyal Customers', 'Average Customer', 
                 'Not So Good Customers', 'Almost Lost', 'Lost Customers', 'Bad Customers']
for segment in segment_names:
    df_segment = df_cleaned[df_cleaned['Segment']==segment]
    df_segment_grouped = df_segment.groupby('MonthYear').agg(custom_aggregation).sort_values(by='InvoiceDate')
    
    data.append(go.Scatter(x=df_segment_grouped.index, 
                   y=df_segment_grouped['TotalPrice']))
    
fig = make_subplots(rows=3, cols=3, shared_yaxes=True, vertical_spacing=0.19, subplot_titles=(segment_names[0], segment_names[1], segment_names[2], segment_names[3], segment_names[4], segment_names[5], 
                                  segment_names[6], segment_names[7], segment_names[8]))    
    
k = 0
for i in range(1,4):
    for j in range(1,4):
        fig.add_trace(data[k], row=i, col=j)
        k+=1


fig.update_layout(height=1000, width=1000, title_text="Sales Through segments", title_x=0.5
                  )
for i in fig['layout']['annotations']:
    i['font'] = dict(size=10,color='#ff0000')
fig.show()

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
    <b> It seems that our segment categories can be improved. As I said earlier this should be done in partnership with the marketing team. For example, the segment "Not So Good Customers" seems to have better sells than the average customers. The problem seems to be in december 2010 because there was a massive decrease in sells this month. We also have a good representation of the segment "Best customers" since they have the biggest sales. Lastly, we can see graphically why we labeled these categories "almost lost" and "lost customers"  </b><br><br>
</div>

<h2 id="prod_sells" style="font-family:verdana;"> 
         6.4 Most sold products
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#prod_sells">¶</a>

</h2>

In [None]:
most_sold_prod = df_cleaned.groupby(['Description'])['Quantity'].sum().sort_values(ascending=False)[:10]

data = [go.Bar(x=most_sold_prod.index, 
               y=most_sold_prod.values)]

layout = go.Layout(title="TOP 10 most sold products", title_x=0.5)

fig = go.Figure(data=data, layout=layout)

fig.show()

In [None]:
i = 1
j = 1
data = []
prod_names= []
for product in most_sold_prod.index[0:9]:
    df_product = df_cleaned[df_cleaned['Description']==product]
    df_product_grouped = df_product.groupby('MonthYear').agg(custom_aggregation).sort_values(by='InvoiceDate')
    
    data.append(go.Scatter(x=df_product_grouped.index, 
                   y=df_product_grouped['TotalPrice']))
    prod_names.append(product)
    
fig = make_subplots(rows=3, cols=3, vertical_spacing=0.19, subplot_titles=(prod_names[0], prod_names[1], prod_names[2], prod_names[3], prod_names[4], prod_names[5], 
                                  prod_names[6], prod_names[7], prod_names[8]))    
    
k = 0
for i in range(1,4):
    for j in range(1,4):
        fig.add_trace(data[k], row=i, col=j)
        k+=1


fig.update_layout(height=1000, width=1000, title_text="Best products monthly sales", title_x=0.5
                  )
for i in fig['layout']['annotations']:
    i['font'] = dict(size=10,color='#ff0000')
fig.show()

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
    <b> This graph would be much more interesting if I could do it in Dash since we could select whatever product we're interested into and see its sells through the year. I'll incorporate it in my future dashboard. The main information we can drive from this graph is that we shouldn't keep the following products in the database "Paper craft" and "Storage Jar" since there are so much canceled ordrers. </b><br><br>
</div>

In [None]:
prod_categories = df_cleaned.groupby('Labels')['TotalPrice'].sum()

In [None]:
data = [go.Pie(labels=prod_categories.index, 
                   values=prod_categories.values)]

layout = go.Layout(title="Total sales through product categories", title_x=0.5)

fig = go.Figure(data=data, layout=layout)

fig.show()

<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
    <b> The work done on product categories doesn't seem to be very representative of the real sells, I'll rework this part in a new version. </b><br><br>
</div>

<h2 id="time_graphs" style="font-family:verdana;"> 
         6.5 Time Features
        <a class="anchor-link" href="https://www.kaggle.com/miljan/product-tagging-for-e-commerce-work-in-progress/#time_graphs">¶</a>

</h2>

In [None]:
hourly_sales = df_cleaned.groupby('Hour')['TotalPrice'].sum().sort_index(ascending=True)
hourly_sales

In [None]:
data = [go.Bar(x=hourly_sales.index, 
               y=hourly_sales.values)]

layout = go.Layout(title="Hourly sales", title_x=0.5)

fig = go.Figure(data=data, layout=layout)

fig.show()

In [None]:
weekday_sales = df_cleaned.groupby('Weekday')['TotalPrice'].sum().sort_index(ascending=True)

data = [go.Bar(x=weekday_sales.index, 
               y=weekday_sales.values)]

layout = go.Layout(title="Weekday sales", title_x=0.5)

fig = go.Figure(data=data, layout=layout)

fig.show()


<br><br>
<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 1.7em">
    <b> I really can't believe that there's a day in the week where sells are dropping that drastically. It must be an error. </b><br><br>
</div>

In [None]:
day_sales = df_cleaned.groupby('Day')['TotalPrice'].sum().sort_index(ascending=True)

data = [go.Bar(x=day_sales.index, 
               y=day_sales.values)]

layout = go.Layout(title="Day of the month sales", title_x=0.5)

fig = go.Figure(data=data, layout=layout)

fig.show()


<center style="font-family:cursive; font-size:18px; color:#159364;">This is a work in progress, feel free to suggest any graphics that you would like to see. <br>I'm going to add more visualizations and then create a dashboard that I'll share with you.<br> Thank you for taking your time to read my notebook 🙏 </center>

***