# Data preparation
In questa sezione estrarremo i dati rilevanti per il clustering dei clienti.

## Import dataset

In [91]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [92]:
df = pd.read_csv("dataset/DQ-dataset.csv", sep="\t",index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 456644 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   BasketID         456644 non-null  int64 
 1   BasketDate       456644 non-null  object
 2   Sale             456644 non-null  object
 3   CustomerID       456644 non-null  object
 4   CustomerCountry  456644 non-null  object
 5   ProdID           456644 non-null  object
 6   ProdDescr        456644 non-null  object
 7   Qta              456644 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 31.4+ MB


## Estrazione nuove features

Estraiamo le feature per ogni utente:
* I: the total number of items purchased by a customer during the period of
observation.

⚠️Si intende la somma delle quantità? Si chiarirà anche a lezione

* Iu: the number of distinct items bought by a customer in the period of
observation.

* Imax: the maximum number of items purchased by a customer during a
shopping session

⚠️Si intende la quantità massima? si chiarirà anche a lezione

* E: the Shannon entropy on the purchasing behaviour of the customer

Utilizzeremo la shanno entropy indicata nella seguente formula:
$H(X)=-\sum_{i=1}^{n} \frac{m_{i}}{m} \log _{2} \frac{m_{i}}{m}$


In [93]:
import numpy as np
def entropy(labels, base=None):

  m = len(labels)
  ent = 0.
  base = e if base is None else base

  if m <= 1:
    return 0

  _,mi = np.unique(labels, return_counts=True)
  probs = mi / m

  n_classes = np.count_nonzero(probs)
  if n_classes <= 1:
    return 0

  
  for p in probs:
    ent -= p * np.math.log(p, base)

  return ent

In [94]:
df_pil = pd.read_csv("dataset/Country_PIL.csv", sep=";",index_col=0)
print(df_pil)

                 Country     PIL
1              Australia   1,271
2                Austria   0,411
3                Bahrain   0,027
4                Belgium   0,501
5                 Brazil   2,412
6                 Canada   1,701
7        Channel Islands   0,030
8                 Cyprus   0,026
9         Czech Republic   0,217
10               Denmark   0,333
11                  EIRE   0,229
12               Finland   0,262
13                France   2,752
14               Germany   3,570
15                Greece   0,293
16             Hong Kong   0,238
17               Iceland   0,014
18                Israel   0,247
19                 Italy   2,213
20                 Japan   5,928
21               Lebanon   0,039
22             Lithuania   0,040
23                 Malta   0,009
24           Netherlands   0,875
25                Norway   0,463
26                Poland   0,504
27              Portugal   0,241
28                   RSA   0,395
29          Saudi Arabia   0,599
30        

In [95]:
dict_pil={}
for index, row in df_pil.iterrows():
    dict_pil[row["Country"]]=row["PIL"]

Quindi processiamo i dati e creiamo un entrata per ogni utente

## Indicatori

In [96]:
new_data=[]
for elem in df.CustomerID.unique():
    I=df[df["CustomerID"]==elem].Qta.sum()
    Iu=len(df[df["CustomerID"]==elem].ProdID.unique())
    Imax=df[df["CustomerID"]==elem].Qta.max()
    E=entropy(df[df["CustomerID"]==elem].ProdID,2)

    country=df[df["CustomerID"]==elem].CustomerCountry.iloc[0]
    PIL = dict_pil[country]
    new_data.append([elem, I, Iu,Imax,E, PIL])
    #print("Customer id:\t"+str(elem))
    #print("l:\t"+str(i))
    #print("lu:\t"+str(iu))
    #print("lmax:\t"+str(imax))
    #print("E:\t"+str(E))
new_df = pd.DataFrame(new_data, columns=['CustomerID', 'I', 'Iu', 'Imax','E', 'PIL'])

In [97]:
new_df.to_csv('dataset/_temp_CS-dataset.csv', sep='\t', index=False)

Altre proposte indicatori: 
* Orario del giorni preferito da ogni utente
* Spesa media tra le varie sessioni


In [98]:
print(new_df)

     CustomerID     I   Iu  Imax         E    PIL
0       17850.0  1707   21    12  4.335643  2,567
1       13047.0  1386  105    32  6.503112  2,567
2       12583.0  4992  113   360  6.504979  2,752
3       13748.0   439   24    80  4.494680  2,567
4       15100.0    80    1    32  0.000000  2,567
...         ...   ...  ...   ...       ...    ...
5570     10660N   801  297    70  8.214319  2,567
5571     39129N   922  351    19  8.455327  2,567
5572     11819N   192   27    48  4.754888  2,567
5573     15094N   325   98    32  6.614710  2,567
5574    12713.0   505   37    25  5.209453  3,570

[5575 rows x 6 columns]
