# Data preparation
In questa sezione estrarremo i dati rilevanti per il clustering dei clienti.

## Import dataset

In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [47]:
df = pd.read_csv("dataset/customer_supermarket.csv", sep="\t",index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 471910 entries, 0 to 541909
Data columns (total 8 columns):
BasketID           471910 non-null object
BasketDate         471910 non-null object
Sale               471910 non-null object
CustomerID         406830 non-null float64
CustomerCountry    471910 non-null object
ProdID             471910 non-null object
ProdDescr          471157 non-null object
Qta                471910 non-null int64
dtypes: float64(1), int64(1), object(6)
memory usage: 32.4+ MB


## Estrazione nuove features

Estraiamo le feature per ogni utente:
* I: the total number of items purchased by a customer during the period of
observation.

⚠️Si intende la somma delle quantità? Si chiarirà anche a lezione

* Iu: the number of distinct items bought by a customer in the period of
observation.

* Imax: the maximum number of items purchased by a customer during a
shopping session

⚠️Si intende la quantità massima? si chiarirà anche a lezione

* E: the Shannon entropy on the purchasing behaviour of the customer

Utilizzeremo la shanno entropy indicata nella seguente formula:
$H(X)=-\sum_{i=1}^{n} \frac{m_{i}}{m} \log _{2} \frac{m_{i}}{m}$


In [48]:
import numpy as np
def entropy(labels, base=None):

  m = len(labels)
  ent = 0.
  base = e if base is None else base

  if m <= 1:
    return 0

  _,mi = np.unique(labels, return_counts=True)
  probs = mi / m

  n_classes = np.count_nonzero(probs)
  if n_classes <= 1:
    return 0

  
  for p in probs:
    ent -= p * np.math.log(, base)

  return ent

Quindi processiamo i dati e creiamo un entrata per ogni utente

In [49]:
for elem in df.CustomerID.unique():
    I=df[df["CustomerID"]==elem].Qta.sum()
    Iu=len(df[df["CustomerID"]==elem].ProdID.unique())
    Imax=df[df["CustomerID"]==elem].Qta.max()
    E=entropy(df[df["CustomerID"]==elem].ProdID,2)
    print("Customer id:\t"+str(elem))
    print("l:\t"+str(i))
    print("lu:\t"+str(iu))
    print("lmax:\t"+str(imax))
    print("E:\t"+str(E))
    

	17274.0
l:	3570
lu:	95
lmax:	9
E:	4.532665279941249
Customer id:	15597.0
l:	3570
lu:	95
lmax:	9
E:	4.436605434317882
Customer id:	17233.0
l:	3570
lu:	95
lmax:	9
E:	4.643856189774723
Customer id:	16887.0
l:	3570
lu:	95
lmax:	9
E:	5.101345257520843
Customer id:	13938.0
l:	3570
lu:	95
lmax:	9
E:	2.8073549220576046
Customer id:	17084.0
l:	3570
lu:	95
lmax:	9
E:	7.12065386088259
Customer id:	14882.0
l:	3570
lu:	95
lmax:	9
E:	2.550340709546388
Customer id:	15017.0
l:	3570
lu:	95
lmax:	9
E:	4.321928094887363
Customer id:	12882.0
l:	3570
lu:	95
lmax:	9
E:	3.5465935642949384
Customer id:	17387.0
l:	3570
lu:	95
lmax:	9
E:	6.105915039375075
Customer id:	13727.0
l:	3570
lu:	95
lmax:	9
E:	4.221928094887362
Customer id:	13532.0
l:	3570
lu:	95
lmax:	9
E:	6.044394119358462
Customer id:	16445.0
l:	3570
lu:	95
lmax:	9
E:	5.945533891753889
Customer id:	15156.0
l:	3570
lu:	95
lmax:	9
E:	5.635905168070111
Customer id:	16372.0
l:	3570
lu:	95
lmax:	9
E:	6.373108718475798
Customer id:	17608.0
l:	3570
lu:	95


Altre proposte indicatori: 
* Orario del giorni preferito da ogni utente
* Spesa media tra le varie sessioni
