# Data preparation
In questa sezione estrarremo i dati rilevanti per il clustering dei clienti.

## Import dataset

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [9]:
df = pd.read_csv("dataset/TC-dataset.csv", sep="\t",index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 471910 entries, 0 to 541909
Data columns (total 8 columns):
BasketID           471910 non-null object
BasketDate         471910 non-null object
Sale               471910 non-null float64
CustomerID         406830 non-null float64
CustomerCountry    471910 non-null object
ProdID             471910 non-null object
ProdDescr          471157 non-null object
Qta                471910 non-null int64
dtypes: float64(2), int64(1), object(5)
memory usage: 32.4+ MB


## Estrazione nuove features

Estraiamo le feature per ogni utente:
* I: the total number of items purchased by a customer during the period of
observation.

⚠️Si intende la somma delle quantità? Si chiarirà anche a lezione

* Iu: the number of distinct items bought by a customer in the period of
observation.

* Imax: the maximum number of items purchased by a customer during a
shopping session

⚠️Si intende la quantità massima? si chiarirà anche a lezione

* E: the Shannon entropy on the purchasing behaviour of the customer

Utilizzeremo la shanno entropy indicata nella seguente formula:
$H(X)=-\sum_{i=1}^{n} \frac{m_{i}}{m} \log _{2} \frac{m_{i}}{m}$


In [10]:
import numpy as np
def entropy(labels, base=None):

  m = len(labels)
  ent = 0.
  base = e if base is None else base

  if m <= 1:
    return 0

  _,mi = np.unique(labels, return_counts=True)
  probs = mi / m

  n_classes = np.count_nonzero(probs)
  if n_classes <= 1:
    return 0

  
  for p in probs:
    ent -= p * np.math.log(p, base)

  return ent

Quindi processiamo i dati e creiamo un entrata per ogni utente

In [11]:
new_data=[]
for elem in df.CustomerID.unique():
    I=df[df["CustomerID"]==elem].Qta.sum()
    Iu=len(df[df["CustomerID"]==elem].ProdID.unique())
    Imax=df[df["CustomerID"]==elem].Qta.max()
    E=entropy(df[df["CustomerID"]==elem].ProdID,2)
    new_data.append([elem, I, Iu,Imax,E])
    #print("Customer id:\t"+str(elem))
    #print("l:\t"+str(i))
    #print("lu:\t"+str(iu))
    #print("lmax:\t"+str(imax))
    #print("E:\t"+str(E))
new_df = pd.DataFrame(new_data, columns=['CustomerID', 'I', 'Iu', 'Imax','E'])


In [12]:
new_df.to_csv('dataset/_temp_CS-dataset.csv', sep='\t', index=False)

Altre proposte indicatori: 
* Orario del giorni preferito da ogni utente
* Spesa media tra le varie sessioni
