# Data preparation
In questa sezione estrarremo i dati rilevanti per il clustering dei clienti.

## Import dataset

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv("dataset/TC-dataset.csv", sep="\t",index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 471910 entries, 0 to 541909
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   BasketID         471910 non-null  object 
 1   BasketDate       471910 non-null  object 
 2   Sale             471910 non-null  object 
 3   CustomerID       406830 non-null  float64
 4   CustomerCountry  471910 non-null  object 
 5   ProdID           471910 non-null  object 
 6   ProdDescr        471157 non-null  object 
 7   Qta              471910 non-null  int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 32.4+ MB


## Estrazione nuove features

Estraiamo le feature per ogni utente:
* I: the total number of items purchased by a customer during the period of
observation.

⚠️Si intende la somma delle quantità? Si chiarirà anche a lezione

* Iu: the number of distinct items bought by a customer in the period of
observation.

* Imax: the maximum number of items purchased by a customer during a
shopping session

⚠️Si intende la quantità massima? si chiarirà anche a lezione

* E: the Shannon entropy on the purchasing behaviour of the customer

Utilizzeremo la shanno entropy indicata nella seguente formula:
$H(X)=-\sum_{i=1}^{n} \frac{m_{i}}{m} \log _{2} \frac{m_{i}}{m}$


In [4]:
import numpy as np
def entropy(labels, base=None):

  m = len(labels)
  ent = 0.
  base = e if base is None else base

  if m <= 1:
    return 0

  _,mi = np.unique(labels, return_counts=True)
  probs = mi / m

  n_classes = np.count_nonzero(probs)
  if n_classes <= 1:
    return 0

  
  for p in probs:
    ent -= p * np.math.log(p, base)

  return ent

Quindi processiamo i dati e creiamo un entrata per ogni utente

In [5]:
new_data=[]
for elem in df.CustomerID.unique():
    I=df[df["CustomerID"]==elem].Qta.sum()
    Iu=len(df[df["CustomerID"]==elem].ProdID.unique())
    Imax=df[df["CustomerID"]==elem].Qta.max()
    E=entropy(df[df["CustomerID"]==elem].ProdID,2)
    new_data.append([elem, I, Iu,Imax,E])
    #print("Customer id:\t"+str(elem))
    #print("l:\t"+str(i))
    #print("lu:\t"+str(iu))
    #print("lmax:\t"+str(imax))
    #print("E:\t"+str(E))
new_df = pd.DataFrame(new_data, columns=['CustomerID', 'I', 'Iu', 'Imax','E'])


In [6]:
new_df.to_csv('dataset/_temp_CS-dataset.csv', sep='\t', index=False)

Altre proposte indicatori: 
* Orario del giorni preferito da ogni utente
* Spesa media tra le varie sessioni


## Indicatore reddito pro capite

In [9]:
dfCountry = pd.read_csv("dataset/CountryRedditoProCap.csv", sep=";",index_col=0)
print(dfCountry)

                  Country          2010          2011
1               Australia    52.022,10     62.517,80 
2                 Austria    46.858,00     51.375,00 
3                 Bahrain    20.722,10     22.514,30 
4                 Belgium    44.141,90     47.348,50 
5                  Brazil    11.286,20     13.245,60 
6                  Canada    47.450,30     52.101,80 
7         Channel Islands    74.463,10     75.863,10 
8                  Cyprus    30.818,50     32.233,60 
9          Czech Republic    19.808,10     21.717,50 
10                Denmark    58.041,40     61.753,60 
11                   EIRE    48.715,20     51.848,90 
12                Finland    46.460,00     51.082,00 
13                 France    40.638,30     43.790,70 
14               Germany     41.531,90     46.644,80 
15                 Greece    26.917,80     25.916,30 
16              Hong Kong    32.550,00     35.142,50 
17                Iceland    43.024,90     47.516,90 
18                 Israel   