# Market segmentation

Let's work on a realistic scenario. We are going to analyze data from an email marketing campaign. The data can bu found [here](https://blog.minethatdata.com/search/label/MineThatData).

This dataset contains 64,000 customers who last purchased within twelve months. The customers were involved in an e-mail test.
- 1/3 were randomly chosen to receive an e-mail campaign featuring Mens merchandise.
- 1/3 were randomly chosen to receive an e-mail campaign featuring Womens merchandise.
- 1/3 were randomly chosen to not receive an e-mail campaign. 

Customer attributes include:
- **Recency**: Months since last purchase.
- **History_Segment**: Categorization of dollars spent in the past year.
- **History**: Actual dollar value spent in the past year.
- **Mens**: 1/0 indicator, 1 = customer purchased Mens merchandise in the past year.
- **Womens**: 1/0 indicator, 1 = customer purchased Womens merchandise in the past year.
- **Zip_Code**: Classifies zip code as Urban, Suburban, or Rural.
- **Newbie**: 1/0 indicator, 1 = New customer in the past twelve months.
- **Channel**: Describes the channels the customer purchased from in the past year.
- **Segment**: describes the e-mail campaign the customer received
    - *Mens E-Mail*: receive an e-mail campaign featuring Mens merchandise
    - *Womens E-Mail*: receive an e-mail campaign featuring Womens merchandise
    - *No E-Mail*: not receive an e-mail campaign

During a period of two weeks following the e-mail campaign, results were tracked:
- **Visit**: 1/0 indicator, 1 = Customer visited website in the following two weeks.
- **Conversion**: 1/0 indicator, 1 = Customer purchased merchandise in the following two weeks.
- **Spend**: Actual dollars spent in the following two week

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. Load data

In [None]:
# cargamos fichero
marketing_data = pd.read_csv("./data/marketing_data.csv",sep=',') 
marketing_data.head(5)

As we have the `history` information, we might want to drop this column.

In [None]:
marketing_data = marketing_data.drop('history_segment',axis=1)
marketing_data.dtypes

## 1.1 Convert categorical variables to numerical ones

In [None]:
# get the list of categorical variblaes
categorical_features = marketing_data.columns[marketing_data.dtypes == 'object'].to_list()

# encode data
marketing_data_encoded = pd.get_dummies(marketing_data, 
                                        columns = categorical_features, 
                                        prefix = 'is', 
                                        drop_first=True)

marketing_data_encoded.head()

In [None]:
# rename columns
cols = ['recency', 'history', 'mens','womens','newbie','visit','conversion','spend',
        'is_suburban','is_urban','phone','web','no_email','womens_email']

marketing_data_encoded.columns = cols

# reordering columnas
reordering_cols = ['recency', 'history', 'mens','womens','newbie','is_suburban','is_urban',
                   'phone','web','no_email','womens_email','visit','conversion','spend']

marketing_data_encoded = marketing_data_encoded[reordering_cols]
marketing_data_encoded.head(5)

## 1.2 From pandas to scikit  

In [None]:
from sklearn import preprocessing

# convertimos el DataFrame al formato necesario para scikit-learn
data = marketing_data_encoded.values 

y_visit      = data[:,-3]      
y_conversion = data[:,-2]
y_spend      = data[:,-1]
X = data[:,0:-3]    # nos quedamos con el resto

feature_names = marketing_data_encoded.columns[0:-3].to_list()

scaler = preprocessing.StandardScaler().fit(X)
Xs = scaler.transform(X)

## 1.3 Take a look to the data

In [None]:
from sklearn.manifold import TSNE

#Take a sample and plot it
N = 5000
random_idx = np.random.choice(Xs.shape[0], N, replace=False)

X_tsne = TSNE(n_components=2, random_state=0).fit_transform(Xs[random_idx,:])

In [None]:
plt.figure(figsize=(6, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c = 'b', marker='o', alpha=0.2)
plt.xticks([])
plt.yticks([])
plt.show();

# 2. K-means

A possible strategy would be:

- Represent `inertia` to determine the number of cluster
- Analize the number of samples on each cluster and the sum of distances to the centroid.
- For each cluster, `display`  the $n$ closest and the furthest examples from its centroid.
- Analyze the features distribution for each cluster.

In [None]:
from sklearn.cluster import KMeans

K = range(1,20)

inertia = []
for k in K:
    kmeans = KMeans(n_clusters=k).fit(Xs)
    inertia.append(kmeans.inertia_)
    
plt.plot(K,inertia,'.-')
plt.xlabel('# of clusters')
plt.ylabel('Inertia')
plt.show()

In [None]:
k = 10
kmeans = KMeans(n_clusters=k, random_state=0)
labels_km = kmeans.fit_predict(Xs)

print("Cluster sizes k-means: {}".format(np.bincount(labels_km)))

distances = []
for c in kmeans.cluster_centers_:
    d = np.sum( np.sum((Xs - c) ** 2, axis=1) ) 
    distances.append(d.round(2))
    
print("Cluster distances k-means: {}".format(distances))

plt.figure(figsize=(12,4))
plt.subplot(121)
plt.bar(range(k),np.bincount(labels_km))

plt.subplot(122)
plt.bar(range(k),distances)
plt.show()

In [None]:
def close_to_far_from_center(X,centroid, n=5):
    
    distance = np.sum((X - centroid) ** 2, axis=1)
    
    print('Close to center')
    display(marketing_data_encoded.iloc[np.argsort(distance)[:n]])
    
    print('Far from center')
    display(marketing_data_encoded.iloc[np.argsort(distance)[-n:]])
    

In [None]:
close_to_far_from_center(Xs,kmeans.cluster_centers_[9])

In [None]:
feature = 'history'
col_number = feature_names.index(feature)

plt.figure(figsize=(15,10))
for l in np.unique(labels_km):
    
    plt.subplot(2,5,l+1)
    plt.hist(X[labels_km == l,col_number],bins = 50, density=True)
    plt.xlabel(feature)
    plt.title('Cluster #' + str(l))

plt.show()

# 3. DBSCAN

In [None]:
from sklearn.cluster import DBSCAN

for eps in [1, 3, 5, 7]:
    print("\neps={}".format(eps))
    dbscan = DBSCAN(eps=eps, min_samples=10)
    labels = dbscan.fit_predict(Xs)
    print("Number of clusters: {}".format(len(np.unique(labels))))
    print("Cluster sizes: {}".format(np.bincount(labels + 1)))

# Some other examples

- A. Müller and S. Guido, [Comparing Clustering Algorithms in the Faces Dataset](https://github.com/amueller/introduction_to_ml_with_python/blob/master/03-unsupervised-learning.ipynb).

- J. Martínez-Heras, [Clustering Dow Jones stocks](https://github.com/jmartinezheras/2018-MachineLearning-Lectures-ESA/blob/master/5_UnsupervisedLearning/5_Unsupervised_DowJones.ipynb)

- P. Mercatoris, [Hierarchical clustering of Exchange-Traded Funds](https://quantdare.com/hierarchical-clustering-of-etfs/)

- Google Machine Learning Course [Clustering with Manual Similarity Measure](https://developers.google.com/machine-learning/clustering/programming-exercise).