# Identifying the Groups of Customers 

As presented in its description:

    "This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

What are the main goals of our notebook ?:

## Goals
**First step:**

* EDA and Data Cleaning
* Preprocessing

**Second step, we'll try to answer to these questions:**

* Find the "behaviour" groups / clusters (Hint: you can use the purchase behaviour for that)
* Explain, if possible, what clusters you have found (for example, customers purchasing furniture, purchasing jewellery, etc.)
* How you can use the clusters from the given dataset to make actionable business insights and what will these insights be?

## First Step

### EDA and Data Cleaning

In [None]:
import pandas as pd #import pandas
import numpy as np #import numpy
import matplotlib.pyplot as plt #import matplotlib.pyplot with plt as alias
%matplotlib inline

In [None]:
df = pd.read_csv("../input/ecommerce-data/data.csv", encoding = 'ISO-8859-1') # import csv file

In [None]:
df.head(1)

In [None]:
df.info()

Different kind of columns with 541.909 rows and 8 columns. First look reveals that:
* *We can see that there are some column types that are not suitable, as InvoiceDate which is 'object' instead of datetime.*
* *CustomerID has less filled rows 406.829, which means that there are some Invoices without customer ID.* 

As response to these cases:
* *Change it's type.*
* *We will admit that these cases can't be processed and should be dropped, as we want to study customers behaviour.*

In [None]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate']) #InvoiceDate to datetime
df = df.dropna() #drop NA

As we are facing an invoice table we are going to have a row for each action made. But are we sure that we don't have any duplicates ?

Let's check if by any means we have duplicates:

In [None]:
df[df.duplicated(keep=False)].sort_values(by=['InvoiceNo', 'StockCode']).head(2) # select duplicates and show header 2

In [None]:
print(str('We have {} duplicates.').format(len(df[df.duplicated(keep=False)].sort_values(by=['InvoiceNo', 'StockCode']))))

Let's drop them:

In [None]:
df = df.drop_duplicates()
df.shape

We can add total value of purchase as 'Total:

In [None]:
df['Total'] = df['Quantity'] * df['UnitPrice']

In [None]:
df.describe()

We have negative values for quantity (min). How can it be interpreted ?

Let's suppose that these are returns or cancelled purchases.

We can decide to split our dataset in two (one for valid purchases, another one for negative quantities) or add a new feature giving categories of purchases.

Purchase behaviour needs to know exactly, a part of invoice detailed level how many products and the amount of purchase.

In [None]:
#first option
dfp = df[df['Quantity']>0]
dfn = df[df['Quantity']<=0]

#second option
dfd = df
dfd['Purchase'] = df['Quantity'].apply(lambda x: 1 if x > 0 else 0)

For each Customer, Year, Month, Product, Quantity Purchased, Total value purchased could help us get more insights on how customers acts.

In [None]:
df.nunique()

In [None]:
print(str('As we can see we have {} unique customers for {} description products over {} countries.').format(df.nunique()['CustomerID'], df.nunique()['Description'], df.nunique()['Country']))

In [None]:
recod_df = pd.get_dummies(df[['Description', 'Country']], drop_first = True)
dfc = pd.concat([df, recod_df], axis=1)
dfc = dfc.drop(['InvoiceNo', 'StockCode', 'Description', 'Country', 'Purchase', 'InvoiceDate','CustomerID'], axis=1)

In [None]:
dfc.head(1)

> To study customer,s behaviour, we will focus on Customer ID, Price, Quantity and Total

In [None]:
from sklearn.preprocessing import normalize

a = dfc[['Quantity', 'UnitPrice', 'Total']]
dfc_scaled = normalize(a)
dfc_scaled = pd.DataFrame(dfc_scaled, columns=a.columns)
dfc_scaled.head(1)

In [None]:
y = dfc_scaled['Total']
y1 = dfc_scaled[['Total']]
X = dfc_scaled.drop('Total', axis=1)

1 - Find the &quot;behaviour&quot; groups / clusters (Hint: you can use the purchase behaviour for that)

As asked, we are going to focus on total purchases as main feature for classification.
For this we will try kmean.

In [None]:
#Using Kmeans we will try to find first the optimal number of clusters with elbow method
from sklearn.cluster import KMeans #import kmeans

listkm = [] # define an empty list to add inertia at each number of clusters

for i in range(1,8):
    km=KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0)
    km.fit(y1)
    listkm.append(km.inertia_)
    
# Plot it
plt.plot(range(1,8),listkm, marker ='s')
plt.title('Plot showing inertia versus number of clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

Optimal number of clusters is 3. let's now fit a 3 clusters kmean:

In [None]:
kms = KMeans(n_clusters=3, random_state=1).fit(X) #kmeans fitting to have our model
predict = kms.predict(X) #predicting
ctds = kms.cluster_centers_
print(ctds)

In [None]:
print("Original array:")
print(predict)
unique_elements, counts_elements = np.unique(predict, return_counts=True)
print("Frequency of unique values of the said array:")
print(np.asarray((unique_elements, counts_elements)))

In [None]:
plt.scatter(X.iloc[predict == 0, 0], X.iloc[predict == 0, 1], s = 50, c = 'green', label = 'Group 1')
plt.scatter(X.iloc[predict == 1, 0], X.iloc[predict == 1, 1], s = 50, c = 'yellow', label = 'Group 2')
plt.scatter(X.iloc[predict == 2, 0], X.iloc[predict == 2, 1], s = 50, c = 'red', label = 'Group 3')
plt.scatter(kms.cluster_centers_[:, 0], kms.cluster_centers_[:, 1], s = 100, c = 'purple', label = 'Centroids')
plt.title('Customer clusters on price and quantity')
plt.xlabel('Quantity')
plt.ylabel('Unit Price')
plt.legend()
plt.show()

Test 2 : PCA

In [None]:
#from sklearn.decomposition import PCA
#pca = PCA().fit(dfc)
#pca_ax2 = pca.transform(dfc)

In [None]:
#import scipy.cluster.hierarchy as sch

#dist = sch.distance.pdist(dfc, lambda u, v: u != v)
#merging = sch.linkage(df_scaled, method='ward')

#plt.figure(figsize=(10,10))
#sch.dendrogram(merging,leaf_font_size=6, leaf_rotation=90)

2 – Explain, if possible, what clusters you have found (for example, customers purchasing furniture, purchasing jewellery, etc.)

In [None]:
from wordcloud import WordCloud, STOPWORDS

df_word1 = df[predict == 0]['Description']
df_word2 = df[predict == 1]['Description']
df_word3 = df[predict == 2]['Description']

patchwork1 = " ".join(word for word in df_word1)
patchwork2 = " ".join(word for word in df_word2)
patchwork3 = " ".join(word for word in df_word3)

# Generate a word cloud image
wordcloud1 = WordCloud(background_color="white").generate(patchwork1)
wordcloud2 = WordCloud(background_color="white").generate(patchwork2)
wordcloud3 = WordCloud(background_color="white").generate(patchwork3)

In [None]:
#plot each cluster
plt.figure(figsize=(36, 12))
plt.subplots_adjust(top=1.2)

plt.subplot(131)
plt.imshow(wordcloud1, interpolation='bilinear')
plt.axis("off")
plt.title('Cluster 1 : Home furnitures', fontsize=30)
plt.subplot(132)
plt.imshow(wordcloud2, interpolation='bilinear')
plt.axis("off")
plt.title('Cluster 2 : Travel Bags and others', fontsize=30)
plt.subplot(133)
plt.imshow(wordcloud3, interpolation='bilinear')
plt.axis("off")
plt.title('Cluster 3 : Birthday events ', fontsize=30)

plt.suptitle('Wordcloud by Cluster', fontsize=70)
plt.show()

3- How you can use the clusters from the given dataset to make actionable business insights and what will these insights be?

The above mentioned Clusters are a good hint on what kind of products custommers are fond of. It might be used to recommend other product of the same genre to people of the same cluster.