# Customer Segmentation using Clustering
***
This mini-project is based on [this blog post]() by yhat. Please feel free to refer to the post for additional information, and solutions.

In [3]:
import pandas as pd
from ggplot import *
import seaborn as sns

ImportError: No module named 'ggplot'

In [None]:
import pip
installed_packages = pip.get_installed_distributions()
installed_packages_list = sorted(["%s==%s" % (i.key, i.version)
     for i in installed_packages])
print(installed_packages_list)

In [None]:
!pip install ggplot

In [None]:
from jupyter_core.paths import jupyter_data_dir
print(jupyter_data_dir())

In [None]:
import sys
sys.executable

In [None]:
import sys
sys.path 

In [None]:
gg_link ="/Users/rogerhuang/Downloads/ENTER/pkgs/ggplot-0.11.1-py35_1/lib/python3.5/site-packages"
sys.path.append(gg_link) 

In [None]:
cyc_link= "/Users/rogerhuang/downloads/ENTER/lib/python3.5/site-packages"
sys.path.append(cyc_link)

In [None]:
import ggplot

In [None]:
!pip install cycler

In [None]:
%pylab inline

## Data

The dataset contains both information on marketing newsletters/e-mail campaigns (e-mail offers sent) and transaction level data from customers (which offer customers responded to and what they bought).

In [None]:
df_offers = pd.read_excel("http://localhost:8888/files/Downloads/clustering/Clustering/WineKMC.xlsx", sheetname=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head(8)

In [None]:
len(df_offers)
df_offers.origin.value_counts()

In [None]:
df_france = df_offers[df_offers.origin == "France"]
len(df_france)

In [None]:
df_france.min_qty.mean()
df_france.min_qty.std()
df_france.min_qty.describe()

In [None]:
df_transactions = pd.read_excel("http://localhost:8888/files/Downloads/clustering/Clustering/WineKMC.xlsx", sheetname=1)
df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head(8)

In [None]:
len(df_transactions)

In [None]:
df_transactions.n.value_counts()

## Data wrangling

We're trying to learn more about how our customers behave, so we can use their behavior (whether or not they purchased something based on an offer) as a way to group similar minded customers together. We can then study those groups to look for patterns and trends which can help us formulate future offers.

The first thing we need is a way to compare customers. To do this, we're going to create a matrix that contains each customer and a 0/1 indicator for whether or not they responded to a given offer. 

**Your turn:** Create a data frame where each row has the following columns (Use the pandas [`merge`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) and [`pivot_table`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html) functions for this purpose):

* customer_name
* One column for each offer, with a 1 if the customer responded to the offer

In [None]:
#your turn
df_final = df_transactions.merge(df_offers, how ='inner', on = "offer_id")
len(df_final)
df_final.tail()
df_final.offer_id.value_counts()

df_30 = df_final[df_final.offer_id == 30]
len(df_30)
df_30

In [None]:
df = df_final 
matrix = df.pivot_table(index=['customer_name'], columns=['offer_id'] , values = 'n')
matrix.head(3)

In [None]:
matrix = matrix.fillna(0).reset_index()
matrix.head(3)

In [None]:
x_cols = matrix.columns[1:]
print(x_cols)

## K-Means Clustering

**Your turn:** 

* Create a numpy matrix `x_cols` with only the columns representing the offers (i.e. the 0/1 colums) 
* Apply the [`KMeans`](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) clustering method from scikit-learn to this matrix. Use `n_clusters=5` (but feel free to play with this)
* Print the number of points in each cluster 

In [None]:
#your turn
from sklearn.cluster import KMeans
cluster = KMeans(n_clusters = 5)

In [None]:
matrix['cluster'] = cluster.fit_predict(matrix[matrix.columns[2:]])
matrix.cluster.value_counts()

matrix

## Visualizing clusters using PCA

How do we visualize clusters? Principal Component Analysis (PCA) will help. There are lots of uses for PCA, but today we're going to use it to transform our multi-dimensional dataset into a 2 dimensional dataset. Why you ask? Well once it is in 2 dimensions (or simply put, it has 2 columns), it becomes much easier to plot!

**Your turn:** Use PCA to plot your clusters:

* Use scikit-learn's [`PCA`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) function to reduce the dimensionality of your clustering data to 2 components
* Create a data frame with the following fields:
  * customer name
  * cluster id the customer belongs to
  * the two PCA components (label them `x` and `y`)

In [None]:
#your turn
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
matrix['x'] = pca.fit_transform(matrix[x_cols])[:,0]
matrix['y'] = pca.fit_transform(matrix[x_cols])[:,1]
matrix = matrix.reset_index()

matrix.head(3)


In [None]:
customer_clusters = matrix[["customer_name","x", "y"]]

customer_clusters.head(3)

What we've done is we've taken those columns of 0/1 indicator variables, and we've transformed them into a 2-D dataset. We took one column and arbitrarily called it `x` and then called the other `y`. Now we can throw each point into a scatterplot. We'll color code each point based on it's cluster so it's easier to see them.

**Your turn:**

* Plot a scatterplot of the `x` vs `y` columns
* Color-code points differently based on cluster ID

How do the clusters look?

In [None]:
#your turn
df = pd.merge(df_transactions,customer_clusters)
df.head(3)


In [None]:
df = pd.merge(df_offers, df)
df.head(3)

In [None]:
from ggplot import *

ggplot(df, aes(x='x', y = 'y', color = 'cluster')) + \
    geom_point(size=75) + \
    ggtitle("Customers grouped by Roger clusters")

In [None]:
ggplot(df, aes(x='x', y='y', color='cluster')) + \
    geom_point(size=75) + \
    ggtitle("Customers Grouped by Cluster")

In [None]:
df = pd.merge(df_transactions, customer_clusters)
df = pd.merge(df_offers, df)

from ggplot import *

ggplot(df, aes(x='x', y='y', color="cluster")) + \
    geom_point(size=75) + \
    ggtitle("Customers Grouped by Cluster")

**Your turn (extra credit):** Play with the following: 

* Different initializations for `KMeans`
* Other clustering algorithms in scikit-learn