### Hi, welcome to my project!, today we will learn and study the 3 fundamental clustering algorithms of unsupervised machine learning (K-means, Hierarchical clustering and DBSCAN).

# Customer Segmentation with K-Means
Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy as a business can target these specific groups of customers and effectively allocate marketing resources. For example, one group might contain customers who are high-profit and low-risk, that is, more likely to purchase products, or subscribe for a service. A business task is to retaining those customers. Another group might include customers from non-profit organizations. And so on.

In [None]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
%matplotlib inline

## Reading file 

In [None]:
cust_df=pd.read_csv('../input/clustering/Cust_Segmentation.csv')
cust_df.head()

In [None]:
cust_df.shape

## Pre-processing 

Address in this dataset is a categorical variable. k-means algorithm isn't directly applicable to categorical variables because Euclidean distance function isn't really meaningful for discrete variables. So, lets drop this feature and run clustering.

In [None]:
df = cust_df.drop('Address', axis=1)
df.head()

Now let's normalize the dataset. But why do we need normalization in the first place? Normalization is a statistical method that helps mathematical-based algorithms to **interpret features with different magnitudes and distributions equally**. We use StandardScaler() to normalize our dataset.

In [None]:
from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
X = df.values[:,1:]   # Select all columns expect CustomerId
X = np.nan_to_num(X)  #Convert all nan to zero
Clust_dataset=ss.fit_transform(X)
Clust_dataset

## Modeling
The KMeans class has many parameters that can be used, but we will be using these three:

* init: Initialization method of the centroids. Value will be: "k-means++", k-means++: Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.
* n_clusters: The number of clusters to form as well as the number of centroids to generate. Value will be: 3 
* n_init: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Value will be: 12 Initialize KMeans with these parameters.

Lets apply k-means on our dataset, and take look at cluster labels.

In [None]:
kmeans = KMeans(init='k-means++', n_clusters=3, n_init=12)
kmeans.fit(X)

In [None]:
print(kmeans.labels_)

## Insights
We assign the labels to each row in our dataframe:

In [None]:
df['k-means']=kmeans.labels_
df.head()

We can check the 'centroid' of our cluster by averaging their features:

In [None]:
df.groupby('k-means').mean()

Fortunately kmeans has an attribute which prints the centroids 'cluster_centers_':

In [None]:
kmeans.cluster_centers_

### Now, lets look at the distribution of customers based on their age and income: 

In [None]:
plt.scatter(X[:, 0], X[:, 3], c=kmeans.labels_.astype(np.float), alpha=0.5)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)

plt.show()

By color we can differenciate the clusters and infer that the blue cluster corresponds to customers with the highest income who also are middle and old aged, for the yellow cluster it fluctuates as middle income since young to old aged customers, and finally the best characteristic of purple cluster is the relatively low income for all ages. 
We could add a new interesting feature to this plot, such as education, which we could think is positively correlated with income, and we could do this by increasing the size of the points proportional to the educational level. 

In [None]:
fig, ax = plt.subplots(figsize=(8,5))

scatter = ax.scatter(X[:, 0], X[:, 3], s=(X[:, 1]**3), c=kmeans.labels_.astype(np.float), alpha=0.5)
legend1 = ax.legend(*scatter.legend_elements(), loc="upper right", title="Clusters")
#ax.add_artist(legend1)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()

In the figure above we can see the big circles have a little tendency to appear in middle to old ages, in the other hand we can also see tiny circles for old ages and some with high income.  
K-means will partition our customers into mutually exclusive groups. The customers in each cluster are similar to each other demographically. Now we can create a profile for each group, considering the common characteristics of each cluster. For example, the 3 clusters could be:

* AFFLUENT, EDUCATED AND OLD AGED
* MIDDLE AGED AND MIDDLE INCOME
* YOUNG AND LOW INCOME

# Clustering on Vehicle dataset with Hieralchical agglomerative
A famous automobile manufacturer has developed prototypes for a new vehicle. Before introducing the new model into its range, the manufacturer wants to determine which existing vehicles on the market are most like the prototypes--that is, how vehicles can be grouped, which group is the most similar with the model, and therefore which models they will be competing against.

Our objective here, is to use clustering methods, to find the most distinctive clusters of vehicles. It will summarize the existing vehicles and help manufacturers to make decision about the supply of new models.

In [None]:
from scipy import ndimage 
import pylab
from scipy.cluster import hierarchy 
from scipy.spatial import distance_matrix 
from sklearn import manifold, datasets 
from sklearn.cluster import AgglomerativeClustering 

In [None]:
pdf=pd.read_csv('../input/clustering/cars_clus.csv')
pdf.head()

The feature sets include price in thousands (price), engine size (engine_s), horsepower (horsepow), wheelbase (wheelbas), width (width), length (length), curb weight (curb_wgt), fuel capacity (fuel_cap) and fuel efficiency (mpg).

In [None]:
pdf.dtypes

Clearly we can see the type of columns does not make sense, so we have to drop all non-numerical fields and convert to float or int.

In [None]:
pdf.size #Number of rows x columns

In [None]:
pdf.iloc[:,2::]=pdf.iloc[:,2::].apply(pd.to_numeric, errors='coerce') #Select all columns which should be numerical, then apply conversion
pdf = pdf.dropna()  #All non-numerical field were converted to NaN, so now we have to drop them
pdf = pdf.reset_index(drop=True)  
pdf.head()  

In [None]:
pdf.dtypes

Now the type of columns make sense and below we see the size of the new dataframe is 1872, so we dropped 672 fields.

In [None]:
pdf.size

## Feature selection
Lets select our feature set:

In [None]:
featureset = pdf[['engine_s',  'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap', 'mpg']]
featureset.head()

Now we can normalize the feature set. MinMaxScaler transforms features by scaling each feature to a given range. It is by default (0, 1). That is, this estimator scales and translates each feature individually such that it is between zero and one.

In [None]:
from sklearn.preprocessing import MinMaxScaler
x = featureset.values #returns a numpy array
min_max_scaler = MinMaxScaler()
feature_mtx = min_max_scaler.fit_transform(x)
feature_mtx [0:5]

## Distance Measurements
We begin the agglomerative clustering process by measuring the distance between the data points using the euclidean method.

In [None]:
from sklearn.metrics.pairwise import euclidean_distances
dist_matrix = euclidean_distances(feature_mtx,feature_mtx) 
print(dist_matrix)

In [None]:
dist_matrix.shape

In [None]:
Z_using_dist_matrix = hierarchy.linkage(dist_matrix, 'complete')
plt.figure()
dn = hierarchy.dendrogram(Z_using_dist_matrix)

In [None]:
Z_using_dist_matrix = hierarchy.linkage(dist_matrix, 'complete')
fig = pylab.figure(figsize=(18,50))
def llf(id):
    return '[%s %s %s]' % (pdf['manufact'][id], pdf['model'][id], int(float(pdf['type'][id])) )
    
dendro = hierarchy.dendrogram(Z_using_dist_matrix,  leaf_label_func=llf, leaf_rotation=0, leaf_font_size =12, orientation = 'right')

The Agglomerative Clustering class will require two inputs:

* n_clusters: The number of clusters to form as well as the number of centroids to generate. Value will be: 6
* linkage: Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. Let's see the outcome for 'ward' and then 'complete'.

In [None]:
agglom = AgglomerativeClustering(n_clusters = 6, linkage = 'ward')
agglom.fit(dist_matrix)

agglom.labels_

In [None]:
agglom = AgglomerativeClustering(n_clusters = 6, linkage = 'complete')
agglom.fit(dist_matrix)

agglom.labels_

Pay attention to the labels printed above for both linkage methods, if we compare we will realize these only differ in the order of the clusters, they are classified as other number, but we see patterns and groups.

In [None]:
pdf['cluster']=agglom.labels_
pdf.head()

Let's group by cluster to find the cluster centers:

In [None]:
pdf.groupby('cluster').mean() 

We could do the same as in K-means, make a scatter plot of 2 main features and a third proportional to the size of the points, and we will differenciate the clusters and their distribution by color.

In [None]:
fig, ax = plt.subplots(figsize=(8,5))

scatter = ax.scatter(pdf['horsepow'], pdf['mpg'], s=(pdf['price']*3), c=pdf.cluster.astype(np.float), alpha=0.7)
legend1 = ax.legend(*scatter.legend_elements(), loc="upper right", title="Clusters")
plt.xlabel('Horsepower', fontsize=18)
plt.ylabel('mpg', fontsize=16)
plt.show()

In the figure above we can see that is not very clear where is placed the centroid of each cluster. Moreover, there are 2 types of vehicles in our dataset, "truck" (value of 0 in the type column) and "car" (value of 1 in the type column). So, we use them to distinguish the classes, and summarize the cluster:

In [None]:
agg_cars = pdf.groupby(['cluster','type'])['horsepow','engine_s','mpg','price'].mean()
agg_cars

Let's make a similar scatter plot as before, but using this new data grouped by type of automobile:

In [None]:
import matplotlib.cm as cm
n_clusters = max(agglom.labels_)+1
colors = cm.rainbow(np.linspace(0, 1, n_clusters))
cluster_labels = list(range(0, n_clusters))

plt.figure(figsize=(16,10))
for color, label in zip(colors, cluster_labels):
    subset = agg_cars.loc[(label,),]
    for i in subset.index:
        plt.text(subset.loc[i][0]+5, subset.loc[i][2], 'type='+str(int(i)) + ', price='+str(int(subset.loc[i][3]))+'k')
    plt.scatter(subset.horsepow, subset.mpg, s=subset.price*20, c=color, label='cluster'+str(label))
plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')

# Clustering weather stations with DBSCAN

Most of the traditional clustering techniques, such as k-means, hierarchical and fuzzy clustering, can be used to group data without supervision. However, when applied to tasks with arbitrary shape clusters, or clusters within cluster, the traditional techniques might be unable to achieve good results. That is, elements in the same cluster might not share enough similarity or the performance may be poor. Additionally, Density-based Clustering locates regions of high density that are separated from one another by regions of low density. Density, in this context, is defined as the number of points within a specified radius.

DBSCAN is specially very good for tasks like class identification on a spatial context. The wonderful attribute of DBSCAN algorithm is that it can find out any arbitrary shape cluster without getting affected by noise. For example, this following example cluster the location of weather stations in Canada. As we will see, it not only finds different arbitrary shaped clusters, can find the denser part of data-centered samples by ignoring less-dense areas or noises.

In [None]:
from sklearn.cluster import DBSCAN 

In [None]:
pdf=pd.read_csv('../input/clustering/weather-stations20140101-20141231.csv')
pdf.head()

Lets remove rows that dont have any value in the Tm field.

In [None]:
pdf.size

In [None]:
pdf.head()

In [None]:
pdf['Tm'].isnull().sum()

In [None]:
pdf = pdf.iloc[pdf['Tm'].dropna().index,:]
pdf = pdf.reset_index(drop=True)
pdf.head()

In [None]:
pdf.dtypes

In [None]:
pdf.size

## Visualization
Visualization of stations on map using basemap package. The matplotlib basemap toolkit is a library for plotting 2D data on maps in Python. Basemap does not do any plotting on it’s own, but provides the facilities to transform coordinates to a map projections.

In [None]:
# Notice: For visualization of map, you need basemap package.
# if you dont have basemap install on your machine, you can use the following line to install it
!conda install -c conda-forge  basemap matplotlib==3.1 -y
# Notice: you maight have to refresh your page and re-run the notebook after installation

In [None]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = (14,10)

llon=-140
ulon=-50
llat=40
ulat=65

pdf = pdf[(pdf['Long'] > llon) & (pdf['Long'] < ulon) & (pdf['Lat'] > llat) &(pdf['Lat'] < ulat)]

my_map = Basemap(projection='merc',
            resolution = 'l', area_thresh = 1000.0,
            llcrnrlon=llon, llcrnrlat=llat, #min longitude (llcrnrlon) and latitude (llcrnrlat)
            urcrnrlon=ulon, urcrnrlat=ulat) #max longitude (urcrnrlon) and latitude (urcrnrlat)

my_map.drawcoastlines()
my_map.drawcountries()
# my_map.drawmapboundary()
my_map.fillcontinents(color = 'white', alpha = 0.3)
my_map.shadedrelief()

# To collect data based on stations        

xs,ys = my_map(np.asarray(pdf.Long), np.asarray(pdf.Lat))
pdf['xm']= xs.tolist()
pdf['ym'] =ys.tolist()

#Visualization1
for index,row in pdf.iterrows():
#   x,y = my_map(row.Long, row.Lat)
   my_map.plot(row.xm, row.ym,markerfacecolor =([1,0,0]),  marker='o', markersize= 5, alpha = 0.75)
#plt.text(x,y,stn)
plt.show()


## Clustering of stations based on their location i.e. Lat & Lon
DBSCAN form sklearn library can runs DBSCAN clustering from vector array or distance matrix. In our case, we pass it the Numpy array Clus_dataSet to find core samples of high density and expands clusters from them.
It works based on two parameters: Epsilon and Minimum Points
* Epsilon: Determines a specified radius that if includes enough number of points within, we call it dense area. Value = 0.15.
* MminimumSamples: Determines the minimum number of data points we want in a neighborhood to be defined as a cluster. Value = 10.

In [None]:
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['xm','ym']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)

# Compute DBSCAN
db = DBSCAN(eps=0.15, min_samples=10).fit(Clus_dataSet)
labels = db.labels_
pdf["Clus_Db"]=labels

realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels)) 


# A sample of clusters
pdf[["Stn_Name","Tx","Tm","Clus_Db"]].head(5)

We can group by cluster label and see the centroid based on location of each cluster:

In [None]:
pdf[["xm","ym","Tx","Tm","Clus_Db"]].groupby('Clus_Db').mean()

As we know DBSCAN detects outliers and its cluster label is -1

In [None]:
set(labels)

## Visualization of clusters based on location:

In [None]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = (14,10)

my_map = Basemap(projection='merc',
            resolution = 'l', area_thresh = 1000.0,
            llcrnrlon=llon, llcrnrlat=llat, #min longitude (llcrnrlon) and latitude (llcrnrlat)
            urcrnrlon=ulon, urcrnrlat=ulat) #max longitude (urcrnrlon) and latitude (urcrnrlat)

my_map.drawcoastlines()
my_map.drawcountries()
#my_map.drawmapboundary()
my_map.fillcontinents(color = 'white', alpha = 0.3)
my_map.shadedrelief()

# To create a color map
colors = plt.get_cmap('jet')(np.linspace(0.0, 1.0, clusterNum))



#Visualization1
for clust_number in set(labels):
    c=(([0.4,0.4,0.4]) if clust_number == -1 else colors[np.int(clust_number)])
    clust_set = pdf[pdf.Clus_Db == clust_number]                    
    my_map.scatter(clust_set.xm, clust_set.ym, color =c,  marker='o', s= 20, alpha = 0.85)
    if clust_number != -1:
        cenx=np.mean(clust_set.xm) 
        ceny=np.mean(clust_set.ym) 
        plt.text(cenx,ceny,str(clust_number), fontsize=25, color='red',)
        print ("Cluster "+str(clust_number)+', Avg Temp: '+ str(np.mean(clust_set.Tm)))